One of the pressing problems in web data extraction is separating meaningful content pertaining to the subject from cruft like navigation links, ads, footnotes and sidebar contents promoting links to other pages with the site or other sites.
This is crucial as the page would be processed for the right content and the signal/noise ratio would be highly reduced in the following processing stages. There have been extensive research in this area and many papers have been presented.
decruft is a fork of python-readability to make it faster. It also has some logic corrections and improvements along the way.
python-readability uses BeautifulSoup to parse the html whereas decruft uses python-lxml.
The advantage of using BeautifulSoup is that it’s purely python and bundled with python-readability. Hence you do not have any outside dependency and can run on any platform, including google AppEngine.
The advantage of decruft is, as it uses libxml2, the processing is orders of magnitude faster and it can parse broken html that even BeautifulSoup cannot handle. But it needs lxml and hence restricted to linux systems.
Decruft is available at google code.
Six Web pages with reasonable article-like content has been taken and tested for performance against python-readability and decruft. The time taken for each process is given in secs in the table below. It should also be noted that while BeautifulSoup had failed to parse parts of National Geographic due to malformed start tag, lxml had no issues.
(In the arc90 links, readability will be applied after the page loads.)
|Endangered Asian ‘unicorn’ captured, first sighting in decade – CNN.com||108.56K||link||15.9949080944||0.187147140503|
|BBC News – Cacao genome ‘may help produce tastier chocolate’||59.27K||link||12.1847000122||0.383187055588|
|Arctic species under threat, report warns – CNN.com||123.34K||link||16.9275019169||0.208026885986|
|Nature – Wikipedia, the free encyclopedia||272.23K||link||151.132905006||2.32351303101|
|National Geographic: Tyrannosaurs Were Human-size for 80 Million Years||134.14K||link||17.4031939507||0.997216939926|
Improvements in decrufting logic
The following are improvements in logic over readability/python-readability.
- More patterns for possibility of meaningful content are checked for in tags’ id and class tags.
- A part of logic deems div’s that do not have containers like img, table, div etc as children as paragraph tags. python-readability (probably as a typo in condition checking) did the reverse of it. i.e if the div has a container obj, it was converted to paragraph. This has been fixed. This fix incidentally increases the processing time. The interesting part of this is that, this reversal gives better result than actual arc90 readability. Case in point, for Nature – Wikipedia, the See Also section would be removed by arc90 but is neatly retained by python-readability. Adding this fix again causes irregularity in See Als section. The below fix takes care of that.
- Once the main section of meaningful text has been determined, a function iterates through all the descendants of the main section and prunes unnecessary crufts. One of such logic is to remove lists whose link density is > 0.2. This also prunes away valid links to relevant content. The main purpose of this logic is to remove any navigational links. So I’ve added an additional constraint for pruning that the length content should be less than 300, to avoid pruning meaningful content.
- some images embedded in text are lost because of the pruning logic (e.g the first img in Wikipedia Nature, Hopetoun Falls, Australia, is lost after applying arc90′s readability). Added extra logic to retain images based on their img size.
Currently the installation option is to download it from google
tar –zxf decruft<em>-x.y</em>.tgz export PYTHONPATH=$PYTHONPATH:$PWD/decruft python
>>> from decruft import Document >>> import urllib2 >>> f = urllib2.open(<em>url</em>) >>> print Document(f.read()).summary()
I’m looking forward to writing a pip/easy_install installer for the same.