Decruft

2010-09-20 15:13:51 » Machine Learning, Python

One of the pressing problems in web data extraction is separating meaningful content pertaining to the subject from cruft like navigation links, ads, footnotes and sidebar contents promoting links to other pages with the site or other sites.

This is crucial as the page would be processed for the right content and the signal/noise ratio would be highly reduced in the following processing stages. There have been extensive research in this area and many papers have been presented.

One of the significant steps forward has been Arc90’s readability project. Readability is a piece of javascript code that resides as a bookmarklet in your browser. When clicked/activated, it extracts the main article content, hiding rest of the cruft making it easy on your eye and your left brain.

decruft

As with any novel project Readability has been ported to php, ruby, python etc, propelled even more by the fact that javascript engines run code in a highly constrained environment and thus are hard to leverage. The python port, python-readability, by gfx-monk is almost similar to readability in logic but is too slow.

decruft is a fork of python-readability to make it faster. It also has some logic corrections and improvements along the way.

python-readability uses BeautifulSoup to parse the html whereas decruft uses python-lxml.

The advantage of using BeautifulSoup is that it’s purely python and bundled with python-readability. Hence you do not have any outside dependency and can run on any platform, including google AppEngine.

The advantage of decruft is, as it uses libxml2, the processing is orders of magnitude faster and it can parse broken html that even BeautifulSoup cannot handle. But it needs lxml and hence restricted to linux systems.

Decruft is available at google code.

Comparison

Six Web pages with reasonable article-like content has been taken and tested for performance against python-readability and decruft. The time taken for each process is given in secs in the table below. It should also be noted that while BeautifulSoup had failed to parse parts of National Geographic due to malformed start tag, lxml had no issues.

(In the arc90 links, readability will be applied after the page loads.)

Original Page Size Arc90’s python-readability decruft
Endangered Asian ‘unicorn’ captured, first sighting in decade – CNN.com 108.56K link 15.9949080944 0.187147140503
BBC News – Cacao genome ‘may help produce tastier chocolate’ 59.27K link 12.1847000122 0.383187055588
Arctic species under threat, report warns – CNN.com 123.34K link 16.9275019169 0.208026885986
Nature – Wikipedia, the free encyclopedia 272.23K link 151.132905006 2.32351303101
National Geographic: Tyrannosaurs Were Human-size for 80 Million Years 134.14K link 17.4031939507 0.997216939926

Improvements in decrufting logic

The following are improvements in logic over readability/python-readability.

  1. More patterns for possibility of meaningful content are checked for in tags’ id and class tags.
  2. A part of logic deems div’s that do not have containers like img, table, div etc as children as paragraph tags. python-readability (probably as a typo in condition checking) did the reverse of it. i.e if the div has a container obj, it was converted to paragraph. This has been fixed. This fix incidentally increases the processing time. The interesting part of this is that, this reversal gives better result than actual arc90 readability. Case in point, for Nature – Wikipedia, the See Also section would be removed by arc90 but is neatly retained by python-readability. Adding this fix again causes irregularity in See Als section. The below fix takes care of that.
  3. Once the main section of meaningful text has been determined, a function iterates through all the descendants of the main section and prunes unnecessary crufts. One of such logic is to remove lists whose link density is > 0.2. This also prunes away valid links to relevant content. The main purpose of this logic is to remove any navigational links. So I’ve added an additional constraint for pruning that the length content should be less than 300, to avoid pruning meaningful content.
  4. some images embedded in text are lost because of the pruning logic (e.g the first img in Wikipedia Nature, Hopetoun Falls, Australia, is lost after applying arc90′s readability). Added extra logic to retain images based on their img size.

Installation

Currently the installation option is to download it from google
code.

tar –zxf decruft<em>-x.y</em>.tgz
export PYTHONPATH=$PYTHONPATH:$PWD/decruft
python

>>> from decruft import Document
>>> import urllib2
>>> f = urllib2.open(<em>url</em>)
>>> print Document(f.read()).summary()

I’m looking forward to writing a pip/easy_install installer for the same.

comments powered by Disqus