List of resources: Article text extraction from HTML documents

11 Mar 2011

UPDATE 21/3/2011: Added reader contributed links to software and API section

Following up to my overview of article text extractors, I'll try to compile a list of research papers, articles, web APIs, libraries and other software that I encountered during my research.

Research papers and Articles

Just to summarize the ones mentioned in my previous post:

Boilerplate Detection using Shallow Text Features by Kohlschütter et al
Extracting Article Text from the Web with Maximum Subsequence Segmentation by Pasternack & Roth
Text Extraction from the Web via Text-to-Tag Ratio by Weninger & Hsu
and the extension of the main algorithm to 2D histogram clustering: Web Content Extraction Through Histogram Clustering (another version)
VIPS: a Vision-based Page Segmentation Algorithm

Others not mentioned in the overview:

Fairly old, but cited by nearly every paper with a similar topic: Roadrunner is a wrapper induction technique based on pattern discovery within similar HTML documents.
Automatic Web News Extraction Using Tree Edit Distance: This algorithm uses a tree comparison metric analogous to Levenshtein distance to detect relevant content in a set of HTML documents.
Discovering Informative Content Blocks from WebDocuments: A bit outdated due to some assumptions, but interesting because it employs entropy as a threshold metric to predict informative blocks of content.
Web Page Cleaning with Conditional Random Fields: If you will be reading any of the articles above, sooner or later you'll notice everybody is citing the CleanEval shared task which took place in 2007. This paper presents the best performing algorithm which makes use of CRF to label blocks of content as text or noise based on block level features.

I've also stumbled upon these:

The only good blog article I came across is the one at ai-depot.com: The Easy Way to Extract Useful Text from Arbitrary HTML. The author is using examples written in python to employ a fairly similar technique described in the text-to-tag ratio paper listed above.

Software

There is only a small amount of competition when it comes to software capable of [removing boilerplate text / extracting article text / cleaning web pages / predicting informative content blocks] or whatever terms authors are using to describe the capabilities of their product.

The least common denominator for all the listed software below is the following criteria: given an HTML document with an article in it, the system should yield the incorporated article itself (and maybe it's title and other meta-data):

Boilerpipe library: an open source Java library. The library itself is the official implementation of the overall algorithm presented in the previously mentioned paper by Kohlschütter et al.
Readability bookmarklet by arc90labs is open sourced. Originally written in JavaScript it was also ported to other languages:
- python-readabilty - using BeautifulSoup (slow)
- fork of python-readability employing lxml for faster parsing
- ruby-readability
- PHP port
- jReadability
- C# port
Project Goose by Gravity labs
Perl module HTML::Feature
Webstemmer is a web crawler and page layout analyzer with a text extraction utility
Demo of VIPS packaged in a .dll (it's use is limited to research purposes only)

Web APIs

After a short inquiry I came across some very decent web APIs:

Alchemy API Web Page Cleaning - a well known commercial API with a limited free service
ViewText.org - they're asking you to be kind to their servers, so this is not your typical commercial service
DiffBot API - describes itself as: "Statistical machine learning algorithms are run over all of the visual elements on the page to extract out the article text and associated metadata, such as its images, videos, and tags."
Purifry - is promising high performance and good accuracy. It's also available as a binary.
Extractiv - text extraction is just a side feature
Repustate API - includes a clean html call

There is a ton of stuff out there that is somehow related to items listed above. Perhaps you'll find them interesting or at least useful:

Full text RSS feed builder at fulltextrssfeed.com - a very neat example of putting article text extraction into practice
FiveFilter.org - another full text RSS builder. According to their FAQ, they're using readability ported to PHP
Boilerpipe has a nice demo running on appengine
Demo of the previously mentioned extraction algorithm using maximum subsequence optimization

This is it. Hopefully you now have a better perspective of the sparse literature and software on the topic in question. If you're aware of an item that should be added to any of the lists in this post, please drop me a line in the comments. Don't forget to subscribe to my RSS feed for related updates.

My Tech Blog.

List of resources: Article text extraction from HTML documents

Research papers and Articles

Software

Web APIs

Related

My Tech Blog.

List of resources: Article text extraction from HTML documents

Research papers and Articles

Software

Web APIs

Related

Related Posts

Evaluating Text Extraction Algorithms 09 Jun 2011

Feature-wise Comparison of HTML Article Text Extractors 19 Apr 2011

Evaluation Metrics for Text Extraction Algorithms 30 Mar 2011