List of resources: Article text extraction from HTML documents
11 Mar 2011UPDATE 21/3/2011: Added reader contributed links to software and API section
Following up to my overview of article text extractors, I'll try to compile a list of research papers, articles, web APIs, libraries and other software that I encountered during my research.
Research papers and Articles
Just to summarize the ones mentioned in my previous post:
- Boilerplate Detection using Shallow Text Features by Kohlschütter et al
- Extracting Article Text from the Web with Maximum Subsequence Segmentation by Pasternack & Roth
- Text Extraction from the Web via Text-to-Tag Ratio by Weninger & Hsu
and the extension of the main algorithm to 2D histogram clustering: Web Content Extraction Through Histogram Clustering (another version) - VIPS: a Vision-based Page Segmentation Algorithm
Others not mentioned in the overview:
- Fairly old, but cited by nearly every paper with a similar topic: Roadrunner is a wrapper induction technique based on pattern discovery within similar HTML documents.
- Automatic Web News Extraction Using Tree Edit Distance: This algorithm uses a tree comparison metric analogous to Levenshtein distance to detect relevant content in a set of HTML documents.
- Discovering Informative Content Blocks from WebDocuments: A bit outdated due to some assumptions, but interesting because it employs entropy as a threshold metric to predict informative blocks of content.
- Web Page Cleaning with Conditional Random Fields: If you will be reading any of the articles above, sooner or later you'll notice everybody is citing the CleanEval shared task which took place in 2007. This paper presents the best performing algorithm which makes use of CRF to label blocks of content as text or noise based on block level features.
I've also stumbled upon these:
- Hierarchical wrapper induction for semistructured information sources
- Template detection for large scale search engines
- Web Page Cleaning for Web Mining through Feature Weighting
- Eliminating noisy information in Web pages for data mining
The only good blog article I came across is the one at ai-depot.com: The Easy Way to Extract Useful Text from Arbitrary HTML. The author is using examples written in python to employ a fairly similar technique described in the text-to-tag ratio paper listed above.
Software
There is only a small amount of competition when it comes to software capable of [removing boilerplate text / extracting article text / cleaning web pages / predicting informative content blocks] or whatever terms authors are using to describe the capabilities of their product.
The least common denominator for all the listed software below is the following criteria: given an HTML document with an article in it, the system should yield the incorporated article itself (and maybe it's title and other meta-data):
- Boilerpipe library: an open source Java library. The library itself is the official implementation of the overall algorithm presented in the previously mentioned paper by Kohlschütter et al.
- Readability bookmarklet by arc90labs is open sourced. Originally written in JavaScript it was also ported to other languages:
- python-readabilty - using BeautifulSoup (slow)
- fork of python-readability employing lxml for faster parsing
- ruby-readability
- PHP port
- jReadability
- C# port
- Project Goose by Gravity labs
- Perl module HTML::Feature
- Webstemmer is a web crawler and page layout analyzer with a text extraction utility
- Demo of VIPS packaged in a .dll (it's use is limited to research purposes only)
Web APIs
After a short inquiry I came across some very decent web APIs:
- Alchemy API Web Page Cleaning - a well known commercial API with a limited free service
- ViewText.org - they're asking you to be kind to their servers, so this is not your typical commercial service
- DiffBot API - describes itself as: "Statistical machine learning algorithms are run over all of the visual elements on the page to extract out the article text and associated metadata, such as its images, videos, and tags."
- Purifry - is promising high performance and good accuracy. It's also available as a binary.
- Extractiv - text extraction is just a side feature
- Repustate API - includes a clean html call
Related
There is a ton of stuff out there that is somehow related to items listed above. Perhaps you'll find them interesting or at least useful:
- Full text RSS feed builder at fulltextrssfeed.com - a very neat example of putting article text extraction into practice
- FiveFilter.org - another full text RSS builder. According to their FAQ, they're using readability ported to PHP
- Boilerpipe has a nice demo running on appengine
- Demo of the previously mentioned extraction algorithm using maximum subsequence optimization
This is it. Hopefully you now have a better perspective of the sparse literature and software on the topic in question. If you're aware of an item that should be added to any of the lists in this post, please drop me a line in the comments. Don't forget to subscribe to my RSS feed for related updates.