List of resources: Article text extraction from HTML documents

UPDATE 21/3/2011: Added reader contributed links to software and API section

Following up to my overview of article text extractors, I'll try to compile a list of research papers, articles, web APIs, libraries and other software that I encountered during my research.

Research papers and Articles

Just to summarize the ones mentioned in my previous post:

Others not mentioned in the overview:

  • Fairly old, but cited by nearly every paper with a similar topic: Roadrunner is a wrapper induction technique based on pattern discovery within similar HTML documents.
  • Automatic Web News Extraction Using Tree Edit Distance: This algorithm uses a tree comparison metric analogous to Levenshtein distance to detect relevant content in a set of HTML documents.
  • Discovering Informative Content Blocks from WebDocuments: A bit outdated due to some assumptions, but interesting because it employs entropy as a threshold metric to predict informative blocks of content.
  • Web Page Cleaning with Conditional Random Fields: If you will be reading any of the articles above, sooner or later you'll notice everybody is citing the CleanEval shared task which took place in 2007. This paper presents the best performing algorithm which makes use of CRF to label blocks of content as text or noise based on block level features.

I've also stumbled upon these:

The only good blog article I came across is the one at ai-depot.com: The Easy Way to Extract Useful Text from Arbitrary HTML. The author is using examples written in python to employ a fairly similar technique described in the text-to-tag ratio paper listed above.

Software

There is only a small amount of competition when it comes to software capable of [removing boilerplate text / extracting article text / cleaning web pages / predicting informative content blocks] or whatever terms authors are using to describe the capabilities of their product.

The least common denominator for all the listed software below is the following criteria: given an HTML document with an article in it, the system should yield the incorporated article itself (and maybe it's title and other meta-data):

Web APIs

After a short inquiry I came across some very decent web APIs:

  • Alchemy API Web Page Cleaning - a well known commercial API with a limited free service
  • ViewText.org - they're asking you to be kind to their servers, so this is not your typical commercial service
  • DiffBot API - describes itself as: "Statistical machine learning algorithms are run over all of the visual elements on the page to extract out the article text and associated metadata, such as its images, videos, and tags."
  • Purifry - is promising high performance and good accuracy. It's also available as a binary.
  • Extractiv - text extraction is just a side feature
  • Repustate API - includes a clean html call

Related

There is a ton of stuff out there that is somehow related to items listed above. Perhaps you'll find them interesting or at least useful:

  • Full text RSS feed builder at fulltextrssfeed.com - a very neat example of putting article text extraction into practice
  • FiveFilter.org - another full text RSS builder. According to their FAQ, they're using readability ported to PHP
  • Boilerpipe has a nice demo running on appengine
  • Demo of the previously mentioned extraction algorithm using maximum subsequence optimization

This is it. Hopefully you now have a better perspective of the sparse literature and software on the topic in question. If you're aware of an item that should be added to any of the lists in this post, please drop me a line in the comments. Don't forget to subscribe to my RSS feed for related updates.