Feature-wise Comparison of HTML Article Text Extractors

19 Apr 2011

In one of my previous posts I compiled quite a decent list of software (and other resources) all capable of extracting article content from an arbitrary HTML document. While I was gathering all the relevant papers and software I kept updating a handwritten spreadsheet that compares the listed software from a feature-wise viewpoint. So I updated it and decided to dump it on my blog. Hopefully this comparison table will mitigate the decision making process of developers whose products are dependent on such software.

Firstly; let's review some shared functionality features explored for each piece of software in the table:

structure retainment - Articles are usually formatted using various html tags - paragraphs, lists hyperlinks and ohers. Some text extractors tend to remove such structure and yield only the plain text of the article.
inner content cleaning - The article content is sometimes broken into non-consecutive text blocks by ads and other boilerplate structures. We're interested in capabilities to remove such inline boilerplate.
implementation
language dependency - Some are limited to only one language.
source parameter - Can we fetch the document by ourselves or does the extractor fetch it internally?
additional features (and remarks)

	structure retainment	inner content cleaning	implementation	source parameter	language dependancy	additional features and remarks
Boilerpipe	plain text only	uses a classifier to determine whether or not the atomic text block holds useful content	open source java library	you can fetch documents by yourself or use built-in utilities to fetch them for you	should be language independent since the text block classifier observes language independent text features	implements many extractors with different classification rules trained on different datasets
Alchemy API	text only (has an option to include relevant hyperlinks)	n/a	commercial web api	include the whole document in the post request or provide an url	observation: returns an error for non-english content e.g. the document contains "unsupported text language"	extra API call to extract the title
Diffbot	plain text or html	an option to remove inline ads	web api (private beta)	does fetching for you via provided url	n/a	extracts: relevant media, titile, tags, xpath descriptor for wrappers, comments and comment count, article summary
Readability	retains original structure	uses hardcoded heuristics to extract content divided by ads	open source javascript bookmarklet	via browser	language independent but it relies on language dependent regular expressions to match id and class labels
Goose	plain text	n/a	open source java library	url only (my fork enables you to fetch the document by yourself)	language independent but it relies on language dependent regular expressions to match id and class labels	uses hardcoded heuristics to search for related images and embedded media
Extractiv	depends on the chosen output format - e.g. xml format breaks the content into paragraphs	n/a	commercial web api	include the whole document in post request or provide an url	n/a	capable of enriching the extracted text with semantic entities and relationships
Repustate API	plain text	n/a	commercial web api	url only	n/a
Webstemmer	plain text	n/a	open source python library	first runs a crawler to obtain seed pages, then it learns layout patterns that are later put to work to extract article content	language independent	the only piece of software on this list that requires a cluster of similar documents obtained by crawling
NCleaner (paper)	plain text	uses character level n-grams to detect content text blocks	open source perl library	arbitrary html document	depends on the training language	reliant on lynx browser for converting html to structured plain text

The reason why some cells in the table are marked as "n/a" was that of this table was built by inspecting the respective software documentation or research papers where the information of an observed feature was absent.

My Tech Blog.

Feature-wise Comparison of HTML Article Text Extractors

Related Posts

Evaluating Text Extraction Algorithms 09 Jun 2011

Evaluation Metrics for Text Extraction Algorithms 30 Mar 2011

List of resources: Article text extraction from HTML documents 11 Mar 2011