Feature-wise Comparison of HTML Article Text Extractors

In one of my previous posts I compiled quite a decent list of software (and other resources) all capable of extracting article content from an arbitrary HTML document. While I was gathering all the relevant papers and software I kept updating a handwritten spreadsheet that compares the listed software from a feature-wise viewpoint. So I updated it and decided to dump it on my blog. Hopefully this comparison table will mitigate the decision making process of developers whose products are dependent on such software.

Firstly; let's review some shared functionality features explored for each piece of software in the table:

  • structure retainment - Articles are usually formatted using various html tags - paragraphs, lists hyperlinks and ohers. Some text extractors tend to remove such structure and yield only the plain text of the article.
  • inner content cleaning - The article content is sometimes broken into non-consecutive text blocks by ads and other boilerplate structures.  We're interested in capabilities to remove such inline boilerplate.
  • implementation
  • language dependency - Some are limited to only one language.
  • source parameter - Can we fetch the document by ourselves or does the extractor fetch it internally?
  • additional features (and remarks)
structure retainment inner content cleaning implementation source parameter language dependancy additional features and remarks
Boilerpipe plain text only uses a classifier to determine whether or not the atomic text block holds useful content open source java library you can fetch documents by yourself or use built-in utilities to fetch them for you should be language independent since the text block classifier observes language independent text features implements many extractors with different classification rules trained on different datasets
Alchemy API text only (has an option to include relevant hyperlinks) n/a commercial web api include the whole document in the post request or provide an url observation: returns an error for non-english content e.g. the document contains "unsupported text language" extra API call to extract the title
Diffbot plain text or html an option to remove inline ads web api (private beta) does fetching for you via provided url n/a extracts: relevant media, titile, tags, xpath descriptor for wrappers, comments and comment count, article summary
Readability retains original structure uses hardcoded heuristics to extract content divided by ads open source javascript bookmarklet via browser language independent but it relies on language dependent regular expressions to match id and class labels
Goose plain text n/a open source java library url only (my fork enables you to fetch the document by yourself) language independent but it relies on language dependent regular expressions to match id and class labels uses hardcoded heuristics to search for related images and embedded media
Extractiv depends on the chosen output format - e.g. xml format breaks the content into paragraphs n/a commercial web api include the whole document in post request or provide an url n/a capable of enriching the extracted text with semantic entities and relationships
Repustate API plain text n/a commercial web api url only n/a
Webstemmer plain text n/a open source python library first runs a crawler to obtain seed pages, then it learns layout patterns that are later put to work to extract article content language independent the only piece of software on this list that requires a cluster of similar documents obtained by crawling
NCleaner (paper) plain text uses character level n-grams to detect content text blocks open source perl library arbitrary html document depends on the training language reliant on lynx browser for converting html to structured plain text

The reason why some cells in the table are marked as "n/a" was that of this table was built by inspecting the respective software documentation or research papers where the information of an observed feature was absent.