In one of my previous posts I compiled quite a decent list of software (and other resources) all capable of extracting article content from an arbitrary HTML document. While I was gathering all the relevant papers and software I kept updating a handwritten spreadsheet that compares the listed software from a feature-wise viewpoint. So I updated it and decided to dump it on my blog. Hopefully this comparison table will mitigate the decision making process of developers whose products are dependent on such software.
Firstly; let’s review some shared functionality features explored for each piece of software in the table:
- structure retainment – Articles are usually formatted using various html tags – paragraphs, lists hyperlinks and ohers. Some text extractors tend to remove such structure and yield only the plain text of the article.
- inner content cleaning – The article content is sometimes broken into non-consecutive text blocks by ads and other boilerplate structures. We’re interested in capabilities to remove such inline boilerplate.
- language dependency – Some are limited to only one language.
- source parameter – Can we fetch the document by ourselves or does the extractor fetch it internally?
- additional features (and remarks)
|structure retainment||inner content cleaning||implementation||source parameter||language dependancy||additional features and remarks|
|Boilerpipe||plain text only||uses a classifier to determine whether or not the atomic text block holds useful content||open source java library||you can fetch documents by yourself or use built-in utilities to fetch them for you||should be language independent since the text block classifier observes language independent text features||implements many extractors with different classification rules trained on different datasets|
|Alchemy API||text only (has an option to include relevant hyperlinks)||n/a||commercial web api||include the whole document in the post request or provide an url||observation: returns an error for non-english content e.g. the document contains “unsupported text language”||extra API call to extract the title|
|Diffbot||plain text or html||an option to remove inline ads||web api (private beta)||does fetching for you via provided url||n/a||extracts: relevant media, titile, tags, xpath descriptor for wrappers, comments and comment count, article summary|
|Goose||plain text||n/a||open source java library||url only (my fork enables you to fetch the document by yourself)||language independent but it relies on language dependent regular expressions to match id and class labels||uses hardcoded heuristics to search for related images and embedded media|
|Extractiv||depends on the chosen output format – e.g. xml format breaks the content into paragraphs||n/a||commercial web api||include the whole document in post request or provide an url||n/a||capable of enriching the extracted text with semantic entities and relationships|
|Repustate API||plain text||n/a||commercial web api||url only||n/a|
|Webstemmer||plain text||n/a||open source python library||first runs a crawler to obtain seed pages, then it learns layout patterns that are later put to work to extract article content||language independent||the only piece of software on this list that requires a cluster of similar documents obtained by crawling|
|NCleaner (paper)||plain text||uses character level n-grams to detect content text blocks||open source perl library||arbitrary html document||depends on the training language||reliant on lynx browser for converting html to structured plain text|
The reason why some cells in the table are marked as “n/a” was that of this table was built by inspecting the respective software documentation or research papers where the information of an observed feature was absent.