Feature-wise Comparison of HTML Article Text Extractors
19 Apr 2011In one of my previous posts I compiled quite a decent list of software (and other resources) all capable of extracting article content from an arbitrary HTML document. While I was gathering all the relevant papers and software I kept updating a handwritten spreadsheet that compares the listed software from a feature-wise viewpoint. So I updated it and decided to dump it on my blog. Hopefully this comparison table will mitigate the decision making process of developers whose products are dependent on such software.
Firstly; let's review some shared functionality features explored for each piece of software in the table:
- structure retainment - Articles are usually formatted using various html tags - paragraphs, lists hyperlinks and ohers. Some text extractors tend to remove such structure and yield only the plain text of the article.
- inner content cleaning - The article content is sometimes broken into non-consecutive text blocks by ads and other boilerplate structures. We're interested in capabilities to remove such inline boilerplate.
- implementation
- language dependency - Some are limited to only one language.
- source parameter - Can we fetch the document by ourselves or does the extractor fetch it internally?
- additional features (and remarks)
structure retainment | inner content cleaning | implementation | source parameter | language dependancy | additional features and remarks | |
Boilerpipe | plain text only | uses a classifier to determine whether or not the atomic text block holds useful content | open source java library | you can fetch documents by yourself or use built-in utilities to fetch them for you | should be language independent since the text block classifier observes language independent text features | implements many extractors with different classification rules trained on different datasets |
Alchemy API | text only (has an option to include relevant hyperlinks) | n/a | commercial web api | include the whole document in the post request or provide an url | observation: returns an error for non-english content e.g. the document contains "unsupported text language" | extra API call to extract the title |
Diffbot | plain text or html | an option to remove inline ads | web api (private beta) | does fetching for you via provided url | n/a | extracts: relevant media, titile, tags, xpath descriptor for wrappers, comments and comment count, article summary |
Readability | retains original structure | uses hardcoded heuristics to extract content divided by ads | open source javascript bookmarklet | via browser | language independent but it relies on language dependent regular expressions to match id and class labels | |
Goose | plain text | n/a | open source java library | url only (my fork enables you to fetch the document by yourself) | language independent but it relies on language dependent regular expressions to match id and class labels | uses hardcoded heuristics to search for related images and embedded media |
Extractiv | depends on the chosen output format - e.g. xml format breaks the content into paragraphs | n/a | commercial web api | include the whole document in post request or provide an url | n/a | capable of enriching the extracted text with semantic entities and relationships |
Repustate API | plain text | n/a | commercial web api | url only | n/a | |
Webstemmer | plain text | n/a | open source python library | first runs a crawler to obtain seed pages, then it learns layout patterns that are later put to work to extract article content | language independent | the only piece of software on this list that requires a cluster of similar documents obtained by crawling |
NCleaner (paper) | plain text | uses character level n-grams to detect content text blocks | open source perl library | arbitrary html document | depends on the training language | reliant on lynx browser for converting html to structured plain text |
The reason why some cells in the table are marked as "n/a" was that of this table was built by inspecting the respective software documentation or research papers where the information of an observed feature was absent.