Evaluating Text Extraction Algorithms

UPDATE 11/6/2011: Added the summary and the results table

Lately I’ve been working on evaluating and comparing algorithms, capable of extractinguseful content from arbitrary html documents. Before continuing I encourage you to pass trough some of my previous posts, just to get a better feel of what we’re dealing with; I’ve written a short overview, compiled a list of resources if you want to dig deeper and made a feature wise comparison of related software and APIs.

Summary

  • The evaluation environment presented in this post consists of 2 datasets with approximately 650 html documents.
  • The gold standard of both datasets was produced by human annotators.
  • 14 different algorithms were evaluated in terms of precision, recall and F1 score.
  • The results have show that the best opensource solution is the boilerpipe library.
  • Commercial APIs included in the evaluation environment produced consistent results on both datasets. Diffbot and Repustate API performed best, while others follow very closely.
  • Readability’s performance is surprisingly poor and lacking consistency between both ports that were put to use in the evaluation setup.
  • Further work will include: adding more APIs and libraries to the setup, working on a new extraction technique and assembling a new dataset.

Evaluation Data

My evaluation setup consists of 2 fairly well known datasets:

  • The final cleaneval evaluation dataset, with 681 documents (created by the ACL web-as-corpus community).
  • Google news dataset, with 621 documents.  (harvested by the authors of the boilerpipe library)

The former was harvested from all sorts of web pages, including: articles, blog posts, mailing list archives, forum conversations, dedicated form submission pages etc. The latter was gathered by scraping the google news stream for a longer period of time (cca 6 months). The random sample that became the final dataset of 621 documents from 408 different news type web sites, came from a larger set of 250k documents assembled during the scrapping process.

From their description and related documentation we can conclude that they represent 2 slightly distinctive domains: the google news dataset represents a specific domain of news articles and the cleaneval dataset represents a cross domain collection of documents.

The gold standard counterpart of both datasets was created by human annotators who assessed each document individually. Annotators who assembled the cleaneval gold standard, first used lynx text based browser to filter out all the non-text building blocks. Each annotator than cleaned out the redundant non-content text using his own visual interpretation of the website. On the other hand, annotators of the google news dataset inserted additional span tags (using class names as labels) into the original html document. The class names of span tags indicate the following content labels: headline, main text, supplemental text, user comments, related content. In my experimental setup I use only the content labeled under headline and main text as the gold standard.

As you might have noticed, these datasets are not as large as we would like them to be and there is a fairly good reason behind that: assessing and labeling html documents for content is a tedious task.  So was the preprocessing of the cleaneval dataset on my part. The methodology of storing raw html documents for the cleaneval shared task included inserting a special <text> tag around the whole html markup to hold some meta data about the document itself. I had to remove these tags from all documents and insert the basic <html> and <body> tags to those who were obviously stripped of some starting and trailing boilerplate html markup. The reasoning of inserting such tags back into the raw html document is based on the nature of particular extraction software implementations which assume the existence of such tags.

Metrics

I’m using precision, recall and f1 score to evaluate the performance.  I covered these in one of my previous posts, but here is a short recap:

Definitions:

Given an HTML document we predict/extract/retrieve a sequence of words and denote them as Sret. Consequentially the sequence of words representing relevant text is denoted as Srel .

Both sequences are constructed by the following procedure:

  1. Remove any sort of remaining inline html tags
  2. Remove all punctuation characters
  3. Remove all control characters
  4. Remove all non ascii characters (due to unreliable information of the document encoding)
  5. Normalize to lowercase
  6. Split on whitespace

The intersection of retrieved and the relevant sequence of words is calculated using the Ratcliff-Obershelff algorithm (python difflib).

As we’re dealing with raw data, we have to account for 4 special cases of these metrics:

Precision Recall F1-score Case
0 0 inf Missmatch – both retrieved and relevant sequences are not empty, but they don’t intersect.
inf 0 nan Set of retrieved words is empty – the extractor predicts that the observed document does not contain any content.
0 inf nan Set of relevant words is empty – the document itself does not contain anything useful.
inf inf nan Both retrieved and relevant set is empty – the document does not contain nothing useful and the extractor comes back empty handed.

Results

Currently my evaluation setup includes the following content extractors that were either already reviewed or mentioned in my preceding blog posts, so I won’t get into the details of every single one in the context of this writeup:

  • Boilerpipe – using two similar variants; the default extractor and the one especially tuned for article type of content.
  • NCleaner – again using two of its variants; the non lexical n-gram character language independat model and the model trained on english corpora.
  • Readability – using its python port and readability ported to node.js on top of jsdom.
  • Pasternack & Roth algorithm (MSS) – authors provided me with access to the implementation presented in the www2009 paper.
  • Goose – using my fork to expose the core content extraction functionality.
  • Zextractor – internally used service developed by my friends/mentors at Zemanta.
  • Alchemy API – using their text extraction API.
  • Extractiv – using their semantic on demand RESTul API.
  • Repustate API – using the clean html API call.
  • Diffbot – using the article API.

Admittedly they are not all specialized for solving problems in a specific domain, say news article content extraction, but they all share one common property: they’re capable of distinguishing useful textual content apart from boilerplate and useless content when observing an arbitrary html document.

Results were obtained by calculating precision and recall for each document in the dataset for every content extractor. Then we calculate the arithmetic mean of all three metrics for every extractor. Bar charts are employed to visualize the results.

Precision, recall & F1 score mean - Google news dataset

Precision, recall & F1 score mean - Cleaneval dataset

Cleaneval→ precision recall f1 score
Boilerpipe DEF 0.931 0.856 0.872
Boilerpipe ART 0.949 0.764 0.804
Goose 0.934 0.719 0.770
MSS 0.911 0.699 0.718
Python Readability 0.841 0.833 0.803
Node Readability 0.704 0.878 0.753
Alchemy API 0.950 0.828 0.854
Diffbot 0.932 0.890 0.891
Extractiv 0.935 0.871 0.887
Repustate 0.940 0.889 0.896
Zextractor 0.916 0.763 0.800
NCleaner En 0.923 0.895 0.897
NCleaner NonLex 0.882 0.927 0.892
Google news→ precision recall f1 score
Boilerpipe DEF 0.863 0.938 0.887
Boilerpipe ART 0.931 0.955 0.939
Goose 0.934 0.887 0.901
MSS 0.907 0.882 0.875
Python Readability 0.638 0.825 0.681
Node Readability 0.688 0.968 0.789
Alchemy API 0.936 0.876 0.888
Diffbot 0.924 0.968 0.935
Extractiv 0.824 0.956 0.870
Repustate 0.917 0.907 0.898
Zextractor 0.850 0.806 0.803
NCleaner En 0.742 0.947 0.812
NCleaner NonLex 0.613 0.959 0.723

Next we explore the distribution of per document measurements. Instead of just calculating the mean of precision, recall and F1 score we can explore the frequencies of per document measurements. In the following chart we visualize frequencies for each metric for all extractors. Notice that we’re splitting the [0,1] interval of each metric into 20 bins with equidistant intervals.

Metric distributions - Google news dataset

Metric distributions - Cleaneval dataset

To account for previously mentioned 4 special cases, we employ a stacked bar chart to inspect the margins of useful/successful per document measurements for every dataset.

Special cases of per document measurements - Cleaneval dataset

Special cases of per document measurements - Google news dataset

Every per document measurement can fall into one of the “special cases” category, “successful” category, or an extra “failed” category. The latter stands for measurement instances that failed due to: parsing errors, implementation specific errors, unsupported language errors etc. The right hand side of both figures depicts the same categories without the “successful” part.

This chart is important, because we only make use of measurements who fall under the “successful” category to obtain the distribution and mean for each metric, respectively.

Observations

The foremost observation is the varying performance of NCleaner outside the cleaneval domain, since both variants seem to perform poorly on the google news collection. The cause of such behavior might be the likelihood that NCleaner was trained on the cleaneval corpus.

Readability’s poor performance came as a surprise, moreover, its varying results between both ports. Relatively low precision and high recall indicate that readability tends to include large portions of useless text in its output. I’m not quite satisfied of how readability is represented in this experiment as I’m not making use of the original implementation. The node.js port seems not to differ from the original as much as the python port, but I still worry I’ll have to use the original implementation on top of a headless webkit browser engine like PhantomJS, to get a better representation.

Notice the consistent performance of commercial APIs (and zemanta’s internal service), Alchemy API and Diffbot in particular. According to diffbot’s co-founder Michael Tung, their article API is only a small portion of their technology stack. They’re using common visual features of websites to train models in order to gain the understanding of various types of web pages (forum threads, photo galleries, contact info pages and about 50 other types).  The article API used in this setup is just an additionally exposed model, trained on visual features of article type pages. On the other hand, zextractor (zemanta’s internal service) leverages on libsvm to classify atomic text blocks of the newly observed document.

What I find especially interesting are the bimodal distributions of precision and/or recall for MSS and readability’s python port. I suspect that they’re failing to produce relevant content on really big documents or they’re capable of extracting only a limited amount of content. It’ll be interesting to explore this phenomena with some additional result visualizations.

Conclusion and Further Work

According to my evaluation setup and personal experience, the best open source solution currently available on the market is the boilerpipe library. If we treat precision with an equal amount of importance as recall (reflected in the F1 score) and take into account the performance consistency across both domains, then boilerpipe performs best. Performance aside, its codebase is seems to be quite stable and it works really fast.

If you’re looking for an API and go by the same criteria (equal importance of precision and recall, performance consistency across both domains), diffbot seems to perform best, although alchemy, repustate and extractiv are following closely. If speed plays a greater role in your decision making process; Alchemy API seems to be a fairly good choice, since its response time could be measured in tenths of  a second, while others rarely finish under a second .

My further work will be wrapped around adding new software to my evaluation environment (I was recently introduced to justext tool), compiling a new (larger) dataset and working on my own project, that’ll hopefully yield some competative results.

Stay tuned for I’ll be releasing all the data and code used in this evaluation setup in the near future.

This entry was posted in text extraction and tagged , , . Bookmark the permalink.
  • Venkat
  • http://twitter.com/jplehmann John Lehmann

    I’m grateful for your in-depth article Tomaz! As with diffbot, text “cleaning” is a non-central part of LCC/Extractiv’s tech stack, so I’m pleased to see it rank in 3rd in terms of average Fm (along with Boilerpipe). And, if one disabled all the other services it’s providing through the API (entities, relations, entity linking, topics,…) it would easily come back with the sub-second responses you noted. Thanks for including us, and I look forward to how we can improve still further based on these results!

  • http://fredeaker.com/2011/06/text-extraction-and-file-type-detection/ Text Extraction and File Type Detection – Fred Eaker | Fred Eaker

    [...] see what tools are available now that would have been part of the evaluation process. I also like Tomaž Kovačič’s thorough explanation of his testing methods and results. This entry was posted in Data Analytics and Visualization. Bookmark the [...]

  • http://tomazkovacic.com/blog Tomaž Kovačič

    Thank you Christian. I’ve already read your thesis and found it quite useful. 

  • http://tomazkovacic.com/blog Tomaž Kovačič

    Thanks John.

    Just wondering … have you considered making a separate API call that would do just the extraction part? 

  • Darin

    Search engines (e.g. Autonomy, Fast, Nutch) seem to
    have built in html to text extraction capabilities. Recently faced with a need
    to extract text from a large number of company homepages, I utilized Nutch to
    crawl and spit out the text of these pages. It appeared to be fairly effective,
    however, it would be interesting to have a sense of how the Nutch text
    extraction compares to the tools you measured. Any thoughts on this?

  • http://robbie.robnrob.com/2011/06/goose-wins-2nd-place-in-text-extraction/ Goose Wins 2nd Place in Text Extraction » Robbie Coleman

    [...] graph below from Tomaž Kovačič‘s study shows only a small amount of the data he collected in his analysis. If you are curious of how he [...]

  • http://robbie.robnrob.com/ Robbie Coleman

    Thank you for including Goose in your analysis! This provides us with new insight into the project, and your deep analysis and detailed results are great assets to the furthering of our efforts.

  • http://www.quora.com/Whats-the-best-method-to-extract-article-text-from-HTML-documents Quora

    What’s the best method to extract article text from HTML documents?…

    I’ve actually written a blogpost and summarized some of your answers: http://tomazkovacic.com/blog/56/list-of-resources-article-text-extraction-from-html-documents/ Update: Here is a performance comparison of 13 libraries and services http://tomazkova...

  • http://tomazkovacic.com/blog Tomaž Kovačič

    That’s good news. Best of luck on improving goose. 

  • http://tomazkovacic.com/blog Tomaž Kovačič

    Interesting question. I’ll have to inspect Nutch to see if its extraction capability could be exposed as a service of some sort. 

    If would be very helpful if you could provide me with the link to the relevant parts of code that does this for Nutch. 

  • Darin

    My experience with
    Nutch has been to indirectly utilized its text extraction capabilities via crawling
    homepage pages using this command

    bin/nutch crawl urls -dir crawl -depth 1 >& crawl/crawl.log

    then dumping
    the resulting text via this command

    bin/nutch readseg -dump
    C:/nutch/apache-nutch-1.1-bin/crawl/segments/20110526204856 crawl/dump
    -nocontent -nofetch -ngenerate -noparse –noparsedata

    However, I am pretty
    sure you can utilize the Nutch text extraction capabilities directly via its java
    API, possibly utilizing the HtmlParser class – http://nutch.apache.org/apidocs-1.1/index.html.

     

    Note that it looks like Nutch is moving to
    utilizing Tika for this functionality and as such it might be easier to
    directly utilize Tika. I don’t have much experience with Tika so
    can only point you to the Tika homepage – http://tika.apache.org/.

  • Darin

    FYI, I took a closer look at Tika and you can download ‘tika-app-0.9.jar’ then run simple commands like this to extract text from a web page.

    java -jar tika-app-0.9.jar -t http://www.zibb.com > zibbout.txt

    http://tika.apache.org/0.9/gettingstarted.html (see the ‘using Tika as a command line utility’ section).

  • http://twitter.com/jplehmann John Lehmann

    Letting the user specify which services to run through the input is a great idea, and we do have it on our list. I think it will probably get added in the fall if not sooner.

  • http://tomazkovacic.com/blog Tomaž Kovačič

    Thanks. I’ll give it a try. 

  • http://enile8.journalgin.com/2011/19 Document Clustering with Python – enile8

    [...] this problem yet, but I ran across Tomaž Kovačič blog the other day and he was covering which data extraction algorithm actually worked best and the different situations in which they fell down (very helpful post at the [...]

  • http://enile8.com/?p=22 Finding the Main Webpage Content | enile8

    [...] apis. The other day Tomaž Kovačič wrote a blog post outlining the performance of the different data extraction algorithm’s and which situations they excelled in and which they didn’t.Below is a very simple example of [...]

  • http://tm.durusau.net/?p=11765 Evaluating Text Extraction Algorithms « Another Word For It

    [...] Evaluating Text Extraction Algorithms [...]

  • Sharmila G Sivakumar

    Hi Tomaz,
        Thank you very much, this is a great evaluation of text extraction tools and will be of great use.  Actually there are 2 python ports for readability.  python-readability and decruft.  Which one did you use?  I worked on decruft.  This would be a great opportunity for me to improve it.

  • http://tomazkovacic.com/blog Tomaž Kovačič

    The former - https://github.com/gfxmonk/python-readability 

  • http://daniel.hepper.net/blog Daniel Hepper

    From my understanding, Tika can extract plaintext and metadata from multiple formats, but doesn’t do any pruning. Just compare the output of e.g. Goose to Tika.

  • http://enile8.org/?p=19 Document Clustering with Python | enile8

    [...] this problem yet, but I ran across Tomaž Kovačič blog the other day and he was covering which data extraction algorithm actually worked best and the different situations in which they fell down (very helpful post at the [...]

  • Jan Pomikálek

    Hi Tomaž,

    thanks for the evaluation. I’m the author of jusText which you reference in your conclusion. (Now I know where the jusText website visitors come from. :) ) I would be very interested to see the results of my tool in your evaluation environment. Have you had time to try it out already?

    If you would like to compare with my own evaluation you can check my PhD thesis (http://is.muni.cz/th/45523/fi_d/phdthesis.pdf). I used the same data sets as you (plus one more), but mostly different list of algorithms.

    Feel free to contact me if you need any support with using jusText.

    Cheers,
    Jan

  • Sarav

    Thanks for this fantastic post,

    Could you please throw some light on Readability as a standalone server side implementation?

    Saravanan

  • http://tomazkovacic.com/blog Tomaž Kovačič

    Hey Jan, thanks for linking your thesis. I found it to be a crucial resource for my further work. I’ll get back to you via email for jusText evaluation. 

  • http://www.facebook.com/tonyx Tony Xiao

    Hey Tomaz, 

    Do you have any ETA on when you plan to release the data you used for this evaluation setup?

  • Ilya Klintsov

    Hi, thanks for great article. Is it possible to get the gold standard documents used for this analysis? We have our own text extraction tool and I would like to compare the quality of extraction…

  • Omarshariffdontlikeit

    Any progress on evaluating JusText? I’m keen to port a boilerplate detection algo to Go, and JusText, in my simple experiments) has shown to produce the best results for me.

blog comments powered by Disqus