Evaluation Metrics for Text Extraction Algorithms

In my two previous posts (both were issued on hacker news, ReadWriteWeb and O'Reilly Radar) I've covered quite a decent array of various text extraction methods and related software. So before reading this one I encourage you to read them to get a better feel of the topic. I bet those of you who read both posts, are eager to ascertain which method performs best. Before getting into performance and accuracy evaluation, we first have to set the groundwork by defining evaluation metrics.

Precision, Recall & F1-score

Both precision and recall are widely used metrics in various fields of study; classification, pattern recognition and information retrieval being most noticeable. When evaluating text extraction algorithms, we feel most comfortable using their respective information retrieval context.

Let's pretend for a while we are evaluating a search engine that retrieves a set of documents given a specific search term. Using this generic example we will review the formal definitions of precision and recall in their information retrieval context.

Formally precision and recall are defined as:

Where Sret denotes the set of retrieved documents and Srel the set of documents relevant to the provided search term.

This interpretation of both measures can easily be applied to text extraction. Instead of retrieved and relevant documents for the given search term, we now have the following situation: given an HTML document we predict/extract/retrieve a set of words and denote them as Sret . Consequentially the set of words of the relevant text is denoted as  Srel .

Formal definitions aside we must now explore the construction of these sets of words. There are many representations for text documents and but not all are suited for determining such sets. The most common model for representing text documents widely known in NLP, the bag-of-words model, has  a tendency to produce biased results when used to construct sets of relevant and retrieved words. Since BOW discards all grammatical information and even word order of the original text, we have no way of determining if multiple instances of the retrieved word are part of the article content or part of the website's boilerplate structure (navigation, menus, related articles ...).

A quick observation leads us to an improved model (for representing retrieved and relevant text) which retains word order - the array-of-words. This model is constructed by the following procedure:

  1. remove punctuation
  2. split the text into an array of words using whitespace as a delimiter
  3. lowercase all words
  4. (optional) stem all numbers and dates to neutral tokens

Calculating the denominator of  both precision and recall using this model is trivial. But what about the numerator - the intersection between retrieved and relevant segment of text? The array-of-words model allows us to make use of delta calculating procedures. The idea is to compute a list of largest continuous subsequences of matching words inside two sequences of words.

Luckily for us, python programmers, there is a module inside the standard library that provides helpers for comparing sequences. Here is a quick example:

With precision and recall in the bag, we now review a measure that combines both. F1-score is the harmonic mean of precision and recall and is computed using the following equation:

F1-score is derived from a generalized F-beta-measure equation:

Wikipedia holds a quote, that perfectly articulates the relation of beta, precision and recall in the F-beta-measure equation:

β measures the effectiveness of retrieval with respect to a user who attaches β times as much importance to recall as precision

Levenshtein Edit Distance

Apart from precision and recall, Levenshtein distance comes from the world of information theory. It's used to determine the number of edit operations needed for transforming one string into the other. Edit operations in the original version of Levenshtein distance are: insertion, deletion and substitution of a single character.

We could employ this measure to compare retrieved and relevant text represented as single strings. The downfall: computationally intensive and biased results due to whitespace character noise. But rather than using the single string representation, we use the array-of-words to represent both texts and than count the edit operations based on words to compute the mutual similarity.

A fairly simmilar technique was put into action during the CLEANEVAL competition, dedicated to competitive evaluation of cleaning arbitrary web pages. Their scoring was based on:

Levenshtein edit distance: the smallest number of ‘insert word’ and‘delete word’ steps to get from the one text to the other. Prior to calculating edit distance, we normalised both files by lower­casing, deleting punctuation and other non­alphanumeric characters and normalising whitespace.

My Ideas & Conclusion

Metrics discussed above are based solely on text and that's fine. But what if you're also interested in the retained DOM structure of the content and other related attributes such as links, images and videos? Some implementations of text/content extractors have this capability that is widely ignored in all research based experimental evaluation setups since they're focused only on pure text.

Testing if the retrieved content includes relevant links, videos and images is somehow trivial, but what about the retained DOM structure? Assuming text extractors retain only the basic blocks (paragraphs, lists, anchor tags) of the original structure, comparing them back to the cleaned segment makes the problem even worse.

Challenged by the arisen issue I've been forging a technique to evaluate the correctness of the retained DOM structure. It roughly follows these guidelines:

  • Flatten the DOM trees of the retrieved and relevant segments
  • Compare the flattened DOM trees with a tree edit distance. Analogous to the Levensthtein distance this measure computes the number of operations needed to transform one tree structure into the other. It was originally developed for comparing ASTs of source code to detect plagiarism, but was later applied to other areas of study.  (see chapter 4.2. in this paper)
  • Employ levenshtein distance to compare contents of each block

Hopefully I'll be able to report about the success (or failure due to expected time complexity) in one of my next posts. I'll also do my best to reveal some early evaluation results of my experimental setup making use of metrics discussed in this post. So feel free to subscribe to my RSS feed or follow me on twitter for more updates.