About
I'm Tomaž Kovačič - Backend Software Enginner at Zemanta with substantial experience building web based applications on various technology stacks.
My thesis research was dedicated to web content extraction algorithms where I conducted a comprehensive study of existing solutions in this sparse field.
Categories
- text extraction (5)
-
Recent Posts
Category Archives: text extraction
Evaluating Text Extraction Algorithms
UPDATE 11/6/2011: Added the summary and the results table Lately I’ve been working on evaluating and comparing algorithms, capable of extractinguseful content from arbitrary html documents. Before continuing I encourage you to pass trough some of my previous posts, just to … Continue reading
Feature-wise Comparison of HTML Article Text Extractors
In one of my previous posts I compiled quite a decent list of software (and other resources) all capable of extracting article content from an arbitrary HTML document. While I was gathering all the relevant papers and software I kept … Continue reading
Posted in text extraction
Tagged comparison, information retrieval, semantic web, software, text extraction, web api
View Comments
Evaluation Metrics for Text Extraction Algorithms
In my two previous posts (both were issued on hacker news, ReadWriteWeb and O’Reilly Radar) I’ve covered quite a decent array of various text extraction methods and related software. So before reading this one I encourage you to read them to get … Continue reading
Posted in text extraction
Tagged evaluation, information retrieval, metrics, text extraction
View Comments
List of resources: Article text extraction from HTML documents
UPDATE 21/3/2011: Added reader contributed links to software and API section Following up to my overview of article text extractors, I’ll try to compile a list of research papers, articles, web APIs, libraries and other software that I encountered during … Continue reading
Overview: Extracting article text from HTML documents
In the world of web scraping, text mining and article reading utilities (readability bookmarklet) there is an ever growing demand for utilities that are capable of distinguishing parts of a HTML document which represent an article apart from other common … Continue reading
Posted in text extraction
View Comments