PAPER REVIEW – by Mohit Gupta
Table Header Detection and Classification
Jing Fang, Prasenjit Mitra, Zhi Tang, C. Lee Giles
The goal of the authors is to be able to detect tables from documents. There exists a huge diversity in table styles and structure. It is assumed that it is largely told by the table headers, i.e. the top rows and the leftmost columns of a generic table. Hence, the problem now reduces to Table Header Detection and Classification, which is where the paper focuses on.
The paper assumes that data rows would normally share similar data type, cell alignment, character spacing, font and size etc. different from the characteristics of the header (or header rows in case of a multi-dimensional table). It does not focus on other attributes that could differentiate header and data cells, for instance, background color and ruling lines.
First, a weighted average score analysis (involving font size score, string length score, overlap score, data type score, alignment score etc. and taking a weighted mean of these scores to define the match score of corresponding cells of two adjacent rows) to calculate similarity between each pair of consecutive rows and the first local minimum from top is marked to be the separation, considering that headers would generally appear in the beginning.
The results are found to not to be satisfactorily accurate and in order to improve the same, supervised learning algorithms are used for classification of table content into headers and data classes.
Row features like row length, cell size, number of characters, fraction of numeric, alphabetical and symbolic characters, font size, font type (bold/italic) etc. are defined. Another set of scores based on the comparison of neighboring rows are defined, which include average content repetition (using Levenshtein distance) average alignment, average overlap proportion etc. Analogously, features are defined for columns which are a subset of ones for rows. The paper uses three types of classifiers–SVM based (using Libsvm), logistic regression based (using R) and a random forest based classifier (using Weka toolkit) for classifying header cells and data cells.
CiteSeerX – a public scientific search engine and digital library, is used as a source of samples of actual scientific PDF documents. TableSeerX –an existing table detection tool is used to extract tables from PDF documents. The paper does not focus on other document media like image, handwritten or web pages. The tables detected in error from two randomly selected samples of 200 documents each are manually removed, which leaves 135 and 120 tables from sample 1 and sample 2 respectively, which is now the test sample. The proportion of various header types (one- dimensional when only a row or column header exists, two-dimensional when both exist and multi-dimensional headers) is determined for both samples. Another classification is made based on table layout complexity –simple and complex –further classified into tables having multi line cells, multi-level headers, long and folded tables, multi- dimensional tables and other irregular layouts. Both the data samples are found to have similar proportions of classes and it is concluded that the dataset is stable.
Among the machine learning methods, which are evidently more accurate than the heuristic method first applied, it is seen that random forest outperforms the other two classifiers. The comparison of the accuracy between the classifiers is achieved using three parameters –precision, recall and F-measure (harmonic mean of precision and recall).
The impact of feature set (row feature set and neighboring row feature set) is tested on the random forest classifier. It is found that using both the feature sets together gives the most accurate results (97.4%). The InfoGain attribute evaluator of the Weka toolkit is used to show the most effective features to be –number of characters, fraction of alphabetic characters, font size, average alignment and consistency of data type.