• Tidak ada hasil yang ditemukan

Table Header Detection and Classification

N/A
N/A
Protected

Academic year: 2024

Membagikan "Table Header Detection and Classification"

Copied!
1
0
0

Teks penuh

(1)

PAPER REVIEW – by Mohit Gupta

Table Header Detection and Classification

Jing Fang, Prasenjit Mitra, Zhi Tang, C. Lee Giles

The goal of the authors is to be able to detect tables from documents. There exists a huge diversity in table styles and structure. It is assumed that it is largely told by the table headers, i.e. the top rows and the leftmost columns of a generic table. Hence, the problem now reduces to Table Header Detection and Classification, which is where the paper focuses on.

The paper assumes that data rows would normally share similar data type, cell alignment, character spacing, font and size etc. different from the characteristics of the header (or header rows in case of a multi-dimensional table). It does not focus on other attributes that could differentiate header and data cells, for instance, background color and ruling lines.

First, a weighted average score analysis (involving font size score, string length score, overlap score, data type score, alignment score etc. and taking a weighted mean of these scores to define the match score of corresponding cells of two adjacent rows) to calculate similarity between each pair of consecutive rows and the first local minimum from top is marked to be the separation, considering that headers would generally appear in the beginning.

The results are found to not to be satisfactorily accurate and in order to improve the same, supervised learning algorithms are used for classification of table content into headers and data classes.

Row features like row length, cell size, number of characters, fraction of numeric, alphabetical and symbolic characters, font size, font type (bold/italic) etc. are defined. Another set of scores based on the comparison of neighboring rows are defined, which include average content repetition (using Levenshtein distance) average alignment, average overlap proportion etc. Analogously, features are defined for columns which are a subset of ones for rows. The paper uses three types of classifiers–SVM based (using Libsvm), logistic regression based (using R) and a random forest based classifier (using Weka toolkit) for classifying header cells and data cells.

CiteSeerX – a public scientific search engine and digital library, is used as a source of samples of actual scientific PDF documents. TableSeerX –an existing table detection tool is used to extract tables from PDF documents. The paper does not focus on other document media like image, handwritten or web pages. The tables detected in error from two randomly selected samples of 200 documents each are manually removed, which leaves 135 and 120 tables from sample 1 and sample 2 respectively, which is now the test sample. The proportion of various header types (one- dimensional when only a row or column header exists, two-dimensional when both exist and multi-dimensional headers) is determined for both samples. Another classification is made based on table layout complexity –simple and complex –further classified into tables having multi line cells, multi-level headers, long and folded tables, multi- dimensional tables and other irregular layouts. Both the data samples are found to have similar proportions of classes and it is concluded that the dataset is stable.

Among the machine learning methods, which are evidently more accurate than the heuristic method first applied, it is seen that random forest outperforms the other two classifiers. The comparison of the accuracy between the classifiers is achieved using three parameters –precision, recall and F-measure (harmonic mean of precision and recall).

The impact of feature set (row feature set and neighboring row feature set) is tested on the random forest classifier. It is found that using both the feature sets together gives the most accurate results (97.4%). The InfoGain attribute evaluator of the Weka toolkit is used to show the most effective features to be –number of characters, fraction of alphabetic characters, font size, average alignment and consistency of data type.

Referensi

Dokumen terkait

Keywords—Vietnamese Keywords Extraction, Vietnamese News Categorization, Text Classification, Neural Network, SVM, Random Forest, Natural Language Processing.. INTRODUCTION Text

Following are the types of classifier methods used, among others: − Random forest − Logistic regression − Bernoulli NB − Support vector clustering For the vectorizer method used,

In this work, K Nearest Neighbors, Decision Tree, Logistic Regression, Support Vector Classifier, Naïve Bayes and Random Forest machine learning classification algorithms are utilized

Accordingly, the algorithm is proposed for the authentication and classification is performed.30 users’ data set is used for the experiments and using logistic classifier in weka they

Artificial speech detection using image-based features and random forest classifier ABSTRACT The ASV spoof 2015 Challenge was one of the efforts of the research community in the

In a different paper [69], the authors developed a stacking classifier in which they combined six existing classifiers—the Support Vector Machine, Artificial Neural Net- work, Logistic

Patil, "Video based action detection and recognition human using optical flow and SVM classifier," 2016 IEEE International Conference on Recent Trends in Electronics, Information &

This paper proposes a method for identifying and classifying both physical and electrical faults in the PV array system applying a machine learning Random Forest model to that is