A Mathematical Measurement For Korean Text Mining and Its Application

It can overcome the shortcomings of the conventional system based on the term-document matrix. A characteristic of exponential growth is that the last term is greater than the sum of all preceding terms.

Retrieval models for text information

Information extracting from single document

Long ago, the well-known companies Microsoft proposed the function “AutoSummarize(Automatically summarize a document)” in the program “word 2007”1.

Information extracting from multi-documents

Information extracting from relational documents

Deciding on the importance of each sentence is the difficult one from the human point of view. Nowadays we have various information issues, and then we can be overcome by smart design.

Fuzzy Logic

The union of A and B is the smallest fuzzy set containing A and B such that fA[B(x) = max(fA(x), fB(x)). The intersection of A and B is the largest fuzzy set containing A and B such that fA\B(x) =min(fA(x), fB(x)).

Summarization

Survey of Summarization

System Framework

Source Document
Preprocessing
Extraction of Features

Ratio between title and content (V1)
Ratio of Part of Speech (V2)

Text Summarization based on Fuzzy Logic
Refinement
Summary Document

In the fuzzifier, crisp inputs are translated into linguistic values using a membership function to be used for input linguistic variables. In the final step, the output linguistic variables from the inference are converted to the final crisp values by the defuzzifier using the membership function to represent the final sentence score.[7] Then we are done with this process 2.5.3, i.e. calculation of the sentence score.

Imprementation

Fuzzy Sets and Fuzzy Operators
Membership Functions
Defuzzification
Korean News Summarization

A triangle function, which returns 0 if x is less than x0 and greater than x2, a maximum value of 1 if x is equal to x1, and straight lines connecting these points. A trapezoid function, which returns 0 if x is less than x0 and greater than x3, a value of 1 if x is between x1 and x2, and straight lines connecting these points. In Mamdani-type fuzzy inference system, we use the center of the area (Centroid) for the defuzzification process.

The center of gravity is calculated using a standard formula found in any calculus book.

Figure 2-2: Rule matrix and Fuzzy sets of V1, V2

Conclusions

Further Work

As you have seen, we can easily approach many news articles today as if we don't really know. Clustering is an unsupervised learning technique that has been widely used in the process of finding subgroups, or clusters, in a data set. The number of words/terms used by the system determines the dimension of a vector space in which the analysis is performed.

Dimensionality reduction can lead to significant savings in computing resources and processing time. In this context, our research focuses on modeling using linguistic features, especially Korean from Asian languages. Because of this characteristic, we choose the Asian language syllable feature for dimensionality reduction.

Syllable vector

Vector space model

If ti is a term and dj is a document of C, the frequency ofti indj is the number of occurrences of ti indj, that is, . However, the proposed method can solve it to reduce the matrix dimension.

Definition of Syllable Vector

Syllable-n Vector
Syllable-n-All Vector

Each term tj generates a row vector (ai1, ai2,· · ·, ain) called a term vector, and each documentdj generates a column vector. As we can see, the length of rows depends on the number of different links. The syllable in Korean has more information, even though the syllable length is only one.

For example, a one-syllable \han or \m hankuk indicates Korea, which is transcribed with the Chinese character $.

Figure 3-2: Dimension reduction using Syllables-n vector

Measurement of similarity

Cosine Similarity

Thus similarity will range from 0 meaning exactly the same to 2 meaning exactly opposite.

Pearson Correlation

Empirical analysis and results

If we run it for all documents, we have to execute for 138 days on the single process. We perform the calculation of similarity for samples as Fig and Pearson correlation applies to all cases. To show the performance of the proposed method, we compare the computation time in the real data set, like Korean news for days by web-crawling (2015).

In Figure 3-13, the computation time for word-based matrix is exponentially increasing, but the proposed method has almost linearly increasing performance.

Figure 3-5: Percentage of document words and total words

Latent semantic indexing

TF-IDF
SVD
LSI
Empirical analysis and results with Syllable Vector

The goal of document frequency is to reduce the influence of terms that appear very often in a given corpus and are therefore empirically less informative than features that appear in a small part of the corpus. In industrial areas3, to prevent zero divisions, a constant 1 is added to the numerator and denominator of idf. The calculation of the SVD matrix A is related to the eigenvalue of the decompositino matrixATA.

LSI uses SVD matrices to identify a linear subspace in the term frequency document space. The derived LSI features, which are linear combinations of the original TF-IDF features, can capture some linguistic features. LSI captures some structures such as polysemy or synonymy in the data that is obscured when raw term overlap is used.

Table 3-5: Variants of term frequency(TF) weight

Non-negative matrix factorization

Definition of Non-negative Matrix Factorization

Non-negative Matrix Factorization with document clustering

The matrix B, which each row means documents, and each column means terms is factorized two matrices W and H. The row of W is an index of each document, then each value in a row means the weight of each feature group. In our example, document 1 has 1.58 weight for the first element and 0 weight for the second element.

Through the matrix W, we know that documents 1 and 2 have a strong feature for feature 1, and they have a similar feature vector. Similarly, we derive the meaning of words for each feature from the matrix H. To summarize, document 1, 2 are similar, and document 3 is not related to document 1, 2, and words 3, 4 are important expressions of features 1 and terms 3, 5 are keywords in move 2.

Empirical analysis and results with Syllable Vector

For convenience, NMF means the result of Scikit learning, NMF* means the result of nimfa.

Figure 3-26: Syllable-1-All Figure 3-27: Syllable-2-All Figure 3-28: Word

Text Clustering

Evaluation of Text Clustering
Purity
Precision and Recall
Evaluation
Top Ranking Matching

Measuring the performance of supervised learning can start by building the confusion matrix6, which records the observed classified data and the predicted classified data. Consider the positive class of the predicted data in the first column of the confusion matrix. In information retrieval contexts, the data can be divided into two types: the retrieved documents (e.g., the list of documents produced by a web search engine for a query) and the relevant documents (e.g., the list of all documents for a specific topic).

On the other hand, recall is to measure the as positively predicted result of the actually positive response. A visualization of the rule is Fig 3-31, 'target' is a target document and arrows mean a match for each document from target document (0 near the best). First look at the case of counting threshold (Fig 3-33). Syl-1-All and Syl-1 missed some correct answers.

Figure 3-30: Example clustering set for purity

Results and discussion

In other words, metadata is defined as data that provides information about one or more aspects of the data [46]. Word2Vec takes into account the meaning and context of a word when expressing it as a vector, and this is done within the framework of the distributional hypothesis in linguistics. Skip-gram, on the other hand, is used to input target expressions and train neural networks to then predict the contextual structure surrounding a particular expression (Figure 4-2). C is the window parameter for how far from the context target expression the conditions will be considered.

First, we analyze the morpheme of the selected data and divide it into units of terms with meaning. The term in the middle of each set of words becomes the target term, and the terms next to the target term become contextual terms. This is then used as a learning model under the architecture of CBOW or Skip-gram.

Heterogeneous Word2Vec

The target word is an object word for learning and the context word is a surveillance data for the object word. We follow the general neural networks (NN) for a hidden layer and set the six nodes for hidden layer in this example. Because the dot product of a one-hot encoding vector and the weighted matrix means an embedded word.

Figure 4-3: Example of Skip-gram model (left: initial NN structure, right: learned weighted NN structure

Law2Vec

Case - Legislations(CL)
Case - Cases(CC)
Case - Legislations, Cases(CLC)
Results

Task description
CL
CC
CLC
Comparison of model architectures

In particular, research to calculate the similarity of documents within legal data is a necessary tool for developing effective legal data processing tools such as legal search engines. However, these studies are still focused on keywords and have difficulty taking into account the semantics of each legal document, especially case law. We propose a new architecture capable of showing the similarities between the fundamental elements of a case, which are jurisprudence and legislation.

We call it Law2Vec and it is composed of three different forms using case law and legislation. Therefore, a reference to legislation will always exist in jurisprudence and we want to learn this connection between jurisprudence and legislation through the architecture of Word2Vec. In this section, we want to demonstrate the possibility of an agreement search that takes into account the semantics of jurisprudence through the contextual learning of the relationship between jurisprudence and legislation.

Figure 4-15: Distribution for W 1 matrix for a large set

Link Prediction

CL Model for large-scale set

When the CL model was used for learning, we could see that the answer sets gave very good results. The distribution of responses at iteration 100 was close to 0 while at iteration 35,800 the distribution of responses was closer to 1.

CC Model for large-scale set

CLC Model for large-scale set

Results and discussion

This means that it would be difficult to analyze the meaning and written context of each piece of legislation using the verbal method. However, by learning the citation relationships of legal cases, we can calculate the similarity of legislation and gain the advantage of calculating the semantic similarity between sections. It is a learning method specialized in the structure of legal data (which consists of legal reasoning and material facts).

Figure 5-1: Demo of News Summarization System

Link prediction for Legal Data

Intelligent Extractive Text Summarization Using Fuzzy Inference Systems.” Intelligent Systems Engineering, 2006 IEEE International Conference on. Fuzzy Genetic Semantic Text Summarization.” Dependable, Autonomous, and Secure Computing (DASC), 2011 IEEE Ninth International Conference on. Dynamic Analysis Based Multiangle Test Case Evaluations.” International Conference on Advanced Data Mining and Applications.

Using Latent Semantic Analysis to Improve Access to Textual Information.” Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. Document clustering based on non-negative matrix factorization.” Proceedings of the 26th Annual International ACM SIGIR Conference on Information Retrieval Research and Development. Beyond accuracy, F-score, and ROC: a family of discriminant measures for performance evaluation. Australian Conference on Artificial Intelligence.