2 Bag-of-Words Model with Word Embedding Clusters

In this chapter, we construct an entirely different method to solve both the BOW model’s sparsity and the semantic problem. We called our model: the bag-of-word clusters model. Unlike the traditional bag-of-words model that treats each word as an independent item, we group semantic-related words as clusters using pre-trained word2vec word embeddings (see Mikolov et al.2013a,b). For example, consider the “used bike” and “old bicycle” example, and we have four unique words: “old,”

“bike,” “used,” and “bicycle.” By using the traditional BOW model, we represent

“old bike” as vector [¹₂,¹₂,0,0] and represent “used bicycle” as [0,0,¹₂,¹₂]. It turns out that these two vectors are orthogonal, and two vectors have no elements in common. But in reality “used bicycle” and “old bike” are semantic-related.

Using our methods, we first group similar words such as “bicycle” and “bike”

into the same cluster and treat them as the same item. For example, we have two groups: cluster #1: “used,” “old”; cluster #2: “bike,” “bicycle.” Then we represent

“old bike” as [¹₂,¹₂], where each dimension indicates one cluster, and then the bag-of-word clusters representation of “used bicycle” is also [¹₂,¹₂]. Using this example, our method solves the BOW model’s sparsity problem, and the number of clusters is significantly smaller than the number of unique words. Also, we retrieve semantic information from text, and similar phrases “old bike” and “used bicycle”

are represented as the same vector in our model. Besides, unlike Zhang’s method, which only considers keywords and treats all other words as noise, our method keeps most words and treats related words as the same item. We believe our method can extract more information from text and thus has a better ranking performance. More detail about the bag-of-word clusters model is introduced in the next section.

The rest of this chapter is organized as follows. In Sect.2, we propose a new text representation method: the bag-of-word clusters model. In Sect.3, we give a detailed description of our ranking algorithm. Section4introduces our experiment with a real Amazon product using the Amazon product dataset (see He and McAuley 2016; McAuley et al.2015).

and a measure of the presence of known words. Given the document collection D= {d_i, i =1,2,3...n}andmunique words in these documents, mathematically, each document d_i is represented by an m×1 vector v_i ∈ R^m^×¹. For instance, consider there are three documents in this collectionD:

d₁: I like learning text mining.

d₂: What is text mining?

d3: Apple tastes good.

Now we make a list of all words in our model’s vocabulary. The unique words here (ignoring case and punctuation) are: {I, like, learning, text, mining, what, is, apple, taste, good}. Thus we have 10 unique words and 12 words in total within this collectionD.

Next step is to score each word in the document. There are many methods of scoring. Let us consider the simple Boolean first. If a word appears in a document, its corresponding weight is 1; otherwise, it is 0. Since our vocabulary has 10 words, we use a fixed-length vector representation, with each position in the vector to score a word. Then the vector representations of these three documents are like this

v1= [1,1,1,1,1,0,0,0,0,0], v2= [0,0,0,1,1,1,1,0,0,0], v3= [0,0,0,0,0,0,0,1,1,1].

Notice that the order of word index is the same as the unique word list above.

The intuition of this model is that the information within a document is from its content, which are words in this case. Documents are similar if they have similar words. Since each of the three vectors has a fixed length, we use cosine similarity to measure their similarity. Consider two vectorsAandBwith a fixed lengthN, and cosine similarity is defined as follows:

Cosine Similarity = A·B

||A||||B||=

#i=N i=1 AiBi

#_i₌_N

i=1 A²_i#_i₌_N

i=1 B_i²

, (4)

whereAi andBi are components of vectorsAandB, respectively.

The cosine value ranges between [−1,1], 1 for vectors pointing at the same direction, 0 for orthogonal, and−1 for vectors pointing in the opposite direction.

For documents, the term values are usually non-negative, so the cosine similarity ranges between[0,1], and the higher the value is, the more similar two documents are.

Now we calculate the similarity between documentsd1, d2 andd3 using this formula,

cos(d1, d₂)=0.4472; cos(d1, d₃)=0; cos(d2, d₃)=0.

We believe that this result is consistent with our observation,d₁andd₂are similar to each other because they are both talking about text mining,d₃has no relation with d₁andd₂, and thus their cosine similarity is zero.

The BOW model is very straightforward and is easy to implement. For the word weight in the BOW model, besides the simple Boolean model, we also use counts of words, frequency, or term frequency-inverse document frequency (tf-idf) as word weight, and more information about this is referred to Salton and Buckley (1988).

Despite the simplicity of this representation method, we face two significant disadvantages if we want to adapt this method on our comments data: (1) The vocabulary size in comments data is about 10,000, while each comment has only 10–200 words, and using the BOW model will lead to high-dimensional and sparse vectors. (2) BOW representation does not consider the semantic relation between words, and it assumes all the words are independent. This assumption may have some problems, for example, “used bicycle” and “old bike” will be considered as entirely different phrases because they have no words in common.

Next we develop a new text representation method based on the BOW model to overcome these disadvantages. First, we introduce word embeddings, a learned representation of words where words with the same meaning will have similar representations (Manning et al.2008). Word2vec (see Mikolov et al.2013a,b) is a very effective algorithm to train word embeddings based on the local documents.

With these word embeddings, we group similar words as a cluster using the clustering method. Finally, instead of representing document (comment) as “bag of words,” we represent them as “bag of word clusters.” In this way, we retrieve semantic information from the text, for example, “bike” and “bicycle” are grouped together because they have the same or similar meaning. Moreover, the BOW model’s sparsity problem is handled since the number of clusters is significantly smaller than the vocabulary size.

Word2vec was created and published in 2013 by a team of researchers led by Tomas Mikolov at Google (Mikolov et al.2013a,b). Word2vec is a group of related models used to produce word vectors (also called word embeddings). Usually, word2vec is referred to two model architectures and two related training techniques:

– 2 model architectures: continuous bag-of-words (CBOW) and skip-gram (SG).

CBOW aims to predict a center word from the surrounding context in terms of word vectors. Skip-gram does the opposite and predicts the probability of context words from a center word.

– 2 training techniques: negative sampling and hierarchical softmax. Negative sampling defines an objective by sampling negative examples, while hierarchical softmax defines an objective using an efficient tree structure to compute proba- bilities of appearance for all the vocabulary.

In our application, the skip-gram model with negative sampling is used. The detailed steps are given in the first author’s master thesis. For a detailed explanation of other models in word2vec, one refers to Rong (2014). Moreover, since skip-gram is a neural network model, if you are not familiar with the neural network model, you can refer to Goodfellow et al. (2016).

Fig. 1 Word2vec example

Training using word2vec model produces N-dimensional vectors for each word in our vocabulary. These word embeddings have many good properties. Since each word embeddings have the same sizeN, it is easy to measure the distance between a pair of word embeddings. Another property of these pre-trained word embeddings is that semantically related words usually have a close distance. Here we use the cosine similarity defined by (4), and the more similar two words are the higher cosine similarity of their word embeddings. Figure1shows three examples of words

“baby,” “apple,” and “well” with their top 10 most similar words in the whole vocabulary with around 10⁴words. More detail about how these word embeddings trained is described in Sect.4.

In our method, we adopt the K-means algorithm (Hartigan & Wong1979) to perform word embeddings clustering. K-means is a straightforward and efficient algorithm for general clustering. We introduce how K-means are applied in our method, and let us review this algorithm first.

In this clustering problem, we are givennword embeddings as our training set {w⁽¹⁾, w⁽²⁾, ..., w⁽ⁿ⁾}andw⁽ⁱ⁾∈R^N. The number of clusters is a pre-set parameter K, and the K-means algorithm is as follows:

1. Initialize cluster centroidsμ1, μ2, ..., μK ∈R^Nrandomly.

2. Repeat until convergence: { for everyi∈ {1, ..., n}, set

c⁽ⁱ⁾:=arg min

j ||w⁽ⁱ⁾−μ_j||²; (5)

for everyj ∈ {1, ..., K}, set μ_j =

i=11{c⁽ⁱ⁾=j}w⁽ⁱ⁾

#_n

i=11{c⁽ⁱ⁾=j} . (6)

}

For every repetition, there are two steps, first is to assign each training sample w⁽ⁱ⁾to its nearest clusterμ_j and update its assigned cluster indexcⁱ. Then, update the clusterμ_j to the mean of the points assigned to it. The K-means algorithm is also regarded as a coordinate descent on the distortion functionJ,

J (c, μ)= n i=1

||w⁽ⁱ⁾−μ_c(i)||². (7)

Clearly, the distortion function is a non-convex function, so the K-means algorithm is easily got stuck in local minima. One common solution to this problem is to run K-means many times with a different random initialization ofμ, and out of all different clusters founded, use the one with the lowest distortionJ as our final solution.

Normally, we use cosine similarity to measure the distance between word embeddings as we mentioned before, but K-means use only Euclidean distance as the distance measure. Although other clustering methods use cosine as a distance measure like K-medoids (Kaufmann1987), we still use K-means due to its much higher computational efficiency. And we justify that for normalized vectors, cosine similarity and Euclidean distance are linearly connected. For two normalized vectors A= {A_i},B= {B_i}(#

A²_i =#

B_i²=1), the Euclidean distance betweenAand Bis

||A−B||²=

(A_i−B_i)²

(A²_i +B_i²−2AiB_i)

A²_i +

B_i²−2 A_iB_i

=1+1−2 cos(A, B)

=2(1−cos(A, B)).

(8)

Note that for normalized vectors cos(A, B) = _#^#^Aⁱ^Bⁱ

A²_i#

B²_i = # A_iB_i. The higher two word embeddings’ cosine similarity is, the closer their Euclidean

distance is, which is consistent with our objective. Thus in our application, we perform K-means on the normalized pre-trained word embeddings.

So instead of representing text documents as “bag of words,” we represent them as “bag of word clusters.” Now we have our pre-trained word embeddings, and then we perform K-means algorithm, which assigns each word a unique cluster index. Then constructing bag-of-word clusters representation of text document is summarized as the following steps:

1. Preprocess and tokenize the text, and then each text is represented as a list of words.

2. Given pre-trained K word cluster, replace each word in the list as its cluster index, if there are unknown words, replace them withK+1, so this vector is transformed to numerical lists with the number 1 toK+1.

3. Calculate each cluster’s frequency in the text list, and construct a vector with length k+1 where each term will be the calculated frequency of clusters with the corresponding index number. And this new vector is our “bag of clusters”

representation of text.

With the above steps, we transform each text into aK+1-dimensional vector.

For example, we have a short text:

"I love eating apples, they are delicious"

and we have four pre-trained word clusters: C₁ ={“I,” “they,” “you”}, C₂ ={“apple,” “pear,” “banana”}, C₃ = {“is,” “are,” “was”}, and C₄ = {

“delicious,” “good,” “tasty”}. First, we tokenize this text as a vector v = [“I,”

“love,” “eating,” “apples,” “they,” “are,” “delicious”], then replace these words with the corresponding clustersv = [1,5,5,2,1,3,4], and remember to replace unknown words with “k+1” that is 5 here. Next we calculate each cluster’s relative frequency:f₁= ²₇,f₂= ¹₇,f₃ = ¹₇,f₄ = ¹₇,f₅= ²₇, and represent this text with new vectorv= [f₁, f₂, f₃, f₄, f₅] = [²₇,¹₇,¹₇,¹₇,²₇].

Text digitalization or representing text as numerical vectors is an essential part of every text mining application. Our “bag of word clusters” model can extract not only statistical information but also part of semantic information from text.

Dalam dokumen (ICSA Book Series in Statistics) Wenqing He, Liqun Wang, Jiahua Chen, Chunfang Devon Lin - Advances and Innovations in Statistics and Data Science-Springer (2022) (Halaman 117-122)