• Tidak ada hasil yang ditemukan

The Vector Space Models v2

N/A
N/A
Protected

Academic year: 2019

Membagikan "The Vector Space Models v2"

Copied!
60
0
0

Teks penuh

(1)

The Vector Space

Models

(VSM)

(2)

|

Documents as

Vectors

- Terms are axes of the space

- Documents are points or vectors

in this space

- So we have a |V|-dimensional vector

(3)

|

The Matrix

(4)

|

The Matrix

Doc 1 : makan makan Doc 2 : makan nasi

Incidence Matrix TF Biner

Term Doc 1 Doc 2

(5)

|

The Matrix :

Binary

Doc 1 : makan makan Doc 2 : makan nasi

Incidence Matrix TF Biner

Term Doc 1 Doc 2

Makan 1 1

(6)

|

Documents as

Vectors

- Terms are axes of the space

- Documents are points or vectors

in this space

- So we have a |V|-dimensional vector

(7)

|

The Matrix :

Binary

Doc 1 : makan makan Doc 2 : makan nasi

Incidence Matrix TF Biner

(8)

|

The Matrix : Binary ->

Count

Doc 1 : makan makan Doc 2 : makan nasi

Incidence Matrix

TF Biner Inverted IndexRaw TF

Term Doc 1 Doc 2 Doc 1 Doc 2

Makan 1 1

(9)

|

The Matrix : Binary ->

Count

Doc 1 : makan makan Doc 2 : makan nasi

Incidence Matrix

TF Biner Inverted IndexRaw TF

Term Doc 1 Doc 2 Doc 1 Doc 2

Makan 1 1 2 1

(10)

|

The Matrix : Binary ->

Count

Doc 1 : makan makan Doc 2 : makan nasi

(11)

| The Matrix : Binary -> Count -> Weight

Doc 1 : makan makan Doc 2 : makan nasi

Incidence Matrix

TF Biner Inverted IndexRaw TF Inverted IndexLogaritmic TF

Term Doc 1 Doc 2 Doc 1 Doc 2 Doc 1 Doc 2

Makan 1 1 2 1

(12)

| The Matrix : Binary -> Count -> Weight

Doc 1 : makan makan Doc 2 : makan nasi

Incidence Matrix

TF Biner Inverted IndexRaw TF Inverted IndexLogaritmic TF

Term Doc 1 Doc 2 Doc 1 Doc 2 Doc 1 Doc 2

Makan 1 1 2 1 1.3 1

(13)

| The Matrix : Binary -> Count -> Weight

Doc 1 : makan makan Doc 2 : makan nasi

Inverted Index Logaritmic TF

(14)

| The Matrix : Binary -> Count -> Weight

Doc 1 : makan makan Doc 2 : makan nasi

Incidence Matrix TF Biner

Inverted Index

Raw TF Inverted IndexLogaritmic TF Inverted IndexTF-IDF

Term Doc 1 Doc 2 Doc 1 Doc 2 Doc 1 Doc 2 Doc 1 Doc 2

Makan 1 1 2 1 1.3 1

(15)

| The Matrix : Binary -> Count -> Weight

Doc 1 : makan makan Doc 2 : makan nasi

Inverted Index

Raw TF Inverted IndexLogaritmic TF Inverted IndexTF-IDF

Term Doc 1 Doc 2 Doc 1 Doc 2 IDF Doc 1 Doc 2

Makan 2 1 1.3 1 0

(16)

| The Matrix : Binary -> Count -> Weight

Doc 1 : makan makan Doc 2 : makan nasi

Inverted Index

Raw TF Inverted IndexLogaritmic TF Inverted IndexTF-IDF

Term Doc 1 Doc 2 Doc 1 Doc 2 IDF Doc 1 Doc 2

Makan 2 1 1.3 1 0 0 0

(17)

| The Matrix : Binary -> Count -> Weight

Doc 1 : makan makan Doc 2 : makan nasi

(18)

| The Matrix : Binary -> Count -> Weight

Doc 1 : makan makan jagung Doc 2 : makan nasi

Inverted Index

Raw TF Inverted IndexLogaritmic TF Inverted IndexTF-IDF

Term Doc 1 Doc 2 Doc 1 Doc 2 IDF Doc 1 Doc 2

Makan 2 1 1.3 1

Nasi 0 1 0 1

(19)

| The Matrix : Binary -> Count -> Weight

Doc 1 : makan makan jagung Doc 2 : makan nasi

Inverted Index

Raw TF Inverted IndexLogaritmic TF Inverted IndexTF-IDF

Term Doc 1 Doc 2 Doc 1 Doc 2 IDF Doc 1 Doc 2

Makan 2 1 1.3 1 0 0 0

Nasi 0 1 0 1 0.3 0 0.3

(20)

|

Documents as

Vectors

- Terms are axes of the space

- Documents are points or vectors

in this space

- So we have a |V|-dimensional vector space - The weight can be anything : Binary, TF,

(21)

|

Documents as

Vectors

- Very high-dimensional: tens of

millions of dimensions when you apply this to a web search engine

- These are very sparse vectors - most

(22)
(23)
(24)

|

VECTOR SPACE MODEL

Key idea 1: Represent Document as vectors in the space

Key idea 2: Do the same for queries: represent them as vectors in the

space

Key idea 3: Rank documents

(25)
(26)

|

Proximity

Proximity = Kemiripan

Proximity = similarity of vectors

Proximity ≈ inverse of distance

Dokumen yang memiliki proximity

dengan query yang terbesar akan

(27)
(28)
(29)

|

Proximity

First cut: distance between two points

( = distance between the end points of the

two vectors)

Euclidean distance?

Euclidean distance is a bad idea . . .

. . . because Euclidean distance is large

(30)

|

Distance Example

Doc 1 : gossip Doc 2 : jealous

(31)

|

Distance Example

Query : gossip Jealous

Inverted Index

Logaritmic TF Inverted IndexTF-IDF

Term Doc 1 Doc 2 Doc 3 Query IDF Doc 1 Doc 2 Doc 3 Query

Gossip 1 0 2.95 1 0.17 0.17 0 0.50 0.17

(32)

|

Distance Example

Query : gossip Jealous

Inverted Index TF-IDF

Term Doc 1 Doc 2 Doc 3 Query

Gossip 0.17 0 0.50 0.17

Jealous 0 0.17 0.48 0.17

(33)

|

Why Distance is a Bad

Idea?

The Euclidean distance between

(34)

|

So, instead of Distance?

- Thought experiment: take a document d

and append it to itself. Call this document d′.

- “Semantically” d and d′ have the same

content

- The Euclidean distance between the two

(35)

|

So, instead of Distance?

- The angle between

the two documents is 0, corresponding to

maximal similarity.

(36)

|

Use angle instead of

distance

(37)

|

From angles to cosines

- The following two notions are equivalent.

- Rank documents in decreasing order of the

angle between query and document

- Rank documents in increasing order of

cosine(query,document)

- Cosine is a monotonically decreasing

(38)
(39)
(40)

a · b = |a| × |b| × cos(θ)

Where:

|a| is the magnitude (length) of vector a

|b| is the magnitude (length) of vector b

θ is the angle between a and b

(41)

qi is the tf-idf weight (or whatever) of term i in the query

di is the tf-idf weight (or whatever) of term i in the document

cos(q,d) is the cosine similarity of q and d … or,

equivalently, the cosine of the angle between q and d.

(42)

|

Length normalization

- A vector can be (length-) normalized by

dividing each of its components by its length

- Dividing a vector by its length makes it a

unit (length) vector (on surface of unit hypersphere)

- Unit Vector = A vector whose length is

(43)

|

Remember this Case

Gossip Jealous

0.4 0

0.4

d’

(44)

|

Length normalization

(45)

|

Remember this Case

Gossip Jealous

1 0

(46)

|

Length normalization

- Effect on the two documents d and d′

(d appended to itself) from earlier slide: they have identical vectors after length-normalization.

- Long and short documents now have

(47)
(48)

After Normalization :

 for q, d length-normalized.

 

q d iV qidi d

q, ) 1

(49)
(50)

|

Cosine similarity illustrated

(51)
(52)

|

TF-IDF Example

Term Query

tf-raw tf-wt df idf tf.idf n’lize auto 0 0 5000 2.3 0

best 1 1 50000 1.3 1.3 car 1 1 10000 2.0 2.0 insurance 1 1 1000 3.0 3.0

Document: car insurance auto insurance Query: best car insurance

(53)

|

TF-IDF Example

Term Query

tf-raw tf-wt df idf tf.idf n’lize auto 0 0 5000 2.3 0 0 best 1 1 50000 1.3 1.3 0.34 car 1 1 10000 2.0 2.0 0.52 insurance 1 1 1000 3.0 3.0 0.78

Document: car insurance auto insurance Query: best car insurance

(54)

|

TF-IDF Example

Term Document

tf-raw tf-wt idf tf.idf n’lize

auto 1 1 2.3 2.3

best 0 0 1.3 0

car 1 1 2.0 2.0

insurance 2 1.3 3.0 3.9

Document: car insurance auto insurance Query: best car insurance

(55)

|

TF-IDF Example

Term Document

tf-raw tf-wt idf tf.idf n’lize

auto 1 1 2.3 2.3 0.3

best 0 0 1.3 0 0

car 1 1 2.0 2.0 0.5 insurance 2 1.3 3.0 3.9 0.79

Document: car insurance auto insurance Query: best car insurance

(56)

After Normalization :

 for q, d length-normalized.

 

q d iV qidi d

q, ) 1

(57)

|

TF-IDF Example

Term Query Document Dot Product

tf.idf n’lize tf.idf n’lize

auto 0 0 2.3 0.3 0

best 1.3 0.34 0 0 0

car 2.0 0.52 2.0 0.5 0.26

insurance 3.0 0.78 3.9 0.79 0.62

Document: car insurance auto insurance Query: best car insurance

Score = 0+0+0.26+0.62 = 0.88

Doc length =

Sec. 6.4



(58)

|

Summary – vector space

ranking

- Represent the query as a weighted tf-idf vector - Represent each document as a weighted tf-idf

vector

- Compute the cosine similarity score for the

query vector and each document vector

- Rank documents with respect to the query by

score

(59)

Cosine similarity amongst 3 documents

term SaS PaP WH

affection 115 58 20

jealous 10 7 11

gossip 2 0 6

wuthering 0 0 38

How similar are the novels

SaS: Sense and Sensibility

PaP: Pride and Prejudice, and

WH: Wuthering Heights?

Term frequencies (counts)

Sec. 6.3

(60)

3 documents example contd.

Log frequency weighting

term SaS PaP WH

affection 3.06 2.76 2.30 jealous 2.00 1.85 2.04 gossip 1.30 0 1.78 wuthering 0 0 2.58

After length normalization

term SaS PaP WH

affection 0.789 0.832 0.524 jealous 0.515 0.555 0.465 gossip 0.335 0 0.405 wuthering 0 0 0.588

cos(SaS,PaP) ≈

0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0 ≈ 0.94

cos(SaS,WH) ≈ 0.79

cos(PaP,WH) ≈ 0.69

Why do we have cos(SaS,PaP) > cos(SaS,WH)?

Referensi

Dokumen terkait

Pada penelitian ini, pengujian yang dilakukan pada pengaruh proses pembobotan fitur dan seleksi fitur pada Term Frequency-Inverse Document Frequency (TF-IDF) dengan

Pembobotan Term Frequency-Inverse Document Frequency (TF-IDF) adalah salah satu proses dari teknik ekstraksi fitur dengan memberikan nilai pada masing-masing kata (term) yang

Modul yang berfungsi untuk menghitung skor kemiripan antar deskripsi buku dengan menggunakan algoritma TF-IDF dalam memberikan bobot setiap term pada deskripsi,

Diagram Alir Document Preprocessing Metode Vector Space Model melakukan pengolahan hasil dari pembobotan TF-IDF dengan menghitung panjang dari vektor setiap dokumen, di samping itu

In the proposed method, a fixed-length vector, called an i-vector, is extracted from each signature, and then this vector is used to create a template.. Several methods, such as the

Let nðCPnÞ be a normal bundle of CPn in the sense that nðCPnÞ is a complex vector bundle satisfying that TðCPnÞlnðCPnÞis stably equivalent to a trivial vector bundle, where TðCPnÞ is

Pada penelitian ini penulis melakukan analisis sentimen pada data Twitter terkait pembelajaran tatap muka ptm menggunakan metode Support Vector Machine SVM dan fitur TF-IDF untuk

The sentiment analysis process begins with collecting data using an instant data scraper on the Indodax website, data preprcoessing, labeling using vader lexicon, TF-IDF as word