The Vector Space Models v2

(1)

The Vector Space

Models

(VSM)

(2)

|

Documents as

Vectors

- _{Terms are axes of the space}

- _{Documents are points or vectors}

in this space

- _{So we have a |V|-dimensional vector}

(3)

|

The Matrix

(4)

|

The Matrix

Doc 1 : makan makan Doc 2 : makan nasi

Incidence Matrix TF Biner

Term Doc 1 Doc 2

(5)

|

The Matrix :

Binary

Term Doc 1 Doc 2

Makan 1 1

(6)

|

Documents as

Vectors

in this space

- _{So we have a |V|-dimensional vector}

(7)

|

The Matrix :

Binary

(8)

|

The Matrix : Binary ->

Count

Incidence Matrix

TF Biner Inverted IndexRaw TF

Term Doc 1 Doc 2 Doc 1 Doc 2

Makan 1 1

(9)

|

The Matrix : Binary ->

Count

TF Biner Inverted IndexRaw TF

Term Doc 1 Doc 2 Doc 1 Doc 2

Makan 1 1 2 1

(10)

|

The Matrix : Binary ->

Count

(11)

| The Matrix : Binary -> Count -> Weight

TF Biner Inverted IndexRaw TF Inverted IndexLogaritmic TF

Term Doc 1 Doc 2 Doc 1 Doc 2 Doc 1 Doc 2

Makan 1 1 2 1

(12)

TF Biner Inverted IndexRaw TF Inverted IndexLogaritmic TF

Term Doc 1 Doc 2 Doc 1 Doc 2 Doc 1 Doc 2

Makan 1 1 2 1 1.3 1

(13)

Inverted Index Logaritmic TF

(14)

Incidence Matrix TF Biner

Inverted Index

Raw TF Inverted IndexLogaritmic TF Inverted IndexTF-IDF

Term Doc 1 Doc 2 Doc 1 Doc 2 Doc 1 Doc 2 Doc 1 Doc 2

Makan 1 1 2 1 1.3 1

(15)

Inverted Index

Term Doc 1 Doc 2 Doc 1 Doc 2 IDF Doc 1 Doc 2

Makan 2 1 1.3 1 0

(16)

Inverted Index

Makan 2 1 1.3 1 0 0 0

(17)

(18)

Doc 1 : makan makan jagung Doc 2 : makan nasi

Inverted Index

Makan 2 1 1.3 1

Nasi 0 1 0 1

(19)

Doc 1 : makan makan jagung Doc 2 : makan nasi

Inverted Index

Makan 2 1 1.3 1 0 0 0

Nasi 0 1 0 1 0.3 0 0.3

(20)

|

Documents as

Vectors

in this space

- _{So we have a |V|-dimensional vector space} - _{The weight can be anything : Binary, TF,}

(21)

|

Documents as

Vectors

- _{Very high-dimensional: tens of}

millions of dimensions when you apply this to a web search engine

- _{These are very sparse vectors - most}

(22)

(23)

(24)

|

VECTOR SPACE MODEL

• _{Key idea 1:}_{Represent Document as} vectors in the space

• _{Key idea 2}_{: Do the same for queries:} represent them as vectors in the

space

• _{Key idea 3}_{: Rank documents}

(25)

(26)

|

Proximity

• _{Proximity = Kemiripan}

• _{Proximity = similarity of vectors}

• _{Proximity ≈ inverse of distance}

• _{Dokumen yang memiliki}_proximity

dengan query yang terbesar akan

(27)

(28)

(29)

|

Proximity

• _{First cut: distance between two points}

– _{( = distance between the end points of the}

two vectors)

• _{Euclidean distance?}

• _{Euclidean distance is a bad idea . . .}

• _{. . . because Euclidean distance is}_large

(30)

|

Distance Example

Doc 1 : gossip Doc 2 : jealous

(31)

|

Distance Example

Query : gossip Jealous

Inverted Index

Logaritmic TF Inverted IndexTF-IDF

Term Doc 1 Doc 2 Doc 3 Query IDF Doc 1 Doc 2 Doc 3 Query

Gossip 1 0 2.95 1 0.17 0.17 0 0.50 0.17

(32)

|

Distance Example

Query : gossip Jealous

Inverted Index TF-IDF

Term Doc 1 Doc 2 Doc 3 Query

Gossip 0.17 0 0.50 0.17

Jealous 0 0.17 0.48 0.17

(33)

|

Why Distance is a Bad

Idea?

The Euclidean distance between

(34)

|

So, instead of Distance?

- _{Thought experiment: take a document d}

and append it to itself. Call this document d′.

- _{“Semantically” d and d′ have the same}

content

- _{The Euclidean distance between the two}

(35)

|

So, instead of Distance?

- _{The angle between}

the two documents is 0, corresponding to

maximal similarity.

(36)

|

Use angle instead of

distance

(37)

|

From angles to cosines

- _{The following two notions are equivalent.}

- _{Rank documents in decreasing order of the}

angle between query and document

- _{Rank documents in increasing order of}

cosine(query,document)

- _{Cosine is a monotonically decreasing}

(38)

(39)

(40)

a · b = |a| × |b| × cos(θ)

Where:

|a| is the magnitude (length) of vector a

|b| is the magnitude (length) of vector b

θ is the angle between a and b

(41)

q_i is the tf-idf weight (or whatever) of term i in the query

d_i is the tf-idf weight (or whatever) of term i in the document

cos(q,d) is the cosine similarity of q and d … or,

equivalently, the cosine of the angle between q and d.

(42)

|

Length normalization

- _{A vector can be (length-) normalized by}

dividing each of its components by its length

- _{Dividing a vector by its length makes it a}

unit (length) vector (on surface of unit hypersphere)

- _{Unit Vector = A vector whose length is}

(43)

|

Remember this Case

Gossip Jealous

0.4 0

0.4

d’

(44)

|

Length normalization

(45)

|

Remember this Case

Gossip Jealous

1 0

(46)

|

Length normalization

- _{Effect on the two documents d and d′}

(d appended to itself) from earlier slide: they have identical vectors after length-normalization.

- _{Long and short documents now have}

(47)

(48)

After Normalization :

for q, d length-normalized.





 

 q d _iV q_id_i d

q, ) ₁

(49)

(50)

|

Cosine similarity illustrated

(51)

(52)

|

TF-IDF Example

Term Query

tf-raw tf-wt df idf tf.idf n’lize auto 0 0 5000 2.3 0

best 1 1 50000 1.3 1.3 car 1 1 10000 2.0 2.0 insurance 1 1 1000 3.0 3.0

Document: car insurance auto insurance Query: best car insurance

(53)

|

TF-IDF Example

Term Query

tf-raw tf-wt df idf tf.idf n’lize auto 0 0 5000 2.3 0 0 best 1 1 50000 1.3 1.3 0.34 car 1 1 10000 2.0 2.0 0.52 insurance 1 1 1000 3.0 3.0 0.78

(54)

|

TF-IDF Example

Term Document

tf-raw tf-wt idf tf.idf n’lize

auto 1 1 2.3 2.3

best 0 0 1.3 0

car 1 1 2.0 2.0

insurance 2 1.3 3.0 3.9

(55)

|

TF-IDF Example

Term Document

tf-raw tf-wt idf tf.idf n’lize

auto 1 1 2.3 2.3 0.3

best 0 0 1.3 0 0

car 1 1 2.0 2.0 0.5 insurance 2 1.3 3.0 3.9 0.79

(56)

After Normalization :

for q, d length-normalized.





 

 q d _iV q_id_i d

q, ) ₁

(57)

|

TF-IDF Example

Term Query Document Dot Product

tf.idf n’lize tf.idf n’lize

auto 0 0 2.3 0.3 0

best 1.3 0.34 0 0 0

car 2.0 0.52 2.0 0.5 0.26

insurance 3.0 0.78 3.9 0.79 0.62

Score = 0+0+0.26+0.62 = 0.88

Doc length =

Sec. 6.4



(58)

|

Summary – vector space

ranking

- _{Represent the query as a weighted tf-idf vector} - _{Represent each document as a weighted tf-idf}

vector

- _{Compute the cosine similarity score for the}

query vector and each document vector

- _{Rank documents with respect to the query by}

score

(59)

Cosine similarity amongst 3 documents

term SaS PaP WH

affection 115 58 20

jealous 10 7 11

gossip 2 0 6

wuthering 0 0 38

How similar are the novels

SaS: Sense and Sensibility

PaP: Pride and Prejudice, and

WH: Wuthering Heights?

Term frequencies (counts)

Sec. 6.3

(60)

3 documents example contd.

Log frequency weighting

affection 3.06 2.76 2.30 jealous 2.00 1.85 2.04 gossip 1.30 0 1.78 wuthering 0 0 2.58

After length normalization

affection 0.789 0.832 0.524 jealous 0.515 0.555 0.465 gossip 0.335 0 0.405 wuthering 0 0 0.588

cos(SaS,PaP) ≈

0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0 ≈ 0.94

cos(SaS,WH) ≈ 0.79

cos(PaP,WH) ≈ 0.69

Why do we have cos(SaS,PaP) > cos(SaS,WH)?