• Tidak ada hasil yang ditemukan

The Vector Space Models v2

N/A
N/A
Protected

Academic year: 2019

Membagikan "The Vector Space Models v2"

Copied!
60
0
0

Teks penuh

(1)

The Vector Space

Models

(VSM)

(2)

|

Documents as

Vectors

- Terms are axes of the space

- Documents are points or vectors

in this space

- So we have a |V|-dimensional vector

(3)

|

The Matrix

(4)

|

The Matrix

Doc 1 : makan makan Doc 2 : makan nasi

Incidence Matrix TF Biner

Term Doc 1 Doc 2

(5)

|

The Matrix :

Binary

Doc 1 : makan makan Doc 2 : makan nasi

Incidence Matrix TF Biner

Term Doc 1 Doc 2

Makan 1 1

(6)

|

Documents as

Vectors

- Terms are axes of the space

- Documents are points or vectors

in this space

- So we have a |V|-dimensional vector

(7)

|

The Matrix :

Binary

Doc 1 : makan makan Doc 2 : makan nasi

Incidence Matrix TF Biner

(8)

|

The Matrix : Binary ->

Count

Doc 1 : makan makan Doc 2 : makan nasi

Incidence Matrix

TF Biner Inverted IndexRaw TF

Term Doc 1 Doc 2 Doc 1 Doc 2

Makan 1 1

(9)

|

The Matrix : Binary ->

Count

Doc 1 : makan makan Doc 2 : makan nasi

Incidence Matrix

TF Biner Inverted IndexRaw TF

Term Doc 1 Doc 2 Doc 1 Doc 2

Makan 1 1 2 1

(10)

|

The Matrix : Binary ->

Count

Doc 1 : makan makan Doc 2 : makan nasi

(11)

| The Matrix : Binary -> Count -> Weight

Doc 1 : makan makan Doc 2 : makan nasi

Incidence Matrix

TF Biner Inverted IndexRaw TF Inverted IndexLogaritmic TF

Term Doc 1 Doc 2 Doc 1 Doc 2 Doc 1 Doc 2

Makan 1 1 2 1

(12)

| The Matrix : Binary -> Count -> Weight

Doc 1 : makan makan Doc 2 : makan nasi

Incidence Matrix

TF Biner Inverted IndexRaw TF Inverted IndexLogaritmic TF

Term Doc 1 Doc 2 Doc 1 Doc 2 Doc 1 Doc 2

Makan 1 1 2 1 1.3 1

(13)

| The Matrix : Binary -> Count -> Weight

Doc 1 : makan makan Doc 2 : makan nasi

Inverted Index Logaritmic TF

(14)

| The Matrix : Binary -> Count -> Weight

Doc 1 : makan makan Doc 2 : makan nasi

Incidence Matrix TF Biner

Inverted Index

Raw TF Inverted IndexLogaritmic TF Inverted IndexTF-IDF

Term Doc 1 Doc 2 Doc 1 Doc 2 Doc 1 Doc 2 Doc 1 Doc 2

Makan 1 1 2 1 1.3 1

(15)

| The Matrix : Binary -> Count -> Weight

Doc 1 : makan makan Doc 2 : makan nasi

Inverted Index

Raw TF Inverted IndexLogaritmic TF Inverted IndexTF-IDF

Term Doc 1 Doc 2 Doc 1 Doc 2 IDF Doc 1 Doc 2

Makan 2 1 1.3 1 0

(16)

| The Matrix : Binary -> Count -> Weight

Doc 1 : makan makan Doc 2 : makan nasi

Inverted Index

Raw TF Inverted IndexLogaritmic TF Inverted IndexTF-IDF

Term Doc 1 Doc 2 Doc 1 Doc 2 IDF Doc 1 Doc 2

Makan 2 1 1.3 1 0 0 0

(17)

| The Matrix : Binary -> Count -> Weight

Doc 1 : makan makan Doc 2 : makan nasi

(18)

| The Matrix : Binary -> Count -> Weight

Doc 1 : makan makan jagung Doc 2 : makan nasi

Inverted Index

Raw TF Inverted IndexLogaritmic TF Inverted IndexTF-IDF

Term Doc 1 Doc 2 Doc 1 Doc 2 IDF Doc 1 Doc 2

Makan 2 1 1.3 1

Nasi 0 1 0 1

(19)

| The Matrix : Binary -> Count -> Weight

Doc 1 : makan makan jagung Doc 2 : makan nasi

Inverted Index

Raw TF Inverted IndexLogaritmic TF Inverted IndexTF-IDF

Term Doc 1 Doc 2 Doc 1 Doc 2 IDF Doc 1 Doc 2

Makan 2 1 1.3 1 0 0 0

Nasi 0 1 0 1 0.3 0 0.3

(20)

|

Documents as

Vectors

- Terms are axes of the space

- Documents are points or vectors

in this space

- So we have a |V|-dimensional vector space - The weight can be anything : Binary, TF,

(21)

|

Documents as

Vectors

- Very high-dimensional: tens of

millions of dimensions when you apply this to a web search engine

- These are very sparse vectors - most

(22)
(23)
(24)

|

VECTOR SPACE MODEL

Key idea 1: Represent Document as vectors in the space

Key idea 2: Do the same for queries: represent them as vectors in the

space

Key idea 3: Rank documents

(25)
(26)

|

Proximity

Proximity = Kemiripan

Proximity = similarity of vectors

Proximity ≈ inverse of distance

Dokumen yang memiliki proximity

dengan query yang terbesar akan

(27)
(28)
(29)

|

Proximity

First cut: distance between two points

( = distance between the end points of the

two vectors)

Euclidean distance?

Euclidean distance is a bad idea . . .

. . . because Euclidean distance is large

(30)

|

Distance Example

Doc 1 : gossip Doc 2 : jealous

(31)

|

Distance Example

Query : gossip Jealous

Inverted Index

Logaritmic TF Inverted IndexTF-IDF

Term Doc 1 Doc 2 Doc 3 Query IDF Doc 1 Doc 2 Doc 3 Query

Gossip 1 0 2.95 1 0.17 0.17 0 0.50 0.17

(32)

|

Distance Example

Query : gossip Jealous

Inverted Index TF-IDF

Term Doc 1 Doc 2 Doc 3 Query

Gossip 0.17 0 0.50 0.17

Jealous 0 0.17 0.48 0.17

(33)

|

Why Distance is a Bad

Idea?

The Euclidean distance between

(34)

|

So, instead of Distance?

- Thought experiment: take a document d

and append it to itself. Call this document d′.

- “Semantically” d and d′ have the same

content

- The Euclidean distance between the two

(35)

|

So, instead of Distance?

- The angle between

the two documents is 0, corresponding to

maximal similarity.

(36)

|

Use angle instead of

distance

(37)

|

From angles to cosines

- The following two notions are equivalent.

- Rank documents in decreasing order of the

angle between query and document

- Rank documents in increasing order of

cosine(query,document)

- Cosine is a monotonically decreasing

(38)
(39)
(40)

a · b = |a| × |b| × cos(θ)

Where:

|a| is the magnitude (length) of vector a

|b| is the magnitude (length) of vector b

θ is the angle between a and b

(41)

qi is the tf-idf weight (or whatever) of term i in the query

di is the tf-idf weight (or whatever) of term i in the document

cos(q,d) is the cosine similarity of q and d … or,

equivalently, the cosine of the angle between q and d.

(42)

|

Length normalization

- A vector can be (length-) normalized by

dividing each of its components by its length

- Dividing a vector by its length makes it a

unit (length) vector (on surface of unit hypersphere)

- Unit Vector = A vector whose length is

(43)

|

Remember this Case

Gossip Jealous

0.4 0

0.4

d’

(44)

|

Length normalization

(45)

|

Remember this Case

Gossip Jealous

1 0

(46)

|

Length normalization

- Effect on the two documents d and d′

(d appended to itself) from earlier slide: they have identical vectors after length-normalization.

- Long and short documents now have

(47)
(48)

After Normalization :

 for q, d length-normalized.

 

q d iV qidi d

q, ) 1

(49)
(50)

|

Cosine similarity illustrated

(51)
(52)

|

TF-IDF Example

Term Query

tf-raw tf-wt df idf tf.idf n’lize auto 0 0 5000 2.3 0

best 1 1 50000 1.3 1.3 car 1 1 10000 2.0 2.0 insurance 1 1 1000 3.0 3.0

Document: car insurance auto insurance Query: best car insurance

(53)

|

TF-IDF Example

Term Query

tf-raw tf-wt df idf tf.idf n’lize auto 0 0 5000 2.3 0 0 best 1 1 50000 1.3 1.3 0.34 car 1 1 10000 2.0 2.0 0.52 insurance 1 1 1000 3.0 3.0 0.78

Document: car insurance auto insurance Query: best car insurance

(54)

|

TF-IDF Example

Term Document

tf-raw tf-wt idf tf.idf n’lize

auto 1 1 2.3 2.3

best 0 0 1.3 0

car 1 1 2.0 2.0

insurance 2 1.3 3.0 3.9

Document: car insurance auto insurance Query: best car insurance

(55)

|

TF-IDF Example

Term Document

tf-raw tf-wt idf tf.idf n’lize

auto 1 1 2.3 2.3 0.3

best 0 0 1.3 0 0

car 1 1 2.0 2.0 0.5 insurance 2 1.3 3.0 3.9 0.79

Document: car insurance auto insurance Query: best car insurance

(56)

After Normalization :

 for q, d length-normalized.

 

q d iV qidi d

q, ) 1

(57)

|

TF-IDF Example

Term Query Document Dot Product

tf.idf n’lize tf.idf n’lize

auto 0 0 2.3 0.3 0

best 1.3 0.34 0 0 0

car 2.0 0.52 2.0 0.5 0.26

insurance 3.0 0.78 3.9 0.79 0.62

Document: car insurance auto insurance Query: best car insurance

Score = 0+0+0.26+0.62 = 0.88

Doc length =

Sec. 6.4



(58)

|

Summary – vector space

ranking

- Represent the query as a weighted tf-idf vector - Represent each document as a weighted tf-idf

vector

- Compute the cosine similarity score for the

query vector and each document vector

- Rank documents with respect to the query by

score

(59)

Cosine similarity amongst 3 documents

term SaS PaP WH

affection 115 58 20

jealous 10 7 11

gossip 2 0 6

wuthering 0 0 38

How similar are the novels

SaS: Sense and Sensibility

PaP: Pride and Prejudice, and

WH: Wuthering Heights?

Term frequencies (counts)

Sec. 6.3

(60)

3 documents example contd.

Log frequency weighting

term SaS PaP WH

affection 3.06 2.76 2.30 jealous 2.00 1.85 2.04 gossip 1.30 0 1.78 wuthering 0 0 2.58

After length normalization

term SaS PaP WH

affection 0.789 0.832 0.524 jealous 0.515 0.555 0.465 gossip 0.335 0 0.405 wuthering 0 0 0.588

cos(SaS,PaP) ≈

0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0 ≈ 0.94

cos(SaS,WH) ≈ 0.79

cos(PaP,WH) ≈ 0.69

Why do we have cos(SaS,PaP) > cos(SaS,WH)?

Referensi

Dokumen terkait

Jenis kelamin merupakan perbedaan alami, karena setiap makhluk hidup, termasuk manusia, telah diciptakan berbeda kelamin (berpasangan), sebagaimana firman Allah dalam Al-Qur‟an

1) Negosiasi dan persyaratan, pada tahap ini melakukan dengan pihak bank yang bersangkutan dengan spesifikasi produk yang diinginkan oleh nasabah, harga beli dan harga jual,

Kata humas adalah merupakan singkatan dari Hubungan Masyarakat, belum ada ilmuan yang dapat menafsirkan arti kata humas dengan memuaskan karena memang banyak

public void copyMessages ( Message [] messages , Folder destination ) throws IllegalStateException , FolderNotFoundException , MessagingException. The copied messages

The regulations are contained in the UEFA Club Licensing and Financial Fair Play Regulations (2012) as a condition of license that will be granted by the UEFA to the

tinggi bilangan oksidasi semakin penting untuk unsur-unsur pada periode yang lebih besar.. d Ukuran atom

Temu bual akan dijalankan bagi mendapatkan maklumat yang berkaitan dengan pencapaian akademik murid Cikgu.. Analisa dokumen berkaitan pencapaian akademik murid juga akan

Berbeda dengan pendapat di atas yang menyatakan bahwa profitabilitas berpengaruh positif terhadap pengungkapan tanggung jawab sosial perusahaan, Donovan dan Gibson