3 ODP-Based Microaggregation of Query Logs

The method proposed has the following steps:

– ODP classiﬁcation (or data preparation).

– Partition.

– Aggregation.

These steps are described in the following sections.

3.1 ODP Classification

When querying the ODP, the returned categories can be divided in depth levels.

Letlbe a parameter of our system that identiﬁes the maximum depth level in the ODP hierarchy, that our system works. For example, if we have the classiﬁcation Sports:Soccer:U EF A:Spain:Clubs:Barcelonaand l= 1, we work with the root categorySports; whenl= 2 we work withSports:Soccer; . . .

In the ODP classiﬁcation step we know the set of users{u1, . . . , un}and their sets ofqueriesQ={Q1, . . . , Qn}, whereQi={q1, . . . , qm}are the queries of the userui. Every queryqj can have several termsts, i.e.qj={t1, . . . , tr}.

For every termts, we obtain its classiﬁcation at levelL∈ {1, . . . , l}using the ODP. Next, we create the matrix that contains the number of queries for each user and category at level L, M_{U xC}^L (classiﬁcation matrix). Please, note that, we obtain one matrix for every levelL ∈ {1, . . . , l}. Finally, we use the M_{U xC}^L matrices in order to compute the incidence matrix that contains the semantic similiarity of the usersMU xU. The process works as follows:

1. Obtain the classiﬁcation matricesM_{U xC}^L using Algorithm 1.

2. Obtain the incidence matrixMU xU using Algorithm 2, i.e. the addition of all coincidences between two users in the classiﬁcation matrix.

Semantic Microaggregation for the Anonymization of Query Logs 131

Algorithm 1.Algorithm for computing the matricesM_{U xC}^L whereL={1, . . . , l} Require: the maximum depthlfor the ODP categories

Require: the set of usersU={ui, . . . , un}

Require: the set of queriesQi={q1, . . . , qm}of each userui

Require: the set of terms{t1, . . . , tr}of each queryqj

Ensure: {M_UxC¹ , . . . , M_UxC^l }, i.e. for every levelL, the matrixM_UxC^L with the number of queries for each category and user in the depthL

forL= 1 toL=ldo forui ∈ {u1, . . . , un}do

forqj ∈Qi={q1, . . . , qm} do forts ∈qj={t1, . . . , tr}do

obtain the categoriescat depthLfor the termts using ODP;

if c∈M_UxC^L then

add one to the cell (ui, c) ofM_UxC^L ; else

add the columnctoM_UxC^L ;

initialize the cell (ui, c) ofM_UxC^L to one;

end if end for end for end for end for

return {M_UxC¹ , . . . , M_UxC^l }.

Algorithm 2.Algorithm for computing the matrixMU xU

Require: the calssiﬁcation matrices{M_UxC¹ , . . . , M_UxC^l } Ensure: MUxU

for M_UxC^L ∈ {M_UxC¹ , . . . , M_UxC^l } do for each columncj ∈M_UxC^L do

for each rowui ∈M_UxC^L do for each rowuρ∈M_UxC^L do

addxto the cell (ui, uρ) of matrixMUxU, where x=min((ui, cj),(uρ, cj));

end for end for end for end for return MUxU.

3.2 Partition

The partition step creates groups of k users with similar interests using Algorithm 3.

Let assume thatui anduρ are the most similar users in the set. We calculate the users’ similarity using the user incidence matrix MU xU, (see Section 3.1).

The most similar users are those that have the highest value in the matrix.

Next, we includeuianduρto the cluster. If the group sizekis two, we deleteui

132 A. Erola et al.

Algorithm 3.Algorithm for computing the clustersZ ={z1, . . . , zγ} of users Require: the set of usersU={u1, . . . , un}

Require: the incidence matrixMUxU

Require: the clusters sizek

Ensure: the clustersZ ={z1, . . . , zγ}of users U ←U;

while size(U)≤k do

obtain the clusterzj ofk users using the Algorithm 4 andU; remove the usersui∈zj formU;

addzjto the setZ; end while

if 0<size(U)≤kthen

createzγwith the usersui∈U; end if

return Z={z1, . . . , zγ}.

anduρ records from the incidence matrix and we repeat the process to obtain a new cluster. When, the group size is bigger than two, we merge the columns and rows ofuianduρcreating a new useru.uis the addition of both users,ui

anduρ. Let assume, that uξ is the most similar user withu. Next, we include uξ to the cluster withuianduρ. The method executes this processk−2 times.

Note that, if there are at leastk users in the matrix we repeat the previous steps, and if there are less thankusers we include them to a new group.

3.3 Aggregation

For every groupzj formed in the partition step, we compute its aggregation by selecting speciﬁc queries from each user in the group. That is, given the group of users z_j = {u₁, . . . , u_k} we obtain a new user u_z_j as the representative (or centroid) of the cluster, which summarizes the queries of all the users of the cluster. To select such queries we deﬁne the following priorities:

– The queries semantically close have more priority.

– Deeper levels in the ODP tree are more important.

The contribution of a useru_i(Contrib_i), in the centroid, depends on her number of queriescard(Qi), that can be calculated as follows:

Contribi= card(Qi) k

i=1card(Qi) (1)

The number of queries of the centroid is the average of the number of queries of each useruiof the clusterzj. Thus, the quota of each useruiin the new centroid uz_j can be computed as:

Quotai= card(Qi)

k (2)

The aggregation method runs the next protocol for each user:

Semantic Microaggregation for the Anonymization of Query Logs 133

Algorithm 4.Algorithm for computing a clusterz ofkusers Require: the incidence matrixMUxU

Require: the clusters sizek Ensure: a clusterz ofkusers

M_UxU ←MUxU

z← ∅

obtain the two most similar users (ui, uρ), i.e. the cell of M_UxU with the highest value;

add (ui, uρ) to the setz M_UxU ←M_UxU ;

while (size(zj)< k)and (columns(M_UxU )>0)do for each columncs ∈M_UxU do

add the cell (cs, uρ) to the cell (cs, ui) of the matrixM_UxU end for

for each rowrs∈M_UxU do add (uρ, rs) to the cell (ui, rs) end for

delete the columnuρof matrixM_UxU ; delete the rowuρof matrixM_UxU ;

obtain the newui’s most similar useruρ, i.e. the cell of the useruiwith the highest value;

adduρto the setz; end while

return z.

1. Sort logs from all users descending by query repetitions.

2. For each clusterz_j and for each useru_i∈z_j: (a) While not reachingQuotai:

i. Add the ﬁrst query of her sorted list with a probabilityContribi×

#qj repetitions. For example, if ui has a query repeated 3 times, andContribi is 0.4, as 3·0.4 = 1.2, the method adds one query to the new log and randomly chooses if it adds the term another time according to the presence probability 0.2.

ii. Delete the ﬁrst query of the list.

Dalam dokumen Lecture Notes in Computer Science (Halaman 140-143)