The method proposed has the following steps:
– ODP classification (or data preparation).
– Partition.
– Aggregation.
These steps are described in the following sections.
3.1 ODP Classification
When querying the ODP, the returned categories can be divided in depth levels.
Letlbe a parameter of our system that identifies the maximum depth level in the ODP hierarchy, that our system works. For example, if we have the classification Sports:Soccer:U EF A:Spain:Clubs:Barcelonaand l= 1, we work with the root categorySports; whenl= 2 we work withSports:Soccer; . . .
In the ODP classification step we know the set of users{u1, . . . , un}and their sets ofqueriesQ={Q1, . . . , Qn}, whereQi={q1, . . . , qm}are the queries of the userui. Every queryqj can have several termsts, i.e.qj={t1, . . . , tr}.
For every termts, we obtain its classification at levelL∈ {1, . . . , l}using the ODP. Next, we create the matrix that contains the number of queries for each user and category at level L, MU xCL (classification matrix). Please, note that, we obtain one matrix for every levelL ∈ {1, . . . , l}. Finally, we use the MU xCL matrices in order to compute the incidence matrix that contains the semantic similiarity of the usersMU xU. The process works as follows:
1. Obtain the classification matricesMU xCL using Algorithm 1.
2. Obtain the incidence matrixMU xU using Algorithm 2, i.e. the addition of all coincidences between two users in the classification matrix.
Semantic Microaggregation for the Anonymization of Query Logs 131
Algorithm 1.Algorithm for computing the matricesMU xCL whereL={1, . . . , l} Require: the maximum depthlfor the ODP categories
Require: the set of usersU={ui, . . . , un}
Require: the set of queriesQi={q1, . . . , qm}of each userui
Require: the set of terms{t1, . . . , tr}of each queryqj
Ensure: {MUxC1 , . . . , MUxCl }, i.e. for every levelL, the matrixMUxCL with the number of queries for each category and user in the depthL
forL= 1 toL=ldo forui ∈ {u1, . . . , un}do
forqj ∈Qi={q1, . . . , qm} do forts ∈qj={t1, . . . , tr}do
obtain the categoriescat depthLfor the termts using ODP;
if c∈MUxCL then
add one to the cell (ui, c) ofMUxCL ; else
add the columnctoMUxCL ;
initialize the cell (ui, c) ofMUxCL to one;
end if end for end for end for end for
return {MUxC1 , . . . , MUxCl }.
Algorithm 2.Algorithm for computing the matrixMU xU
Require: the calssification matrices{MUxC1 , . . . , MUxCl } Ensure: MUxU
for MUxCL ∈ {MUxC1 , . . . , MUxCl } do for each columncj ∈MUxCL do
for each rowui ∈MUxCL do for each rowuρ∈MUxCL do
addxto the cell (ui, uρ) of matrixMUxU, where x=min((ui, cj),(uρ, cj));
end for end for end for end for return MUxU.
3.2 Partition
The partition step creates groups of k users with similar interests using Algorithm 3.
Let assume thatui anduρ are the most similar users in the set. We calculate the users’ similarity using the user incidence matrix MU xU, (see Section 3.1).
The most similar users are those that have the highest value in the matrix.
Next, we includeuianduρto the cluster. If the group sizekis two, we deleteui
132 A. Erola et al.
Algorithm 3.Algorithm for computing the clustersZ ={z1, . . . , zγ} of users Require: the set of usersU={u1, . . . , un}
Require: the incidence matrixMUxU
Require: the clusters sizek
Ensure: the clustersZ ={z1, . . . , zγ}of users U ←U;
while size(U)≤k do
obtain the clusterzj ofk users using the Algorithm 4 andU; remove the usersui∈zj formU;
addzjto the setZ; end while
if 0<size(U)≤kthen
createzγwith the usersui∈U; end if
return Z={z1, . . . , zγ}.
anduρ records from the incidence matrix and we repeat the process to obtain a new cluster. When, the group size is bigger than two, we merge the columns and rows ofuianduρcreating a new useru.uis the addition of both users,ui
anduρ. Let assume, that uξ is the most similar user withu. Next, we include uξ to the cluster withuianduρ. The method executes this processk−2 times.
Note that, if there are at leastk users in the matrix we repeat the previous steps, and if there are less thankusers we include them to a new group.
3.3 Aggregation
For every groupzj formed in the partition step, we compute its aggregation by selecting specific queries from each user in the group. That is, given the group of users zj = {u1, . . . , uk} we obtain a new user uzj as the representative (or centroid) of the cluster, which summarizes the queries of all the users of the cluster. To select such queries we define the following priorities:
– The queries semantically close have more priority.
– Deeper levels in the ODP tree are more important.
The contribution of a userui(Contribi), in the centroid, depends on her number of queriescard(Qi), that can be calculated as follows:
Contribi= card(Qi) k
i=1card(Qi) (1)
The number of queries of the centroid is the average of the number of queries of each useruiof the clusterzj. Thus, the quota of each useruiin the new centroid uzj can be computed as:
Quotai= card(Qi)
k (2)
The aggregation method runs the next protocol for each user:
Semantic Microaggregation for the Anonymization of Query Logs 133
Algorithm 4.Algorithm for computing a clusterz ofkusers Require: the incidence matrixMUxU
Require: the clusters sizek Ensure: a clusterz ofkusers
MUxU ←MUxU
z← ∅
obtain the two most similar users (ui, uρ), i.e. the cell of MUxU with the highest value;
add (ui, uρ) to the setz MUxU ←MUxU ;
while (size(zj)< k)and (columns(MUxU )>0)do for each columncs ∈MUxU do
add the cell (cs, uρ) to the cell (cs, ui) of the matrixMUxU end for
for each rowrs∈MUxU do add (uρ, rs) to the cell (ui, rs) end for
delete the columnuρof matrixMUxU ; delete the rowuρof matrixMUxU ;
obtain the newui’s most similar useruρ, i.e. the cell of the useruiwith the highest value;
adduρto the setz; end while
return z.
1. Sort logs from all users descending by query repetitions.
2. For each clusterzj and for each userui∈zj: (a) While not reachingQuotai:
i. Add the first query of her sorted list with a probabilityContribi×
#qj repetitions. For example, if ui has a query repeated 3 times, andContribi is 0.4, as 3·0.4 = 1.2, the method adds one query to the new log and randomly chooses if it adds the term another time according to the presence probability 0.2.
ii. Delete the first query of the list.