4.3 Approximate Inference
4.3.2 Stochastic Variational Inference
Algorithm 3Variational Bayesian Factorization Machine (VBFM) Require: α, σ0, σwci, σvci,k ∀i, k
Ensure: Randomly Initialize σ00, µ00, σ0wi, µ0wi,σ0vik,µ0vik ∀i, k
Ensure: Compute Rn for all the training data points.
1: fort= 1toM do
2: // Updatew0’s parameter
3: σold←σ00
4: µold←µ00
5: σ00 ←(σ0+αN)−1
6: µ00 ←σ00α
N
P
n=1
Rn+µ00
7: forn = 1toNdo
8: Rn ←Rn+µold−µ00
9: end for
10: // Updatewi’s parameter
11: fori= 1toDdo
12: σold←σw0i
13: µold ←µ0wi
14: σ0wi ←
σwci +α PN
n=1
x2ni −1
15: µ0wi ←σ0wiα
N
P
n=1
xni Rn+xniµ0wi
16: forninΩido
17: Rn←Rn+xni(µold−µ0wi)
18: end for
19: end for
20: // Updatevik’s parameter
21: fork = 1toKdo
22: fori= 1toDdo
23: σold←σv0ik
24: µold←µ0vik
25: σv0ik ←
(σvcik+α
N
P
n=1
x2ni (S1(i, k)2+S2(i, k))
−1
26: µ0vik ← σ0vikα
N
P
n=1
xniS1(i, k) [Rn+xniµvikS1(i, k)]
27: forninΩi do
28: Rn←Rn+xniS1(i, k)(µold− µ0vik)
29: end for
30: end for
31: end for
32: // Update hyperparameters
33: α← PN N
n=1
R2n+Tn
34: σ0 ← µ21
0+σ0
35: fori= 1to|c|do
36: σwci ←
P
j∈ci
1
P
j∈ci
µ0wj2 +σ0wj
37: end for
38: fork = 1toK do
39: fori= 1to|c|do
40: σvcik ←
P
j∈ci
1
P
j∈ci
µ0vjk2 +σvjk0
41: end for
42: end for
43: end for
tion. Therefore, in the implementation, after sub-sampling a data instancen uniformly at random from the given dataset, the noisy estimate ofL(q,θ)can be computed as follows:
Lnoisy(q,θ) =sn−1Fn+F0+
D
X
i=1;i∈n
Fiw+
D
X
i=1;i∈n K
X
k=1
Fikv, (4.25)
wheresnis the rescaling constant. Eq. (4.25) is the rescaled version of Eq. (4.12). Rescale factors for w0, wi and vik are set to N, |Ωi|, and |Ωi| respectively. The variational pa- rameters associated withq(Z)are updated by making a small step in the direction of the gradient of Eq. (4.25). Since natural gradient leads to faster convergence [Amari, 1998;
Hoffmanet al., 2013], natural parameters ofq(Z)are considered for the updates. The nat- ural gradient of a function accounts for the information geometry of its parameter space.
The classical gradient method for maximization tries to find a maximum of a function by taking small steps in the direction of the gradient. The gradient (when it exists) points in the direction of steepest ascent. However, the Euclidean metric might not capture a mean- ingful notion of distance [Hoffman et al., 2013]. The natural gradient corrects for this issue by redefining the basic definition of the gradient [Amari, 1998]. While the Euclidean gradient points in the direction of steepest ascent in Euclidean space, the natural gradient points in the direction of steepest ascent in the Riemannian space, that is, the space where local distance is defined by KL divergence rather than the L2 norm.
Natural parameters are represented as: ¯vik = µ
0 vik
σvik0 , ˆvik = 1
σ0vik and˚vik = {v¯ik,ˆvik}, with˚vik denoting the natural parameter corresponding to vik. Also, the natural gradient of Lnoisy(q,θ) with respect to˚vik is given by ∇L0(˚vik). As the model is conditionally conjugate [Hoffmanet al., 2013],∇L0(˚vik) = ˚vik∗ −˚vik, where˚v∗ik ={v¯∗ik,ˆvik∗}is the value of˚vik that maximizes Eq. (4.25). Therefore, update equation for˚vik can be written as:
˚viknew= ˚vikold+ηvi(˚vik∗ −˚vikold) = (1−ηiv)˚vikold+ηvi˚vik∗, (4.26)
whereηiv is the step size corresponding to˚vik. Step sizesη0w, ηwi andηvi are updated each time the corresponding parameters get updated using Robbins-Monro conditions [Hoffman et al., 2013] which ensures convergence. In particular, lettw0, twi andtvik be the number of times corresponding parameters get updated. Then the update rules can be written as
Algorithm 4Online Variational Bayesian Factorization Machine (OVBFM) Require: α, σ0, σwci, σvcik, ηi ∀i, k
Ensure: Randomly Initialize σ00, µ00, σ0wi, µ0wi,σ0vik,µ0vik ∀i, k
1: fort= 1toM do
2: fors∈B do
3: ComputeRn∀n ∈s
4: // Updatew0’s parameter
5: fori= 1tondo
6: Updatew˚0avgusing (4.31)
7: end for
8: Updatew˚0using (4.29)
9: UpdateRnlike algorithm 3 for cur- rent batch
10: // Updatewi’s parameter
11: fori= 1toDdo
12: forn = 1toΩido
13: Updatew˚avgi using (4.32)
14: end for
15: Updatew˚i using (4.29)
16: Update Rn like algorithm 3 for current batch
17: end for
18: // Updatevik’s parameter
19: fork= 1toKdo
20: fori= 1toDdo
21: forn= 1toΩi do
22: Update˚vikavgusing (4.33)
23: end for
24: Update˚vik using (4.29)
25: Update Rn like algorithm 3 for current batch
26: end for
27: end for
28: Updateηw0,ηwi,ηvi using (4.27) and (4.28)
29: // Update hyperparameters
30: α ←(1−η0w)α+ηw0( |s| |s|
P
n=1
R2n+Tn
)
31: σ0 ←(1−η0w)σ0+ηw0(µ21 0+σ0)
32: fori= 1to|c|do
33: σwci ← (1 − ηiw)σwci + ηiw(
P
j∈ci
1
P
j∈ci
µ0wj2 +σ0wj)
34: end for
35: fork = 1toK do
36: fori= 1to|c|do
37: σvcik ← (1 − ηvi)σvcik + ηvi(
P
j∈ci
1
P
j∈ci
µ0vjk2 +σvjk0 )
38: end for
39: end for
40: end for
41: end for
follows:
ηw0 = (1 +tw0)−λ, ηwi = (1 +twi)−λ∀i∈ {1,2,· · · , D}, (4.27) ηvi = (1 +tvik)−λ ∀i∈ {1,2,· · · , D}and∀k∈ {1,2,· · · , K}, (4.28)
where λ ∈(0.5,1). For all the experiments, minimum value of λ produced best results.
Thereforeλis set at0.5.
To reduce the variance due to noisy estimate of the gradient, a mini-batch version is considered with a batch size of s number of points. To update a parameter, for example
˚vik ={¯vik∗,vˆ∗ik}are computed and stored for all the data instances with non-zero feature
values in theithcolumn. The update of˚vikcan then be derived as follows:
˚vnewik = (1−ηiv)˚vikold+ηvi˚vikavg, (4.29)
where,
˚vavgik =
ni
X
n=1
˚vik∗,n/ni. (4.30)
Hereniis the number of non-zero entries in theithcolumn of the design matrix constructed from the current batch and˚vik∗,n is the value of˚vik∗ produced when the nth data point is considered. Detailed update equations of parameters set{w˚∗0,w˚∗i,˚v∗ik}, which are used in Eq. (4.29) to calculate the variational parameters{w˚0,w˚i,˚vik}, are as follows.
• Update rule for the parameters of w˚0∗ = {w¯0∗,wˆ0∗} given the nth data point is as follows:
wˆ∗0 =σ0+N α,
w¯∗0 =N α(Rn+µ00) . (4.31)
• Update rule for the parameters of w˚i∗ = {w¯i∗,wˆi∗} given the nth data point is as follows:
wˆi∗ = (σwci +|Ωi|αx2ni), w¯i∗ =|Ωi|αxni
Rn+xniµ0wi
. (4.32)
• Update rule for the parameters of˚vik∗ = {v¯ik∗,ˆvik∗} given the nth data point is as follows:
vˆik∗ =σvcik+|Ωi|αx2ni S1(i, k)2+S2(i, k) , v¯ik∗ =|Ωi|αxniS1(i, k)
Rn+xniµ0vikS1(i, k)
. (4.33)
Algorithm 4 describes the detail procedure of OVBFM. In each iteration of OVBFM, the dataset is partitioned into B random batches. Then in each iteration, OVBFM loops through these B batches sequentially. For a given batch, line 3-27 update all the natural parameters, and line 28-39 update all the model hyperparameters. These steps are then repeated forB batches which completes a full iteration of OVBFM. These steps are then repeated forM iterations.
Table 4.1: Description of the datasets.
Dataset No. of User No. of Movie No. of Entries
Movielens 1m 6040 3900 1m
Movielens 10m 71567 10681 10m
Netflix 480189 17770 100m
KDD Music 1000990 624961 263m