• Tidak ada hasil yang ditemukan

Stochastic Variational Inference

4.3 Approximate Inference

4.3.2 Stochastic Variational Inference

Algorithm 3Variational Bayesian Factorization Machine (VBFM) Require: α, σ0, σwci, σvci,k ∀i, k

Ensure: Randomly Initialize σ00, µ00, σ0wi, µ0wi0vik0vik ∀i, k

Ensure: Compute Rn for all the training data points.

1: fort= 1toM do

2: // Updatew0’s parameter

3: σold←σ00

4: µold←µ00

5: σ00 ←(σ0+αN)−1

6: µ00 ←σ00α

N

P

n=1

Rn00

7: forn = 1toNdo

8: Rn ←Rnold−µ00

9: end for

10: // Updatewi’s parameter

11: fori= 1toDdo

12: σold←σw0i

13: µold ←µ0wi

14: σ0wi

σwci +α PN

n=1

x2ni 1

15: µ0wi ←σ0wiα

N

P

n=1

xni Rn+xniµ0wi

16: forninΩido

17: Rn←Rn+xniold−µ0wi)

18: end for

19: end for

20: // Updatevik’s parameter

21: fork = 1toKdo

22: fori= 1toDdo

23: σold←σv0ik

24: µold←µ0vik

25: σv0ik

vcik

N

P

n=1

x2ni (S1(i, k)2+S2(i, k))

1

26: µ0vik ← σ0vikα

N

P

n=1

xniS1(i, k) [Rn+xniµvikS1(i, k)]

27: forninΩi do

28: Rn←Rn+xniS1(i, k)(µold− µ0vik)

29: end for

30: end for

31: end for

32: // Update hyperparameters

33: α← PN N

n=1

R2n+Tn

34: σ0µ21

00

35: fori= 1to|c|do

36: σwci

P

j∈ci

1

P

j∈ci

µ0wj2 0wj

37: end for

38: fork = 1toK do

39: fori= 1to|c|do

40: σvcik

P

j∈ci

1

P

j∈ci

µ0vjk2 vjk0

41: end for

42: end for

43: end for

tion. Therefore, in the implementation, after sub-sampling a data instancen uniformly at random from the given dataset, the noisy estimate ofL(q,θ)can be computed as follows:

Lnoisy(q,θ) =sn−1Fn+F0+

D

X

i=1;in

Fiw+

D

X

i=1;in K

X

k=1

Fikv, (4.25)

wheresnis the rescaling constant. Eq. (4.25) is the rescaled version of Eq. (4.12). Rescale factors for w0, wi and vik are set to N, |Ωi|, and |Ωi| respectively. The variational pa- rameters associated withq(Z)are updated by making a small step in the direction of the gradient of Eq. (4.25). Since natural gradient leads to faster convergence [Amari, 1998;

Hoffmanet al., 2013], natural parameters ofq(Z)are considered for the updates. The nat- ural gradient of a function accounts for the information geometry of its parameter space.

The classical gradient method for maximization tries to find a maximum of a function by taking small steps in the direction of the gradient. The gradient (when it exists) points in the direction of steepest ascent. However, the Euclidean metric might not capture a mean- ingful notion of distance [Hoffman et al., 2013]. The natural gradient corrects for this issue by redefining the basic definition of the gradient [Amari, 1998]. While the Euclidean gradient points in the direction of steepest ascent in Euclidean space, the natural gradient points in the direction of steepest ascent in the Riemannian space, that is, the space where local distance is defined by KL divergence rather than the L2 norm.

Natural parameters are represented as: ¯vik = µ

0 vik

σvik0 , ˆvik = 1

σ0vik and˚vik = {v¯ik,ˆvik}, with˚vik denoting the natural parameter corresponding to vik. Also, the natural gradient of Lnoisy(q,θ) with respect to˚vik is given by ∇L0(˚vik). As the model is conditionally conjugate [Hoffmanet al., 2013],∇L0(˚vik) = ˚vik −˚vik, where˚vik ={v¯ik,ˆvik}is the value of˚vik that maximizes Eq. (4.25). Therefore, update equation for˚vik can be written as:

˚viknew= ˚vikoldvi(˚vik −˚vikold) = (1−ηiv)˚vikoldvi˚vik, (4.26)

whereηiv is the step size corresponding to˚vik. Step sizesη0w, ηwi andηvi are updated each time the corresponding parameters get updated using Robbins-Monro conditions [Hoffman et al., 2013] which ensures convergence. In particular, lettw0, twi andtvik be the number of times corresponding parameters get updated. Then the update rules can be written as

Algorithm 4Online Variational Bayesian Factorization Machine (OVBFM) Require: α, σ0, σwci, σvcik, ηi ∀i, k

Ensure: Randomly Initialize σ00, µ00, σ0wi, µ0wi0vik0vik ∀i, k

1: fort= 1toM do

2: fors∈B do

3: ComputeRn∀n ∈s

4: // Updatew0’s parameter

5: fori= 1tondo

6: Updatew˚0avgusing (4.31)

7: end for

8: Updatew˚0using (4.29)

9: UpdateRnlike algorithm 3 for cur- rent batch

10: // Updatewi’s parameter

11: fori= 1toDdo

12: forn = 1toΩido

13: Updatew˚avgi using (4.32)

14: end for

15: Updatew˚i using (4.29)

16: Update Rn like algorithm 3 for current batch

17: end for

18: // Updatevik’s parameter

19: fork= 1toKdo

20: fori= 1toDdo

21: forn= 1toΩi do

22: Update˚vikavgusing (4.33)

23: end for

24: Update˚vik using (4.29)

25: Update Rn like algorithm 3 for current batch

26: end for

27: end for

28: Updateηw0wivi using (4.27) and (4.28)

29: // Update hyperparameters

30: α ←(1−η0w)α+ηw0( |s| |s|

P

n=1

R2n+Tn

)

31: σ0 ←(1−η0w0w0(µ21 00)

32: fori= 1to|c|do

33: σwci ← (1 − ηiwwci + ηiw(

P

j∈ci

1

P

j∈ci

µ0wj2 0wj)

34: end for

35: fork = 1toK do

36: fori= 1to|c|do

37: σvcik ← (1 − ηvivcik + ηvi(

P

j∈ci

1

P

j∈ci

µ0vjk2 vjk0 )

38: end for

39: end for

40: end for

41: end for

follows:

ηw0 = (1 +tw0)λ, ηwi = (1 +twi)λ∀i∈ {1,2,· · · , D}, (4.27) ηvi = (1 +tvik)λ ∀i∈ {1,2,· · · , D}and∀k∈ {1,2,· · · , K}, (4.28)

where λ ∈(0.5,1). For all the experiments, minimum value of λ produced best results.

Thereforeλis set at0.5.

To reduce the variance due to noisy estimate of the gradient, a mini-batch version is considered with a batch size of s number of points. To update a parameter, for example

˚vik ={¯vik,vˆik}are computed and stored for all the data instances with non-zero feature

values in theithcolumn. The update of˚vikcan then be derived as follows:

˚vnewik = (1−ηiv)˚vikoldvi˚vikavg, (4.29)

where,

˚vavgik =

ni

X

n=1

˚vik,n/ni. (4.30)

Hereniis the number of non-zero entries in theithcolumn of the design matrix constructed from the current batch and˚vik∗,n is the value of˚vik produced when the nth data point is considered. Detailed update equations of parameters set{w˚0,w˚i,˚vik}, which are used in Eq. (4.29) to calculate the variational parameters{w˚0,w˚i,˚vik}, are as follows.

• Update rule for the parameters of w˚0 = {w¯0,wˆ0} given the nth data point is as follows:

00+N α,

0 =N α(Rn00) . (4.31)

• Update rule for the parameters of w˚i = {w¯i,wˆi} given the nth data point is as follows:

i = (σwci +|Ωi|αx2ni), w¯i =|Ωi|αxni

Rn+xniµ0wi

. (4.32)

• Update rule for the parameters of˚vik = {v¯ik,ˆvik} given the nth data point is as follows:

ikvcik+|Ωi|αx2ni S1(i, k)2+S2(i, k) , v¯ik =|Ωi|αxniS1(i, k)

Rn+xniµ0vikS1(i, k)

. (4.33)

Algorithm 4 describes the detail procedure of OVBFM. In each iteration of OVBFM, the dataset is partitioned into B random batches. Then in each iteration, OVBFM loops through these B batches sequentially. For a given batch, line 3-27 update all the natural parameters, and line 28-39 update all the model hyperparameters. These steps are then repeated forB batches which completes a full iteration of OVBFM. These steps are then repeated forM iterations.

Table 4.1: Description of the datasets.

Dataset No. of User No. of Movie No. of Entries

Movielens 1m 6040 3900 1m

Movielens 10m 71567 10681 10m

Netflix 480189 17770 100m

KDD Music 1000990 624961 263m

Dokumen terkait