Stochastic Variational Inference - Approximate Inference

4.3 Approximate Inference

4.3.2 Stochastic Variational Inference

Algorithm 3Variational Bayesian Factorization Machine (VBFM) Require: α, σ0, σw_ci, σv_ci,k ∀i, k

Ensure: Randomly Initialize σ₀⁰, µ⁰₀, σ⁰_w_i, µ⁰_w_i,σ⁰_v_ik,µ⁰_v_ik ∀i, k

Ensure: Compute Rn for all the training data points.

1: fort= 1toM do

2: // Updatew0’s parameter

3: σ_old←σ₀⁰

4: µold←µ⁰₀

5: σ₀⁰ ←(σ0+αN)⁻¹

6: µ⁰₀ ←σ⁰₀α

n=1

Rn+µ⁰₀

7: forn = 1toNdo

8: R_n ←R_n+µ_old−µ⁰₀

9: end for

10: // Updatewi’s parameter

11: fori= 1toDdo

12: σold←σ_w⁰_i

13: µ_old ←µ⁰_w_i

14: σ⁰_w_i ←

σw_ci +α P^N

n=1

x²_ni −1

15: µ⁰_w_i ←σ⁰_w_iα

n=1

xni Rn+xniµ⁰_w_i

16: forninΩ_ido

17: Rn←Rn+xni(µold−µ⁰_w_i)

18: end for

19: end for

20: // Updatevik’s parameter

21: fork = 1toKdo

22: fori= 1toDdo

23: σold←σ_v⁰_ik

24: µold←µ⁰_v_ik

25: σ_v⁰_ik ←

(σv_cik+α

n=1

x²_ni (S1(i, k)²+S2(i, k))

−1

26: µ⁰_v_ik ← σ⁰_v_ikα

n=1

xniS1(i, k) [Rn+xniµvikS1(i, k)]

27: forninΩ_i do

28: Rn←Rn+xniS1(i, k)(µold− µ⁰_v_ik)

29: end for

30: end for

31: end for

32: // Update hyperparameters

33: α← _P^N ^N

n=1

R²_n+Tn

34: σ0 ← _µ²¹

0+σ0

35: fori= 1to|c|do

36: σw_ci ←

j∈ci

µ⁰_wj² +σ⁰_wj

37: end for

38: fork = 1toK do

39: fori= 1to|c|do

40: σv_cik ←

j∈ci

µ⁰_vjk² +σ_vjk⁰

41: end for

42: end for

43: end for

tion. Therefore, in the implementation, after sub-sampling a data instancen uniformly at random from the given dataset, the noisy estimate ofL(q,θ)can be computed as follows:

Lnoisy(q,θ) =sn−1Fn+F⁰+

i=1;i∈n

F_i^w+

i=1;i∈n K

k=1

F_ik^v, (4.25)

wheresnis the rescaling constant. Eq. (4.25) is the rescaled version of Eq. (4.12). Rescale factors for w0, wi and vik are set to N, |Ω_i|, and |Ω_i| respectively. The variational parameters associated withq(Z)are updated by making a small step in the direction of the gradient of Eq. (4.25). Since natural gradient leads to faster convergence [Amari, 1998;

Hoffmanet al., 2013], natural parameters ofq(Z)are considered for the updates. The natural gradient of a function accounts for the information geometry of its parameter space.

The classical gradient method for maximization tries to find a maximum of a function by taking small steps in the direction of the gradient. The gradient (when it exists) points in the direction of steepest ascent. However, the Euclidean metric might not capture a mean- ingful notion of distance [Hoffman et al., 2013]. The natural gradient corrects for this issue by redefining the basic definition of the gradient [Amari, 1998]. While the Euclidean gradient points in the direction of steepest ascent in Euclidean space, the natural gradient points in the direction of steepest ascent in the Riemannian space, that is, the space where local distance is defined by KL divergence rather than the L2 norm.

Natural parameters are represented as: ¯vik = ^µ

0 vik

σ_vik⁰ , ˆvik = ¹

σ⁰_vik and˚vik = {v¯ik,ˆvik}, with˚vik denoting the natural parameter corresponding to vik. Also, the natural gradient of L^noisy(q,θ) with respect to˚vik is given by ∇L⁰(˚vik). As the model is conditionally conjugate [Hoffmanet al., 2013],∇L⁰(˚vik) = ˚v_ik^∗ −˚vik, where˚v^∗_ik ={v¯^∗_ik,ˆv_ik^∗}is the value of˚vik that maximizes Eq. (4.25). Therefore, update equation for˚vik can be written as:

˚v_ik^new= ˚v_ikôld+η_vⁱ(˚v_ik^∗ −˚v_ikôld) = (1−ηⁱ_v)˚v_ikôld+η_vⁱ˚v_ik^∗, (4.26)

whereηⁱ_v is the step size corresponding to˚vik. Step sizesη⁰_w, η_wⁱ andη_vⁱ are updated each time the corresponding parameters get updated using Robbins-Monro conditions [Hoffman et al., 2013] which ensures convergence. In particular, lettw0, twi andtvik be the number of times corresponding parameters get updated. Then the update rules can be written as

Algorithm 4Online Variational Bayesian Factorization Machine (OVBFM) Require: α, σ0, σw_ci, σv_cik, ηⁱ ∀i, k

Ensure: Randomly Initialize σ₀⁰, µ⁰₀, σ⁰_w_i, µ⁰_w_i,σ⁰_v_ik,µ⁰_v_ik ∀i, k

1: fort= 1toM do

2: fors∈B do

3: ComputeRn∀n ∈s

4: // Updatew0’s parameter

5: fori= 1tondo

6: Updatew˚₀^avgusing (4.31)

7: end for

8: Updatew˚₀using (4.29)

9: UpdateRnlike algorithm 3 for current batch

10: // Updatew_i’s parameter

11: fori= 1toDdo

12: forn = 1toΩ_ido

13: Updatew˚^avg_i using (4.32)

14: end for

15: Updatew˚i using (4.29)

16: Update R_n like algorithm 3 for current batch

17: end for

18: // Updatevik’s parameter

19: fork= 1toKdo

20: fori= 1toDdo

21: forn= 1toΩ_i do

22: Update˚v_ik^avgusing (4.33)

23: end for

24: Update˚v_ik using (4.29)

25: Update Rn like algorithm 3 for current batch

26: end for

27: end for

28: Updateη_w⁰,η_wⁱ,η_vⁱ using (4.27) and (4.28)

29: // Update hyperparameters

30: α ←(1−η⁰_w)α+η_w⁰( _|s| ^|s|

n=1

R²n+Tn

)

31: σ0 ←(1−η⁰_w)σ0+η_w⁰(_µ2¹ 0+σ0)

32: fori= 1to|c|do

33: σw_ci ← (1 − ηⁱ_w)σw_ci + ηⁱ_w(

j∈ci

µ⁰_wj² +σ⁰_wj)

34: end for

35: fork = 1toK do

36: fori= 1to|c|do

37: σ_v_cik ← (1 − η_vⁱ)σ_v_cik + η_vⁱ(

j∈ci

µ⁰_vjk² +σ_vjk⁰ )

38: end for

39: end for

40: end for

41: end for

follows:

η_w⁰ = (1 +t_w₀)⁻^λ, η_wⁱ = (1 +t_w_i)⁻^λ∀i∈ {1,2,· · · , D}, (4.27) η_vⁱ = (1 +t_v_ik)⁻^λ ∀i∈ {1,2,· · · , D}and∀k∈ {1,2,· · · , K}, (4.28)

where λ ∈(0.5,1). For all the experiments, minimum value of λ produced best results.

Thereforeλis set at0.5.

To reduce the variance due to noisy estimate of the gradient, a mini-batch version is considered with a batch size of s number of points. To update a parameter, for example

˚vik ={¯v_ik^∗,vˆ^∗_ik}are computed and stored for all the data instances with non-zero feature

values in thei^thcolumn. The update of˚vikcan then be derived as follows:

˚v^new_ik = (1−ηⁱ_v)˚v_ik^old+η_vⁱ˚v_ik^avg, (4.29)

where,

˚v^avg_ik =

n=1

˚v_ik^∗^,n/ni. (4.30)

Hereniis the number of non-zero entries in thei^thcolumn of the design matrix constructed from the current batch and˚v_ik^∗,n is the value of˚v_ik^∗ produced when the n^th data point is considered. Detailed update equations of parameters set{w˚^∗₀,w˚^∗_i,˚v^∗_ik}, which are used in Eq. (4.29) to calculate the variational parameters{w˚0,w˚i,˚vik}, are as follows.

• Update rule for the parameters of w˚₀^∗ = {w¯₀^∗,wˆ₀^∗} given the n^th data point is as follows:

wˆ^∗₀ =σ0+N α,

w¯^∗₀ =N α(R_n+µ⁰₀) . (4.31)

• Update rule for the parameters of w˚_i^∗ = {w¯_i^∗,wˆ_i^∗} given the n^th data point is as follows:

wˆ_i^∗ = (σw_ci +|Ω_i|αx²_ni), w¯_i^∗ =|Ω_i|αx_ni

R_n+x_niµ⁰_w_i

. (4.32)

• Update rule for the parameters of˚v_ik^∗ = {v¯_ik^∗,ˆv_ik^∗} given the n^th data point is as follows:

vˆ_ik^∗ =σv_cik+|Ω_i|αx²_ni S1(i, k)²+S2(i, k) , v¯_ik^∗ =|Ω_i|αxniS1(i, k)

Rn+xniµ⁰_v_ikS1(i, k)

. (4.33)

Algorithm 4 describes the detail procedure of OVBFM. In each iteration of OVBFM, the dataset is partitioned into B random batches. Then in each iteration, OVBFM loops through these B batches sequentially. For a given batch, line 3-27 update all the natural parameters, and line 28-39 update all the model hyperparameters. These steps are then repeated forB batches which completes a full iteration of OVBFM. These steps are then repeated forM iterations.

Table 4.1: Description of the datasets.

Dataset No. of User No. of Movie No. of Entries

Movielens 1m 6040 3900 1m

Movielens 10m 71567 10681 10m

Netflix 480189 17770 100m

KDD Music 1000990 624961 263m

Dalam dokumen Scalable Bayesian Factorization Models for Recommender Systems (Halaman 58-62)