• Tidak ada hasil yang ditemukan

Probabilistic Models

Chapter II: Background

2.2 Probabilistic Models

Set-Up: Probability Distributions & Maximum Likelihood

Consider a random variable, 𝑋 ∈R𝑀, with a corresponding distribution, 𝑝data(𝑋), defining the probability of observing each possible observation, 𝑋 = x. We will use the shorthand notation 𝑝data(x) to denote the probability 𝑝data(𝑋 = x). This distribution is the result of an underlying data generation process, e.g. the emission and scattering of photons. While we do not have direct access to 𝑝data, we can sample observations,x∼ 𝑝data(x), yielding an empirical distribution, 𝑝bdata(x).

Often, we wish to model 𝑝data, for instance, to predict or compress observations of 𝑋. We refer to this model as π‘πœƒ(x), with parameters πœƒ. A natural approach for estimating the model involves minimizing some divergence measure between 𝑝data(x) and π‘πœƒ(x) with respect to the model parameters,πœƒ. The Kullback Leibler (KL) divergence, denoted 𝐷KL, is a common choice of divergence measure:

πœƒβˆ— ← arg min

πœƒ

𝐷KL(𝑝data(x) ||π‘πœƒ(x)), (2.1) πœƒβˆ— ← arg min

πœƒ ExβˆΌπ‘data(x) [log𝑝data(x) βˆ’logπ‘πœƒ(x)], (2.2) πœƒβˆ— ← arg max

πœƒ ExβˆΌπ‘data(x) [logπ‘πœƒ(x)]. (2.3) Here,Edenotes the expectation operator. We have gone from Eq. 2.1 to Eq. 2.2 using the definition of the KL divergence. Because𝑝data(x) does not depend onπœƒ, we can omit it from the optimization, expressing the equivalent maximization in Eq. 2.3.

This is the standardmaximum log-likelihood objective, which is found throughout machine learning and probabilistic modeling (Murphy, 2012; Goodfellow, Y. Bengio, and Courville, 2016). In practice, we do not have access to 𝑝data(x)and must instead approximate the objective using data samples, i.e. using b𝑝data(x). Withx(𝑖) as the𝑖th

X

<latexit sha1_base64="ZTIWoTksoLoiyDHEgNaAIVk1a3s=">AAACdnicdVHLTgIxFC3jC/EFujQxjYToisygiS6JblxCwiuBCemUCzS200nbMZIJX+BWP84/cWkHWDCgN2lycu7jnN4bRJxp47rfOWdnd2//IH9YODo+OT0rls47WsaKQptKLlUvIBo4C6FtmOHQixQQEXDoBq/Pab77BkozGbbMLAJfkEnIxowSY6lmb1gsu1V3EXgbeCtQRqtoDEu54WAkaSwgNJQTrfueGxk/IcowymFeGMQaIkJfyQT6FoZEgPaThdM5rlhmhMdS2RcavGDXOxIitJ6JwFYKYqZ6M5eSf+X6sRk/+gkLo9hASJdC45hjI3H6bTxiCqjhMwsIVcx6xXRKFKHGLiczqeX5SWouHZORD8S8UMHrTGojMuI9W0f4RFqFqfiHZnS7IdIQ263KkV2gPYm3eYBt0KlVvbtqrXlfrj+tjpNHl+ga3SIPPaA6ekEN1EYUAfpAn+gr9+NcORXnZlnq5FY9FygTjvsL9bXC5Q==</latexit>

p

<latexit sha1_base64="NZ5I9Uz3EubyGC9Sm16VZOY8I+E=">AAACdnicdVHLSgMxFE3HV61vXQoSLEVXZaYKuiy6cVnBtkIdSiZz2waTSUgyYhn6BW714/wTl2baWfShFwKHcx/n5N5IcWas73+XvLX1jc2t8nZlZ3dv/+Dw6LhjZKoptKnkUj9HxABnCbQtsxyelQYiIg7d6PU+z3ffQBsmkyc7VhAKMkzYgFFiHfWo+odVv+5PA6+CoABVVESrf1Tqv8SSpgISSzkxphf4yoYZ0ZZRDpPKS2pAEfpKhtBzMCECTJhNnU5wzTExHkjtXmLxlJ3vyIgwZiwiVymIHZnlXE7+leuldnAbZixRqYWEzoQGKcdW4vzbOGYaqOVjBwjVzHnFdEQ0odYtZ2HSUxBmubl8zIJ8JCaVGp5nchvKivfFOsKH0imMxD80o6sNykDqtipjt0B3kmD5AKug06gHV/XG43W1eVccp4xO0Tm6RAG6QU30gFqojSgC9IE+0Vfpxzvzat7FrNQrFT0naCE8/xcndML9</latexit>

pβœ“(x)

<latexit sha1_base64="DJHvJodbDQ1zy9j80H0t+wyUyzg=">AAACjHicdVFdSxtBFJ1sW03TqtHSJ1+GBsG+hF0VFEohtCA+KiQqJMsyO7mbDM7sDDN3JWHJj/G1/UX+G2djHvKhFy4czv06nJsaKRyG4XMt+PDx09Z2/XPjy9ed3b3m/sGt04Xl0ONaanufMgdS5NBDgRLujQWmUgl36cPfqn73CNYJnXdxaiBWbJSLTHCGnkqa300ywDEgo8cDxXCcZuVk9jNptsJ2OA+6CaIFaJFFXCf7tWQw1LxQkCOXzLl+FBqMS2ZRcAmzxqBwYBh/YCPoe5gzBS4u5/pn9MgzQ5pp6zNHOmeXJ0qmnJuq1HdWGt16rSLfqvULzC7iUuSmQMj566GskBQ1rcygQ2GBo5x6wLgVXivlY2YZR2/ZyqZuFJeVuGrNyvlUzRpHdJmpZBhUk9U+JkfaXxird2jBNweMg8K7qofeQP+SaP0Bm+D2pB2dtk9uzlqdP4vn1Mkh+UGOSUTOSYdckWvSI5yU5In8I/+D3eAs+BX8fm0NaouZb2QlgssXUBDKdw==</latexit>

pdata(x)

<latexit sha1_base64="KMQ8ygvEEBrHwGNZcCadnKEw6ng=">AAACk3icdVHLahsxFJWnj6TuI05LV+1C1ATSjZlJCyl0E9IuuimkECcBexjuaO7YItJISHeKzTCbfk237d/0b6pxvIjj9oLgcO7r6NzcKukpjv/0onv3Hzzc2X3Uf/zk6bO9wf7zC29qJ3AsjDLuKgePSlY4JkkKr6xD0LnCy/z6U5e//I7OS1Od09JiqmFWyVIKoEBlg9c2mxIuyOmmAIKWH0410Dwvm0X7NhsM41G8Cr4NkjUYsnWcZfu9bFoYUWusSCjwfpLEltIGHEmhsO1Pa48WxDXMcBJgBRp92qy+0fKDwBS8NC68iviKvd3RgPZ+qfNQ2Wn0d3Md+a/cpKbyQ9rIytaElbhZVNaKk+GdJ7yQDgWpZQAgnAxauZiDA0HBuY1J50nadOK6MRvrc932D/htppNhSS8260DNTNgw1/+hpdhusB7r4KopgoHhJMndA2yDi6NR8m509O398OR0fZxd9oq9YYcsYcfshH1hZ2zMBPvBfrJf7Hf0MvoYnUafb0qj3rrnBduI6OtfGl3N3Q==</latexit>

x⇠pdata(x)

<latexit sha1_base64="Min9X1kSrlMsgf0t/tlM/bkvIlw=">AAACo3icdVFNa9tAEF2rX4nTNk57zGWpMSRQjJQE2mNoL4Vc0mAnAUuI1WpkL9nVLrujYiP0J/prek3+Rf9NVo4hcdwOLDzevJl5O5MZKRyG4d9O8OLlq9dvtra7O2/fvd/t7X24dLqyHMZcS22vM+ZAihLGKFDCtbHAVCbhKrv53uavfoF1QpcjXBhIFJuWohCcoafS3udYMZxlRT1vaOyEoiaNEeZoVZ0zZA09eBQcpr1+OAyXQTdBtAJ9sorzdK+TxrnmlYISuWTOTaLQYFIzi4JLaLpx5cAwfsOmMPGwZApcUi+/1dCBZ3JaaOtfiXTJPq2omXJuoTKvbD2657mW/FduUmHxNalFaSqEkj8MKipJUdN2RzQXFjjKhQeMW+G9Uj5jlnH0m1zrNIqSujXXtlkbn6mmO6BPmdaGQTVf1zE51X7CTP2HFnyzwDio/FZ17hfoTxI9P8AmuDwaRsfDo58n/dNvq+NskX3yiRyQiHwhp+QHOSdjwslv8ofckrtgEJwFF8HoQRp0VjUfyVoEyT0JRtSg</latexit>

Figure 2.1: Maximum Likelihood Estimation. A data distribution, 𝑝data, produces samples,x, of a random variable 𝑋. Maximum likelihood estimation attempts to fit a model distribution, π‘πœƒ, to the empirical distribution of samples by maximizing the log-likelihood of data samples under the model (Eq. 2.3).

observation sample out of𝑁 total samples, we can approximate the objective as the following empirical average:

ExβˆΌπ‘data(x) [logπ‘πœƒ(x)] β‰ˆExβˆΌπ‘bdata(x) [logπ‘πœƒ(x)] = 1 𝑁

𝑁

Γ•

𝑖=1

logπ‘πœƒ(x(𝑖)). (2.4) In summary, we can frame probabilistic modeling as the process of maximizing the log-likelihood, logπ‘πœƒ(x), of data observations under our model. A diagram of this setup is shown in Figure 2.1.

Model Formulation

Formulating a probabilistic model involves considering the dependency structure of the model and the parameterization of these dependencies. As with all models, there is a bias-variance decomposition of modeling error. Without considering a sufficiently flexible dependency structure and parameterization, the model will be inherently limited, i.e. biased, in its capacity to accurately model the data. Yet, with an overly expressive model, containing many dependencies and parameters, there is a risk of overfitting to the training examples, the outcome of a model with high variance.

Dependency Structure

The dependency structure of a probabilistic model is the set of conditional depen- dencies between variables. One common form of dependency structure is that of autoregression (Frey, G. E. Hinton, and Dayan, 1996; Y. Bengio and S. Bengio, 2000), which utilizes the chain rule of probability to model dependencies between variables:

π‘πœƒ(x) =

𝑀

Γ–

𝑗=1

π‘πœƒ(π‘₯𝑗|π‘₯< 𝑗). (2.5)

Here, we have induced an arbitrary ordering over the𝑀 dimensions ofx, allowing us to factor the joint distribution over dimensions, π‘πœƒ(x), into a product of 𝑀 conditional distributions, each conditioned on the previous dimensions,π‘₯< 𝑗. Slightly abusing notation, at 𝑗 =1, we have π‘πœƒ(π‘₯𝑗|π‘₯< 𝑗) = π‘πœƒ(π‘₯1). A natural use-case for this dependency structure arises in modeling sequential data, where time provides an ordering over a sequence of𝑇 observed variables,x1:𝑇:

π‘πœƒ(x1:𝑇)=

𝑇

Γ–

𝑑=1

π‘πœƒ(x𝑑|x<𝑑). (2.6)

Although such models are conventionally formulated in forward temporal order, this is truly a modeling assumption. Likewise, while the chain rule of probability dictates that we must considerallprevious variables, there may be cases where it is safe to assume conditional independence outside of some window. In the extreme case, in which we only consider pairwise dependencies, we arrive at a Markov chain:

π‘πœƒ(x1:𝑇) =ΓŽπ‘‡

𝑑=1π‘πœƒ(x𝑑|xπ‘‘βˆ’1).

Autoregressive models are also referred to as β€œfully-visible” models (Frey, G. E.

Hinton, and Dayan, 1996), as dependencies are only explicitly modeled between observed variables. However, we can also model such dependencies by introducing latent variables, denoted as𝑍. Formally, a latent variable model is defined by the joint distribution

π‘πœƒ(x,z) = π‘πœƒ(x|z)π‘πœƒ(z), (2.7) where π‘πœƒ(x|z)is theconditional likelihoodand π‘πœƒ(z) is theprior. Again, we have used the shorthand notation π‘πœƒ(x,z)to denote π‘πœƒ(𝑋 =x, 𝑍 =z). Introducing latent variables is one of, if not,theprimary technique for increasing the flexibility of a probabilistic model. This is because evaluating the probability of an observation now requires marginalizing over the latent variables. If𝑍 is a continuous variable, this involves integration, π‘πœƒ(x) = ∫

π‘πœƒ(x,z)𝑑z, and if𝑍 is discrete, this involves

34

Figure 2.2: Dependency Structures. Circles denote random variables, with gray denoting observed variables and white denoting latent variables. Arrows denote probabilistic conditional dependencies. From left to right: autoregressive model (Eqs. 2.5 & 2.6), latent variable model (Eq. 2.7), hierarchical latent variable model (Eq. 2.10), autoregressive or sequential latent variable model (Eq. 2.11). For clarity, we have drawn a subset of the possible dependencies in the final model.

summation, π‘πœƒ(x)=Í

zπ‘πœƒ(x,z). In either case, we have

π‘πœƒ(x) =EzβˆΌπ‘πœƒ(z) [π‘πœƒ(x|z)], (2.8) which illustrates that π‘πœƒ(x)is amixturedistribution, with each mixture component,

π‘πœƒ(x|z), weighted according toπ‘πœƒ(z). Thus, even when restrictingπ‘πœƒ(x|z)to simple distribution forms, such as Gaussian distributions, π‘πœƒ(x)can take on flexible forms that do not have closed form analytical expressions. In this way, 𝑍 can implicitly model dependencies in 𝑋, assigning higher probability to particular regions of the observation space.

However, increasing flexibility through latent variables comes with increasing computational overhead. In general, marginalizing over𝑍is not analytically tractable, particularly with continuous latent variables or complex conditional likelihoods.

This requires us to either 1) adopt approximation techniques, which we discuss in Section 2.3, or 2) restrict the form of the model to ensure computationally tractable evaluation of π‘πœƒ(x). This latter approach is the basis offlow-based models (Tabak and Turner, 2013; Rippel and Adams, 2013; Dinh, Krueger, and Y. Bengio, 2015), which define the conditional dependency between𝑋 and𝑍 in terms of an invertible transform,x= π‘“πœƒ(z) andz= π‘“βˆ’1

πœƒ (x). We can then express π‘πœƒ(x) using thechange of variables formula:

π‘πœƒ(x)= π‘πœƒ(z)

det πœ•x

πœ•z

βˆ’1

, (2.9)

where πœ•z is the Jacobian of the transform and det denotes matrix determinant.

The term det

πœ•x

πœ•z

βˆ’1

can be interpreted as the local scaling of space when moving from 𝑍 to 𝑋, conserving probability mass in the transform. Flow-based models, also referred to asnormalizing flows(Rezende and Mohamed, 2015), are the basis of the classical technique of independent components analysis (ICA) (Bell and Sejnowski, 1997; HyvΓ€rinen and Oja, 2000) and non-linear generalizations (S. S.

Chen and Gopinath, 2001; Laparra, Camps-Valls, and Malo, 2011). As such, these models can serve as a general-purpose mechanism for adding and removing statistical dependencies between variables. Although flow-based models avoid the intractability of marginalization, their requirement of invertibility may be overly restrictive or undesirable in some contexts (Cornish et al., 2020). And while the change of variables formula can also be applied to non-invertible transforms (Cvitkovic and Koliander, 2019), it raises computational intractabilities. This motivates the use of approximate techniques for training latent variables models (Section 2.3).

While we have presented autoregression and latent variables separately, these techniques can, in fact, be combined in numerous ways to model dependencies. For instance, one can createhierarchical latent variable models(Dayan et al., 1995), incorporating autoregressive dependencies between latent variables. Considering𝐿 levels of latent variables,𝑍1:𝐿 =

𝑍1, . . . , 𝑍𝐿

, we can express the joint distribution as

π‘πœƒ(x,z1:𝐿) = π‘πœƒ(x|z1:𝐿)

𝐿

Γ–

β„“=1

π‘πœƒ(zβ„“|zβ„“+1:𝐿). (2.10) From this perspective, hierarchical latent variable models are a repeated application of the latent variables technique in order to create increasingly complexempirical priors (Efron and Morris, 1973). We can also consider incorporating latent variables within sequential (autoregressive) probabilistic models, giving rise to sequential latent variable models. Considering a single level of latent variables in a corresponding sequence,𝑍1:𝑇, we generally have the following joint distribution:

π‘πœƒ(x1:𝑇,z1:𝑇) =

𝑇

Γ–

𝑑=1

π‘πœƒ(x𝑑|x<𝑑,z≀𝑑)π‘πœƒ(z𝑑|x<𝑑,z<𝑑), (2.11) where we have again assumed a forward sequential ordering. By restricting the dependency structure and distribution forms in sequential latent variable models, we recover familiar special cases, such as hidden Markov models or linear Gaussian state-space models (Murphy, 2012). Beyond hierarchical and sequential latent variable models, there are a variety of other ways to combine autoregression and

latent variables (Gulrajani et al., 2017; Razavi, Aaron van den Oord, and Vinyals, 2019). The remaining chapters focus on hierarchical (Eq. 2.10) and sequential (Eq. 2.11) latent variable models, though models with more flexible hierarchical, sequential, and spatial dependencies will be required to advance the frontier of probabilistic modeling.

Parameterizing the Model

In the previous section, we discussed the dependency structure of probabilistic models. The probability distributions that define these dependencies are ultimately functions. In this section, we discuss possible forms that these functions may take.

We restrict our focus here toparametricdistributions, which are defined by one or more distribution parameters. The canonical example is the Gaussian (or Normal) distribution,N (π‘₯;πœ‡, 𝜎2), which is defined by a mean, πœ‡, and variance, 𝜎2. This can be extended to the multivariate setting, wherex ∈R𝑀 is modeled with a mean vector,Β΅, and covariance matrix,𝚺, with the probability density written as

N (x;¡,𝚺)= 1

(2πœ‹)𝑀/2det(𝚺)1/2exp βˆ’1

2 (xβˆ’Β΅)|πšΊβˆ’1(xβˆ’Β΅)

. (2.12) For convenience, we may also consider diagonal covariance matrices,𝚺 =diag(Οƒ2), simplifying the parameterization and resulting calculations. In particular, the special case where𝚺 = I𝑀, the 𝑀 Γ— 𝑀 identity matrix, the log-density, up to a constant, becomes the familiarmean squared error,

logN (x;Β΅,I) =βˆ’1

2||xβˆ’Β΅||22+const. (2.13) With a parametric distribution, conditional dependencies are mediated by the distribution parameters, which are functions of the conditioning variables. For example, we can express an autoregressive Gaussian distribution (of the form in Eq. 2.5) through conditional densities, π‘πœƒ(π‘₯𝑗|π‘₯< 𝑗) = N (π‘₯𝑗;πœ‡πœƒ(π‘₯< 𝑗), 𝜎2

πœƒ(π‘₯< 𝑗)), where πœ‡πœƒ and 𝜎2

πœƒ are functions taking π‘₯< 𝑗 as input. A similar form applies to autoregressive models on sequences of vector inputs (Eq. 2.6), with π‘πœƒ(x𝑑|x<𝑑) = N (x𝑑;Β΅πœƒ(x<𝑑),πšΊπœƒ(x<𝑑)). Likewise, in a latent variable model (Eq. 2.7), we can express a Gaussian conditional likelihood as π‘πœƒ(x|z) =N (x;Β΅πœƒ(z),πšΊπœƒ(z)). Note that in the above examples, we have overloaded notation, simply using a subscriptπœƒ for all functions. In practice, while it is not uncommon to share parameters across functions, this is ultimately a modeling choice.

The functions supplying each of the distribution parameters can range in complexity, from constant to highly non-linear. Classical modeling techniques often employ linear functions. For instance, in a latent variable model, we could parameterize the mean as a linear function ofz:

Β΅πœƒ(z) =Wz+b, (2.14)

whereWis a matrix of weights andbis a bias vector. Models of this form underlie factor analysis, probabilistic principal components analysis (Tipping and Bishop, 1999), independent components analysis (Bell and Sejnowski, 1997; HyvΓ€rinen and Oja, 2000), and sparse coding (Olshausen and Field, 1996). Linear autoregressive models are also the basis of many classical time-series models and are common across fields that use statistical methods. While linear models are relatively computationally efficient, they are often too limited to accurately model complex data distributions, e.g. those found in natural images or audio.

Recent improvements in deep learning (Goodfellow, Y. Bengio, and Courville, 2016) have provided probabilistic models with a more expressive class of non-linear func- tions, improving their modeling capacity. In these models, the distribution parameters are parameterized with deep networks, which are then trained by backpropagating (Rumelhart, G. E. Hinton, and Williams, 1986) the gradient of the log-likelihood objective,βˆ‡πœƒEx∼

b

𝑝data [logπ‘πœƒ(x)], back through the layers of the network. Typically, this is estimated using a mini-batch of data examples, rather than the full empirical distribution as in Eq. 2.4. Deep autoregressive models and deep latent variable models have enabled recent advances across an array of areas, including speech (Graves, 2013; AΓ€ron van den Oord et al., 2016), natural language (Sutskever, Vinyals, and Le, 2014; Radford et al., 2019), images (Razavi, Aaron van den Oord, and Vinyals, 2019), video (Kumar et al., 2020), reinforcement learning (Chua et al., 2018; Ha and Schmidhuber, 2018) and many others.

We visualize a simplified probabilistic computation graph for a deep autoregressive model in Figure 2.3. This diagram translates the autoregressive dependency structure from Figure 2.2, which only depicts the variables and their dependencies in the model, into a more detailed visualization, breaking the variables into their corresponding distributions and terms in the log-likelihood objective. Here, green circles denote the conditional likelihood at each step, containing a Gaussian mean and standard deviation, which are parameterized by a deep network. The log-likelihood, logπ‘πœƒ(x𝑑|x<𝑑), evaluated at the data observation, x𝑑 ∼ 𝑝data(x𝑑|x<𝑑) (gray circle), provides the

11

Β΅βœ“

<latexit sha1_base64="l0Ke2CNbY3w+39J9mtQy0ZuO9Ck=">AAAChnicdVHLSgMxFE3HV62vqks3wSK4KjM+0GXRjUsFq0JnGDLpbRtMJkNyI5ahX+JWP8q/MVO7sFYvXDic+zqcmxVSWAzDz1qwtLyyulZfb2xsbm3vNHf3Hqx2hkOXa6nNU8YsSJFDFwVKeCoMMJVJeMyer6v64wsYK3R+j+MCEsWGuRgIztBTaXMnzlQZKzdJYxwBsrTZCtvhNOgiiGagRWZxm+7W0rivuVOQI5fM2l4UFpiUzKDgEiaN2FkoGH9mQ+h5mDMFNimnyif0yDN9OtDGZ450yv6cKJmydqwy36kYjuzvWkX+Ves5HFwmpcgLh5Dz70MDJylqWtlA+8IARzn2gHEjvFbKR8wwjt6suU33UVJW4qo1c+czNWkc0Z9MJaNA9Trfx+RQ+wsj9Q8t+OJAYcF5V3XfG+hfEv1+wCJ4OGlHp+2Tu7NW52r2nDo5IIfkmETkgnTIDbklXcKJI2/knXwE9aAdnAcX361BbTazT+Yi6HwBrh7IfQ==</latexit> βœ“<latexit sha1_base64="dWNC+c6RW6+WgIuGD7J2Bwe3Emg=">AAACiXicdVFNSyNBEO3M+pGNX1GPXhqD4CnMqKB4EnPZowtGhcww1HQqSWP39NBdI4Yhf8Wr+5f232xPzMGYtaDg8err8SorlHQUhn8bwY+19Y3N5s/W1vbO7l57/+DBmdIK7AujjH3KwKGSOfZJksKnwiLoTOFj9tyr648vaJ00+T1NC0w0jHM5kgLIU2n7IM50FTs51jBLY5ogQdruhN1wHnwVRAvQYYu4S/cbaTw0otSYk1Dg3CAKC0oqsCSFwlkrLh0WIJ5hjAMPc9DokmoufsZPPDPkI2N95sTn7OeJCrRzU535Tg00cV9rNfm/2qCk0VVSybwoCXPxcWhUKk6G107wobQoSE09AGGl18rFBCwI8n4tbbqPkqoWV69ZOp/pWeuEf2ZqGQXp1+U+UGPjL0z0N7QUqwOFw9K7aobeQP+S6OsDVsHDWTc67579vujc3C6e02RH7Jidsohdshv2i92xPhPslb2xd/Yn2Aqi4Cq4/mgNGouZQ7YUQe8fpfnJyg==</latexit>

pβœ“(xt|x<t)

<latexit sha1_base64="hsjE9FCjkzXNO2FF+fHIUg7mGLw=">AAACoHicdVHBThsxEHUWSmmgNLTHXiwiJLhEu1AJDhwQveTWIBFAJKuV15lNLOy1Zc9WRMt+A1/DFb6jf1NvyCEh7UiWnt/MeJ7fpEYKh2H4pxGsrX/Y+Lj5qbm1/XnnS2v367XTheXQ51pqe5syB1Lk0EeBEm6NBaZSCTfp/c86f/MbrBM6v8KpgVixcS4ywRl6KmkdmmSIE0BGD4aK4STNyocqQfpIF67lGVaHSasddsJZ0FUQzUGbzKOX7DaS4UjzQkGOXDLnBlFoMC6ZRcElVM1h4cAwfs/GMPAwZwpcXM7+VNF9z4xopq0/OdIZu9hRMuXcVKW+shbq3udq8l+5QYHZaVyK3BQIOX8blBWSoqa1QXQkLHCUUw8Yt8JrpXzCLOPobVx66SqKy1pc/czS+FRVzX26yNQyDKqH5Tomx9pPmKj/0IKvNhgHhXdVj7yBfiXR+wWsguujTnTcObr80T6/mC9nk3wne+SAROSEnJMu6ZE+4eSJPJMX8hrsBd3gV3D5Vho05j3fyFIEd38BY6vTEw==</latexit> logpβœ“(xt|x<t)<latexit sha1_base64="L82yc0PcdKQbIVR4FVe9NQvnOgE=">AAACpXicdVHBThsxEHWWtkBa2kCPvViNUOES7UIlOHBA9NJLJZCSgJRdrbzObGJhry17FhEt+xd8DVf6E/2bekMOCWlHsvT8Zsbz/CYzUjgMwz+tYOPN23ebW9vt9x92Pn7q7O4NnS4thwHXUtubjDmQooABCpRwYywwlUm4zm5/NPnrO7BO6KKPMwOJYpNC5IIz9FTa6cVST6hJY5wCMnoQK4bTLK/u6xTpA126VmdYH6adbtgL50HXQbQAXbKIy3S3lcZjzUsFBXLJnBtFocGkYhYFl1C349KBYfyWTWDkYcEUuKSaf6ym+54Z01xbfwqkc3a5o2LKuZnKfGUj1L3ONeS/cqMS89OkEoUpEQr+MigvJUVNG5foWFjgKGceMG6F10r5lFnG0Xu58lI/SqpGXPPMyvhM1e19usw0Mgyq+9U6JifaT5iq/9CCrzcYB6V3VY+9gX4l0esFrIPhUS867h1dfe+eXyyWs0W+kK/kgETkhJyTn+SSDAgnj+SJPJPfwbfgV9APhi+lQWvR85msRJD+BQgy1QM=</latexit> xtβ‡ pdata<latexit sha1_base64="W0flu+bhmcsHUKZ1gjE/g4gfuv4=">AAACmHicdVHLjtMwFHXDayivDuxgY6hGYlUlAxIsK1gAuwFNZ0ZqoujGuWmtsWPLvkGtoqz5GrbwLfwNTqeL6RSuZOno3NfxuYVV0lMc/xlEt27fuXvv4P7wwcNHj5+MDp+eedM4gTNhlHEXBXhUssYZSVJ4YR2CLhSeF5cf+/z5d3RemvqU1hYzDYtaVlIABSofvUw10LKo2lWXE0+91NzmKeGKnG5LIOjy0TiexJvg+yDZgjHbxkl+OMjT0ohGY01CgffzJLaUteBICoXdMG08WhCXsMB5gDVo9Fm7+UvHjwJT8sq48GriG/Z6Rwva+7UuQmWv3N/M9eS/cvOGqvdZK2vbENbialHVKE6G98bwUjoUpNYBgHAyaOViCQ4EBft2Jp0mWduL68fsrC90Nzzi15lehiW92q0DtTBhw1L/h5Ziv8F6bIKrpgwGhpMkNw+wD86OJ8mbyfHXt+Pph+1xDtgL9oq9Zgl7x6bsMzthMybYD/aT/WK/o+fRNPoUfbkqjQbbnmdsJ6JvfwHIjdBW</latexit>

Figure 2.3:Model Parameterization & Computation Graph. The diagram depicts a simplified computation graph for a deep autoregressive Gaussian model. Green circles denote the conditional likelihood distribution at each step, while gray circles again denote the (distribution of) data observations. Smaller red circles denote each of the log-likelihood terms in the objective. Gradients w.r.t. these terms are backpropagated through the networks parameterizing the model’s distribution parameters (red dotted lines).

objective (red dot). The gradient of this objective w.r.t. the network parameters is calculated through backpropagation (red dotted line). This general depiction of dependencies, distributions, and log-probabilities (or differences of log-probabilities) is used throughout this thesis to describe the various computational setups.

Purely autoregressive models (without latent variables) have proven useful in many domains, often obtaining better log-likelihoods as compared with latent variable models. However, there are a number of reasons to prefer latent variable models in some contexts. First, autoregressive sampling is inherently sequential, and this linear computational scaling becomes costly in high-dimensional domains. Second, latent variables provide a representational space for downstream tasks, compression, and overall data analysis. Finally, latent variables provide added flexibility, which is particularly useful for modeling continuous random variables with relatively simple, e.g. Gaussian, conditional distributions. For these reasons, we require methods for handling the latent marginalization in Eq. 2.8. Variational inference is one such method.

2.3 Variational Inference