Chapter II: Background
2.2 Probabilistic Models
Set-Up: Probability Distributions & Maximum Likelihood
Consider a random variable, π βRπ, with a corresponding distribution, πdata(π), defining the probability of observing each possible observation, π = x. We will use the shorthand notation πdata(x) to denote the probability πdata(π = x). This distribution is the result of an underlying data generation process, e.g. the emission and scattering of photons. While we do not have direct access to πdata, we can sample observations,xβΌ πdata(x), yielding an empirical distribution, πbdata(x).
Often, we wish to model πdata, for instance, to predict or compress observations of π. We refer to this model as ππ(x), with parameters π. A natural approach for estimating the model involves minimizing some divergence measure between πdata(x) and ππ(x) with respect to the model parameters,π. The Kullback Leibler (KL) divergence, denoted π·KL, is a common choice of divergence measure:
πβ β arg min
π
π·KL(πdata(x) ||ππ(x)), (2.1) πβ β arg min
π ExβΌπdata(x) [logπdata(x) βlogππ(x)], (2.2) πβ β arg max
π ExβΌπdata(x) [logππ(x)]. (2.3) Here,Edenotes the expectation operator. We have gone from Eq. 2.1 to Eq. 2.2 using the definition of the KL divergence. Becauseπdata(x) does not depend onπ, we can omit it from the optimization, expressing the equivalent maximization in Eq. 2.3.
This is the standardmaximum log-likelihood objective, which is found throughout machine learning and probabilistic modeling (Murphy, 2012; Goodfellow, Y. Bengio, and Courville, 2016). In practice, we do not have access to πdata(x)and must instead approximate the objective using data samples, i.e. using bπdata(x). Withx(π) as theπth
X
<latexit sha1_base64="ZTIWoTksoLoiyDHEgNaAIVk1a3s=">AAACdnicdVHLTgIxFC3jC/EFujQxjYToisygiS6JblxCwiuBCemUCzS200nbMZIJX+BWP84/cWkHWDCgN2lycu7jnN4bRJxp47rfOWdnd2//IH9YODo+OT0rls47WsaKQptKLlUvIBo4C6FtmOHQixQQEXDoBq/Pab77BkozGbbMLAJfkEnIxowSY6lmb1gsu1V3EXgbeCtQRqtoDEu54WAkaSwgNJQTrfueGxk/IcowymFeGMQaIkJfyQT6FoZEgPaThdM5rlhmhMdS2RcavGDXOxIitJ6JwFYKYqZ6M5eSf+X6sRk/+gkLo9hASJdC45hjI3H6bTxiCqjhMwsIVcx6xXRKFKHGLiczqeX5SWouHZORD8S8UMHrTGojMuI9W0f4RFqFqfiHZnS7IdIQ263KkV2gPYm3eYBt0KlVvbtqrXlfrj+tjpNHl+ga3SIPPaA6ekEN1EYUAfpAn+gr9+NcORXnZlnq5FY9FygTjvsL9bXC5Q==</latexit>
p
<latexit sha1_base64="NZ5I9Uz3EubyGC9Sm16VZOY8I+E=">AAACdnicdVHLSgMxFE3HV61vXQoSLEVXZaYKuiy6cVnBtkIdSiZz2waTSUgyYhn6BW714/wTl2baWfShFwKHcx/n5N5IcWas73+XvLX1jc2t8nZlZ3dv/+Dw6LhjZKoptKnkUj9HxABnCbQtsxyelQYiIg7d6PU+z3ffQBsmkyc7VhAKMkzYgFFiHfWo+odVv+5PA6+CoABVVESrf1Tqv8SSpgISSzkxphf4yoYZ0ZZRDpPKS2pAEfpKhtBzMCECTJhNnU5wzTExHkjtXmLxlJ3vyIgwZiwiVymIHZnlXE7+leuldnAbZixRqYWEzoQGKcdW4vzbOGYaqOVjBwjVzHnFdEQ0odYtZ2HSUxBmubl8zIJ8JCaVGp5nchvKivfFOsKH0imMxD80o6sNykDqtipjt0B3kmD5AKug06gHV/XG43W1eVccp4xO0Tm6RAG6QU30gFqojSgC9IE+0Vfpxzvzat7FrNQrFT0naCE8/xcndML9</latexit>
pβ(x)
<latexit sha1_base64="DJHvJodbDQ1zy9j80H0t+wyUyzg=">AAACjHicdVFdSxtBFJ1sW03TqtHSJ1+GBsG+hF0VFEohtCA+KiQqJMsyO7mbDM7sDDN3JWHJj/G1/UX+G2djHvKhFy4czv06nJsaKRyG4XMt+PDx09Z2/XPjy9ed3b3m/sGt04Xl0ONaanufMgdS5NBDgRLujQWmUgl36cPfqn73CNYJnXdxaiBWbJSLTHCGnkqa300ywDEgo8cDxXCcZuVk9jNptsJ2OA+6CaIFaJFFXCf7tWQw1LxQkCOXzLl+FBqMS2ZRcAmzxqBwYBh/YCPoe5gzBS4u5/pn9MgzQ5pp6zNHOmeXJ0qmnJuq1HdWGt16rSLfqvULzC7iUuSmQMj566GskBQ1rcygQ2GBo5x6wLgVXivlY2YZR2/ZyqZuFJeVuGrNyvlUzRpHdJmpZBhUk9U+JkfaXxird2jBNweMg8K7qofeQP+SaP0Bm+D2pB2dtk9uzlqdP4vn1Mkh+UGOSUTOSYdckWvSI5yU5In8I/+D3eAs+BX8fm0NaouZb2QlgssXUBDKdw==</latexit>
pdata(x)
<latexit sha1_base64="KMQ8ygvEEBrHwGNZcCadnKEw6ng=">AAACk3icdVHLahsxFJWnj6TuI05LV+1C1ATSjZlJCyl0E9IuuimkECcBexjuaO7YItJISHeKzTCbfk237d/0b6pxvIjj9oLgcO7r6NzcKukpjv/0onv3Hzzc2X3Uf/zk6bO9wf7zC29qJ3AsjDLuKgePSlY4JkkKr6xD0LnCy/z6U5e//I7OS1Od09JiqmFWyVIKoEBlg9c2mxIuyOmmAIKWH0410Dwvm0X7NhsM41G8Cr4NkjUYsnWcZfu9bFoYUWusSCjwfpLEltIGHEmhsO1Pa48WxDXMcBJgBRp92qy+0fKDwBS8NC68iviKvd3RgPZ+qfNQ2Wn0d3Md+a/cpKbyQ9rIytaElbhZVNaKk+GdJ7yQDgWpZQAgnAxauZiDA0HBuY1J50nadOK6MRvrc932D/htppNhSS8260DNTNgw1/+hpdhusB7r4KopgoHhJMndA2yDi6NR8m509O398OR0fZxd9oq9YYcsYcfshH1hZ2zMBPvBfrJf7Hf0MvoYnUafb0qj3rrnBduI6OtfGl3N3Q==</latexit>
xβ pdata(x)
<latexit sha1_base64="Min9X1kSrlMsgf0t/tlM/bkvIlw=">AAACo3icdVFNa9tAEF2rX4nTNk57zGWpMSRQjJQE2mNoL4Vc0mAnAUuI1WpkL9nVLrujYiP0J/prek3+Rf9NVo4hcdwOLDzevJl5O5MZKRyG4d9O8OLlq9dvtra7O2/fvd/t7X24dLqyHMZcS22vM+ZAihLGKFDCtbHAVCbhKrv53uavfoF1QpcjXBhIFJuWohCcoafS3udYMZxlRT1vaOyEoiaNEeZoVZ0zZA09eBQcpr1+OAyXQTdBtAJ9sorzdK+TxrnmlYISuWTOTaLQYFIzi4JLaLpx5cAwfsOmMPGwZApcUi+/1dCBZ3JaaOtfiXTJPq2omXJuoTKvbD2657mW/FduUmHxNalFaSqEkj8MKipJUdN2RzQXFjjKhQeMW+G9Uj5jlnH0m1zrNIqSujXXtlkbn6mmO6BPmdaGQTVf1zE51X7CTP2HFnyzwDio/FZ17hfoTxI9P8AmuDwaRsfDo58n/dNvq+NskX3yiRyQiHwhp+QHOSdjwslv8ofckrtgEJwFF8HoQRp0VjUfyVoEyT0JRtSg</latexit>
Figure 2.1: Maximum Likelihood Estimation. A data distribution, πdata, produces samples,x, of a random variable π. Maximum likelihood estimation attempts to fit a model distribution, ππ, to the empirical distribution of samples by maximizing the log-likelihood of data samples under the model (Eq. 2.3).
observation sample out ofπ total samples, we can approximate the objective as the following empirical average:
ExβΌπdata(x) [logππ(x)] βExβΌπbdata(x) [logππ(x)] = 1 π
π
Γ
π=1
logππ(x(π)). (2.4) In summary, we can frame probabilistic modeling as the process of maximizing the log-likelihood, logππ(x), of data observations under our model. A diagram of this setup is shown in Figure 2.1.
Model Formulation
Formulating a probabilistic model involves considering the dependency structure of the model and the parameterization of these dependencies. As with all models, there is a bias-variance decomposition of modeling error. Without considering a sufficiently flexible dependency structure and parameterization, the model will be inherently limited, i.e. biased, in its capacity to accurately model the data. Yet, with an overly expressive model, containing many dependencies and parameters, there is a risk of overfitting to the training examples, the outcome of a model with high variance.
Dependency Structure
The dependency structure of a probabilistic model is the set of conditional depen- dencies between variables. One common form of dependency structure is that of autoregression (Frey, G. E. Hinton, and Dayan, 1996; Y. Bengio and S. Bengio, 2000), which utilizes the chain rule of probability to model dependencies between variables:
ππ(x) =
π
Γ
π=1
ππ(π₯π|π₯< π). (2.5)
Here, we have induced an arbitrary ordering over theπ dimensions ofx, allowing us to factor the joint distribution over dimensions, ππ(x), into a product of π conditional distributions, each conditioned on the previous dimensions,π₯< π. Slightly abusing notation, at π =1, we have ππ(π₯π|π₯< π) = ππ(π₯1). A natural use-case for this dependency structure arises in modeling sequential data, where time provides an ordering over a sequence ofπ observed variables,x1:π:
ππ(x1:π)=
π
Γ
π‘=1
ππ(xπ‘|x<π‘). (2.6)
Although such models are conventionally formulated in forward temporal order, this is truly a modeling assumption. Likewise, while the chain rule of probability dictates that we must considerallprevious variables, there may be cases where it is safe to assume conditional independence outside of some window. In the extreme case, in which we only consider pairwise dependencies, we arrive at a Markov chain:
ππ(x1:π) =Γπ
π‘=1ππ(xπ‘|xπ‘β1).
Autoregressive models are also referred to as βfully-visibleβ models (Frey, G. E.
Hinton, and Dayan, 1996), as dependencies are only explicitly modeled between observed variables. However, we can also model such dependencies by introducing latent variables, denoted asπ. Formally, a latent variable model is defined by the joint distribution
ππ(x,z) = ππ(x|z)ππ(z), (2.7) where ππ(x|z)is theconditional likelihoodand ππ(z) is theprior. Again, we have used the shorthand notation ππ(x,z)to denote ππ(π =x, π =z). Introducing latent variables is one of, if not,theprimary technique for increasing the flexibility of a probabilistic model. This is because evaluating the probability of an observation now requires marginalizing over the latent variables. Ifπ is a continuous variable, this involves integration, ππ(x) = β«
ππ(x,z)πz, and ifπ is discrete, this involves
34
Figure 2.2: Dependency Structures. Circles denote random variables, with gray denoting observed variables and white denoting latent variables. Arrows denote probabilistic conditional dependencies. From left to right: autoregressive model (Eqs. 2.5 & 2.6), latent variable model (Eq. 2.7), hierarchical latent variable model (Eq. 2.10), autoregressive or sequential latent variable model (Eq. 2.11). For clarity, we have drawn a subset of the possible dependencies in the final model.
summation, ππ(x)=Γ
zππ(x,z). In either case, we have
ππ(x) =EzβΌππ(z) [ππ(x|z)], (2.8) which illustrates that ππ(x)is amixturedistribution, with each mixture component,
ππ(x|z), weighted according toππ(z). Thus, even when restrictingππ(x|z)to simple distribution forms, such as Gaussian distributions, ππ(x)can take on flexible forms that do not have closed form analytical expressions. In this way, π can implicitly model dependencies in π, assigning higher probability to particular regions of the observation space.
However, increasing flexibility through latent variables comes with increasing computational overhead. In general, marginalizing overπis not analytically tractable, particularly with continuous latent variables or complex conditional likelihoods.
This requires us to either 1) adopt approximation techniques, which we discuss in Section 2.3, or 2) restrict the form of the model to ensure computationally tractable evaluation of ππ(x). This latter approach is the basis offlow-based models (Tabak and Turner, 2013; Rippel and Adams, 2013; Dinh, Krueger, and Y. Bengio, 2015), which define the conditional dependency betweenπ andπ in terms of an invertible transform,x= ππ(z) andz= πβ1
π (x). We can then express ππ(x) using thechange of variables formula:
ππ(x)= ππ(z)
det πx
πz
β1
, (2.9)
where πz is the Jacobian of the transform and det denotes matrix determinant.
The term det
πx
πz
β1
can be interpreted as the local scaling of space when moving from π to π, conserving probability mass in the transform. Flow-based models, also referred to asnormalizing flows(Rezende and Mohamed, 2015), are the basis of the classical technique of independent components analysis (ICA) (Bell and Sejnowski, 1997; HyvΓ€rinen and Oja, 2000) and non-linear generalizations (S. S.
Chen and Gopinath, 2001; Laparra, Camps-Valls, and Malo, 2011). As such, these models can serve as a general-purpose mechanism for adding and removing statistical dependencies between variables. Although flow-based models avoid the intractability of marginalization, their requirement of invertibility may be overly restrictive or undesirable in some contexts (Cornish et al., 2020). And while the change of variables formula can also be applied to non-invertible transforms (Cvitkovic and Koliander, 2019), it raises computational intractabilities. This motivates the use of approximate techniques for training latent variables models (Section 2.3).
While we have presented autoregression and latent variables separately, these techniques can, in fact, be combined in numerous ways to model dependencies. For instance, one can createhierarchical latent variable models(Dayan et al., 1995), incorporating autoregressive dependencies between latent variables. ConsideringπΏ levels of latent variables,π1:πΏ =
π1, . . . , ππΏ
, we can express the joint distribution as
ππ(x,z1:πΏ) = ππ(x|z1:πΏ)
πΏ
Γ
β=1
ππ(zβ|zβ+1:πΏ). (2.10) From this perspective, hierarchical latent variable models are a repeated application of the latent variables technique in order to create increasingly complexempirical priors (Efron and Morris, 1973). We can also consider incorporating latent variables within sequential (autoregressive) probabilistic models, giving rise to sequential latent variable models. Considering a single level of latent variables in a corresponding sequence,π1:π, we generally have the following joint distribution:
ππ(x1:π,z1:π) =
π
Γ
π‘=1
ππ(xπ‘|x<π‘,zβ€π‘)ππ(zπ‘|x<π‘,z<π‘), (2.11) where we have again assumed a forward sequential ordering. By restricting the dependency structure and distribution forms in sequential latent variable models, we recover familiar special cases, such as hidden Markov models or linear Gaussian state-space models (Murphy, 2012). Beyond hierarchical and sequential latent variable models, there are a variety of other ways to combine autoregression and
latent variables (Gulrajani et al., 2017; Razavi, Aaron van den Oord, and Vinyals, 2019). The remaining chapters focus on hierarchical (Eq. 2.10) and sequential (Eq. 2.11) latent variable models, though models with more flexible hierarchical, sequential, and spatial dependencies will be required to advance the frontier of probabilistic modeling.
Parameterizing the Model
In the previous section, we discussed the dependency structure of probabilistic models. The probability distributions that define these dependencies are ultimately functions. In this section, we discuss possible forms that these functions may take.
We restrict our focus here toparametricdistributions, which are defined by one or more distribution parameters. The canonical example is the Gaussian (or Normal) distribution,N (π₯;π, π2), which is defined by a mean, π, and variance, π2. This can be extended to the multivariate setting, wherex βRπ is modeled with a mean vector,Β΅, and covariance matrix,πΊ, with the probability density written as
N (x;Β΅,πΊ)= 1
(2π)π/2det(πΊ)1/2exp β1
2 (xβΒ΅)|πΊβ1(xβΒ΅)
. (2.12) For convenience, we may also consider diagonal covariance matrices,πΊ =diag(Ο2), simplifying the parameterization and resulting calculations. In particular, the special case whereπΊ = Iπ, the π Γ π identity matrix, the log-density, up to a constant, becomes the familiarmean squared error,
logN (x;Β΅,I) =β1
2||xβΒ΅||22+const. (2.13) With a parametric distribution, conditional dependencies are mediated by the distribution parameters, which are functions of the conditioning variables. For example, we can express an autoregressive Gaussian distribution (of the form in Eq. 2.5) through conditional densities, ππ(π₯π|π₯< π) = N (π₯π;ππ(π₯< π), π2
π(π₯< π)), where ππ and π2
π are functions taking π₯< π as input. A similar form applies to autoregressive models on sequences of vector inputs (Eq. 2.6), with ππ(xπ‘|x<π‘) = N (xπ‘;Β΅π(x<π‘),πΊπ(x<π‘)). Likewise, in a latent variable model (Eq. 2.7), we can express a Gaussian conditional likelihood as ππ(x|z) =N (x;Β΅π(z),πΊπ(z)). Note that in the above examples, we have overloaded notation, simply using a subscriptπ for all functions. In practice, while it is not uncommon to share parameters across functions, this is ultimately a modeling choice.
The functions supplying each of the distribution parameters can range in complexity, from constant to highly non-linear. Classical modeling techniques often employ linear functions. For instance, in a latent variable model, we could parameterize the mean as a linear function ofz:
Β΅π(z) =Wz+b, (2.14)
whereWis a matrix of weights andbis a bias vector. Models of this form underlie factor analysis, probabilistic principal components analysis (Tipping and Bishop, 1999), independent components analysis (Bell and Sejnowski, 1997; HyvΓ€rinen and Oja, 2000), and sparse coding (Olshausen and Field, 1996). Linear autoregressive models are also the basis of many classical time-series models and are common across fields that use statistical methods. While linear models are relatively computationally efficient, they are often too limited to accurately model complex data distributions, e.g. those found in natural images or audio.
Recent improvements in deep learning (Goodfellow, Y. Bengio, and Courville, 2016) have provided probabilistic models with a more expressive class of non-linear func- tions, improving their modeling capacity. In these models, the distribution parameters are parameterized with deep networks, which are then trained by backpropagating (Rumelhart, G. E. Hinton, and Williams, 1986) the gradient of the log-likelihood objective,βπExβΌ
b
πdata [logππ(x)], back through the layers of the network. Typically, this is estimated using a mini-batch of data examples, rather than the full empirical distribution as in Eq. 2.4. Deep autoregressive models and deep latent variable models have enabled recent advances across an array of areas, including speech (Graves, 2013; AΓ€ron van den Oord et al., 2016), natural language (Sutskever, Vinyals, and Le, 2014; Radford et al., 2019), images (Razavi, Aaron van den Oord, and Vinyals, 2019), video (Kumar et al., 2020), reinforcement learning (Chua et al., 2018; Ha and Schmidhuber, 2018) and many others.
We visualize a simplified probabilistic computation graph for a deep autoregressive model in Figure 2.3. This diagram translates the autoregressive dependency structure from Figure 2.2, which only depicts the variables and their dependencies in the model, into a more detailed visualization, breaking the variables into their corresponding distributions and terms in the log-likelihood objective. Here, green circles denote the conditional likelihood at each step, containing a Gaussian mean and standard deviation, which are parameterized by a deep network. The log-likelihood, logππ(xπ‘|x<π‘), evaluated at the data observation, xπ‘ βΌ πdata(xπ‘|x<π‘) (gray circle), provides the
11
Β΅β
<latexit sha1_base64="l0Ke2CNbY3w+39J9mtQy0ZuO9Ck=">AAAChnicdVHLSgMxFE3HV62vqks3wSK4KjM+0GXRjUsFq0JnGDLpbRtMJkNyI5ahX+JWP8q/MVO7sFYvXDic+zqcmxVSWAzDz1qwtLyyulZfb2xsbm3vNHf3Hqx2hkOXa6nNU8YsSJFDFwVKeCoMMJVJeMyer6v64wsYK3R+j+MCEsWGuRgIztBTaXMnzlQZKzdJYxwBsrTZCtvhNOgiiGagRWZxm+7W0rivuVOQI5fM2l4UFpiUzKDgEiaN2FkoGH9mQ+h5mDMFNimnyif0yDN9OtDGZ450yv6cKJmydqwy36kYjuzvWkX+Ves5HFwmpcgLh5Dz70MDJylqWtlA+8IARzn2gHEjvFbKR8wwjt6suU33UVJW4qo1c+czNWkc0Z9MJaNA9Trfx+RQ+wsj9Q8t+OJAYcF5V3XfG+hfEv1+wCJ4OGlHp+2Tu7NW52r2nDo5IIfkmETkgnTIDbklXcKJI2/knXwE9aAdnAcX361BbTazT+Yi6HwBrh7IfQ==</latexit> β<latexit sha1_base64="dWNC+c6RW6+WgIuGD7J2Bwe3Emg=">AAACiXicdVFNSyNBEO3M+pGNX1GPXhqD4CnMqKB4EnPZowtGhcww1HQqSWP39NBdI4Yhf8Wr+5f232xPzMGYtaDg8err8SorlHQUhn8bwY+19Y3N5s/W1vbO7l57/+DBmdIK7AujjH3KwKGSOfZJksKnwiLoTOFj9tyr648vaJ00+T1NC0w0jHM5kgLIU2n7IM50FTs51jBLY5ogQdruhN1wHnwVRAvQYYu4S/cbaTw0otSYk1Dg3CAKC0oqsCSFwlkrLh0WIJ5hjAMPc9DokmoufsZPPDPkI2N95sTn7OeJCrRzU535Tg00cV9rNfm/2qCk0VVSybwoCXPxcWhUKk6G107wobQoSE09AGGl18rFBCwI8n4tbbqPkqoWV69ZOp/pWeuEf2ZqGQXp1+U+UGPjL0z0N7QUqwOFw9K7aobeQP+S6OsDVsHDWTc67579vujc3C6e02RH7Jidsohdshv2i92xPhPslb2xd/Yn2Aqi4Cq4/mgNGouZQ7YUQe8fpfnJyg==</latexit>
pβ(xt|x<t)
<latexit sha1_base64="hsjE9FCjkzXNO2FF+fHIUg7mGLw=">AAACoHicdVHBThsxEHUWSmmgNLTHXiwiJLhEu1AJDhwQveTWIBFAJKuV15lNLOy1Zc9WRMt+A1/DFb6jf1NvyCEh7UiWnt/MeJ7fpEYKh2H4pxGsrX/Y+Lj5qbm1/XnnS2v367XTheXQ51pqe5syB1Lk0EeBEm6NBaZSCTfp/c86f/MbrBM6v8KpgVixcS4ywRl6KmkdmmSIE0BGD4aK4STNyocqQfpIF67lGVaHSasddsJZ0FUQzUGbzKOX7DaS4UjzQkGOXDLnBlFoMC6ZRcElVM1h4cAwfs/GMPAwZwpcXM7+VNF9z4xopq0/OdIZu9hRMuXcVKW+shbq3udq8l+5QYHZaVyK3BQIOX8blBWSoqa1QXQkLHCUUw8Yt8JrpXzCLOPobVx66SqKy1pc/czS+FRVzX26yNQyDKqH5Tomx9pPmKj/0IKvNhgHhXdVj7yBfiXR+wWsguujTnTcObr80T6/mC9nk3wne+SAROSEnJMu6ZE+4eSJPJMX8hrsBd3gV3D5Vho05j3fyFIEd38BY6vTEw==</latexit> logpβ(xt|x<t)<latexit sha1_base64="L82yc0PcdKQbIVR4FVe9NQvnOgE=">AAACpXicdVHBThsxEHWWtkBa2kCPvViNUOES7UIlOHBA9NJLJZCSgJRdrbzObGJhry17FhEt+xd8DVf6E/2bekMOCWlHsvT8Zsbz/CYzUjgMwz+tYOPN23ebW9vt9x92Pn7q7O4NnS4thwHXUtubjDmQooABCpRwYywwlUm4zm5/NPnrO7BO6KKPMwOJYpNC5IIz9FTa6cVST6hJY5wCMnoQK4bTLK/u6xTpA126VmdYH6adbtgL50HXQbQAXbKIy3S3lcZjzUsFBXLJnBtFocGkYhYFl1C349KBYfyWTWDkYcEUuKSaf6ym+54Z01xbfwqkc3a5o2LKuZnKfGUj1L3ONeS/cqMS89OkEoUpEQr+MigvJUVNG5foWFjgKGceMG6F10r5lFnG0Xu58lI/SqpGXPPMyvhM1e19usw0Mgyq+9U6JifaT5iq/9CCrzcYB6V3VY+9gX4l0esFrIPhUS867h1dfe+eXyyWs0W+kK/kgETkhJyTn+SSDAgnj+SJPJPfwbfgV9APhi+lQWvR85msRJD+BQgy1QM=</latexit> xtβ pdata<latexit sha1_base64="W0flu+bhmcsHUKZ1gjE/g4gfuv4=">AAACmHicdVHLjtMwFHXDayivDuxgY6hGYlUlAxIsK1gAuwFNZ0ZqoujGuWmtsWPLvkGtoqz5GrbwLfwNTqeL6RSuZOno3NfxuYVV0lMc/xlEt27fuXvv4P7wwcNHj5+MDp+eedM4gTNhlHEXBXhUssYZSVJ4YR2CLhSeF5cf+/z5d3RemvqU1hYzDYtaVlIABSofvUw10LKo2lWXE0+91NzmKeGKnG5LIOjy0TiexJvg+yDZgjHbxkl+OMjT0ohGY01CgffzJLaUteBICoXdMG08WhCXsMB5gDVo9Fm7+UvHjwJT8sq48GriG/Z6Rwva+7UuQmWv3N/M9eS/cvOGqvdZK2vbENbialHVKE6G98bwUjoUpNYBgHAyaOViCQ4EBft2Jp0mWduL68fsrC90Nzzi15lehiW92q0DtTBhw1L/h5Ziv8F6bIKrpgwGhpMkNw+wD86OJ8mbyfHXt+Pph+1xDtgL9oq9Zgl7x6bsMzthMybYD/aT/WK/o+fRNPoUfbkqjQbbnmdsJ6JvfwHIjdBW</latexit>
Figure 2.3:Model Parameterization & Computation Graph. The diagram depicts a simplified computation graph for a deep autoregressive Gaussian model. Green circles denote the conditional likelihood distribution at each step, while gray circles again denote the (distribution of) data observations. Smaller red circles denote each of the log-likelihood terms in the objective. Gradients w.r.t. these terms are backpropagated through the networks parameterizing the modelβs distribution parameters (red dotted lines).
objective (red dot). The gradient of this objective w.r.t. the network parameters is calculated through backpropagation (red dotted line). This general depiction of dependencies, distributions, and log-probabilities (or differences of log-probabilities) is used throughout this thesis to describe the various computational setups.
Purely autoregressive models (without latent variables) have proven useful in many domains, often obtaining better log-likelihoods as compared with latent variable models. However, there are a number of reasons to prefer latent variable models in some contexts. First, autoregressive sampling is inherently sequential, and this linear computational scaling becomes costly in high-dimensional domains. Second, latent variables provide a representational space for downstream tasks, compression, and overall data analysis. Finally, latent variables provide added flexibility, which is particularly useful for modeling continuous random variables with relatively simple, e.g. Gaussian, conditional distributions. For these reasons, we require methods for handling the latent marginalization in Eq. 2.8. Variational inference is one such method.
2.3 Variational Inference