Probabilistic Models - Learned Feedback & Feedforward Perception & Control

Chapter II: Background

2.2 Probabilistic Models

Set-Up: Probability Distributions & Maximum Likelihood

Consider a random variable, 𝑋 ∈R^𝑀, with a corresponding distribution, 𝑝_data(𝑋), defining the probability of observing each possible observation, 𝑋 = x. We will use the shorthand notation 𝑝_data(x) to denote the probability 𝑝_data(𝑋 = x). This distribution is the result of an underlying data generation process, e.g. the emission and scattering of photons. While we do not have direct access to 𝑝_data, we can sample observations,x∼ 𝑝_data(x), yielding an empirical distribution, 𝑝b_data(x).

Often, we wish to model 𝑝_data, for instance, to predict or compress observations of 𝑋. We refer to this model as 𝑝_𝜃(x), with parameters 𝜃. A natural approach for estimating the model involves minimizing some divergence measure between 𝑝_data(x) and 𝑝_𝜃(x) with respect to the model parameters,𝜃. The Kullback Leibler (KL) divergence, denoted 𝐷_KL, is a common choice of divergence measure:

𝜃^∗ ← arg min

𝜃

𝐷_KL(𝑝_data(x) ||𝑝_𝜃(x)), (2.1) 𝜃^∗ ← arg min

𝜃 Ex∼𝑝_data(x) [log𝑝_data(x) −log𝑝_𝜃(x)], (2.2) 𝜃^∗ ← arg max

𝜃 Ex∼𝑝_data(x) [log𝑝_𝜃(x)]. (2.3) Here,Edenotes the expectation operator. We have gone from Eq. 2.1 to Eq. 2.2 using the definition of the KL divergence. Because𝑝_data(x) does not depend on𝜃, we can omit it from the optimization, expressing the equivalent maximization in Eq. 2.3.

This is the standardmaximum log-likelihood objective, which is found throughout machine learning and probabilistic modeling (Murphy, 2012; Goodfellow, Y. Bengio, and Courville, 2016). In practice, we do not have access to 𝑝_data(x)and must instead approximate the objective using data samples, i.e. using b𝑝_data(x). Withx^(𝑖) as the𝑖^th

X

<latexit sha1_base64="ZTIWoTksoLoiyDHEgNaAIVk1a3s=">AAACdnicdVHLTgIxFC3jC/EFujQxjYToisygiS6JblxCwiuBCemUCzS200nbMZIJX+BWP84/cWkHWDCgN2lycu7jnN4bRJxp47rfOWdnd2//IH9YODo+OT0rls47WsaKQptKLlUvIBo4C6FtmOHQixQQEXDoBq/Pab77BkozGbbMLAJfkEnIxowSY6lmb1gsu1V3EXgbeCtQRqtoDEu54WAkaSwgNJQTrfueGxk/IcowymFeGMQaIkJfyQT6FoZEgPaThdM5rlhmhMdS2RcavGDXOxIitJ6JwFYKYqZ6M5eSf+X6sRk/+gkLo9hASJdC45hjI3H6bTxiCqjhMwsIVcx6xXRKFKHGLiczqeX5SWouHZORD8S8UMHrTGojMuI9W0f4RFqFqfiHZnS7IdIQ263KkV2gPYm3eYBt0KlVvbtqrXlfrj+tjpNHl+ga3SIPPaA6ekEN1EYUAfpAn+gr9+NcORXnZlnq5FY9FygTjvsL9bXC5Q==</latexit>

p

<latexit sha1_base64="NZ5I9Uz3EubyGC9Sm16VZOY8I+E=">AAACdnicdVHLSgMxFE3HV61vXQoSLEVXZaYKuiy6cVnBtkIdSiZz2waTSUgyYhn6BW714/wTl2baWfShFwKHcx/n5N5IcWas73+XvLX1jc2t8nZlZ3dv/+Dw6LhjZKoptKnkUj9HxABnCbQtsxyelQYiIg7d6PU+z3ffQBsmkyc7VhAKMkzYgFFiHfWo+odVv+5PA6+CoABVVESrf1Tqv8SSpgISSzkxphf4yoYZ0ZZRDpPKS2pAEfpKhtBzMCECTJhNnU5wzTExHkjtXmLxlJ3vyIgwZiwiVymIHZnlXE7+leuldnAbZixRqYWEzoQGKcdW4vzbOGYaqOVjBwjVzHnFdEQ0odYtZ2HSUxBmubl8zIJ8JCaVGp5nchvKivfFOsKH0imMxD80o6sNykDqtipjt0B3kmD5AKug06gHV/XG43W1eVccp4xO0Tm6RAG6QU30gFqojSgC9IE+0Vfpxzvzat7FrNQrFT0naCE8/xcndML9</latexit>

p✓(x)

<latexit sha1_base64="DJHvJodbDQ1zy9j80H0t+wyUyzg=">AAACjHicdVFdSxtBFJ1sW03TqtHSJ1+GBsG+hF0VFEohtCA+KiQqJMsyO7mbDM7sDDN3JWHJj/G1/UX+G2djHvKhFy4czv06nJsaKRyG4XMt+PDx09Z2/XPjy9ed3b3m/sGt04Xl0ONaanufMgdS5NBDgRLujQWmUgl36cPfqn73CNYJnXdxaiBWbJSLTHCGnkqa300ywDEgo8cDxXCcZuVk9jNptsJ2OA+6CaIFaJFFXCf7tWQw1LxQkCOXzLl+FBqMS2ZRcAmzxqBwYBh/YCPoe5gzBS4u5/pn9MgzQ5pp6zNHOmeXJ0qmnJuq1HdWGt16rSLfqvULzC7iUuSmQMj566GskBQ1rcygQ2GBo5x6wLgVXivlY2YZR2/ZyqZuFJeVuGrNyvlUzRpHdJmpZBhUk9U+JkfaXxird2jBNweMg8K7qofeQP+SaP0Bm+D2pB2dtk9uzlqdP4vn1Mkh+UGOSUTOSYdckWvSI5yU5In8I/+D3eAs+BX8fm0NaouZb2QlgssXUBDKdw==</latexit>

p_data(x)

<latexit sha1_base64="KMQ8ygvEEBrHwGNZcCadnKEw6ng=">AAACk3icdVHLahsxFJWnj6TuI05LV+1C1ATSjZlJCyl0E9IuuimkECcBexjuaO7YItJISHeKzTCbfk237d/0b6pxvIjj9oLgcO7r6NzcKukpjv/0onv3Hzzc2X3Uf/zk6bO9wf7zC29qJ3AsjDLuKgePSlY4JkkKr6xD0LnCy/z6U5e//I7OS1Od09JiqmFWyVIKoEBlg9c2mxIuyOmmAIKWH0410Dwvm0X7NhsM41G8Cr4NkjUYsnWcZfu9bFoYUWusSCjwfpLEltIGHEmhsO1Pa48WxDXMcBJgBRp92qy+0fKDwBS8NC68iviKvd3RgPZ+qfNQ2Wn0d3Md+a/cpKbyQ9rIytaElbhZVNaKk+GdJ7yQDgWpZQAgnAxauZiDA0HBuY1J50nadOK6MRvrc932D/htppNhSS8260DNTNgw1/+hpdhusB7r4KopgoHhJMndA2yDi6NR8m509O398OR0fZxd9oq9YYcsYcfshH1hZ2zMBPvBfrJf7Hf0MvoYnUafb0qj3rrnBduI6OtfGl3N3Q==</latexit>

x⇠p_data(x)

<latexit sha1_base64="Min9X1kSrlMsgf0t/tlM/bkvIlw=">AAACo3icdVFNa9tAEF2rX4nTNk57zGWpMSRQjJQE2mNoL4Vc0mAnAUuI1WpkL9nVLrujYiP0J/prek3+Rf9NVo4hcdwOLDzevJl5O5MZKRyG4d9O8OLlq9dvtra7O2/fvd/t7X24dLqyHMZcS22vM+ZAihLGKFDCtbHAVCbhKrv53uavfoF1QpcjXBhIFJuWohCcoafS3udYMZxlRT1vaOyEoiaNEeZoVZ0zZA09eBQcpr1+OAyXQTdBtAJ9sorzdK+TxrnmlYISuWTOTaLQYFIzi4JLaLpx5cAwfsOmMPGwZApcUi+/1dCBZ3JaaOtfiXTJPq2omXJuoTKvbD2657mW/FduUmHxNalFaSqEkj8MKipJUdN2RzQXFjjKhQeMW+G9Uj5jlnH0m1zrNIqSujXXtlkbn6mmO6BPmdaGQTVf1zE51X7CTP2HFnyzwDio/FZ17hfoTxI9P8AmuDwaRsfDo58n/dNvq+NskX3yiRyQiHwhp+QHOSdjwslv8ofckrtgEJwFF8HoQRp0VjUfyVoEyT0JRtSg</latexit>

Figure 2.1: Maximum Likelihood Estimation. A data distribution, 𝑝_data, produces samples,x, of a random variable 𝑋. Maximum likelihood estimation attempts to fit a model distribution, 𝑝_𝜃, to the empirical distribution of samples by maximizing the log-likelihood of data samples under the model (Eq. 2.3).

observation sample out of𝑁 total samples, we can approximate the objective as the following empirical average:

E^x∼^𝑝data(x) [log𝑝_𝜃(x)] ≈Ex∼𝑝b_data(x) [log𝑝_𝜃(x)] = 1 𝑁

𝑁

𝑖=1

log𝑝_𝜃(x⁽^𝑖⁾). (2.4) In summary, we can frame probabilistic modeling as the process of maximizing the log-likelihood, log𝑝_𝜃(x), of data observations under our model. A diagram of this setup is shown in Figure 2.1.

Model Formulation

Formulating a probabilistic model involves considering the dependency structure of the model and the parameterization of these dependencies. As with all models, there is a bias-variance decomposition of modeling error. Without considering a sufficiently flexible dependency structure and parameterization, the model will be inherently limited, i.e. biased, in its capacity to accurately model the data. Yet, with an overly expressive model, containing many dependencies and parameters, there is a risk of overfitting to the training examples, the outcome of a model with high variance.

Dependency Structure

The dependency structure of a probabilistic model is the set of conditional dependencies between variables. One common form of dependency structure is that of autoregression (Frey, G. E. Hinton, and Dayan, 1996; Y. Bengio and S. Bengio, 2000), which utilizes the chain rule of probability to model dependencies between variables:

𝑝_𝜃(x) =

𝑀

𝑗=1

𝑝_𝜃(𝑥_𝑗|𝑥_{< 𝑗}). (2.5)

Here, we have induced an arbitrary ordering over the𝑀 dimensions ofx, allowing us to factor the joint distribution over dimensions, 𝑝_𝜃(x), into a product of 𝑀 conditional distributions, each conditioned on the previous dimensions,𝑥_{< 𝑗}. Slightly abusing notation, at 𝑗 =1, we have 𝑝_𝜃(𝑥_𝑗|𝑥_{< 𝑗}) = 𝑝_𝜃(𝑥₁). A natural use-case for this dependency structure arises in modeling sequential data, where time provides an ordering over a sequence of𝑇 observed variables,x_1:𝑇:

𝑝_𝜃(x_1:𝑇)=

𝑇

𝑡=1

𝑝_𝜃(x𝑡|x<𝑡). (2.6)

Although such models are conventionally formulated in forward temporal order, this is truly a modeling assumption. Likewise, while the chain rule of probability dictates that we must considerallprevious variables, there may be cases where it is safe to assume conditional independence outside of some window. In the extreme case, in which we only consider pairwise dependencies, we arrive at a Markov chain:

𝑝_𝜃(x_1:𝑇) =Î𝑇

𝑡=1𝑝_𝜃(x𝑡|x𝑡−1).

Autoregressive models are also referred to as “fully-visible” models (Frey, G. E.

Hinton, and Dayan, 1996), as dependencies are only explicitly modeled between observed variables. However, we can also model such dependencies by introducing latent variables, denoted as𝑍. Formally, a latent variable model is defined by the joint distribution

𝑝_𝜃(x,z) = 𝑝_𝜃(x|z)𝑝_𝜃(z), (2.7) where 𝑝_𝜃(x|z)is theconditional likelihoodand 𝑝_𝜃(z) is theprior. Again, we have used the shorthand notation 𝑝_𝜃(x,z)to denote 𝑝_𝜃(𝑋 =x, 𝑍 =z). Introducing latent variables is one of, if not,theprimary technique for increasing the flexibility of a probabilistic model. This is because evaluating the probability of an observation now requires marginalizing over the latent variables. If𝑍 is a continuous variable, this involves integration, 𝑝_𝜃(x) = ∫

𝑝_𝜃(x,z)𝑑z, and if𝑍 is discrete, this involves

Figure 2.2: Dependency Structures. Circles denote random variables, with gray denoting observed variables and white denoting latent variables. Arrows denote probabilistic conditional dependencies. From left to right: autoregressive model (Eqs. 2.5 & 2.6), latent variable model (Eq. 2.7), hierarchical latent variable model (Eq. 2.10), autoregressive or sequential latent variable model (Eq. 2.11). For clarity, we have drawn a subset of the possible dependencies in the final model.

summation, 𝑝_𝜃(x)=Í

z𝑝_𝜃(x,z). In either case, we have

𝑝_𝜃(x) =Ez∼𝑝_𝜃(z) [𝑝_𝜃(x|z)], (2.8) which illustrates that 𝑝_𝜃(x)is amixturedistribution, with each mixture component,

𝑝_𝜃(x|z), weighted according to𝑝_𝜃(z). Thus, even when restricting𝑝_𝜃(x|z)to simple distribution forms, such as Gaussian distributions, 𝑝_𝜃(x)can take on flexible forms that do not have closed form analytical expressions. In this way, 𝑍 can implicitly model dependencies in 𝑋, assigning higher probability to particular regions of the observation space.

However, increasing flexibility through latent variables comes with increasing computational overhead. In general, marginalizing over𝑍is not analytically tractable, particularly with continuous latent variables or complex conditional likelihoods.

This requires us to either 1) adopt approximation techniques, which we discuss in Section 2.3, or 2) restrict the form of the model to ensure computationally tractable evaluation of 𝑝_𝜃(x). This latter approach is the basis offlow-based models (Tabak and Turner, 2013; Rippel and Adams, 2013; Dinh, Krueger, and Y. Bengio, 2015), which define the conditional dependency between𝑋 and𝑍 in terms of an invertible transform,x= 𝑓_𝜃(z) andz= 𝑓⁻¹

𝜃 (x). We can then express 𝑝_𝜃(x) using thechange of variables formula:

𝑝_𝜃(x)= 𝑝_𝜃(z)

det 𝜕x

𝜕z

−1

, (2.9)

where _𝜕_z is the Jacobian of the transform and det denotes matrix determinant.

The term det

𝜕x

𝜕z

−1

can be interpreted as the local scaling of space when moving from 𝑍 to 𝑋, conserving probability mass in the transform. Flow-based models, also referred to asnormalizing flows(Rezende and Mohamed, 2015), are the basis of the classical technique of independent components analysis (ICA) (Bell and Sejnowski, 1997; Hyvärinen and Oja, 2000) and non-linear generalizations (S. S.

Chen and Gopinath, 2001; Laparra, Camps-Valls, and Malo, 2011). As such, these models can serve as a general-purpose mechanism for adding and removing statistical dependencies between variables. Although flow-based models avoid the intractability of marginalization, their requirement of invertibility may be overly restrictive or undesirable in some contexts (Cornish et al., 2020). And while the change of variables formula can also be applied to non-invertible transforms (Cvitkovic and Koliander, 2019), it raises computational intractabilities. This motivates the use of approximate techniques for training latent variables models (Section 2.3).

While we have presented autoregression and latent variables separately, these techniques can, in fact, be combined in numerous ways to model dependencies. For instance, one can createhierarchical latent variable models(Dayan et al., 1995), incorporating autoregressive dependencies between latent variables. Considering𝐿 levels of latent variables,𝑍^1:^𝐿 =

𝑍¹, . . . , 𝑍^𝐿

, we can express the joint distribution as

𝑝_𝜃(x,z^1:^𝐿) = 𝑝_𝜃(x|z^1:^𝐿)

𝐿

ℓ=1

𝑝_𝜃(z^ℓ|z^ℓ+1:^𝐿). (2.10) From this perspective, hierarchical latent variable models are a repeated application of the latent variables technique in order to create increasingly complexempirical priors (Efron and Morris, 1973). We can also consider incorporating latent variables within sequential (autoregressive) probabilistic models, giving rise to sequential latent variable models. Considering a single level of latent variables in a corresponding sequence,𝑍_1:_𝑇, we generally have the following joint distribution:

𝑝_𝜃(x_1:𝑇,z_1:𝑇) =

𝑇

𝑡=1

𝑝_𝜃(x𝑡|x<𝑡,z≤𝑡)𝑝_𝜃(z𝑡|x<𝑡,z<𝑡), (2.11) where we have again assumed a forward sequential ordering. By restricting the dependency structure and distribution forms in sequential latent variable models, we recover familiar special cases, such as hidden Markov models or linear Gaussian state-space models (Murphy, 2012). Beyond hierarchical and sequential latent variable models, there are a variety of other ways to combine autoregression and

latent variables (Gulrajani et al., 2017; Razavi, Aaron van den Oord, and Vinyals, 2019). The remaining chapters focus on hierarchical (Eq. 2.10) and sequential (Eq. 2.11) latent variable models, though models with more flexible hierarchical, sequential, and spatial dependencies will be required to advance the frontier of probabilistic modeling.

Parameterizing the Model

In the previous section, we discussed the dependency structure of probabilistic models. The probability distributions that define these dependencies are ultimately functions. In this section, we discuss possible forms that these functions may take.

We restrict our focus here toparametricdistributions, which are defined by one or more distribution parameters. The canonical example is the Gaussian (or Normal) distribution,N (𝑥;𝜇, 𝜎²), which is defined by a mean, 𝜇, and variance, 𝜎². This can be extended to the multivariate setting, wherex ∈R^𝑀 is modeled with a mean vector,µ, and covariance matrix,𝚺, with the probability density written as

N (x;µ,𝚺)= 1

(2𝜋)^𝑀^/2det(𝚺)^1/2exp −1

2 (x−µ)^|𝚺⁻¹(x−µ)

. (2.12) For convenience, we may also consider diagonal covariance matrices,𝚺 =diag(σ²), simplifying the parameterization and resulting calculations. In particular, the special case where𝚺 = I^𝑀, the 𝑀 × 𝑀 identity matrix, the log-density, up to a constant, becomes the familiarmean squared error,

logN (x;µ,I) =−1

2||x−µ||₂²+const. (2.13) With a parametric distribution, conditional dependencies are mediated by the distribution parameters, which are functions of the conditioning variables. For example, we can express an autoregressive Gaussian distribution (of the form in Eq. 2.5) through conditional densities, 𝑝_𝜃(𝑥_𝑗|𝑥_{< 𝑗}) = N (𝑥_𝑗;𝜇_𝜃(𝑥_{< 𝑗}), 𝜎²

𝜃(𝑥_{< 𝑗})), where 𝜇_𝜃 and 𝜎²

𝜃 are functions taking 𝑥_{< 𝑗} as input. A similar form applies to autoregressive models on sequences of vector inputs (Eq. 2.6), with 𝑝_𝜃(x𝑡|x<𝑡) = N (x𝑡;µ𝜃(x<𝑡),𝚺𝜃(x<𝑡)). Likewise, in a latent variable model (Eq. 2.7), we can express a Gaussian conditional likelihood as 𝑝_𝜃(x|z) =N (x;µ𝜃(z),𝚺𝜃(z)). Note that in the above examples, we have overloaded notation, simply using a subscript𝜃 for all functions. In practice, while it is not uncommon to share parameters across functions, this is ultimately a modeling choice.

The functions supplying each of the distribution parameters can range in complexity, from constant to highly non-linear. Classical modeling techniques often employ linear functions. For instance, in a latent variable model, we could parameterize the mean as a linear function ofz:

µ^𝜃(z) =Wz+b, (2.14)

whereWis a matrix of weights andbis a bias vector. Models of this form underlie factor analysis, probabilistic principal components analysis (Tipping and Bishop, 1999), independent components analysis (Bell and Sejnowski, 1997; Hyvärinen and Oja, 2000), and sparse coding (Olshausen and Field, 1996). Linear autoregressive models are also the basis of many classical time-series models and are common across fields that use statistical methods. While linear models are relatively computationally efficient, they are often too limited to accurately model complex data distributions, e.g. those found in natural images or audio.

Recent improvements in deep learning (Goodfellow, Y. Bengio, and Courville, 2016) have provided probabilistic models with a more expressive class of non-linear functions, improving their modeling capacity. In these models, the distribution parameters are parameterized with deep networks, which are then trained by backpropagating (Rumelhart, G. E. Hinton, and Williams, 1986) the gradient of the log-likelihood objective,∇𝜃Ex∼

𝑝_data [log𝑝_𝜃(x)], back through the layers of the network. Typically, this is estimated using a mini-batch of data examples, rather than the full empirical distribution as in Eq. 2.4. Deep autoregressive models and deep latent variable models have enabled recent advances across an array of areas, including speech (Graves, 2013; Aäron van den Oord et al., 2016), natural language (Sutskever, Vinyals, and Le, 2014; Radford et al., 2019), images (Razavi, Aaron van den Oord, and Vinyals, 2019), video (Kumar et al., 2020), reinforcement learning (Chua et al., 2018; Ha and Schmidhuber, 2018) and many others.

We visualize a simplified probabilistic computation graph for a deep autoregressive model in Figure 2.3. This diagram translates the autoregressive dependency structure from Figure 2.2, which only depicts the variables and their dependencies in the model, into a more detailed visualization, breaking the variables into their corresponding distributions and terms in the log-likelihood objective. Here, green circles denote the conditional likelihood at each step, containing a Gaussian mean and standard deviation, which are parameterized by a deep network. The log-likelihood, log𝑝_𝜃(x𝑡|x<𝑡), evaluated at the data observation, x𝑡 ∼ 𝑝_data(x𝑡|x<𝑡) (gray circle), provides the

µ✓

<latexit sha1_base64="l0Ke2CNbY3w+39J9mtQy0ZuO9Ck=">AAAChnicdVHLSgMxFE3HV62vqks3wSK4KjM+0GXRjUsFq0JnGDLpbRtMJkNyI5ahX+JWP8q/MVO7sFYvXDic+zqcmxVSWAzDz1qwtLyyulZfb2xsbm3vNHf3Hqx2hkOXa6nNU8YsSJFDFwVKeCoMMJVJeMyer6v64wsYK3R+j+MCEsWGuRgIztBTaXMnzlQZKzdJYxwBsrTZCtvhNOgiiGagRWZxm+7W0rivuVOQI5fM2l4UFpiUzKDgEiaN2FkoGH9mQ+h5mDMFNimnyif0yDN9OtDGZ450yv6cKJmydqwy36kYjuzvWkX+Ves5HFwmpcgLh5Dz70MDJylqWtlA+8IARzn2gHEjvFbKR8wwjt6suU33UVJW4qo1c+czNWkc0Z9MJaNA9Trfx+RQ+wsj9Q8t+OJAYcF5V3XfG+hfEv1+wCJ4OGlHp+2Tu7NW52r2nDo5IIfkmETkgnTIDbklXcKJI2/knXwE9aAdnAcX361BbTazT+Yi6HwBrh7IfQ==</latexit> ✓<latexit sha1_base64="dWNC+c6RW6+WgIuGD7J2Bwe3Emg=">AAACiXicdVFNSyNBEO3M+pGNX1GPXhqD4CnMqKB4EnPZowtGhcww1HQqSWP39NBdI4Yhf8Wr+5f232xPzMGYtaDg8err8SorlHQUhn8bwY+19Y3N5s/W1vbO7l57/+DBmdIK7AujjH3KwKGSOfZJksKnwiLoTOFj9tyr648vaJ00+T1NC0w0jHM5kgLIU2n7IM50FTs51jBLY5ogQdruhN1wHnwVRAvQYYu4S/cbaTw0otSYk1Dg3CAKC0oqsCSFwlkrLh0WIJ5hjAMPc9DokmoufsZPPDPkI2N95sTn7OeJCrRzU535Tg00cV9rNfm/2qCk0VVSybwoCXPxcWhUKk6G107wobQoSE09AGGl18rFBCwI8n4tbbqPkqoWV69ZOp/pWeuEf2ZqGQXp1+U+UGPjL0z0N7QUqwOFw9K7aobeQP+S6OsDVsHDWTc67579vujc3C6e02RH7Jidsohdshv2i92xPhPslb2xd/Yn2Aqi4Cq4/mgNGouZQ7YUQe8fpfnJyg==</latexit>

p✓(xt|x<t)

<latexit sha1_base64="hsjE9FCjkzXNO2FF+fHIUg7mGLw=">AAACoHicdVHBThsxEHUWSmmgNLTHXiwiJLhEu1AJDhwQveTWIBFAJKuV15lNLOy1Zc9WRMt+A1/DFb6jf1NvyCEh7UiWnt/MeJ7fpEYKh2H4pxGsrX/Y+Lj5qbm1/XnnS2v367XTheXQ51pqe5syB1Lk0EeBEm6NBaZSCTfp/c86f/MbrBM6v8KpgVixcS4ywRl6KmkdmmSIE0BGD4aK4STNyocqQfpIF67lGVaHSasddsJZ0FUQzUGbzKOX7DaS4UjzQkGOXDLnBlFoMC6ZRcElVM1h4cAwfs/GMPAwZwpcXM7+VNF9z4xopq0/OdIZu9hRMuXcVKW+shbq3udq8l+5QYHZaVyK3BQIOX8blBWSoqa1QXQkLHCUUw8Yt8JrpXzCLOPobVx66SqKy1pc/czS+FRVzX26yNQyDKqH5Tomx9pPmKj/0IKvNhgHhXdVj7yBfiXR+wWsguujTnTcObr80T6/mC9nk3wne+SAROSEnJMu6ZE+4eSJPJMX8hrsBd3gV3D5Vho05j3fyFIEd38BY6vTEw==</latexit> logp✓(xt|x<t)<latexit sha1_base64="L82yc0PcdKQbIVR4FVe9NQvnOgE=">AAACpXicdVHBThsxEHWWtkBa2kCPvViNUOES7UIlOHBA9NJLJZCSgJRdrbzObGJhry17FhEt+xd8DVf6E/2bekMOCWlHsvT8Zsbz/CYzUjgMwz+tYOPN23ebW9vt9x92Pn7q7O4NnS4thwHXUtubjDmQooABCpRwYywwlUm4zm5/NPnrO7BO6KKPMwOJYpNC5IIz9FTa6cVST6hJY5wCMnoQK4bTLK/u6xTpA126VmdYH6adbtgL50HXQbQAXbKIy3S3lcZjzUsFBXLJnBtFocGkYhYFl1C349KBYfyWTWDkYcEUuKSaf6ym+54Z01xbfwqkc3a5o2LKuZnKfGUj1L3ONeS/cqMS89OkEoUpEQr+MigvJUVNG5foWFjgKGceMG6F10r5lFnG0Xu58lI/SqpGXPPMyvhM1e19usw0Mgyq+9U6JifaT5iq/9CCrzcYB6V3VY+9gX4l0esFrIPhUS867h1dfe+eXyyWs0W+kK/kgETkhJyTn+SSDAgnj+SJPJPfwbfgV9APhi+lQWvR85msRJD+BQgy1QM=</latexit> xt⇠pdata<latexit sha1_base64="W0flu+bhmcsHUKZ1gjE/g4gfuv4=">AAACmHicdVHLjtMwFHXDayivDuxgY6hGYlUlAxIsK1gAuwFNZ0ZqoujGuWmtsWPLvkGtoqz5GrbwLfwNTqeL6RSuZOno3NfxuYVV0lMc/xlEt27fuXvv4P7wwcNHj5+MDp+eedM4gTNhlHEXBXhUssYZSVJ4YR2CLhSeF5cf+/z5d3RemvqU1hYzDYtaVlIABSofvUw10LKo2lWXE0+91NzmKeGKnG5LIOjy0TiexJvg+yDZgjHbxkl+OMjT0ohGY01CgffzJLaUteBICoXdMG08WhCXsMB5gDVo9Fm7+UvHjwJT8sq48GriG/Z6Rwva+7UuQmWv3N/M9eS/cvOGqvdZK2vbENbialHVKE6G98bwUjoUpNYBgHAyaOViCQ4EBft2Jp0mWduL68fsrC90Nzzi15lehiW92q0DtTBhw1L/h5Ziv8F6bIKrpgwGhpMkNw+wD86OJ8mbyfHXt+Pph+1xDtgL9oq9Zgl7x6bsMzthMybYD/aT/WK/o+fRNPoUfbkqjQbbnmdsJ6JvfwHIjdBW</latexit>

Figure 2.3:Model Parameterization & Computation Graph. The diagram depicts a simplified computation graph for a deep autoregressive Gaussian model. Green circles denote the conditional likelihood distribution at each step, while gray circles again denote the (distribution of) data observations. Smaller red circles denote each of the log-likelihood terms in the objective. Gradients w.r.t. these terms are backpropagated through the networks parameterizing the model’s distribution parameters (red dotted lines).

objective (red dot). The gradient of this objective w.r.t. the network parameters is calculated through backpropagation (red dotted line). This general depiction of dependencies, distributions, and log-probabilities (or differences of log-probabilities) is used throughout this thesis to describe the various computational setups.

Purely autoregressive models (without latent variables) have proven useful in many domains, often obtaining better log-likelihoods as compared with latent variable models. However, there are a number of reasons to prefer latent variable models in some contexts. First, autoregressive sampling is inherently sequential, and this linear computational scaling becomes costly in high-dimensional domains. Second, latent variables provide a representational space for downstream tasks, compression, and overall data analysis. Finally, latent variables provide added flexibility, which is particularly useful for modeling continuous random variables with relatively simple, e.g. Gaussian, conditional distributions. For these reasons, we require methods for handling the latent marginalization in Eq. 2.8. Variational inference is one such method.

2.3 Variational Inference

Dalam dokumen Learned Feedback & Feedforward Perception & Control (Halaman 36-44)