The Identification of Discrete Mixture Models

Introduction

Organization

The set of all linear combinations of the matricesP(i) is a commutative algebra, the B(v) projection algebra, which we call AB(v). Together with the algorithm, references to the most important steps of the analysis are given in appendix 4.5.

The k -Mix IID problem

Introduction

Except for casek = 2, but here we are dealing with the complexity of the problem as a function ofk.). Similar results for somewhat more general models were obtained at a similar time in [17]. k) on the task sample size.

Mixture Models and Other Definitions

We associate to each vector q∈Rk a degree k−1 polynomialq(x) =ˆ Pk−1. For this reason, we use zero indexing for the vector.). For a closed convex setS ⊆Rk and any pointx /∈S, the Euclidean projection of xontoSProjS(x) is := arg miny∈S∥y−x∥2. This projection is unique.

Properties of Hankel Matrices

With the Courant-Fischer-Weyl min-max principle, the smallest eigenvalue of Hk is given by minimizing the Rayleigh-Ritz quotient.

The Empirical Moments

By eliminating all the factors whose absolute value is below 2 from the product, we get for summer≤k′−ℓ,|β1β2· · ·βr| ≥ ∥q∥2. We can convert the standard moments of the distribution between the normalized histogram using the observation (Lemma 1 in [25]) that for any ∈R,.

Learning the Source

After sampling, LearnCoinMixture computes the approximate model M˜ using O(k2logk+klog2k ·log(logζ−1 + logπmin−1 +γ)) arithmetic operations. We can get a relative guarantee∥˜π−π∥∞ ≤πmin·2−γ by increasing the sample size by a factor ofπ−2min.

Implications for Topic Models

Note that solving the k-coin outcome of one of the two reductions using either of the two previous algorithms requires a sample size of at least kO(k2) (due to the required precision). Our algorithm enables a solution to the outcome of these reductions using a sample size of kO(k) (and total running time of O(k3+o(1))).

Analysis

We need to show that our calculated first eigenvector of the empirical Hankel matrix is close to the kernel eigenvector of the true Hankel matrix. We are going to use the roots of the polynomial rˆ as our guessed coin biases (after the roots are projected back to [0,1]).

Computing the Weights

The weights produced before RectifyWeights is called sum to1 (since they satisfy the equation1 = µ0 =P . jπ˜j), but they may lie outside[0,1]. To fix this, we simply make all negative weights zero and scale all non-negative weights so that the sum of the weights does not change. Note that in Algorithm2, I− indicates the indices of the negative weights, and I+ the positive weights.

We know that the true weights lie in [0,1], so increasing the negative weights to 0 only brings them closer to their true values. To see that the running time is O(k), we note that we can compute I− and I+ in linear time and likewise for W− and W+.

Deferred Proofs

Using the sub-multiplicity of the operator norm and the fact that the Frobenius norm superimposes the operator norm, we get that. Since H˜k+1 is a Hankel matrix, a similarity transformation A = T H˜k+1 T−1, where A is tridiagonal, can be calculated in time O(k2logk). Choose an initial guess v(0) uniformly at random on the unit sphere (ie, from the unit Haar measure on the sphere).

Useful Theorems

In the algorithm and the analysis, we will make a lot of use of the singular values of a matrixM, which we will denoteσmax(M) =σ1(M)≥σ2(M)≥. Empirical counterparts In the pseudocode for Algorithm3, we will work exclusively with empirically calculated (thus approximate) versions of the above quantities. While we have a log-linear dependence of the probability on the parameters, these are unnormalized, in that the probability of X = 1|u1,.

On the Learnability of Discrete Distributions," in Proceedings of the 26th Annual ACM Symposium on Theory of Computing, 1994, p. Schulman, "Source Identification for Mixtures of Product Distributions," in Proceedings of the 34th Annual Conference on Learning Theory - COLT, Ser. Schulman, "Identification sources for mixtures of product distributions,” in Proceedings of the 34th Annual Conference on Learning Theory - COLT, ser.

Sufficient Conditions for the Identifiability of Mixtures of

Introduction

If H(m) has full column rank, then there exists a set Rof at most k−1 rows m such that H(m|R) has full column rank. Considering the more combinatorial second question, note that if it has two equal columns, then the same is true for H(m), so the latter cannot have the full column rank. Apparently, the only well-known case of the NAE condition is when m contains k−1 rows that are identical and all of whose entries are different.

For another example where the NAE condition causes rank H(m) = k, take the (k−1) row matrix with mji = 1for i ≤ j and mji = 1/2fori > j. Here only minimal the NAE condition is met, in that for every ℓ≤kℓcolumns there are C st.ε(m|C) = −1. Fork >3 The NAE condition is no longer necessary to ensure that H(m) has the full column rank.

Motivation

The problem of source identification (or parameter estimation) is the problem of computing (m, π) from the joint statistic Xi. In general, µ is not injective (it even allows permutations between the values of π and the columns of m). Obviously, for example, it is not injective if it has two equal columns (unless π assigns weight to them).

In general, and assuming allπj >0, it cannot be injective unless H(m) has full column order. So for small enough δ > 0,π+δα is a mixture distribution, distinct from π, with identical statistics.). A weaker and still sufficient condition for the injectivity of µ, due to [34], is that for every∈[n] there exist two disjoint groupsA, B ⊆[n]− {i}such thatH(m| A) andH(m| B) to have the full order of the column. It is not known whether two such separate A, B are strictly necessary.). For any data, if it is large enough and satisfies a certain non-singularity condition, the mixture learning problem becomes easier; this overview is due to [6].

Some Theory for Hadamard Products, and a Proof of Theorem 45 35

One of the main motivations for our work is the characterization of intervention distributions in Bayesian networks (causal DAGs). As in previous works, the algorithm and its analysis make extensive use of the Hadamard matrix expansion. To avoid clutter, we will describe most of the steps of the algorithm as if we had access to exactly the moments we need (i.e.

Swamy, “Learning arbitrary statistical mixtures of discrete distributions,” in Proceedings of the 47th Annual ACM Symposium on Theory of Computing, 2015, p. page Rao, “Learning mixtures of product distributions using correlations and independence,” in Proceedings of the 21st Annual Conference on Learning Theory - COLT, Omnipress, 2008, p.

Figure 3.2: Argument for Theorem 47(a). Upper-left region is white. Entries (t, f (t)) (indicated with black dots) are not white.

Source Identification for Mixtures of Products

Introduction

In the notation of Bayesian networks, this situation is represented graphically by a single unobservable random variable U with edges for each of the variablesXi ∈ X. Clearly, the sample size number requires a further quantitative assumption on the partition; this is the role of the ζ parameter in our work. Since our algorithm calculates, as part of its operation, the number of terms of the matrices it uses to invert the model→statistical map, it will return an output only under conditions that ensure that the product is truly unique (within the error of allowed).

The role of the current work is to provide and analyze the necessary algorithm for that problem. Some separation condition, at least some of the variables, is necessary, because we certainly cannot identify the distribution of the latent variable if it does not have sufficient effect on the observable variables. Hence the key role played by the improved sample and runtime complexity of the current work.

Preliminaries

An approach introduced in [44] is to create synthetic copies of a single variable through linear combinations of other variables; since we need to modify the method, we describe below how this is done. Notation for Subsets and Collections of Subsets We will reserve calligraphic fonts for collections of subsets of [n], i.e., S ⊆ 2[n]. Finally, for a matrix M, M+ will denote the Moore-Penrose inverse (i.e., pseudoinverse), given by (MTM) -1MT in the case that M has full column order.

We will make use of information about the dependencies between observables through measurements of E[XiXi′]. The ℓth power moment is defined as E[X1⊙ℓ], the expectation of the product of ℓcopies of X1 conditioned iid on U. The Empirical Multi-linear Moments For a finite sample drawn from the model, let ˜g(S) be the empirical estimate of E[XS], that is, the fraction of samples for whichQ.

The Algorithm

That is, each input of the vector vi will be a mixed moment, with a multilinear part given by XA and a power moment part given by X1⊙ℓ. To avoid this we replace C˜BA (resp. CˆB′A), the rankk approximation obtained by abbreviating to the first singular values. There are lg steps in the iteration, each taking 20(k) time, because we are constantly multiplying each iteration many times by matrices of magnitude 20(k).

We calculate by multiplying the matrix of the prime vi vectors by the inverse of the matrix Vdm(˜m1), the Vandermonde matrix generated from the empirical version of rym1. Note the advantage of the repetition in line 12; we can get away with performing at most 1 + lgk iterations to compute any of ~v1, . We note that a simpler version of the above procedure uses only 2k −1 ζ-separated variables, but requires 2k−1 iterations to compute the v˜i-s, so the required initial accuracy would be exponential ink2 rather than logk.

The Condition Number Bound

Analysis of the Algorithm

First, we assume that the ideal versions of the matrices in question behave well, relying heavily on Theorem 71. This assumption of “independent influence” of the latent variables on the observed variables makes the resulting model a log- linear model; we have. EachUi is uniform over {0,1}.1 Finally, we have the probability that X = 1 given U =ui is given by the product of terms corresponding to each of the hidden variables, namely P(X =x|U =u) = Qℓ.

Kakade, “A method of moments for mixture models and hidden Markov models,” in Proceedings of the 25th Annual Conference on Learning Theory - COLT, ser. Swamy, "Learning mixtures of arbitrary distributions over large discrete domains," in Proceedings of the 5th Conference on Innovations in Theoretical Computer Science, 2014, pp. Moitra, "Learning topic models — going beyond SVD," in Proceedings of the 53rd Annual IEEE Symposium on Foundations of Computer Science, 2012.

Mansour, ‘Estimating a mix of two product distributions’, inProceedings of the 12th Annual Conference on Computational Learning Theory, juli. Moitra, ‘Beyond the low-degree algoritme: Mixtures of subcubes and their applications’, inProceedings of the 51st Annual ACM-symposium over computertheorie, 2019, blz.

The Identifiability of Uniform Mixtures of Binomial Distri-