Likelihood Gradient - Data Analysis of and Results from Observations of the Cosmic Microwave Ba

then take of order (8×10⁵)³/2×10⁹∼10 years to invert the matrix, and would require∼5 terabytes of memory to store it! Clearly, great care must be taken when creating C to make it as small as possible, and then one must work with it as efficiently as possible.

the multidimensional search method used is relatively efficient, simply varying theqB is not a bad way of reaching the peak, and in fact is what we use in Chapter 3. Because to measure the likelihood we need only factor C into the triangular matrix L such that LL^T = C (a Cholesky factorization.

See below for how to obtain the likelihood), a single calculation of the likelihood can be very much faster than iterations of more sophisticated methods that converge in fewer steps. For instance, using the LAPACK linear algebra library (Anderson et al., 1999) on a Pentium IV, factoring C is about six times faster than inverting it. To see how to get the likelihood from factoring, note that what we really need is C⁻¹∆ and log|C|. To get the determinant, we need merely multiply the diagonal elements of L, and to get C⁻¹∆, we solve the system of equations Cy=∆which is done inO n²

time once C is factored.

We can do better than that, though, especially if we are fitting many bins. If we could characterize the likelihood surface around a point, in addition to being able to converge to the maximum more quickly (through, for instance, Newton-Raphson iteration), we could also directly estimate quantities of interest such as errors. Many authors have advocated calculating or approximating the gradient and curvature of the likelihood (Bond et al., 1998; Borrill, 1999e.g.), then using Newton-Raphson iteration to find the zero of the gradient. In order to do this, we need to be able to calculate gradients and curvatures of the likelihood. I show here the calculation of the gradient, with the curvature discussed in Section 2.4.

Recall the formula for the derivative of the likelihood of uncorrelated data under these assump- tions, Equation 2.8. First let us analyze the second term, originating from the log of the determinant of C

−X 1

2(qSi+Ni)Si (2.22)

The denominator is the total variance Λ⁻¹_i (inverse since it’s in the denominator), while the coefficient is the change in Λiwith respect to the parameter in questionq. So, we would like a matrix operation that will multiply those two sets of numbers and sum them. Fortunately there is such an operation—

the trace of a matrix. The trace is the sum of the diagonal elements of a matrix, and has the nice

property that it is the sum of the eigenvalues, and hence is unchanged when we rotate the matrix.

So, we can write the term as follows

−X 1

2(qSi+Ni)Si=−1 2

XΛi,q

Λi

=−1

2T r Λ,qΛ⁻¹

(2.23)

where Λ,q is the derivative of Λ with respect to the band powerq. We can now rotate from Λ to C since the trace is unaffected, giving the general expression

−1

2T r C,qC⁻¹

(2.24)

The first term, which is theχ² of the data

X1 2

x²_i

(qSi+Ni)²Si (2.25)

is rather more interesting since there aretwoways it can be transformed into matrix notation, both of which are useful. It is reasonably straightforward to process it in the diagonal case and then rotate, but is not trivial because some care must be taken when rotating multiple matrices that do not have the same eigenvectors. Instead, I will proceed directly from the matrix description

−¹₂∆^TC⁻¹∆. We will need the derivative of the inverse of a matrix, which is as follows

dq C⁻¹C

=dC⁻¹

dq C + C⁻¹dC

dq = 0 (2.26)

where it is equal to zero because the initial product is the identity matrix (by definition of the inverse), whose derivative is clearly zero. We can then solve for the derivative of the inverse

d dq C⁻¹

=−C⁻¹dC

dqC⁻¹ (2.27)

We can use this to calculate the derivative (Bond et al., 1998)

d dq

∆^TC⁻¹∆

=−∆^TC⁻¹C,qC⁻¹∆=−∆^TC⁻¹WqC⁻¹∆ (2.28)

where the final step is because of the parameterization of the spectrum, Equation 2.21. This form has appeared in the literature before (Oh et al., 1999; Borrill, 1999). Since the data vector is constant, it has no derivative.

The other expression for the derivative comes from noting that we can rewrite the first term in the likelihoodT r

∆∆^TC⁻¹

. An element by element comparison with the standard formula shows that the operations are identical. We can then take the derivative using Equation 2.27, yielding

d dqT r

∆∆^TC⁻¹

=−T r

∆∆^TC⁻¹C,qC⁻¹

(2.29)

Combining these with Equation 2.24 and evaluating the C,q gives the final numerically equivalent expressions for the gradient of the likelihood

dlog (L) dq = 1

2∆^TC⁻¹WqC⁻¹∆−1

2T r WqC⁻¹

(2.30)

dlog (L) dq = 1

2T r

−∆∆^TC⁻¹WqC⁻¹+ WqC⁻¹

(2.31)

We are now in a position to see the different utilities of the two expressions. The first is important because it is fast to calculate, once we have the inverse. The χ² term requires only matrix times vector operations, which are fast. The determinant term looks like it should require ann³operation, but because we take the trace, we need only calculated the diagonal elements of the product, which is ann²operation. In fact, the trace of a product can be performed very quickly indeed for symmetric matrices. The jj^th element of AB = P

iAijBji, and the trace is the sum of that over i. If the matrices are symmetric, Bij = Bji, and the trace is simply P

jAijBij. If the matrices are stored, as is usually the case, in a contiguous stretch of memory, then we are simply taking the dot product of ann² long vector. This is an extremely efficient way of accessing computer memory for

the trace, especially on multiprocessor machines (Sievers, 2004, in prep).

The usefulness of the second expression becomes clear if we introduce an extra factor of CC⁻¹ into the determinant term, giving

dlog (L) dq =1

2T r

∆∆^T−C

C⁻¹WqC⁻¹

(2.32)

We can see that we reach the maximum of the likelihood, where the gradient is zero, at the point where the matrix formed by the data ∆∆^T “most closely” matches the covariance matrix C. In addition, we can see how the gradient will respond to the addition of an expected signal, which usually requires a matrix to describe rather than a vector. This is the key to understanding the contribution to the power spectrum from other signals, discussed in Section 2.5. Unfortunately, calculating the gradient using this expression is computationally expensive, requiringnbin matrix- matrix multiplications. We can get one matrix multiplication for free because of the trace, but we have to pay for the others. Since we need the derivative foreach bin, this requires a factor of order the number of bins more work to calculate the gradient using this formula rather than Equation 2.30. When the number of bins becomes large (for the CBI, we have typically around 20), this factor can be the difference between being able to run on a typical desktop machine and having to run on a supercomputer, or the difference between being able to run on a supercomputer and not being able to extract a power spectrum at all.

Dalam dokumen Data Analysis of and Results from Observations of the Cosmic Microwave Background with the Cosmic Background (Halaman 39-43)