(1)COMPUTER MEMORIES Thesis by Mario Blaum In Pa.rtia.l Fulfillment of the Requirements for the Degree of Doctor of Philosophy California

The results in Chapter III were obtained during a two-week stay at the University of Manchester in England at the kind invitation of Prof. Other professors who had a major influence on my mathematical training during my time at Caltech were R. In Chapter II , we present accurate and easily evaluated estimates for the average lifetime of solid-state RAM protected by single fault detection code and double error detection (SEC-DED).

As an application, we provide an analysis of the benefits of soft "clearing" errors when both hard and soft errors are present. In Chapter IV, we extend the construction to codes that can correct a larger number of track errors and deletions. In Appendix II, we give a discussion of the Poisson approximation used in Chapter II, assuming a simplified situation.

Being aware of the logical contradiction of the concept, we emphasize that this thesis is unique.

Extensions of the Birthday

Introduction

A generalization of the "birthday surprise" problem

The recent survey article by Chen and Hsiao ([4]) gives a very good overview of SEC-DED memory encoding; but we will summarize here the important aspects of the coding architecture. Each n-bit codeword consists of one bit from each of the n chips in a row (figure 3). In the following discussion, a chip failure will be taken as a situation in which one or more bits written to a chip cannot be reliably recovered.

These errors are traditionally classified as either "hard" (meaning that the memory cells involved are permanently damaged, e.g. "stuck at" errors) or "soft" (meaning that a given bit has somehow been complemented, but that the chip itself has not suffered structural damage). In the next section, we will present a model for the occurrence of the different types of chip failures and use it to derive an estimate based on the Poisson approximation of MTBF. Although a hardware implementation of the methods in Sections 5 and 6 is possible, we are not aware of any commercial application.

Since rows fail independently, the reliability of the entire series M rows is R(t)M, and so on. Note that we assume that each row fails according to a Poisson process, i.e. the prob- . ability of exactly j failures in any row is P.·~:)i e-.Ans. To find R3(t), note that when a row-column failure occurs, single cell failures may occur in the corresponding ( l - 1) 2 cells left, so,. 5) The last term must be subtracted since events I and IT are not disjoint, event C lies in the intersection.

Since the MT BF of the unprotected memory is 1/ >.kM, dividing the expression in (9) by this value, we get. In this section we will prove one of the statements made at the beginning, that is, that "scrubbing" soft erroi~ is useful when ECC is present. Since most single-cell faults are soft faults (see [6]), techniques such as scrubbing of soft faults, combined with ECC, are useful to extend the lifetime of the system.

In this section, we will discuss the methods described in Sections 5 and 6 for extending the life of computer memory. Diagonal reading of entries ai.J starts at a0 ,0 and continues with the following entries along the directed path determined by law (1). The directed cyclic interpretation of the read allows us to associate with a string of length b a directed path of length (ie, the number of edges) b - 1, where the first and last bits of the string correspond to the first and last points of the path, respectively.

We can now prove our main result relating the burst-error capability of the code with conditions on k1 and k2. We will do that in the next section. We must show that no other burst of length less than or equal to k1 has the same syndrome. i) The path of length b - 1 associated with the burst does not skip rows. ii) The path associated with burst skips rows. The rows and the first 8 bits in each column will be considered elements of the Galois field of order 28, GF(28).

Figure 5: A computer memory with s rows of spare chips

Asymptotic estimates

Models. Formula for MT BF

The reliability of a given chip (the probability that there will be no failure after .t hours) is given by e-.\t, where We assume that in a given chip these four events occur independently and that errors in one chip are independent of errors in all other chips. Thus, for example, the probability that after t hours a given line in the chip has not yet failed is e-.\at/l.

Only row or single cell errors occur, so there is no more than one error in a. But we have l rows, all of which fail independently, so we need to take expression ( *) for the power l and multiply this power by e-(b+d)).nt, the probability that neither column nor row-column errors performance. Finally, by substituting (5) into (1) and changing the variable .Ant= x, we obtain. 8) Although the integral in (8) is somewhat complicated, if M is small it can be easily evaluated using numerical methods.

On the other hand, if M is large, asymptotic methods can be used to estimate it.

The case of large M. Asymptotic approximations

We are interested in finding the coding gain (CG) with respect to the unprotected memory.

Numerical examples

That is, row and column faults, while much less frequent than single-cell faults, are in typical cases roughly twice as likely to account for system-wide faults (note that in example 4 of [3] M = 4, while here M is a large number). Therefore, the actual MT BF is about 40% smaller than the value obtained if the errors of one cell are ignored.

Error protection when s rows of spare chips are added

From (1) and the fact that the mean time between failures of unprotected memory is 1/ ).Mk, we have. Denote by (MT BF), the mean time between failure of the coding-error-correcting memory when it is added from spare chip types. MT BF)o indicates a common case. Similarly, N is the number of errors that will cause the entire memory to fail, and (CG) is the coding gain with respect to . CG)0 is given by Equation (22) and was found in the previous sections when the SEC-DED code is implemented.

If we assume that spare chips do not fail when disconnected, by Wald's identity, we get MT BF), = >.nM E(N,) 1 (23) Since the best case occurs when all spare chips are used, and the worst case when a memory failure occurs when the first two irreplaceable chips fail, we have.

Doubly error protection

We will prove that, when the right conditions on k1 and k2 are met, then for each burst of length less than or equal to k1 corresponds to a unique syndrome (vertical and horizontal). Equation (5) shows that there are bursts of length k1 + 1 that cannot be corrected. is skipped by the road. The code defined by the diagonal readout {1) can correct any burst of length at most k1 if and only if.

Suppose there is another blast b of length at most k1 with the same syndrome that a and a =fi b. So the claim is true, and without loss, we can assume that the initial point (a, ,B) lies in the rectangle, i.e. Patel and Hong ([2], [4]) devised an error correction scheme that was successfully used on IBM 3420 series tape drives with a recording density of 6250 bits per inch.

This will be achieved by the family of B(n, m) codes that will be described next. As stated in the introduction, we consider the rows and the first n bits in each column as elements of GF(2"). We want to prove that the code is an MDS code and its minimum distance is m + 1.

If the syndrome is the zero vector, we conclude that the code word was transmitted without errors. Solving circuits (26) is more complicated than in the case of two deletions, but still completely feasible. An important assumption when we found the reliability R(t) of a row of chips was that the failures in the row are distributed according to a Poisson process.

Figure 1 shows a simple array-code in which the last row and the last column are parity-check bits

Comparison between the two methods

Basic properties of the code

For this reading to make sense, we need the directed graph to be a directed cycle. Of course, they can be considered as vertices of a cycle directed according to the l -+ l + 1 law.

The main result

From (a), (b), and (c), the result is true for bursts whose associated path does not cross rows.

Figure 6: Case (i): The path (black points) does not skip rows Let

A Class of Error-Correcting Codes

Encoding and decoding

The coding is as follows: first Bn-1 is accepted, then Bn_2, etc., up to Bm. SP is easy to obtain, one bit at a time, while circuits that find Si for 0 ~ i ~ m-1 are also implemented without difficulty ([2]). Therefore, it is necessary to build circuits that will solve the system (12) in order to complete the decoding.

The strategy for choosing the decoding mode is as follows: count the number t of erasures that have occurred, and then choose the maximum s such that 2s + t ~ m + 1. Suppose then s errors have occurred, and this choice of s will determine the decoding mode.

Examples

The minimum distance of B(B, 1) is 3, so it can correct either one track error or two track deletions. If the error occurs in track 8, then S0 = 0 and we don't need to bother to retrieve the information. Now the parity check bits are contained in B0 , B1 and Z8 • Let's describe the encoding and decoding in detail.

Since B(8, 2) has minimum distance 4, it can correct either one track error along with one track deletion or three track deletions. Suppose an error pattern ei occurs in track i and ei occurs in track j, but j is known and all the other tracks are correctly transmitted. I= ai S1 • So when i = 8, this fact is easy to determine and we fix track j after finding ei.

In Chapter II we found MT BF and CG of single and double fault protected computer memories. To answer this, we need to find the MT BF and CG for both single and double fault protected computer memories in a simplified situation: we assume that all the chip failures are catastrophic.

Asymptotic Estimates

How Good is the Poisson