### Fitness distributions in evolutionary computation:

### motivation and examples in the continuous domain

### Kumar Chellapilla

a_{, David B. Fogel}

b,_{*}

a_{Department of Elect.}_{Comp.}_{Engg.,}_{UCSD,}_{La Jolla,}_{CA}_{92037}_{,}* _{USA}*
b

_{Natural Selection,}_{Inc.,}_{3333}

_{N.}_{Torrey Pines Ct.,}_{Ste.}_{200}

_{,}

_{La Jolla,}_{CA}_{92037}

_{,}

_{USA}Received 25 January 1999; accepted 11 June 1999

**Abstract**

Evolutionary algorithms are, fundamentally, stochastic search procedures. Each next population is a probabilistic function of the current population. Various controls are available to adjust the probability mass function that is used to sample the space of candidate solutions at each generation. For example, the step size of a single-parent variation operator can be adjusted with a corresponding effect on the probability of finding improved solutions and the expected improvement that will be obtained. Examining these statistics as a function of the step size leads to a ‘fitness distribution’, a function that trades off the expected improvement at each iteration for the probability of that improvement. This paper analyzes the effects of adjusting the step size of Gaussian and Cauchy mutations, as well as a mutation that is a convolution of these two distributions. The results indicate that fitness distributions can be effective in identifying suitable parameter settings for these operators. Some comments on the utility of extending this protocol toward the general diagnosis of evolutionary algorithms is also offered. © 1999 Elsevier Science Ireland Ltd. All rights reserved.

*Keywords*:Fitness distributions; Evolutionary computation; Continuous domain

www.elsevier.com/locate/biosystems

**1. Introduction**

When used for function optimization, evolu-tionary computation relies on a population of contending solutions to a problem at hand, where each individual is subject to random variation

(mutation, recombination, etc.) and placed in
competition with other extant solutions. Random
variation provides a means for discovering
nov-elty while selection serves to eliminate those trials
that do not appear worthwhile in the context of
the given criterion. Thus evolutionary algorithms
can be seen as performing a search over a state
space *S* of possible solutions.

In essence, most evolutionary algorithms can be described by the difference equation

*x*[*t* + 1] = *s*(6_{(}_{x}_{[}_{t}_{]))} _{(1)}

* Corresponding author. Tel.: +1-619-4556449; fax: + 1-619-4551560.

*E-mail addresses*: kchellap@ece.ucsd.edu (K. Chellapilla),
dfogel@natural-selection.com (D.B. Fogel)

where *x*[*t*] is the population at time *t* under
representation *x*, 6 is the variation operator(s),
and *s* is the selection operator. The stochastic
elements of this difference equation include6, and
often *s*, and the initialization mechanism for
choosing*x*[0]. The choices made in terms of
rep-resentation, variation, selection, and initialization
shape the probability mass function that describes
the likelihood of choosing solutions from*S*at the
next iteration. Alternative choices can lead to
dramatically different rates and probabilities of
improvement at each iteration.

The question of how to design evolutionary search algorithms to improve their optimization performance has been given significant consider-ation, but few practical answers have been iden-tified. Indeed, many supposed answers have in fact been false leads, dogmatically repeated over many years such that they have almost become ‘conventional wisdom’. Only recently have several of these generally accepted central tenets of evolu-tionary algorithms been questioned and exposed as being misleading or simply incorrect. Four of these erroneous tenets are detailed here.

1.1. *Binary representations and maximizing*
*implicit parallelism*

Holland (1975) pp. 70 – 71, speculated that
bi-nary representations would provide an advantage
to an evolutionary algorithm. The rationale
un-derlying this claim relied on the notion of
maxi-mizing implicit parallelism in schemata. To review
the concept, a schema is a template from some
alphabet of symbols, *A*, and a wild card symbol

c that matches any symbol in *A*. For example,
given a binary alphabet*A*, the schema [10c c] is
a template for the strings [1000], [1001], [1010],
and [1011]. Holland (1975) pp. 64 – 74, offered
that any evaluated string (encoding a possible
solution to a task at hand) actually offers partial
information about the expected fitness of all
pos-sible schemata in which that string resides. That
is, if string [0000] is evaluated to have some
fitness, then partial information is also received
about the worth of sampling from variations in

[c c c c], [0c c c], [c0c c], [c00c],

[c0c0], and so forth. This characteristic is

termed *intrinsic parallelism* (or *implicit paral*

*-lelism*), in that through a single sample,
informa-tion is gained with respect to many schemata.

If information were actually gained by this process of implicit sampling, it would seem rea-sonable to expect that maximizing the number of schemata that are processed in parallel would be beneficial (Holland 1975). For any representation that is a bijective mapping from the state space of possible solutions to the encoded individuals, im-plicit parallelism is maximized for the smallest cardinality of alphabet. That is, given a choice between using a binary encoding or any other cardinality, binary encodings should be preferred. The emphasis on binary representations, specifi-cally within genetic algorithms (GAs), was so strong that Antonisse (1989) wrote ‘‘the bit string representation has been raised beyond a common feature to almost a necessary precondition for serious work in the GA’’.1

More careful inspection of this issue, however,
indicates immediate problems with the claim that
there should be an advantage to binary encodings.
Of primary concern is the choice of binary
encod-ing. Radcliffe (1992) noted that there are many
binary representations of solutions in *S*. For
ex-ample, if *S*={1, 2, 3, 4}, then there are 4!
differ-ent binary represdiffer-entations of these solutions. One
would be {1, 2, 3, 4}{00, 01, 10, 11}, while

another would be {1, 2, 3, 4}{11, 01, 00, 10}. It

is easy to show empirically that the performance of evolutionary algorithms which employ different binary representations for the same problem do not exhibit similar performance (see De Jong et al., 1995, described below), and for some binary representations, performance is worse that a com-pletely random search. This provides a counterex-ample to the claim of optimality of binary encodings (and moreover the ‘principle of mini-mum encoding’, proposed in Goldberg, 1989).

Fortunately, binary encodings are often clumsy, and many researchers abandoned the practice of

1_{The fundamental impact of binary representations on }

using binary encodings in the late 1980s and early 1990s, finding more convenient representations for their problems, and also better results (An-tonisse, 1989; Koza, 1989; Davis, 1991, p. 63; Michalewicz, 1992; Ba¨ck and Schwefel, 1993; Fogel and Stayton, 1994). More recently, Fogel and Ghozeil (1997a) proved that it is possible to create completely equivalent evolutionary al-gorithms on any problem regardless of the cardi-nality of a bijective representation. Thus there is provably no information gained or lost as a con-sequence of simply altering the cardinality of rep-resentations (cf. Holland, 1975).

1.2. *Crosso*6_{er and building blocks}

In addition to advocating the use of binary representations, Holland (1975) also strongly ad-vocated the use of one-point crossover as a mech-anism to combine ‘building blocks’ from different solutions, and simultaneously minimized the im-portance of random mutation. More recently, Holland (1992) offered that early efforts to simu-late evolution ‘‘fared poorly because they… relied on mutation rather than mating’’. The supposed importance of crossover and relative unimpor-tance of mutation has been echoed in Goldberg (1989), Davis (1991), Koza (1992), and many others. And yet, several problems with this ap-proach have been discovered.

With regard to processing building blocks (i.e. schemata that are associated with above-average performance), suppose that one particular build-ing block was [1c…c0], that is, determining the first and last positions in a string to be 1 and 0, respectively, defines the building block. When one-point crossover is applied, it will always dis-rupt this building block. A natural solution to this problem is to view individuals not as strings, but as rings, and use two-point crossover to cut and splice segments of solutions. Once two-point crossover is invented, it is an easy step to move to uniform crossover, where each component in an offspring is selected at random from either of two parents.

If evolution proceeds best when it combines building blocks, it would be easy to speculate that uniform crossover would not perform as well as

one- or two-point crossover because rather than preserve such blocks of code, it tends to disrupt them. Yet, the empirical evidence offered in Syswerda (1989) showed uniform crossover out-performing both one- and two-point crossover on several problems including the traveling salesman (TSP) and the onemax (counting ones) problems. Fogel and Angeline (1998) observed similar re-sults comparing these operators in solving linear systems of equations. Moreover, there have been many studies generating empirical evidence that evolutionary algorithms which do not rely on crossover can outperform or perform comparably with those that do (Reed et al., 1967; Fogel and Atmar, 1990; Ba¨ck and Schwefel, 1993; Angeline, 1997; Chellapilla, 1997, 1998a; Fuchs, 1998; Luke and Spector, 1998). Jones (1995) even demon-strated that crossing extant parents with com-pletely random solutions (dubbed ‘‘headless chicken crossover’’) could outperform structured recombination on several problems (see Fogel and Angeline (1998) for further supportive evidence).

De Jong et al. (1995) showed unequivocally that there is an important synergy between opera-tor and representation. They first considered the problem of assigning the four values {1, 2, 3, 4} to binary strings {00, 01, 10, 11}, respectively, under the fitness function:

*f*(*y*) = integer(*y*) + 1 (2)

For a population of*n*=5, De Jong et al. (1995)
enumerated a Markov chain that completely
de-scribes the probabilistic behavior of an
evolution-ary algorithm on this problem. Fig. 1 shows the
probability of having the evolving population
contain the best possible solution as a function of
the number of generations both when using or not
using crossover. The performance of random

search is provided for comparison. Here,

ver-sion of evolutionary algorithm for at least the first ten generations. In the first equivalence class, applying crossover to the second- and third-best solutions can generate the global optimum. In the other two classes, it cannot. Thus the chosen

Fig. 3. The exact probability of the population containing the best solution to the problem shown in Fig. 2 for the third equivalence class of representations (from De Jong et al., 1995). Here, no crossover outperforms the use of crossover by a wider margin across all generations consistently. Note too that random search alone outperforms both crossover and the absence of crossover for about the first ten generations. The results indicate the importance of matching the variation operator with the representation.

Fig. 1. The exact probability of containing the best solution as
a function of the number of generations when using a simple
genetic algorithm to optimize*f(y)=integer(y)+1 under *
rep-resentation {00, 01, 10, 11} mapped to {1, 2, 3, 4} (from De
Jong et al., 1995). Here crossover is seen to outperform the
absence of crossover regardless of the number of generations.
However, this mapping is only one of 4! different possible
mutations. Of these possibilities, three equivalent classes
emerge. The other two classes of behavior are shown in Figs.
2 and 3.

representation cannot be considered in isolation from the search operator.

This has been more broadly proved in the ‘no
free lunch’ theorems of Wolpert and Macready
(1997). For algorithms that do not resample
points in *S*, across all possible problems, all
al-gorithms perform the same on average.2

When an algorithm is tailored to a particular problem, it will of necessity perform worse than random search on some other problem. Thus there cannot be one best variation operator across all prob-lems. Crossover, in all its forms, can be seen in this context simply as one possible choice that the human operator can make. Its effectiveness is problem and representation dependent.

1.3. *The schema theorem*:*the fundamental*
*theorem of genetic algorithms*

Holland (1975) offered a theorem that describes the average propagation of schemata from one generation to the next under the influence of

Fig. 2. The exact probability of the population containing the best solution to the problem shown in Fig. 1 for the second equivalence class of representations (from De Jong et al., 1995). Here, no crossover outperforms the use of crossover by a small margin across all generations consistently.

2_{Salomon (1996) studied the relevance of resampling in}

proportional selection and variation operators such as one-point crossover and mutation. Omit-ting the effects of variation operators, the formula is:

*EP*(*H*,*t*+1)=*P*(*H*,*t*)*f*(*H*,*t*)

*f*( (3)

where *H* is a particular schema (hyperplane),

*f*(*H*,*t*) is the mean fitness of solutions that contain

*H*at time*t*,*f*( is the mean fitness of all solutions in
the population, and *P*(*H*,*t*) is the proportion of
solutions that contain *H* at time *t*. Thus the
expected frequency of *H* in the next time step is
proportional to its current frequency and its
rela-tive fitness. Extrapolating, Goldberg (1989), p. 33,
concluded that above-average schemata receive
exponentially increasing trials in subsequent
gen-erations and offered this result as being of such
importance that it is named the ‘Fundamental
Theorem of Genetic Algorithms’.

Radcliffe (1992) noted that the theorem applies to all schemata in a population, even when the schemata defined by a representation may not capture the properties that determine fitness. For example, if the objective is to maximize the in-teger value of a binary string, then the strings [1000] and [0111] are as close as possible in terms of fitness (8 vs. 7), and yet they share no sche-mata. Thus the intuition that proportional selec-tion will tend to emphasize those schemata that share important features related to fitness may not hold in practice.

Most work in genetic algorithms no longer uses proportional selection, so the ‘fundamental’ im-portance of the schema theorem can immediately be questioned. Moreover, the conclusion that above-average schemata will continue to receive exponentially increasing attention omits the con-sideration that the theorem only describes the

*expected* behavior in a*single* generation. There is
no reason to believe that the equation can be
extrapolated over successive generations without
giving explicit consideration to the variance of the
process as well as its expectation. But more
sig-nificantly, Fogel and Ghozeil (1997b) proved that
the schema theorem does not apply when the
fitness of schemata are described by random
vari-ables, as is often the case in real-world

applica-tions. The theorem only applies to the specific population in question and the specific fitness values assigned to each individual in that population.

Even more importantly, the theorem cannot address the issue of how new solutions are discov-ered; it can only indicate the statistical expecta-tion of reproducing already existing soluexpecta-tions in proportion to their relative fitness. It cannot esti-mate long-term proportions of schemata with reli-ability because this depends strongly on the likelihood of new solutions being generated by variation.

1.4. *Proportional selection and the k*-*armed*
*bandit*

Holland (1975) made an analogy between the
problem of how best to sample from competing
schemata within a population and how best to
sample from a*k*-armed bandit (i.e. a slot machine
with *k* arms). The payoff from each arm of the
bandit has a mean and variance, and the analysis
centered on how best to sample the arms so as to
minimize expected losses over those samples. The
conclusion was essentially to sample in proportion
to the observed payoff from each arm, which led
to the use of proportional selection in genetic
algorithms, and the resulting focus on the schema
theorem.

Unfortunately, insufficient attention was given to this analysis in two regards. The first is the choice of criterion: Minimizing expected losses does not correspond with the typical problem of function optimization that demands discovering the single best solution. In order to minimize expected losses between two choices, the proper sampling is to devote all trials to the choice with the greater average payoff. But this choice may prohibit discovering the best possible solution. Consider the case where there are four possible solutions to a problem with corresponding fitness values as shown:

[00]=19, [01]=0, [10]=11, [11]=9

expected losses, trials should be allocated to [1c], but this would then preclude discovering the best solution [00].

The second respect is more fundamental: The claim that the analysis in Holland (1975) leads to an optimal sampling plan has been shown to be mathematically flawed both by counterexample (Rudolph, 1997) and direct analysis (Macready and Wolpert 1998). Proportional selection does not minimize expected losses, so even if this crite-rion is given preference, the development in

Hol-land (1975) does not support the use of

proportional selection in evolutionary algorithms. This form of selection is just one among many options, and the choice should be based on the dependencies posed by the particular problem.

1.5. *A new direction*

Certainly, the above list could be extended (e.g. inversion was offered to reorder schemata for effective processing as building blocks by one-point crossover (Holland, 1975; pp. 106 – 109), but this has had no general empirical support (Davis, 1991; Mitchell, 1996; Lobo et al., 1998). In light of these missteps, it would appear appropriate to investigate new methods for assessing the funda-mental nature of evolutionary search and opti-mization. The formulation in Eq. (1) leads directly to a Markov chain view of evolutionary al-gorithms in which a time-invariant, memoryless probability transition matrix describes the likeli-hood of transitioning to each possible population configuration given each possible configuration (Fogel, 1994; Rudolph, 1994 and others). Such a description immediately leads to answers regard-ing questions about the asymptotic behavior of various algorithms (e.g. typical instances of evolu-tion strategies and evoluevolu-tionary programming ex-hibit asymptotic global convergence (Fogel, 1995a), whereas the canonical genetic algorithm (Holland, 1975) is not convergent due to its re-liance on proportional selection (Rudolph, 1994)). Further, as shown above, De Jong et al. (1995) used Markov chains and brute force computation to analyze the exact transient behavior of genetic algorithms under small populations (e.g. size five) and small chromosomes (e.g. two or three bits)

concentrating on the expected waiting time until
the global optimum is found for the first time. But
this procedure appears at present to be too
com-putationally intensive to be useful in designing
more effective (in terms of quality of evolved
solution) and efficient (in terms of rate of
conver-gence) evolutionary algorithms for real problems.
The description offered by Eq. (1), however,
suggests that some level of understanding of the
behavior of an evolutionary algorithm can be
garnered by examining the stochastic effects of the
operators *s*and6_{on a population}_{x}_{at time}_{t}_{. Of}
interest is the probabilistic description of the
fitness of the solutions contained in *x*[*t*+1].
Re-cent efforts (Altenberg, 1995; Fogel, 1995a;
Grefenstette, 1995; Fogel and Ghozeil, 1996) have
been directed at generalized expressions describing
the relationship between offspring and parent
fitness under particular variation operators, or
empirical determination of the fitness of offspring
for a given random variation technique. This
pa-per offers evidence that this approach to
describ-ing the behavior of an evolutionary algorithm can
be used to design more efficient and effective
optimization techniques.

**2. Background on methods to relate parent and**
**offspring fitness**

Altenberg (1995) offered the conjecture that, rather than rely on the schema theorem, the per-formance of an evolutionary algorithm could be better estimated by examining the probability mass function:

*Pr*(*W* = *w*(*x*); *w*(*y*), *w*(*z*)) (4)

Grefenstette (1995) offered a similar notion for assessing the suitability of various genetic opera-tors. Attention was focused on the mean fitness of the offspring generated by applying a genetic op-erator to a parent conditioned on the parents’ fitness. That is, the fitness distribution of an oper-ator was defined as:

*FD*op(*Fp*) = *Pr*(*FcFp*) (5)

where the fitness distribution of an operator*FD*op

is the family of probability distributions of the
fitness of the offspring *Fc*, indexed by the mean

fitness of the parents *Fp*. It was shown that the

mean of the fitness distribution for some genetic
operators could be described by simple linear
functions of *Fp*.

This analysis, although potentially insightful, suffered from two important drawbacks. First, attention was unfortunately limited to the case of proportional selection and second, and more im-portantly, the analysis turns on relevance of the correlation in fitness between parent and off-spring. For example, Grefenstette (1995) offered that if the fitness distribution of an operator were shown to be independent of the parent’s fitness then poor performance (‘failure’) should be ex-pected. But this can be contradicted by counterex-ample. For a Newton – Gauss search on a quadratic bowl, regardless of the position of the parent, and therefore its fitness, the offspring generated will be at the global optimum and have minimum error. Thus offspring fitness is indepen-dent of parental fitness, yet the algorithm is as successful as possible on this function.

The emphasis on correlation between parental
fitness and offspring fitness goes back at least to
Manderick et al. (1991). The utility of this
ap-proach can suffer when attention is focused on
the correlation between mean parental fitness and
mean offspring fitness. For example, for the case
of linear fitness functions, under real-valued
rep-resentations the use of zero mean Gaussian
muta-tions yields zero mean difference between parent
and offspring fitness regardless of the setting for
the step size control parameter s _{(the standard}
deviation). But the expected rate of convergence
for these methods depends crucially on the setting
of s_{, as summarized in Ba¨ck (1996) and Fogel}
(1995a).

In contrast, rather than examine mean parental
fitness and how it correlates to mean offspring
fitness for a particular search operator, attention
can be more fruitfully given to the expected rate
of improvement (in terms of mean progress
to-ward the optimum), as was offered in Rechenberg
(1973). For the case of searching in *Rn*_{using zero}

mean Gaussian mutations, Rechenberg (1973) noted that the maximum expected rate of conver-gence was attained for two simple functions, the sphere and corridor models, when the probability of a successful mutation was approximately 0.2. Thus, the 1/5 rule was suggested:

The ratio of successful mutations to all muta-tions should be 1/5. If this ratio is greater than 1/5, increase the variance; if it is less, decrease the variance.

Schwefel (1981) suggested measuring the
suc-cess probability on-line over 10*n* trials (where
there are *n* dimensions) and adjusting s _{at }
itera-tion *t* by:

cesses in 10*n* trials divided by 10*n*. This allowed
for a general solution to setting the step size, but
the robustness of this procedure remains
un-known in general.

Fogel (1995b) and Fogel and Ghozeil (1996) empirically examined the distribution of fitness scores attained under different variation operators for specific parameter settings on three continuous optimization problems (sphere, Rosenbrock, Bo-hachevsky) and one discrete problem (travelling salesman problem (TSP)). For the continuous problems, the variation operators included zero mean Gaussian mutations and different forms of recombination (one-point, intermediate). In con-trast, a variable-length reversal of a segment of the list of cities to be visited was tested for the TSP. Experiments involved repeated Monte Carlo application of variation operators to parents from an initial generation, with the probability of im-provement and expected amount of imim-provement (i.e. the reduction in error) being recorded for each trial. The mean behavior of the operators as a function of their parametrization was depicted graphically (see Fig. 4), and consistently showed the potential for maximizing the expected pro-gress that could be obtained with a particular

operator by adjusting its control parameter and its associated probability of improvement. The results demonstrated the possibility for optimizing variation operators even when no analytic deriva-tion for optimal parameters settings may be possible.

Following Fogel and Ghozeil (1996), the method is further developed here and used to examine the appropriate settings for scaling three different types of single-parent variation operators across a set of four continuous function optimiza-tion problems in 2, 5, and 10 dimensions. The results indicate that the expected improvement of an operator can be estimated for various control parameters; however, in contrast to the 1/5 rule it may be insufficient to use the probability of im-provement as a surrogate variable to maximize the expected improvement.

**3. Methods**

To begin, three sets of experiments were per-formed to investigate the properties of three

varia-tion operations that are common in the

application of evolutionary algorithms to
continu-ous function optimization problems. The
frame-work for selection is based on the (1, 100) model,
indicating that a single parent generates 100
off-spring, and then the best of these offspring is
selected to be the parent for the next generation.
Attention was given to the probability of
im-provement (*PI*) and the expected improvement
(*EI*) attained by the application of a particular
variation operator. The algorithm proceeded as
follows:

(i) The trial number, *t*, was set to 1.

(ii) 100 initial solutions (parents) *xi*(*t*), *i*=1,…,

100 were sampled uniformly from an interval
[*a*,*b*]*n*

.

(iii) Each parent was evaluated in light of the
objective function *F*(*x*) (defined below).

(iv) The best parent*x*(*t*) with the lowest
objec-tive value was used to generate 100 offspring,

*x*%* _{i}*(

*t*),

*i*=1,…, 100 through 100 independent ap-plications of a variation operator6. The variation 6

_{was accomplished in the form:}

Fig. 5. The probability density function (pdf) of the standard
Gaussian and Cauchy pdfs in comparison with their
convolu-tion. The convolution of the pdfs is equivalent to taking the
mean of the random variables. There exists a trade-off among
the three pdfs between the probabilities of generating very
small (0.0 – 0.6); small (0.6 – 1.2); medium (1.2 – 2.0); large (2.0 –
4.8); and very large (\_{4.8) mutations. These were the pdfs}

used for generating 100 offspring from the best parents*x*(t)
(Eq. (6)).

Steps (i) – (vii) were repeated for 5000 trials in a
Monte Carlo fashion whereupon the mean *f*(*t*)
and *I*(*t*) were recorded as estimates of *PI*and*EI*

for the variation operator 6 with scaling term *s*.
For convenience, *PI* and *EI* are used to denote
these estimates in the following discourse. The
values of *s*are identified later in this section.

Each experiment was conducted on four test
functions, *F*1–*F*4, given by:

Function *F*1 is the sphere, *F*2 is a modified

version of the Ackley function, *F*3is the Rastrigin

function, and *F*4 is the generalized step function.

The three different variation operators were performed as:

*x*%_{i}_{,}_{j}_{=}_{x}_{*j}_{+}* _{sN}_{j}*(0,1) (12)

*x*%_{i}_{,}_{j}_{=}*x*j*+*sCj*(0,1) (13)

*x*%_{i}_{,}* _{j}*=

*x*j*+0.5

*s*(

*Nj*(0,1)+

*Cj*(0,1)) (14)

where *N*(0,1) is a standard normal RV, *C*(0,1) is
a standard Cauchy RV, *j* is an index for the *j*th
dimension, and *i* is an index for the*i*th offspring
from *x*. For the case of Gaussian mutation, *s*is
the standard deviation s, but recall that the

stan-dard deviation of a Cauchy pdf is undefined, thus

*s* is best viewed simply as a scaling factor.
Throughout the remainder of the paper, these
three variations are described as Gaussian,
Cauchy, and mean mutation operators (GMO,
CMO, and MMO, respectively).

For each function *F*1–*F*4, 200 separate

experi-ments (each of 5000 trials) were conducted by
stepping the value of *s* from 0.01 to 4.00 by
increments of 0.02. Initial solutions were

*dis-x*%* _{i}*(

*t*) =

*x*(

*t*) + 6 (6)

where6_{was a random variable with one of three}
possible probability density functions (pdfs): (1) a
zero mean Gaussian random variable with
stan-dard deviation (scaling parameter) *s*; (2) a
stan-dard Cauchy random variable scaled by *s*; (3) a
convolution of (1) and (2). These variation
opera-tors follow typical implementations in
evolution-ary computation for real parameter optimization
(Ba¨ck, 1996; Rudolph, 1997; Chellapilla, 1998b).
Fig. 5 indicates the pdf for each choice.

(v) Each offspring was evaluated in light of

*F*(*x*).

(vi) The fraction of offspring, *f*(*t*), that were
strictly better (i.e. lower error) than *x*(*t*) was
computed.

(vii) The offspring with lowest error,*x*%(*t*), was
used to compute the improvement during trial *t*

using

*I*(*t*) = *F*(*x*(*t*))−*F*(*x*%(*t*)) (7)

Note that*I*(*t*) could be negative if the best of the
100 offspring was worse than the parent that
generated it.

tributed uniformly over [−4,4]*n*_{, (}

*n*=2, 5, and
10) which is symmetric about the optimum
solu-tion.

Fig. 8. The *PI*for the GMO, CMO, and MMO across the
settings of the step size, *s, for the 10-dimensional sphere*
function (F1). For all three operators, the*PI*values decrease

with increasing sigma. Since the quadratic function is
continu-ous and unimodal, the*PI*value attains a peak value of 0.5 as
*s*tends to 0, which is the*PI*value on an inclined plane. The*PI*
curve for the CMO drops the fastest and is followed by those
for the MMO and GMO. Paralleling the estimates for*EI*(Fig.
7), the GMO offers the greatest *PI*for any fixed value of*s,*
followed by MMO and CMO, respectively. For *s*_{0 the}

*PI*_{0.5, and as}_{s}_{becomes large the}_{PI}_{tends to zero.}

Fig. 6. The *EI*for the GMO, CMO, and MMO across the
settings of the step size,*s, for the 2-dimensional sphere *
func-tion (F1). The CMO*EI*curve peaks first, followed by those for

the GMO and MMO. The maximum *EI*occurs at *s=0.47,*
0.49, and 0.21 for the GMO, MMO, and CMO, respectively.
The corresponding peak*EI*values were 0.20, 0.19, and 0.20
for the GMO, MMO, and CMO, respectively. The GMO
curve has the largest bandwidth, followed by those for the
MMO and CMO.

**4. Results**

Figs. 6 and 7 show the mean*EI*as a function of

*s* for the GMO, CMO, and MMO on the 2- and
10-dimensional sphere (*F*1). Certain similar

pat-terns are evidenced immediately. First, in both
cases, the GMO offers a greater peak *EI* than
does the MMO or CMO. In addition, the peak*EI*

occurs for GMO, MMO, and CMO with
decreas-ing values of *s*. That is, larger *s* values are
re-quired to generate the peak*EI*for GMO than are
required for either MMO or CMO. As expected,
for*s*0 the*EI*0, and as*s*becomes large the*EI*

turns negative (i.e. the typical step size of the
variation operator is larger than twice the distance
to the optimum and the resulting offspring are
worse than their parent). Fig. 8 shows the *PI*for
each operator on the 10-dimensional *F*1.

Parallel-ing the estimates for *EI*, the GMO offers the
greatest *PI* for any fixed value of *s*, followed by
MMO and CMO, respectively. For*s*0 the*PI*

0.5, and as *s*becomes large the *PI*tends to zero.
Of particular interest is the corresponding
rela-tionship between *PI*and *EI*for each operator on

Fig. 9. The relationship between the*PI*and*EI*for the GMO,
CMO, and MMO on the 10-dimensional sphere function (F1).

For very small*PI*there is a corresponding negative*EI*
regard-less of the variation operator, but as*PI*increases there is a
wide range of values that correspond to essentially similar
values of*EI.*

there is a wide range of values that correspond to
essentially similar values of*EI*. Table 1 shows the

*PI* associated with the peak *EI*for each operator
for the 10-dimensional *F*1 (and other functions).

As indicated, regardless of the variation operator
the best choice for *PI*is considerably less than 0.2
(cf. the 1/5 rule).

One method for assessing the robustness of a
particular variation operator concerns the range
of values for*s*that will yield reasonable values of

*EI*. For the case of unimodal fitness distribution
curves, define the bandwidth to be the range of
values *s*such that

*EI*(*s*) \ 0.5 *EI** (21)

where *EI*(*s*) is the expected improvement for a
scaling value *s* and *EI** is the peak *EI*. Table 2
indicates that the bandwidth for the GMO was
almost twice as wide as for the CMO. In this
sense, the GMO is less sensitive to particular
values of *s*than the CMO.

In a similar manner, Figs. 10 and 11 offer the
results for the 10-dimensional Ackley and
Rast-the 10-dimensional*F*1(see Fig. 9). For very small

*PI*there is a corresponding negative*EI*regardless
of the variation operator, but as *PI* increases

Table 1

The scale factor (s), expected improvement (EI), and probability of improvement (PI) values at the peaks of the*EI*curves as a
function of the step size*s*for the test function*F*1–F4

*EI*peak values *s*at*EI*peak (sp) *EI*peak value (EIp) *PI*value when*EI*peaks
(Dim=10)

GMO CMO

GMO MMO CMO GMO MMO MMO CMO

*F*1 1.01 0.15

Sphere 0.69 0.27 9.43 8.01 7.79 0.15 0.10

1.01
*F*2

Ackley 0.99 0.53 0.25 1.22 1.01 0.14 0.13 0.15

203.94 180.35 175.48 0.13

Rastrigin *F*4 1.05 0.63 0.33 0.12 0.13

2.88

Step *F*4 0 99 0.55 0.27 2.51 2.47 0.09 0.08 0.09

Table 2

The left and right*EI*bandwidths for the test functions*F*1–F4a

Overall*EI*bandwidth
Left half*EI*bandwidth

*EI*peak values Right half*EI*bandwidth

(Dim=10)

GMO MMO CMO GMO MMO CMO GMO MMO CMO

*F*1 0.74 0.66 0.26 0.98 0.92 0.68

Sphere 1.72 1.58 0.94

*F*2 0.82 0.50 0.24 0.92

Ackley 0.98 0.64 1.74 1.48 0.88

Rastrigin *F*4 0.86 0.60 0.32 1.00 1.04 0.68 1.86 1.64 1.00

1.74 0.88

0.62 1.48

Step *F*4 0.80 0.52 0.26 0.94 0.96

rigin functions. Despite the presence of multiple
local minima in these functions, the *EI* and *PI*

curves appear essentially similar to those obtained
for *F*1. GMO offers a greater peak *EI* and has a

larger bandwidth than MMO or CMO.
Interest-ingly, for the step function*F*4, Fig. 12 shows that

*EI*is not a function of*PI*(i.e. it is one-to-many).
This illustrates that assessing the appropriate
value of *PI* alone may not be sufficient to
maxi-mize *EI*. Note also that the maximum *PI*for this
case never exceeds 0.2, thus it would be
impossi-ble to apply the 1/5 rule to this problem.

**5. Discussion**

The results indicate that the expected improve-ment and probability of improveimprove-ment when ap-plying a particular search operator to a parent, or parents, can be estimated empirically. These mea-sures were seen to vary as a function of a parame-terization of the search operator (e.g. as a function of the standard deviation of a Gaussian mutation), and across alternative operators. The procedure of examining the fitness distribution that results from a variation operator appears to offer a general method for assessing and improv-ing the performance of evolutionary algorithms in light of the expected results of applying arbitrary search operators to candidate solutions in light of objective criteria and selection mechanisms.

The graphs of *EI* and *PI* as a function of the
control parameter for GMO, MMO, and CMO
exhibited sufficient regularity to be interpreted
easily. The degree to which this clarity will extend
to other functions remains unknown. Further, it is
also unknown in general if the improvements that
are measured in each trial represent progress in
the direction of the optimum. This is certainly
true for the sphere, but for the Ackley, Rastrigin,
and step functions (which possess multiple
min-ima), local improvements may occur in a direction
opposite to the global optimum. This suggests
further investigation into the fitness distribution
of a series of offspring generated over several
iterations.

The method of analysis by fitness distribution developed in this paper (also see Fogel, 1995b)

Fig. 10. The*EI*for the GMO, CMO, and MMO across the
settings of the step size, *s, for the 10-dimensional Ackley*
function. The CMO*EI*curve peaks first, followed by those for
the MMO and GMO. The maximum *EI*occurs ats=0.99,
0.53, and 0.25 for the GMO, MMO, and CMO, respectively.
The corresponding peak *EI*values were 1.22, 1.01, and 1.01
for the GMO, MMO, and CMO, respectively.

Fig. 11. The*EI*for the GMO, CMO, and MMO across the
settings of the step size, *s, for the 10-dimensional Rastrigin*
function. The CMO *EI*curve peaks first (barely noticeable),
followed by those for the MMO and GMO. The maximum*EI*
occurs at *s=1.03, 0.65, and 0.31 for the GMO, MMO, and*
CMO, respectively. The corresponding peak *EI* values were
5276.93, 4645.37, and 4487.84 for the GMO, MMO, and
CMO, respectively.

differs subtly from that proposed in Grefenstette
(1995). In particular, rather than give concern to
the correlation (a linear function) between *mean*

Fig. 12. The*PI*for the GMO, CMO, and MMO across the settings of the step size,*s, for the 10-dimensional step function. For*
all three operators, the*PI*values start out close to zero for*s=0.01, climb to a peak value and then gradually tend to zero as*stends
to infinity. This is in contrast to the*PI*curves for all the other test functions. As might be expected, the*PI*for the CMO peaked
first, followed by those for the MMO and GMO. Further, the peak*PI*values were largest for the GMO, followed by those for the
MMO and CMO; (b) the*EI*as a function of the*PI*on the 10-dimensional step function (F4) for the GMO, CMO, and MMO. Note

that the*EI*is not a function of*PI*(i.e. it is one-to-many). This illustrates that assessing the appropriate value of*PI*alone may not
be sufficient to maximize*EI. Note also that the maximumPI*for this case never exceeds 0.2, thus it would be impossible to apply
the 1/5 rule to this problem. The shape of the*EI*versus*PI*graph appears to be the same for all three operators. However, most
of the CMO points are scattered to the left of the graph (due to a large fraction of the points having relatively low*PI*values and
negative*EI*values) whereas the corresponding GMO points are located aboveand to the right with relatively higher*EI*and *PI*
values. The MMO points were in-between those for the CMO and GMO. Since the*PI*values started out close to zero, increased
to a peak, and gradually decreased to zero as a function of*s*(see part (a)), a loop is observed in the*EI*versus*PI*graph.

the parent population in light of the selection criterion. The best variation operator to use may vary as a function, for example, of whether the selection criterion is a plus strategy (in which the parents compete with the offspring) or a comma strategy (in which the parents are removed every generation). Analyses that assess search operators without considering the effects of selection, partic-ularly on individuals rather than on the average, omit a requisite piece of the puzzle.

One limitation of the approach is its computa-tional requirements. To estimate the fitness distri-bution empirically requires function evaluations, and in online applications these might be better used for testing alternative solutions. Thus the method appears better suited currently for offline testing of alternative approaches to classes of problems. For example, different mutation opera-tors, such as Cauchy and Gaussian, could be compared across a suite of test functions. Perhaps such analysis would also yield information on how to provide for more efficient self-adaptation of independently adjusted strategy parameters.

In-deed, it would be myopic to restrict the use of fitness distributions to only assess the utility of search operators. The interaction between repre-sentation, variation, and selection operators in light of an objective function suggests that any of these facets could be explicated by similar inquisition.

Recent work (Chellapilla and Fogel 1999)
indi-cated that reasonable estimates of *EI* and*PI* can
be obtained on benchmark functions without
re-quiring as large a sample as was conducted here.
Chellapilla and Fogel (1999) offered data based
on samples of 1, 10, 100, and 1000 sets of trials
(recall that each ‘trial’ is the selection of the best
of 100 parents followed by the generation of 100
offspring from the best parent). Although the
variability of the results was too high to be
reli-able when using only a single trial, the*EI*and*PI*

determined if a series of (single) trials over several generations can yield estimates that are as reliable as multiple trials in a single generation.

Fitness distributions offer a practical tool for assessing the utility of variation operators in a variety of contexts. Importantly, the empirical evidence obtained when determining the fitness distribution for a particular operator is just that: hard evidence. It provides a basis for statistical hypothesis tests to compare different operators on a range of optimization problems. The conclu-sions derived from comparing different variation (or other) operators using fitness distributions are statistical in nature, thus they have the weight of the framework and foundation of statistics as a buttress. This stands in marked contrast to the speculations that have been proposed about the utility of various aspects of evolutionary al-gorithms for optimization (e.g. binary strings, proportional selection, and one-point crossover) none of which has evidenciary support. Through the use of fitness distributions and the statistics that can be derived from them, statistical evidence can be garnered to suggest when to use a particu-lar operator and how to set its associated parame-ters. Equally as important, it can suggest when not to use a particular operator. It may also provide a tool for generalizing about the appro-priateness of certain operators across a range of functions. But this speculation remains for future work.

**References**

Altenberg, L., 1995. The Schema Theorem and Price’s Theo-rem. In: Foundations of Genetic Algorithms 3. Morgan Kaufmann, San Mateo, CA, pp. 23 – 49.

Angeline, P.J., 1997. Subtree Crossover: Building Block En-gine or Macromutation? Genetic Programming 1997. Pro-ceedings of the Second Annual Conference on Genetic Programming, Morgan Kaufmann, San Francisco, CA, pp. 9 – 17.

Antonisse, J. (1989). A New Interpretation of Schema Nota-tion that Overturns the Binary Encoding Constraint. Pro-ceedings of the Third International Conference on Genetic Algorithms, Morgan Kaufmann, San Mateo, CA, pp. 86 – 91.

Ba¨ck, T., 1996. Evolutionary Algorithms in Theory and Prac-tice. Oxford Univ. Press, NY.

Ba¨ck, T. (Ed.), 1997. Proceedings of the Seventh International Conference on Genetic Algorithms, Morgan Kaufmann, San Francisco, CA.

Ba¨ck, T., Schwefel, H.-P., 1993. An overview of evolutionary algorithms for parameter optimization. Evol. Comp. J. 1 (1), 1 – 24.

Belew, R.K., Booker, L.B. (Eds.), 1991. Proceedings of the Fourth International Conference on Genetic Algorithms, Morgan Kaufmann, San Mateo, CA.

Chellapilla, K., 1997. Evolving computer programs without subtree crossover. IEEE Trans. Evol. Comp. 1 (3), 209 – 216.

Chellapilla, K., 1998. A preliminary investigation into evolving modular programs without subtree crossover. Genetic Pro-gramming 98: Proceedings of the Third Annual Genetic Programming Conference, Morgan Kaufmann, San Fran-cisco, CA, pp. 23 – 31.

Chellapilla, K., 1998b. Combining mutation operators in evo-lutionary programming. IEEE Trans. Evol. Comp. 2 (3), 91 – 96.

Chellapilla, K., Fogel D.B., 1999. Fitness distributions in evolutionary computation: analysis of noisy functions. In: Priddy, K., Keller, P., Fogel, D.B., Bezdek, J.C. (Eds.), Proceedings of Symposium Applications and Science of Computational Intelligence II, SPIE Vol. 3722, SPIE, Bellingham, WA, pp. 313 – 323.

Davis, L., (Ed.), 1991. Handbook of Genetic Algorithms. Van Nostrand Reinhold, NY.

De Jong, K.A., Spears, W.M., et al., 1995. Using Markov Chains to Analyze GAFOs. Foundations of Genetic Al-gorithms 3, Morgan Kaufmann, San Mateo, CA, pp. 115 – 137.

Eshelman, L.J. (Ed.), 1995. Proceedings of the Sixth Interna-tional Conference on Genetic Algorithms, Morgan Kauf-mann, San Mateo, CA.

Fogel, D.B., 1994. Evolutionary programming: an introduc-tion and some current direcintroduc-tions. Stat. Comp. 4, 113 – 129. Fogel, D.B., 1995. Evolutionary Computation: Toward a New Philosophy of Machine Intelligence. IEEE Press, Piscat-away, NJ.

Fogel, D.B., 1995. Phenotypes, Genotypes, and Operators. Proceedings of the 1995 IEEE International Conference on Evolutionary Computation, IEEE, Perth, Australia, pp. 193 – 198.

Fogel, D.B., Angeline, P.J., 1998. Evaluating Alternative Forms of Crossover in Evolutionary Computation on Lin-ear Systems of Equations. SPIE Symposium on Neural, Fuzzy, and Evolutionary Computation, SPIE, San Diego, CA.

Fogel, D.B., Atmar, J.W., 1990. Comparing genetic operators with gaussian mutations in simulated evolutionary pro-cesses using linear systems. Biol. Cybernetics 63 (2), 111 – 114.

Fogel, D.B., Ghozeil, A., 1997a. A note on representations and variation operators. IEEE Trans. Evol. Comp. 1 (2), 159 – 161.

Fogel, D.B., Ghozeil, A., 1997b. Schema processing under proportional selection in the presence of random effects. IEEE Trans. Evol. Comp. 1 (4), 290 – 293.

Fogel, D.B., Stayton, L.C., 1994. On the effectiveness of crossover in simulated evolutionary optimization. BioSys-tems 32 (3), 171 – 182.

Forrest, S. (Ed.), 1993. Proceedings of the Fifth International Conference on Genetic Algorithms, Morgan Kaufmann, San Mateo, CA.

Fuchs, M., 1998. Crossover versus mutation: an empirical and theoretical case study. Genetic Programming 98: Proceed-ings of the Third Annual Genetic Programming Confer-ence, Morgan Kaufmann, San Francisco, CA, pp. 78 – 85. Goldberg, D.E., 1989. Genetic Algorithms in Search, Opti-mization and Machine Learning. Addison-Wesley, Read-ing, MA.

Grefenstette, J.J., 1995. Predictive Models Using Fitness Distri-butions of Genetic Operators. In: Foundations of Genetic Algorithms 3. Morgan Kaufmann, San Mateo, CA, pp. 139 – 161.

Holland, J.H., 1975. Adaptation in Natural and Artificial Systems. Univ. Michigan Press, Ann Arbor, MI. Holland, J.H., 1992. Genetic Algorithms. Sci. Am. (July):66 –

72.

Jones, T., 1995. Crossover, Macromutation, and population-based search. Proceedings of the Sixth International Con-ference on Genetic Algorithms, Morgan Kaufmann, San Mateo, CA, pp. 73 – 80.

Koza, J.R., 1989. Hierarchical genetic algorithms operating on populations of computer programs. Proceedings of the 11th International Joint Conference on Artificial Intelligence, Morgan Kaufmann, San Mateo, CA, pp. 768 – 774. Koza, J.R., 1992. Genetic Programming. MIT Press,

Cam-bridge, MA.

Lobo, F.G., Deb, K., et al., 1998. Compressed introns in a linkage learning genetic algorithm. Genetic Programming 98: Proceedings of the Third Annual Genetic Programming Conference, Morgan Kaufmann, San Francisco, CA, pp. 551 – 558.

Luke, S., Spector, L., 1998. A revised comparison of crossover and mutation in genetic programming. Genetic Program-ming 98: Proceedings of the Third Annual Genetic

Pro-gramming Conference, San Francisco, CA, Morgan Kaufmann.

Macready, W.G., Wolpert, D.H., 1998. Bandit problems and the exploration/exploitation tradeoff. IEEE Trans. Evol. Comp. 2 (1), 2 – 22.

Manderick, B., deWeger, M., et al., 1991. The genetic al-gorithm and the structure of the fitness landscape. Proceed-ings of the Fourth International Conference on Genetic Algorithms, Morgan Kaufmann, San Mateo, CA, pp. 143 – 150.

Michalewicz, Z., 1992. Genetic Algorithms+Data Struc-tures=Evolution Programs. Springer, Berlin.

Mitchell, M., 1996. An Introduction to Genetic Algorithms. MIT Press, Cambridge, MA.

Radcliffe, N.J., 1992. Non-linear genetic representations. In: Parallel Problem Solving from Nature 2. North-Holland, Amsterdam, pp. 259 – 268.

Rechenberg, I., 1973. Evolutionsstrategie: Optimierung Tech-nisher Systeme nach Prinzipien der Biologischen Evolution. Fromman-Holzboog, Stuttgart.

Reed, J., Toombs, R., et al., 1967. Simulation of biological evolution and machine learning. J. Theor. Biol. 17, 319 – 342.

Rudolph, G., 1994. Convergence analysis of canonical genetic algorithms. IEEE Trans. Neural Networks 5 (1), 96 – 101. Rudolph, G., 1997. Reflections on bandit problems and selec-tion methods in uncertain environments. Proceedings of the Seventh International Conference on Genetic Algorithms, Morgan Kaufmann, San Francisco, CA, pp. 166 – 173. Salomon, R., 1996. Reevaluating genetic algorithm

perfor-mance under coordinate rotation of benchmark functions, a survey of some theoretical and practical aspects of genetic algorithms. BioSystems 39 (3), 263 – 278.

Schaffer, J.D. (Ed.), 1989. Proceedings of the Third Interna-tional Conference on Genetic Algorithms, Morgan Kauf-mann, San Mateo, CA.

Schwefel, H.-P., 1981. Numerical Optimization of Computer Models. John Wiley, Chichester, UK.

Syswerda, G., 1989. Uniform crossover in genetic algorithms. Proceedings of the Third International Conference on Ge-netic Algorithms, Morgan Kaufmann, San Mateo, CA, pp. 2 – 9.