Implementing adaptive nonlinear models
Arnold F. Shapiro
a,∗, R. Paul Gorman
baSmeal College of Business, Penn State University, University Park, PA 16802, USA bNeuristics Corporation, 200 International Circle, Hunt Valley, MD 21030, USA
Received 1 November 1998; received in revised form 1 December 1999
Abstract
This paper addresses that class of complex problems where there is little or no underlying theory upon which to build a model and the situation dictates the use of an adaptive approach based on the observed data. The field of study is known as adaptive nonlinear models (ANMs), and its goal is to quantify interaction terms without imposing assumptions on the solution. The purpose of this paper is to discuss, in conceptual rather than technical terms, the issues related to the implementation of these ANMs. The topics covered include: a short overview of technologies used in adaptive nonlinear modeling; modeling considerations; the model development process; and a comparison of linear and nonlinear models. © 2000 Elsevier Science B.V. All rights reserved.
Keywords: Adaptive; Nonlinear; Models; Heuristic
1. Introduction
This paper addresses that class of complex prob-lems where there is little or no underlying theory upon which to build a model and the situation dictates the use of an adaptive approach based on the observed data. The field of study is known as adaptive nonlinear models (ANMs).
The phrase “adaptive nonlinear” encompasses a wide variety of technologies, such as neural networks (NNs) and genetic algorithms (GAs). The overriding characteristic is that the approach is very strongly adaptive in the sense that, not only is it not pre-programmed, but the goal is to derive relationships among a large number of variables adaptively using
∗Corresponding author. Tel.:+1-814-865-3961.
E-mail address: [email protected] (A.F. Shapiro)
some rather generic nonlinear basis functions.1 The methodology involves balancing implicit information derived from adaptive technologies and explicit in-formation2 to develop an effective model. Another distinguishing characteristic is that a good deal of the modeling is done on the basis of sparse sample data.
Much of the methodology of ANMs has its roots in the defense industry where it was used to develop target recognition systems (Gorman and Sejnowski, 1988a,b; Gorman, 1991). In recent years, however, it has been applied to financial modeling problems, in-cluding such areas as insurance company solvency, where the focus has been on improving early
warn-1NNs, for example, involve connection networks which are
math-ematically represented by a basis function u(w,x), where w stands for the weight matrix, and x for the input vector.
A (first-order) linear basis function takes the form Pjwijxj, while a (second-order) nonlinear basis function takes the form [Pj(xj−wij)2]0.5.
2Also referred to as top down information or domain expertise.
Fig. 1. ANM technologies.
ing signals (Brockett et al., 1994), credit card risk and profitability, where the focus has been on the modeling of response characteristics, profitability and fraud (Gorman, 1996), and claim fraud in bodily in-jury claims (Brockett et al., 1998).
The purpose of this paper is to discuss, in concep-tual rather than technical terms, the issues related to the implementation of ANMs. The topics covered in-clude: a short overview of ANM technologies; mod-eling considerations; the model development process; and a comparison of linear and nonlinear models.
2. The underlying technologies
Adaptive nonlinear modeling involves the develop-ment of crafted solutions driven by the nature of the modeling problem and requires the integration of com-plimentary technologies. The diversity of these ANM technologies is depicted in Fig. 1,3 and briefly de-scribed in the statement that follows:
• Evolutionary optimization (EV) is an approach to the design of learning algorithms that is structured along the lines of the theory of evolution. While EV includes GAs, genetic programming, and evolution strategies, the primary focus in this paper is GAs.
• Fuzzy logic (FL) is a superset of conventional logic extended to handle the concept of partial truths.
• Intelligent agents are software applications that au-tomate tasks. They recognize events and use do-main knowledge to take appropriate actions based on those events.
3A simple overview of many of these technologies is found in
Shapiro, 2000.
• Expert systems are designed to replicate the problem-solving capability associated with a spe-cialized domain human expert.
• Bayesian (belief) networks are systems that rep-resents cause and effect relationships among ables, along with probabilities that each cause vari-able will influence each effect varivari-able. It is an alter-native to fuzzy expert systems for combining expert knowledge with inferences derived from historical data.
• Case-based reasoning is an approach to problem solving based on the retrieval and adaptation of cases.
• Rule induction induces logical rules from historical data and then applies the rules to make predictions on other given data.
• Learning vector quantization (LVQ) (Kohonen, 1988, Section 7.5) is a neural computing paradigm used to improve the classification accuracy in pat-tern recognition problems.
• Statistical inference is the process of drawing in-ferences from functions on samples (statistics) to functions on populations (parameters).
• Neural-fuzzy systems are combinations of NNs with expert fuzzy systems.
• NNs are nonlinear predictive models that learn both structure and parameter values through training, which also superficially resemble biological NNs in structure.
• Operations research (OR), from a methodology ori-entation, is the application of quantitative methods to solve practical problems.4
Some of these technologies, like OR and statistical inference, are well-known. Others, while of somewhat more recent vintage, are relatively common. These clude expert systems, Bayesian (belief) networks, in-telligent agents, case-based reasoning, and rule induc-tion. Still others, like the soft computing technolo-gies5 (EV, NNs, FL, and hybrids of these), have only recently been added to the actuary’s arsenal. The next section provides a brief introduction to the soft com-puting technologies, and is intended for those
read-4Some would regard this methodology oriented view of OR as
too narrow. Jewell (1980, p. 113), for example, would prefer to stress the system building opportunities and areas for constructive interaction, rather than the tools and techniques of OR.
5Soft computing is a concept that was introduced by Zadeh
ers who are unfamiliar with the area. Following that, the reminder of this section discusses the grouping of ANM technologies into functional classes and the team approach.
3. The soft computing technologies
This section gives a cursory overview of the soft computing technologies NNs, FL and GAs. The reader is referred to the references for more detail on each of these technologies.
3.1. Neural networks
NNs (Bishop, 1995) are software programs that em-ulate the biological structure of the human brain and its associated neural complex and are used for pat-tern classification, prediction and financial analysis, and control and optimization. The core of an NN is the neural processing unit, a representation of which is shown in Fig. 2.
The inputs to the neuron, xj, are multiplied by their
respective weights, wj, and aggregated. The weight w0
serves the same function as the intercept in a regression formula. The weighted sum is then passed through an activation function, F, to produce the output of the unit. Often, the activation function takes the form of the logistic functionF (z)=(1+e−z)−1, wherez= P
jwjxj, as shown in the figure.
NN can be either supervised or unsupervised. The distinguishing feature of a supervised NN is that its input and output is known and its objective is to dis-cover a relationship between the two. Insurance appli-cations using supervised NNs include Tu (1993), who compared NNs and logistic regression models for pre-dicting length of stay in the intensive care unit
follow-Fig. 2. Neural processing unit.
ing cardiac surgery, and Brockett et al. (1994), who sought to improve the early warning signals associated with property-liability insurance company insolvency. The distinguishing feature of an unsupervised NN is that only the input is known and the goal is to un-cover patterns in the features of the input data. Insur-ance applications involving unsupervised NNs include Jang (1997), who investigated insolvencies in the life insurance industry and Brockett et al. (1998), who in-vestigated automobile bodily injury claims fraud. The remainder of this section is devoted to an overview of supervised and unsupervised NNs.
3.1.1. Supervised neural networks
A sketch of the operation of a supervised NN is shown in Fig. 3.
Since supervised learning is involved, the system will attempt to match a know output, such as firms that have become insolvent or claims which are fraudulent. The process begins by assigning random weights to the connection between each set of neurons in the network. These weights represent the intensity of the connection between any two neurons and will contain the memory of the network. Given the weights, the intermediate values (a hidden layer) and the output of the system are computed. If the output is optimal, the process is halted; if not, the weights are adjusted and the process is continued until an optimal solution is obtained or an alternate stopping rule is reached.
If the flow of information through the network is from the input to the output, it is known as a feed forward network. The NN is said to involve back-propagation since inadequacies in the output are fed back through the network so that the algorithm can be improved.
Fig. 4. Three-layer neural network.
3.1.2. A three-layer neural network
An NN is composed of layers of neurons, an ex-ample of which is the three-layer NN depicted in Fig. 4. Extending the notation associated with Fig. 2, the first layer, the input layer, has three neurons (labeled
x0j, j=0, 1, 2), the second layer, the hidden
process-ing layer, has three neurons (labeled x1j, j=0, 1, 2),
and the third layer, the output layer, has one neuron (labeled x21). There are two inputs I1 and I2.
The neurons are connected by the weights wij k,
where the subscripts i, j, and k refer to the ith layer, the jth node of the ith layer, and the kth node of the (i+1)th layer, respectively. Thus, for example, w021
is the weight connecting node 2 of the input layer (layer 0) to node 1 of the hidden layer (layer 1). It follows that the aggregation in the neural process-ing associated with the hidden neuron x11 results in z=x00w001+x01w011+x02w021, which is the input to
the activation function.
3.1.3. The learning rules
The weights of the network serve as its memory, and so the network “learns” when its weights are updated. The updating is done using a learning rule, a common example of which is the Delta rule (Shepherd, 1997, p. 15), which is the product of a learning rate, which controls the speed of convergence, an error signal, and the value associated with the jth node of the ith layer. The choice of the learning rate is critical: if its value is too large, the error term may not converge at all, and if it is too small, the weight updating process may get stuck in a local minimum and/or be extremely time intensive.
3.1.4. The learning strategy of a neural network
The characteristic feature of NNs is their ability to learn and the strategy by which this takes place involves training, testing, and validation. Briefly, the clean and scrubbed data is randomly subdivided into three subsets: T1: which is used for training the net-work; T2: which is used for testing the stopping rule; T3: which is used for validating the resulting net-work. For example, T1, T2 and T3 may be 50, 25 and 25% of the database, respectively. The stopping rule reduces the likelihood that the network will become overtrained, by stopping the training on T1 when the predictive ability of the network, as measured on T2, is no longer improved.
3.1.5. Unsupervised neural networks
This section discusses one of the most common unsupervised NNs, the Kohonen network (Kohonen, 1988), which often is referred to as a self-organizing feature map (SOFM). The purpose of the network is to emulate our understanding of how the brain uses spatial mappings to model complex data structures. Specifically, the learning algorithm develops a map-ping from the input patterns to the output units that embodies the features of the input patterns.
In contrast to the supervised network, where the neurons are arranged in layers, in the Kohonen net-work they are arranged in a planar configuration and the inputs are connected to each unit in the network. The configuration is depicted in Fig. 5.
Fig. 6. Operation of a 2D Kohonen network.
As indicated, the Kohonen SOFM is a two-layered network consisting of a set of input units in the input layer and a set of output units arranged in a grid called a Kohonen layer. The input and output layers are to-tally interconnected and there is a weight associated with each link, which is a measure of the intensity of the link.
The sketch of the operation of an unsupervised NN is shown in Fig. 6.
The first step in the process is to initialize the pa-rameters and organize the data. This entails setting the iteration index, t, to 0, the interconnecting weights to small positive random values, and the learning rate to a value smaller than but close to 1. Each unit has a neighborhood of units associated with it and empirical evidence suggests that the best approach is to have the neighborhoods fairly broad initially and then to have them decrease over time. Similarly, the learning rate is a decreasing function of time.
Each iteration begins by randomizing the training sample, which is composed of P patterns, each of which is represented by a numerical vector. For ex-ample, the patterns may be composed of solvent and insolvent insurance companies and the input variables may be financial ratios. Until the number of patterns used (p) exceeds the number available (p>P), the pat-terns are presented to the units on the grid, each of which is assigned the Euclidean distance between its connecting weight to the input unit and the value of the input. This distance is given by [Pj(xj−wij)2]0.5,
where wij is the connecting weight between the jth
input unit and the ith unit on the grid and xj the input
Fig. 7. An FL system.
from unit j. The unit which is the best match to the pattern, the winning unit, is used to adjust the weights of the units in its neighborhood. The process contin-ues until the number of iterations exceeds some pre-determined value (T).
In the foregoing training process, the winning units in the Kohonen layer develop clusters of neighbors which represent the class types found in the training patterns. As a result, patterns associated with each other in the input space will be mapped on to output units which also are associated with each other. Since the class of each cluster is know, the network can be used to classify the inputs.
3.2. Fuzzy logic
FL6 was developed as a response to the fact that most of the parameters we encounter in the real world are not precisely defined. For example, a particular investor may have a “high risk capacity” or the rate of return on an investment might be “around 6%”; the first of these is known as a linguistic variable while the second is known as a fuzzy number. These concepts and the structure of an FL system are discussed in this section.
3.2.1. The structure of a fuzzy logic system
The essential structure of an FL system is depicted in the flow chart shown in Fig. 7, which was adapted from Von Altrock (1997, p. 37).
6Following Zadeh (1994, p. 192), in this paper the term FL is
Fig. 8. (Fuzzy) set of clients with high risk capacity.
In the figure, numerical variables are the input of the system. These variables are passed through a fuzzifi-cation stage, where they are transformed to linguistic variables and subjected to inference rules. The linguis-tic results are then transformed by a defuzzification stage into numerical values which become the output of the system.
3.2.2. Linguistic variables
A linguistic variable (Zedeh, 1975a,b, 1981) is a variables whose values are expressed as words or sen-tences. Risk capacity, for example, may be viewed both as a numerical value ranging over the interval [0,100%], and a linguistic variable that can take on val-ues like high, not very high, and so on. Each of these linguistic values may be interpreted as a label of a fuzzy subset of the universe of discourse X=[0,100%], whose base variable, x, is the generic numerical value risk capacity. Such a set, an example of which is shown in Fig. 8, is characterized by a membership function,
µhigh(x), which assigns to each object a grade of
mem-bership ranging between zero and one. In this case, which represents the set of clients with a high risk ca-pacity, individuals with a risk capacity of 50%, or less, are assigned a membership grade of zero and those with a risk capacity of 80%, or more, are assigned a grade of one. Between those risk capacities, (50, 80%), the grade of membership is fuzzy.
Fuzzy sets are implemented by extending many of the basic identities that hold for ordinary sets. Thus, for example, the union of fuzzy sets A and B is the smallest fuzzy set containing both A and B, and the intersection of A and B is the largest fuzzy set which is contained in both A and B.
Representative insurance paper involving linguistic variables include DeWit (1982), the first FL paper in the area, which dealt with individual underwriting, and
Young (1993, 1996), who modeled the selection and rate changing process in group health insurance.
3.2.3. Fuzzy numbers
The general characteristic of a fuzzy numbers (Zedeh, 1975a,b; Dubois and Prade, 1980) is repre-sented in Fig. 9.
This shape of fuzzy number is referred to as a “flat” fuzzy number; if m2 was equal to m3, it would be
referred to as a “triangular” fuzzy number. The points
mj, j=1, 2, 3, 4, and the functions fj(y|M), j =
1,2,M a fuzzy number, which are inverse functions
mapping the membership function onto the real line, characterize the fuzzy number. As indicated, a fuzzy number is usually taken to be a convex fuzzy subset of the real line.
As one would anticipate, fuzzy arithmetic can be ap-plied to the fuzzy numbers. Using the extension prin-ciple (Zedeh, 1975a,b), the nonfuzzy arithmetic oper-ations can be extended to incorporate fuzzy sets and fuzzy numbers. Briefly, if * is a binary operation such as addition (+) or min (∧), the fuzzy number z, de-fined by z=x*y, is given as a fuzzy set by
µz(w)=Vu,vµx(u)∧µy(v), u, v, w∈R,
subject to the constraint that w=u*v, where µx,µy,
andµzdenote the membership functions of x, y, and z,
respectively and Vu,vdenotes the supremum over u, v.
Representative insurance papers that focused on fuzzy numbers include Lemaire (1990), who showed how to compute a fuzzy premium for a pure en-dowment policy, Ostaszewski (1993), who extended Lemaire, and Cummins and Derrig (1997), who ad-dressed the financial pricing of property-liability insurance contracts.
A large number of potential FL applications in in-surance are mentioned in Ostaszewski (1993). Read-ers interested in a grand tour of the first 30 years of FL are urged to read the collection of Zadeh’s papers contained in Yager et al. (1987) and Klir and Yuan (1996).
3.3. Genetic algorithms
GAs are automated heuristics that perform opti-mization by emulating biological evolution. They are particularly well suited for solving problems that in-volve loose constraints, such as discontinuity, noise, high dimensionality, and multimodal objective func-tions. Examples of GA applications in the insurance area include Wendt (1995), who used a GA to built a portfolio efficient frontier (a set of portfolios with optimal combinations of risk and returns) and Tan (1997), who developed a flexible framework to mea-sure the profitability, risk, and competitiveness of in-surance products.
GAs can be thought of as an automated, intelligent approach to trial and error, based on principles of nat-ural selection. In this sense, they are modern succes-sors to Monte Carlo search methods. The flow chart in Fig. 10 gives a representation of the process.
As indicated, GAs are iterative procedures, where each iteration (g) represents a generation. The pro-cess starts with an initial population of solutions, P(0), which are randomly generated. From this initial pop-ulation, the best solutions are “bred” with each other and the worse are discarded. The process ends when the termination criterion is satisfied.
For a simple example, suppose that the problem is to find by trial and error, the value of x, x=0, 1,. . .,
Fig. 10. Flow chart of GA.
31, which maximizes f(x), where f(x) is the output of a black box. Using the methodology of Holland (1975), an initial population of potential solutions {yj|j=1, . . ., N}would be randomly generated, where each so-lution would be represented in binary form. Thus, if 0 and 31 were in this initial population of solutions, they would be represented as 00000 and 11111, re-spectively.7 A simple measure of the fitness of yj
ispj =f (yj)/Pjf (yj), and the solutions with the
highest pj’s would be bred with one another.
There are three ways to develop a new generation of solutions: reproduction, crossover and mutation. Reproduction adds a copy of a fit individual to the next generation. In the previous example, reproduc-tion would take place by randomly choosing a solu-tion from the populasolu-tion, where the probability a given solution would be chosen depends on its pj value.
Crossover emulates the process of creating children, and involves the creation of new individuals (children) from the two fit parents by a recombination of their genes (parameters). In the example, crossover would take place in two steps: first, the fit parents would be randomly chosen on the basis of their pj values;
sec-ond, there would be a recombination of their genes. If, for example, the fit parents were 11000 and 01101, crossover might result in the two children 11001 and 01100. Under mutation, there is a small probability that some of the gene values in the population will be replaced with randomly generated values. This has the potential effect of introducing good gene values that may not have occurred in the initial population or which were eliminated during the iterations. In this illustration, the process is repeated until the new gen-eration has the same number of individuals (M) as the current one.
3.4. Hybrid systems
While the foregoing discussions focused on each technology separately, a natural evolution in soft com-puting has been the emergence of hybrid systems, where the technologies are used simultaneously. FL based technologies can be used to design NNs or GAs, with the effect of increasing their capability to display good performance across a wide range of complex problems with imprecise data. Thus, for example, a
fuzzy NN can be constructed where the NN possesses fuzzy signals and/or has fuzzy weights. Conversely, FL can use technologies from other fields, like NNs or GAs, to deduce or to tune, from observed data, the membership functions in fuzzy rules, and may also structure or learn the rules themselves.
4. Functional classes
ANM technologies also can be grouped into func-tional classes. Broadly, these classes include:
• Generalized nonlinear function approximation, which adaptively constructs a nonlinear relation-ship between observed variables and dependent variables. Technologies in this class include NNs and generalized regression networks.
• Domain segmentation, which involves finding sub-domains in the observation space where the rela-tionships between observed variables and dependent variables are consistent. This is a very crucial step in the development of nonlinear models. This class includes rule induction and technologies for analyz-ing data and for inducanalyz-ing decision trees from data.
• Generalized knowledge encoding, which uses do-main knowledge to creatively bias more general adaptive methods. This is an important component in the evolution of robust nonlinear models. Tech-nologies in this category includes expert systems, neuro-fuzzy models and case-based reasoning tech-nologies.
• Dimension reduction, which uses variable selection and aggregation to lower the dimensionality of the problem. This group includes Kohonen networks and fuzzy clustering (Zimmermann, 1991, Section 11.2), which seeks to divide objects into categori-cally homogeneous subsets called “clusters”.
• Numerical optimization, which allows the numeri-cal estimation of model parameters based on a sam-ple of data. These optimization technologies are an indispensable element of an adaptive nonlinear modeling toolkit. Technologies in this class include gradient descent, simulated annealing (Aarts and Van Laarhoven, 1987), and GAs. Gradient descent iteratively updates the weight vector in the direc-tion of the greatest decrease in the network error and simulated annealing is a stochastic algorithm that minimizes numerical functions, whose
distin-guishing feature is that it uses a random process to elude local minimums.
5. The team approach
Given this wide range of technologies, it often is ad-vantageous to approach a problem with a team whose members have diverse backgrounds and experiences. Thus, for example, an ideal team may be composed of members whose backgrounds include not only math-ematics, economics and statistics, but physics, and computational and computer science, as well. It is the ability of the team to craft a solution by integrating complementary advanced technologies that makes this methodology so powerful.
6. Modeling considerations
Modeling considerations include the heuristic na-ture of the approach, data issues, the emphasis on non-linear relationships, and domain knowledge.
6.1. Heuristic approach
There is no canonically optimal approach to the de-velopment of models for nonlinear problems so solu-tions can vary considerably from problem to problem. In this sense, the technology is highly heuristic and often ad hoc, and since the field is still in its embry-onic stage, there is considerable room for improve-ment. Thus, the formulation of a science of adaptive nonlinear modeling should be considered as a work in progress.
Moreover, while the focus is on ANM technologies, this cannot always be to the exclusion of more tradi-tional approaches. Regardless of the sophistication, if the signal-to-noise ratios8 are so poor that reasonable relationships cannot be derived, it may be necessary to resort to more conventional technologies to attain adequate performance.
8If it is assumed that a system has a given pattern, µ, and
6.2. Data issues
There are many issues with data if there is no theo-retical framework to constrain the solution, since the resolution of the problem depends on and is highly sensitive to the nature of the sample data. As a con-sequence, considerable resources are devoted to pro-cessing the data, with an emphasis on missing and corrupted data and the removal of bias from the sam-ple. Additionally, where multiple sources of data are involved, the consistency of the differential semantics across these sources have to be verified.
6.3. Emphasis on nonlinear relationships
A distinguishing assumption of this approach is that there are important nonlinearities both between the ob-servables (independent variables) and the dependent variable as well as nonlinearities among the observ-ables. The emphasis is on not making unjustified as-sumptions about the nature of those nonlinearities, and technologies are used that have the capacity, in theory at least, to extract the appropriate interaction terms adaptively.
6.4. Domain knowledge
As mentioned previously, the technologies do not al-ways achieve their ends because of the signal-to-noise ratios in the sample data. Of necessity, in these in-stances, the approach is to constrain the solution by introducing expert knowledge into the process. So, it is not quite a theoretical framework but it is more a heuristic framework to help constrain the solution space.
7. The model development process
An overview of the key features of the general model development process is shown in Fig. 119 and previewed in this section. The process involves data preprocessing, domain segmentation, variable selec-tion, model development, and benchmarking and ver-ification.
9Adopted from Gorman (1996), Slide 7.
Fig. 11. Model development process.
7.1. Data preprocessing
The data preprocessing stage focuses on the reduc-tion of inconsistencies and bias in the data, and the development of aggregate information as a proxy for relevant individual characteristics.
7.2. Domain segmentation
Domain segmentation is actually a part of data pre-processing, but because of its importance, it is shown here as a separate step. In addition to rule induction, which was mentioned previously, it involves such tech-nologies as supervised and unsupervised clustering and gated architectures, each of which is discussed be-low. If it is appropriate, models are attempted within these domains.
7.3. Variable selection
One way to reduce the amount of variability in the model is to constrain the number of predictors. Of course, this process must be balanced with the need to preserve information, and this can be accomplished using traditional approaches like regression analysis and sensitivity analysis. The essence of this process was discussed by Brockett et al. (1994, pp. 411–412). In some cases, it is possible to use technologies that prune the parameters as the model learns the prob-lem.10 Similarly, weight pruning11 and decay12 can be used.
10An example of parameter pruning would be the discarding of
subordinate solutions during the training stage of an GA.
11Weight pruning refers to the adjustment of weights in a weighted
procedure. An example is the adjustment of weights that takes place through the back-propagation algorithm of NNs.
12Decay emulates the process of forgetting over time. If, for
The technologies mentioned earlier in this section were listed in the order of the amount of manual heuristic knowledge inherent in each stage. Ideally, tasks are pushed down to where the development is automatic and the structure in the data is used to ex-tract domain boundaries and information in the data is used to extract the interaction terms. Again, however, the process can be thwarted by small sample size and poor signal-to-noise ratios.
7.4. Model development
Once the best performance predictors have been identified, the next step is the development of the non-linear model, and a considerable portion of this paper is devoted to that topic. Related issues that need to be reconciled are the advantages and disadvantages of both the linear paradigm and nonlinear paradigm, and the reasons for taking on the complexities of trying to extract nonlinearities.
7.5. Benchmarking and validation
The final step in the model development process is benchmarking and model validation. The latter is a part of comparative performance testing and is done iteratively during model development to verify that if the approach adds complexity, it also adds comparable value.13
It should be clear that the approach is very empiri-cal and that the nature of the problem determines the approach. This is even more apparent in the remainder of the paper where the details of each of these steps is discussed.
8. Data preprocessing
The primary considerations in data preprocessing are to reconcile disparate sources of data, to reduce or eliminate intrinsic data bias, and to aggregate vari-ables, when appropriate. These issues are addressed in this section.
13The accounting profession refers to this consideration as the
“materiality criterion”.
8.1. Reconcile disparate sources of data
Generally, a number of sources of data are needed to develop the model. This might include insurers and agencies, household demographics, econometrics data, and client internal transaction data. Conse-quently, reconciling disparate sources of data becomes critical.
8.2. Intrinsic data bias
Another challenge when dealing with data is to re-duce some of its internal biases. In the area of con-sumer behavior, for example, where adverse selection is the issue, the insured data base may provide limited guidance in some cases because it contains only in-sureds, and people that need to be identified on the ad-verse side already have been selected away. So, strate-gies need to be developed to compensate for these bi-ases.
8.3. Aggregate variables
One productive approach is to develop a set of ag-gregate variables to help take the raw state of these sources of variables and bring them together into a concise set of aggregates. A common example of this is the use of residential areas as a proxy for socioeco-nomic characteristics.14 Where this is done, that level will typically be used to begin the modeling process. As discussed by Bishop (1995, Section 8.6.2), one might approach this issue using a kind of a neuro-network architecture that is autoassociative,15 which tries to predict the same patterns that are at the input at the output through a narrow number of units. This results in compression and at the same time takes advantage of nonlinearities or interaction terms between the observables.
14The use of aggregate information as a proxy for the individual
characteristics of interest has to be used with care because it can result in biases. The reason for this is that there is a tendency for aggregate proxies to exaggerate the effects of micro-level variables and to do more poorly than micro-level variables at controlling for confounding. This has been found, for example, when socioeconomic characteristics of residential areas (such as median income associated with a zip code) are used to proxy for individual characteristics. See Geronimus et al. (1996).
15An autoassociative network is a network where the target data
Another approach would be to use nonlinear com-pression, which is kind of the nonlinear correlate to factor analysis16 or principle components.17 This can be accomplished, for example, with a four-layer au-toassociative network, where the first and third hidden layers have sigmoidal nonlinear activation functions.
9. Domain segmentation
Domain segmentation involves the identification of segments within the decision space where the implicit relationship between variables is constant. It is a very important step and has been demonstrated to provide enormous amounts of value-added performance (Kelly et al., 1995). Fig. 12 exemplifies the situation.18
Traditionally, two approaches have been used for segmentation. One is to try to find segments in the pop-ulation that have relatively constant behavior within a group and then to assign either a score or an output to that entire group, assuming they have uniform be-havior. Another approach is to ignore the segmenta-tion altogether and attempt to fit a model to the entire decision space.
Fig. 12 demonstrates that neither approach really gets at the underlying structure in the data since, in
Fig. 12 Domain segmentation.
16In the current context, factor analysis may be thought of as
a technique which uses measures of association (correlations) to extract patterns (latent structure in the data) of association (a dependence on common processes) in complex data sets. To be really useful and valid, factor analysis needs large data arrays as correlations can be found for spurious reasons.
17Principal component analysis is a methodology for finding the
structure of a cluster located in multidimensional space. Conceptu-ally, it is equivalent to choosing that rotation of the cluster which best depicts its underlying structure.
18Adopted from Gorman (1996), Slide 9.
both approaches, much of the resolution in the model is lost. Typically, what needs to be done is to isolate the unique domains and model within them. This has the advantage of improving the ability of these adapted technologies to extract the structure. In essence, the technique is allowed to focus in on relatively stationary behavior, so that it has a better opportunity to extract the information.
9.1. Isotropic subdomains
Domains that may be used in the development of insurance models include the insurance companies themselves. In the area of consumer behavior models, for example, it may turn out, as it has with credit bureaus (Gorman, 1996), that each insurer actually reflects relatively distinct characteristics of individual consumer behavior and so modeling within companies rather than across them may have some advantages. Depending on the inquiry, one would also expect that there are geographic regions that should be cordoned off. So, those groups can be isolated and modeled within those domains to get a better resolution with regard to that behavior.
It also generally makes sense to classify clients by adverse selection characteristics and to model within each of those classes. Moreover, as discussed below, certain aspects of temporal behavior are likely to be much more important than cross-sectional behavior. Since the goal is to refine the detection of these types of behavior the data could be segregated accordingly.
10. Variable selection and derivation
The next step in the process of developing a model is to select and aggregate raw variables to obtain the most concise representation of the information content within the data.
10.1. Concise representation
rele-vant, by using aggregates developed from experience in other applications.
10.2. Primary methodologies
The primary methodologies include rule induction technologies,19 which encompass such things as CHAID20 and neuro-fuzzy inferencing (Wang et al., 1995, p. 89), and significance testing using regression and sensitivity analysis.21 Again, wherever the struc-ture within the domain allows the use of dynamic variable selection based on pruning the parameters or pruning the weights, that approach is adopted.22 Of course, depending on the domain and the strength of
19Rule induction comprises a wide variety of technologies, but the
basic intent is to take a set of sample data and extract implicit rules in the data itself. For instance, in the case of some technologies that might be considered neuro-fuzzy technologies, which are really kind of kernel-based neuro-networks, rules (if-then statements, really) can be represented in terms of membership functions that are defined over ranges of variables.
The model can be set up with both the position and the boundaries of these membership functions randomized and then the parameters associated with the boundaries can be adapted by looking at the data itself. Hence, implicit rules can be extracted to help predict the output. An example would be whether an individual with a certain pattern was a high risk or whether a contract on that individual was likely to be profitable.
20CHAID (chi-squared automatic interaction detection) (SPSS,
1993) has been a popular method for segmentation and profiling, which is used when the variable responses are categorical in nature and a relationship is sought between the predictor variables and a categorical outcome measure. It seeks to formulate interaction terms between variables and uses kind of a maximum likelihood technique to determine where the boundaries are along the ranges of variables. It then builds that up hierarchically, which allows rules to be extracted.
21These may involve such things as CART (classification and
regression trees) (Breiman et al., 1984), which is a procedure for analyzing categorical (classification) or continuous (regression) data and C4.5 (Quinlan, 1993), which is an algorithm for inducing decision trees from data.
22This optimization technique is kind of a connectionist
architec-ture which uses gradient decent (Hayes, 1996, p. 499) or a con-jugate gradient technique, which is an improved steepest descent approach, or perhaps evolutionary GAs, to optimize the parame-ters and locate the appropriate boundaries, and thus develop the best set of predictions.
These technologies are used primarily to discover domains within the data but they also provide some insight into which variables are predictive. Moreover, they have the advantage of being able to address joint relationships between variables, as opposed to something like regression which looks at how significant predictors are independently.
the structure exhibited in the data, that technique may or may not work.
10.3. Behavioral changes
Empirical evidence suggests that a very important aspect of predicting behavior is not simply the current status of an individual, but how that status changes over time. So, a number of aggregates have been de-rived to help capture that characteristic and some of the predictive variables are sampled over time to mon-itor the trend. With respect to credit cards, for exam-ple, key considerations are the balance-to-credit ratio, patterns of status updates, and age difference between primary and secondary household member.
11. A comparison of linear and nonlinear models
The stage is set to discuss the actual development of the model. Before doing that, however, it is appro-priate to digress to compare the linear and nonlinear models and to describe the motivations for the nonlin-ear approach and some of its shortcomings. The as-sumption is made a priori that the goal is to extract interactions out of the sample data.
11.1. The linear modeling paradigm
Approaching the world from a linear perspective is a very powerful strategy. This, coupled with the su-perposition assumption that complex behavior can be modeled as a linear combination of simpler behaviors (Hayes, 1996, p. 10) and independence of dimensions, provides a powerful set of technologies for analyzing performance and significance of variables. In practice, this can lead to a tendency to ignore the nonlinear fac-tors, the justification being that the higher order terms add only a slight perturbation to the overall behavior the model is attempting to capture.
Interactions are often extraordinarily important from the perspective of many of the less well un-derstood financial problems.23 Consequently, if it is assumed that all the variables are independent, the model must involve an enormous number of degrees
23This is not a new phenomenon, it also is true from a target
of freedom to capture that complexity. In due course, since computational complexity scales with model complexity, a threshold is reached with linear sys-tems where the model becomes very brittle24 as its dimensionality is increased and an attempt is made to capture finer and finer behavior.
Of course, nonlinearities can be accommodated in the linear regime and still take advantage of many of the powerful technologies available when modeling from a linear perspective. This can be accomplished by making assumptions about the form of the nonlin-earity, by explicitly representing higher order statis-tics (Nikias and Petropulu, 1993), and by using non-linear basis functions that are orthogonal25 (Chen et al., 1991). However, trying to represent these interac-tion terms becomes a combinatorial problem26 as the dimensionality of the model increases.
So, the linear approach has been very powerful but it has its limitations when the complexity increases.
11.2. The nonlinear modeling paradigm
The nonlinear approach that guides many of the technologies were derived from studies of complex systems like the neural or evolutionary systems, where complexity grew out of relatively simple components whose interactions were the key to the emergent be-havior. Interactions, of course, imply nonlinearity.
The nonlinear modeling approach is depicted in Fig. 13.
As indicated, the process models complexity with-out computational complexity. It starts with very sim-ple transforms in the case of NNs where a weighted sum is developed and passed through a nonlinearity. That is done interactively through layers and though the transformation is simple, these simple nonlineari-ties are combined to approximate very complex non-linear behavior. As a result, one is forced to approach the problem adaptively simply because there are no
24“Brittle” is a common and descriptive term in engineering,
which implies that the effects of the model are very sensitive to minor changes in the parameters.
25Two random variables are said to be orthogonal if their
corre-lation is zero.
26Combinatorial optimization problems present difficulties
be-cause they cannot be computed in polynomial time. Instead, they require times that are exponential functions of the problem size.
Fig. 13. Nonlinear modeling approach.
closed form solutions when dealing with nonlinear ba-sis functions that are nonorthoginal.
The only way to estimate the parameters associated with models of this type, where there are hundreds of degrees of freedom, is to adopt some kind of nu-merical optimization technique involving incremental optimization. From a cost-benefit perspective, the pri-mary reason this would be attempted is that, theoret-ically at least, if no stringent assumptions are made, one can model an arbitrary nonlinear function to any arbitrary degree of accuracy by overlaying these basis functions. This is an ideal; it obviously is not always the case.
11.3. Linear vs. nonlinear models
One example which clearly distinguishes between the two approaches when trying to capture complex behavior involves the determination of the underlying structure of a time series. Resorting to spectrum es-timation (Hayes, 1996, Chapter 8), one might try to capture the structure in the time series by building it up from simple sine and cosine functions which serve as orthogonal basis functions.27 Given a Fourier trans-form, the parameters associated with that transform can be determined analytically.
It can turn out, however, that although the time se-ries looks periodic, the power spectrum28 has a very
27The orthogonal functions are cos (2πt/L) and sin (2πt/L), where
L is the period.
28The power spectrum is the Fourier transform of an
Fig. 14. Model performance (high signal-to-noise case).
broad band,29 which is problematic for the Fourier approach, since it indicates the existence of a contin-uum of frequencies. Typically, in order to capture the complex behavior, it is necessary to sample over a large number of time samples and to do the transform with sufficient spectral resolution. That can result in a large numbers of degrees of freedom, perhaps on the order of a 1000, or more, depending on the situation. If the nonlinear approach is taken in this case, and this is indicative of many problems (Gorman, 1996), an NN can be used based on only a few of the temporal samples, time delayed. Then the underlying dynamics of a time series that is generated by a nonlinear dy-namic system can be rebuild by taking the time delays and imbedding them in a state space.30 The details of this process are described by Packard et al. (1980) and expanded upon by Tufillaro et al. (1992, Chapter 3). Briefly, assuming that the time series, x(t), is pro-duced by a deterministic dynamical system that can be modeled by some nth-order ordinary differential equation, then the trajectory of the system is uniquely specified at time 0 by its value and its first n−1 deriva-tives. If the sampling time is evenly spaced, almost all the information about the derivative is contained
29A broad band power spectrum suggests either a purely random
or noisy process or chaotic motion. In this instance, since the series looks periodic, we apparently are confronted with a chaotic time series.
30State (phase) space is an abstract space used to represent all
possible states of a system. In state space, the value of a variable is plotted against all possible values of the other variables at the same time. Conceptually, if one thinks in terms of a bouncing ball, the height of the ball at any time could be represented by a time series, and the state space of the motion of the bouncing ball could be represented by the two dimensions height and velocity.
in the differences of the original series, and almost all the information about the orbit can be recovered from embedded variables of the form yj i=xi−r(j ), where j
denotes the jth embedded variable, i denotes the ith term of the series, and the time delay, r, is unique to each variable.
Of course, since the form of those nonlinearities are not know a priori, they have to be built up from sigmoids.31 So, it does take tens of parameters to capture the underlying structure but not tens of tens. In many cases, once the interactive terms are captured successfully, a much more concise description of the underlying process is obtained.
11.4. The bias–variance tradeoff
Another dimension to the issue of using ANMs, rather than linear models, has to do with the bias–variance tradeoff (Geman et al., 1992). The field began when Rumelhart et al. (1986) decided against the consensus of their peers to try gradient decent in a multilayered NN and found that it converged.
Initially, when these networks were applied, the pri-mary focus was on very high signal-to-noise problems. This situation is depicted in Fig. 14,32 which shows model performance of the linear and nonlinear pro-cesses as a function of the variance/bias tradeoff.
As illustrated by the solid line in the figure, a linear (high bias) model does not model a nonlinear process
31In the context of NNs, the sigmoid (S-shaped) function is a
nonlinear activation function of a neuron. (Bishop, 1995, p. 82, pp. 232–234).
Fig. 15. Model performance (low signal-to-noise case).
very well. This is a consequence of the many assump-tions made in a linear model about the underlying structure. In contrast, as the solution tends to the stan-dard (canonical) nonlinear architectures, where fewer and fewer assumptions are made, the ability to capture the nonlinearities in the problem are improved. This was an important result, since relatively simple com-ponents can be pieced together to capture very non-linear behavior.
11.5. Financial models
Fig. 15 portrays the complication that occurs when the foregoing technologies are applied to low signal-to-noise situations, such as those that often accompany financial modeling. Now the nonlinear model does not capture the nonlinear process (the solid line) very well. The reason being that, while NNs and architectures of that type have low bias, very few assumptions are made and there is a tendency to overfit. This, coupled with finite sample data, leads to significant problems with the variance. So, depending on the initial conditions and the sample used from the overall population, widely varying solutions can be obtained. In many cases, linear solutions are better in the sense that they capture the underlying structure, at least the first-order structure, much better than the high dimensional low bias models. This follows be-cause of the bias imposed on the solution in the linear technique.
The foregoing anomaly arose because of the enor-mous change in the underlying characteristics of the problem. Initially, the problem involved improving the classification performance or decision performances
from the 80% range up to the 95–98% range. When it came to financial issues, however, the problem be-came one of achieving one or two percentage points over chance, and it was clear that if this was to be accomplished, the high variance issue had to be ad-dressed. Part of the solution involved domain segmen-tation, variable selection, the use of aggregates, and so forth. In addition, embedded expert knowledge was used to impose constraints on the solution of these low bias models in order to avoid the problem of overfit-ting. This is a very heuristic and ad hoc approach but, to date, there is no satisfactory alternative.
12. Model constraints
Turning now to model constraints, Fig. 16 lists types of networks that have been adopted, and their associ-ated technologies, in the order of the extent to which they can be constrained.
The first set of technologies listed has the lowest bias in the sense that it involves making the least number of assumptions, although it is the more sen-sitive when confronted with finite samples or low signal-to-noise. Further down the list are approaches that constrain the problem more and more to the point where if there are very few samples and very noisy data some heuristic information has to be embed into the problem to get it to converge properly.
12.1. Neural networks
Fig. 16. Networks and their associated technologies.
perception (Wang et al., 1995, p. 39), the finite im-pulse response (FIR) networks (Hayes, 1996, p. 12) used for capturing nonstationary temporal behav-ior, and the gated experts (Atiya et al., 1998) which attempt to dynamically determine the boundaries within the population while the models within those boundaries are being optimized.33
Unsupervised and autoassociative networks are also used for doing clustering and compression and these are typically networks that do not use any kind of output in the determination of an optimal solution. They simply look at correlations within the data to determine groupings.
12.2. Nonlinear kernel networks
The next level of technologies typically used are kernel techniques (Duflo, 1997, Chapter 7) which can be employed to impose a number of constraints on the convergence process and vastly reduce the num-ber of parameters that have to be optimized. These include radial-basis functions34 (Wang et al., 1995, p. 42), which are hypersphere-type functions,
gener-33Gated experts have a lot of promise but have been found to
be very sensitive to sample size and noise. Essentially, there is a gate that learns to determine which one of these experts, all of whom are training on the same data, are doing the best job. This information is used to begin to cordon off which part of the population each one of these experts focuses on. Thus, the results is the best of both worlds in the sense that both domain segmentation and modeling are carried out at one and the same time. For certain problems this has been a very powerful approach; for many problems, however, it does not work.
34These are second-order (nonlinear) basis functions.
alized regression neural networks (GRNNs) (Master, 1995), which involve a three-layer network with one hidden neuron for each training pattern,35 and Gabor Networks (Feichtinger and Strohmer, 1997), which in-volve the simultaneous analysis of signals in time and frequency. Most commonly, there will be some form of the Gaussian kernel (Bishop, 1995, Section 2.5.3) that is centered (positioned to centroid) within the sample space and the variance associated with that kernel can be either fixed a priori or adapted, depending on the problem. This lends a kind of bias to the problem and allows small sample sizes and noisy situations to be accommodated.
12.3. Neuro-fuzzy networks
The final set of technologies is neuro-fuzzy net-works, which combine the architecture and learning properties of an NN with the representational advan-tages of a fuzzy system. (Wang et al., 1995, p. 92). They include rule-induction technologies, bordering on rule-based technologies, where membership func-tions can be defined with more or less precision. The sample data is allowed to determine the bound-aries and the extent of those membership functions. Again, these technologies are used in the case where there is an enormous amount of noise and small samples.
35GRNN works by measuring how far a given sample pattern is
13. Model parameter optimization
Once the particular type of technology that will be used to model the problem is determined, the next step is to determine the parameters of that model. As mentioned previously, since there is no analytical closed form solutions to many of these problems, an incremental numerical optimization technique must be used.
13.1. Gradient descent — continuous optimization
The workhorse for networks is back-propagation, which simply measures the gradient of the error with respect to each one of the parameters and back-propagates that error through the nonlinearities, so that a global minimum is obtained at the minimum for which the value of the error function is smallest. In this context, the error minimization process can be conceptualized (Bishop, 1995, p. 254) by envisioning the error function as an error surface sitting above weight space. Second-order gradient technologies can be used, just as in any other optimization problem.36
The gated expert, mentioned above, employs an expectation-maximization (EM) technique (Couvreur, 1997) which uses a maximum likelihood approach for optimizing two aspects of the problem at once, that is, which expert to use for which domain. This opti-mization is taking place at the same time the internal parameters associated with the network are being op-timized, and the (EM) optimization technique works well in that context.
13.2. Genetic evolution — discontinuous optimization
Another class of optimization technologies which use GAs is well suited for error surfaces that are very convoluted. As described above, typically what is done in a GA is to set up components of a model that compete on the basis of fitness and cooperate in accordance with genetic operations. A recombination technique is used to generate new generations of the components that perform well, while the ones that do not perform well fall by the wayside. The power here is that the optimization process starts from multiple
36Momentum is a network attempt at capturing second-order
information about the gradient.
points on the error surface and moves down the gra-dients and many of the components that get stuck in suboptimal local minima are eliminated and the ones that achieve more global performance persist. So, it has considerable power when dealing with very noisy error surfaces.
13.3. Unsupervised clustering
As discussed previously, some of the technologies used for clustering and compressing multidimensional data into a lower dimensional space use unsuper-vised learning which attempts to cluster the under-lying data adaptively. Fruitful methodologies include the Hebbian-covariance networks (Domany et al., 1996, p. 61), which are based on the proposition that synaptic change depends on the covariance of post-synaptic and prepost-synaptic activity, and the Kohonen’s self-organizing map, whereby each input pattern leads to a single, localized cluster of activity.
14. Benchmarking and model validation
The final step in the development of the model is benchmarking and model validations. The objec-tive here is to gauge the generality of the model and demonstrate the effectiveness of the model design. Typically, the initial benchmarks are simple models, often linear models, and then more sophisticated ap-proaches are used to determine if there is value added. The criterion is the complexity-performance tradeoff and the process stops when the point is reached where there appears to be no further improvement, the point of diminishing returns.37
15. Closing comment
The traditional approach of capturing first-order be-havior has represented a very important step toward capturing complex behavior. Now, the use of ANMs that can go into the data without any kind of assump-tion and begin to quantify interacassump-tions represents the next phase of picking up the higher order terms.
37The testing and validation portions of an NN exemplify this
Of course, since analysts approach the problem without any kind of theoretical framework, the sam-ple data are regarded as representative of the pop-ulation and that brings with it a lot of dangers and problems. The strategy for dealing with this is to re-gard each problem as unique and to approach it very methodically until what appears to be the right kind of technology or set of technologies for a particular problem are achieved.
Acknowledgements
The impetus for this paper was the presentation given by Gorman (1996) at the Actuarial and Finan-cial Modeling Conference. Many of the ideas that ap-pear here were discussed during the conference. This work was supported in part by the Robert G. Schwartz Faculty Fellowship at the Penn State University and a grant of the Committee on Knowledge Extension and Research (CKER) of the Society of Actuaries.
References
Aarts, E.H.L., Van Laarhoven, P.J.M., 1987. Simulated Annealing: Theory and Applications. Reidel, Dordrecht.
Atiya, A., Shaheen, R., Shaheen, S., 1998. A practical gated expert network. Working paper presented at IJCNN’98 in May 1998, Anchorage, Alaska.
Bishop, C.M., 1995. Neural Networks for Pattern Recognition. Clarendon Press, Oxford.
Breiman, L., Friedman, J., Olshen, R., Stone, C.J., 1984. Classification and Regression Trees. Chapman & Hall, New York.
Brockett, P.L., Cooper, W.W., Golden, L.L., Pitaktong, U., 1994. A neural network method for obtaining an early warning of insurer insolvency. Journal of Risk and Insurance 61 (3), 402. Brockett, P.L., Xia, X., Derrig, R.A., 1998. Using Kohonen’s self-organizing feature map to uncover automobile bodily injury claims fraud. Journal of Risk and Insurance 65 (2), 245. Chen, S., Cowan, C.F.N., Grant, P.M., 1991. Orthogonal least
squares learning for radial basis function networks. IEEE Transactions on Neural Networks 2 (2), 302–309.
Couvreur, C., 1997. The EM algorithm: a guided tour. In: Warwick, K., Karny, M. (Eds.), Computer-Intensive Methods in Control and Signal Processing: The Curse of Dimensionality. Birkhauser, Boston, MA, pp. 209–222 (Chapter 12). Cummins, J.D., Derrig, R.A., 1997. Fuzzy financial pricing of
property-liability insurance. North American Actuarial Journal 1 (4), 21–44.
DeWit, G.W., 1982. Underwriting and uncertainty. Insurance: Mathematics and Economics 1, 277–285.
Domany, E., van Hemmen, J.L., Schulten, K. (Eds.), 1996. Models of Neural Networks III. Springer, New York.
Dubois, D., Prade, H., 1980. Fuzzy Sets and Systems: Theory and Applications. Academic Press, San Diego.
Duflo, M., 1997. Random Iterative Models. Springer, New York. Feichtinger, H.G., Strohmer, T. (Eds.), 1997. Gabor Analysis and Algorithms: Theory and Applications. Birkhauser, Boston, MA. Geman, S., Bienenstock, E., Doursat, R., 1992. Neural networks and the bias/variance dilemma. Neural Computation 4 (1), 1–58. Geronimus, A.T., Bound, J., Neidert, L.J., 1996. On the validity of using census geocode characteristics to proxy individual socioeconomic characteristics. Journal of the American Statistical Association 91 (434), 529.
Gorman, R.P., 1991. Neural networks and the classification of complex sonar signals. In: Proceedings of the IEEE Conference on Neural Networks for Ocean Engineering, pp. 283–290. Gorman, R.P., 1996. Current modeling approaches: a case study.
In: Actuarial and Financial Modeling Conference, 16–17 December. Georgia State University.
Gorman, R.P., Sejnowski, T.J., 1988a. Learned classification of sonar targets using a massively parallel network. IEEE Transactions on Acoustics, Speech and Signal Processing 36, 1135–1140.
Gorman, R.P., Sejnowski, T.J., 1988b. Analysis of hidden units in a layered network trained to classify sonar targets. Neural Networks 1, 75–89.
Hayes, M.H., 1996. Statistical Digital Signal Processing and Modeling. Wiley, New York.
Holland, J.H., 1975. Adaptation in Natural and Artificial Systems. University Michigan Press, Ann Arbor, MI.
Jang, J., 1997. Comparative analysis of statistical methods and neural network for predicting life insurers insolvency (bankruptcy), Ph.D. Dissertation. University of Texas at Austin. Jewell, W.S., 1980. Models in insurance: paradigms, puzzles, communications, and revolutions. In: Transactions of the 21st International Congress of Actuaries, Suppl. Vol. , pp. S87–S141. Kelly, F., Barnes, J., Aiken, M., 1995. Artificial neural networks: a new methodology for industrial market segmentation. Industrial Marketing Management 24, 5.
Klir, G.J., Yuan, B., 1996. Fuzzy Sets, Fuzzy Logic, and Fuzzy Systems: Selected Papers by Lotfi A. Zadeh. World Scientific, New Jersey.
Kohonen, T., 1988. Self-Organization and Associative Memory, 2nd Edition. Springer, New York.
Lemaire, J., 1990. Fuzzy insurance. ASTIN Bulletin 20 (1), 33–55. Master, T., 1995. The general regression neural network.
NeuroVe$t Journal 3 (5), 13–17.
Nikias, C.L., Petropulu, A.P., 1993. Higher-order Spectra Analysis PTR. Prentice-Hall, Englewood Cliffs, NJ.
Ostaszewski, K., 1993. Fuzzy set methods in actuarial science. Society of Actuaries, Schaumburg, IL.
Packard, N.H., Crutchfield, J.P., Farmer, J.D., Shaw, R.S., 1980. Geometry from a time series. Physical Review Letters 45, 712. Quinlan, J.R., 1993. C4.5: Programs for Machine Learning.
Morgan Kaufmann, Los Altos, CA.
Shapiro, A.F., 2000. A Hitchhiker’s Guide to the techniques of adaptive nonlinear models. Insurance: Mathematics and Economics 26, 119–132.
Shepherd, A.J., 1997. Second-order Method for Neural Networks. Springer, Berlin.
SPSS, 1993. SPSS PC Chaid Version 5.0. Prentice-Hall, Englewood Cliffs, NJ.
Tan, R., 1997. Seeking the profitability-risk-competitiveness frontier using a genetic algorithm. Journal of Actuarial Practice 5 (1), 49.
Tu, J.V., 1993. A comparison of neural network and logistic regression models for predicting length of stay in the intensive care unit following cardiac surgery. Master Thesis. University of Toronto.
Tufillaro, N.B., Reilly, J., Abbott, T., 1992. An Experimental Approach to Nonlinear Dynamics and Chaos. Addison-Wesley, Reading, MA.
Von Altrock, C., 1997. Fuzzy Logic and NeuroFuzzy Applications in Business and Finance. Prentice-Hall, Englewood Cliffs, NJ. Wang, H., Liu, G.P., Harris, C.J., Brown, M., 1995. Advanced
Adaptive Control, Terrytown. Elsevier, New York.
Wendt, R.Q., 1995. Build your own GA efficient frontier. Risks and Rewards 24, 1.
Yager, R.R., Ovchinnikov, S., Tong, R.M., Ngugen, H.T., 1987. Fuzzy Sets and Applications: Collected Papers of Lotfi A. Zadeh. Wiley, New York.
Young, V.R., 1993. The Application of fuzzy sets to group health underwriting. Transactions of the Society of Actuaries 45, 551– 590.
Young, V.R., 1996. Insurance rate changing: a fuzzy logic approach. Journal of Risk and Insurance 63, 461– 483.
Zadeh, L.A., 1992. Foreword of the Proceedings of the Second International Conference on Fuzzy Logic and Neural Networks, Iizuka, Japan, pp. xiii–xiv.
Zadeh, L.A., 1994. The role of fuzzy logic in modeling, identification and control. Modeling Identification and Control 15 (3), 191.
Zedeh, L.A., 1975a. The concept of a linguistic variable and its application to approximate reasoning, Part I. Information Sciences 8, 199–249.
Zedeh, L.A., 1975b. The concept of a linguistic variable and its application to approximate reasoning, Part II. Information Sciences 8, 301–357.
Zedeh, L.A., 1981, Fuzzy systems theory: a framework for the analysis of humanistic systems. In: Cavallo, R.E. (Ed.), Recent Developments in Systems Methodology in Social Science Research. Kluwer Academic Press, Boston, MA, pp. 25– 41.