Psychology in Neural Networks

(1)

AURA: Alfred University Research & Archives https://aura.alfred.edu/

College of Liberal Arts and Sciences Faculty Scholarship

2022-09

Psychology in Neural Networks – In Honor of Professor Tracy Mott

Bedi, Harpreet Singh

Harpreet Singh Bedi (2022), "Psychology in Neural Networks – In Honor of Professor Tracy

Mott", Review of Behavioral Economics: Vol. 9: No. 3, pp 251-262. http://dx.doi.org/10.1561/105.00000158 now publishers

http://dx.doi.org/10.1561/105.00000158 https://libraries.alfred.edu/AURA/termsofuse

This is the Accepted Manuscript of the following article: Harpreet Singh Bedi (2022),

"Psychology in Neural Networks – In Honor of Professor Tracy Mott", Review of Behavioral Economics: Vol. 9: No. 3, pp 251-262, which has been published in final form at

http://dx.doi.org/10.1561/105.00000158. This article is protected by copyright.

Downloaded from AURA: Alfred University Research & Archives

(2)

− in honor of Professor Tracy Mott

Harpreet Singh Bedi [email protected]

Abstract

This paper introduces psychology into neural networks by building a correspondence between the theory of behavioral economics and the theory of artificial neural networks. The connection between these two disparate branches of knowledge is concretely constructed by designing a dictionary between prospect theory and artificial neural networks. More specifically, the activation functions in neural networks can be converted to a probability weighting functions in prospect theory and vice versa. This approach leads to infinitely many activation functions and allows for their psychological interpretation in terms of risk seeking and risk averse behavior.

1 Introduction

The original contribution of this paper is to re-interpret activation functions from the the point of view of prospect theory. This new approach gives a psychological understanding of the learning process behind the neural networks. Furthermore, this allows interpretation of probability weighting function of prospect theory as an activation function and vice versa. The paper also gives a recipe to construct infinitely many probability weighting functions which can then be used as activation functions in neural networks. Finally, some applications to behavioral finance are discussed.

1.1 Structure of the Paper

A summary of prospect theory and probability weighting functions is given in section 2. The probability weighting function is further studied in section 3. The linear transformation in section 3.2 converts a probability weighting function to an activation function and the opposite direction is given in 3.3. Finally, the theory described in the paper is used to construct infinitely many activation functions is given in section 3.4.

(3)

2 Prospect Theory

The story begins with expected value of a random variable. LetX be a random variable which has a finite number of outcomes{x₁, x₂, . . . , x_k} with associated probabilities{p₁, p₂, . . . , p_k}, then the expected value is given as

(2.1) Expected Value:

k

X

i=1

xipi.

The expected value was reformulated by Von Neumann-Morgenstern as expected utility theory. In this theory the finite outcomes were transformed via a utility function, for exampleu(w) = logw.

(2.2) Expected Utility:

k

X

i=1

u(x_i)p_i.

The expected utility theory was further transformed by Daniel Kahneman and Amos Tversky in the seminal paper [TK79]. In this theory the probability itself is modified using a probability weighting function π(·) giving the value as

(2.3) Prospect Theory:

k

X

i=1

u(xi)π(pi).

The first probability weighting function in the literature was given in [TK92]

(2.4) π(p) = p^γ

(p^γ+ (1−p)^γ)^1/γ wherep∈[0,1],

andγ was empirically determined to be0.61. The idea to modify probabilities comes from the observation that people prefer certainty over mere probable outcomes. Moreover, people like to play lotteries by over- estimating their chances of winning. This is demonstrated in the field experiments given in table 1 and 2.

Further expositions of expected utility and prospect theory can be see in [Pet17] and [Wak10].

Table 1: Problem 1 of [TK79, pp 265] choice for certainty

Lottery Value Probability Lottery Value Probability

Choice A 2500 0.33 Choice B 2400 1

2400 0.66

0 0.01

In table 1,82% of the respondents chose B though expected value of choice B is lower than expected value of choice A.

(4)

Table 2: Problem 8 of [TK79, pp 267] choice for lottery Choice A Choice B

Lottery Value 6000 3000 Probability 0.001 0.002

In table 2, 73% of respondents chose A, though expected value of either of the option is same. According to [TK79] this showed that people overweigh lower probabilities when the monetary reward is higher. The above behavior is similar to the one portrayed by the gamblers, who are ready to purchase a lottery ticket when monetary rewards are higher.

(2.0.1) The underweighting and overweighing the probabilities can be visually seen in the figure 1. The risk neutral behavior is demonstrated by the dashed red line. For example the functionπ(x) =x^αforα∈(1,∞) andx∈[0,1]captures the risk averse behavior as the person will under weigh the probabilities (right panel of figure 1). Dually, the function π(x) = x^α forα ∈(0,1) and x ∈[0,1] will over weigh the probabilities making the person risk seeking (left panel of figure 1). The risk neutral person will have no weighting of probability, that is,π(x) =x.

x π(x)

Figure 1: Optimist and a Pessimist

Therefore, the probability weighting function has an inverse S shape (left panel of figure 2). It should be concave down in the right neighborhood of zero to capture lottery playing behavior and it should be concave up in the left neighborhood of one to capture choice of certainty over probability.

(5)

x π(x)

Figure 2: InverseS shape versusS shape

3 Probability Weighting Functions

The probability weighting functionπ(x)has following important properties [Bed06, pp 7-8].

1. Rationality: It is strictly monotonic increasing. This is because rational subjects will always choose higher probability over lower probability for the same prize.

2. Domain and Range: For everyx∈(0,1)the output of the function also lies in(0,1)that isπ(x)∈(0,1).

3. Continuity: The function is continuous.

The condition of continuity above is not strictly required, jump discontinuities can be introduced into the function. In this paper we will stick to continuous functions.

3.1 Examples of Probability Weighting Functions

The following functions have appeared in the literature.

1. The first function in the literature was given in [TK92]

(3.1) π(x) = x^γ

(x^γ+ (1−x)^γ)^1/γ whereγ∈[0,1]

(6)

2. In [Pre98] there is two parameter function

(3.2) π(x) = exp(−b(−logx)^g) b >0, g∈(0,1).

3. In [RW06] there is one parameter function

(3.3) π(x) = (3−2g)x−(6−6g)x²+ (4−4g)x³, g∈(0,1).

The parameters,γ, b, g are optimally determined to fit the data generated by the experiments.

(3.1.1) Probability weighting asS shape The probability weighting function is based upon comparison with boundary points, that is with comparing uncertainity with certainty and justifying the lottery playing behavior. This does not capture the example 3 in which more people have chosen B over A, although the expected value of both the options are the same. This example suggests that higher probabilities are overweighed with respect to lower probabilities and thus the curve should beS shaped.

Table 3: Problem 7 of [TK79, p 267]

Choice A Choice B Lottery Value 6000 3000

Probability 0.45 0.90

% Vote for 14% 86%

(3.1.2) It has been shown [Bed06],[HHV05](published as [HHV10]) that theSshaped curve fits better than the inverse S shape curve, especially when the data set consists of examples as given in table 3. The S shaped curve becomes even more pronounced when the option of certainty is not given. Additionally, the researchers do not have millions of dollars to replicate the lottery behavior. While working on [Bed06] I interviewed people who bought lottery tickets. In my sample most of the people came from financially poor backgrounds and for them buying a lottery ticket was akin to buying a ticket for an action movie. In other words, the lottery ticket gave them the means to imagine a richer life if they won and they could dream till the results were declared. The subjects were well aware of the fact that the probabilities were low.

(3.1.3) A much more flexible probability weighting function is suggested in [Bed06], which incorporates both S and inverse S shaped functions. This function has four parameters a, b, c, d which are utilized for rearranging the shape of the function. These parameters are adjusted in the risk neutral lineπ(x) =x. Fifth parameterk is utilized for getting the curvature of the function.

(3.4) π(x) =











a^(1−1/k)x^1/k whenx∈[0, a],

a+ (−a+b)^1−k(−a+x)^k whenx∈[a, b]

b+ (−b+c)^(1−1/k)(−b+x)^1/k whenx∈[b, c]

c+ (−c+d)^(1−k)(−c+x)^k whenx∈[c, d]

such thatk≥1 and0≤a≤b≤c≤d≤1.

(7)

The above function is represented in figure 3 fork= 3, a= 1/10, b= 1/2, c= 9/10andd= 1. Settinga= 0 andc =d= 1will given anS shaped function and settingb=c= 1gives inverse S shaped function. The reader can play with the parameters of the function athttps://www.desmos.com/calculator/wulqnvvjsl.

Rotation of polynomials also gives multiple such functions, all these functions are detailed in [Bed06].

x π(x)

a

b

c d

Figure 3: A more flexible function that incorporates both S and inverseS shapes.

(3.1.4) Activation Functions The heart of a neural network is an activation function and it plays a primary role in the training of a deep neural network. The training can be philosophically thought of as a process to map the real world data to the set{0,1}(finite field of order 2, that isZ/2Z). Here 1 represents that a neuron will fire and 0 represents that a neuron will not fire.

A straightforward description and a brief history of the neurons is given in [And95, Chapter 2] or [PP92].

The firing of these neurons can also be can also be described via AND, NOT and OR gates [Arb03, page 8, Figure 3]. A more realistic class of models called integrators can also be used to describe firing of neurons, the neuron will fire when the stimulus is above a certain threshold [And95, p 52].

More precisely the learning process involves updating the output repeatedly by back propagation through multiple layers of neurons. As in regression the purpose is to minimize the total error, but in neural networks the minimization is achieved by the method of gradient descent. This method can be basically described as working with the derivative of the activation function along with the error terms in order to decrease the error as much as possible [And95, Chapter 9].

A fundamental activation function is the sigmoid S shaped function also called the sigmoid curve. The deep learning using the sigmoid function and minimization of error through associated gradient descent is given explicitly given via an example in [Arb03, page 21-22]. The paper [HM95] studies a particular class of

(8)

Sigmoid functions and assess the speed of back propagation learning.

We now want to concretely relate the S shaped sigmoid function to an S shaped probability weighting function. In fact we will show that the S shaped sigmoid functions and S shaped probability weighting functions are avatars of each other.

The big picture is very clear, take a sigmoid function contract its domain and range to[0,1]and it becomes a S shaped probability weighting function. On the other hand theS shaped probability weighting function is already a sigmoid function. But, we need to expand its domain so that it can take values from any domain.

This can be done via a simple linear transformation. In the next section we choose the logistic function as an example of the sigmoid function to explicitly show the computations.

(3.1.5) Logistic Function to Probability Weighting function The Logistic function [Gal07, pp 39]

used in the neural network also has aS shape, as shown in figure 4.

x

π(x)

Figure 4: Logistic Function asS shape and its transformation via`(x)in (3.10).

This function can now be interpreted under the lens of prospect theory. It takes the data and then ‘learns’

by underweighing less probable outcomes and overweighing the more probable outcomes thus classifying the data. The logistic function is given as

(3.5) f(x) = 1

1 + exp(−x) , f :R→[0,1].

More precisely, let the input set of values lie in the interval[a₁, a₂]⊂Rthat isf(x) : [a₁, a₂]→[0,1]. This can be transformed to a probability weighting function π(x) : [0,1] → [0,1]by shrinking the function to

(9)

preserve concavity and convexity in certain regions. The techniques for shrinking are given in section 3.3 and the example 3.1 is used to shrink the logistic function. In the other direction the probability weighting function can be reorganized as an activation function in the neural network, see section 3.2.

3.2 Linear Transformation

In order to use the probability weighting function as an activation function in the neural network we first need to map the input data bijectively to the domain[0,1]of the probability weighting function.

Let the input values of the activation function lie in the interval [a1, a2], these values can be bijectively transformed to the interval[0,1]by the mappingg: [a1, a2]→[0,1]given as

(3.6) g(x) = x−a1

a2−a1

.

The probability weighting function can now be applied to the transformation above to get the activation functionπ◦g. Notice that this is a linear bijective function, thus preserves the impact of the input values faithfully.

In order to avoid the lowest and the highest value getting mapped to0and1respectively consider the more general functionL: [a₁, a₂]→[b₁, b₂]given as

(3.7) L(x) =

b₂−b₁ a2−a1

(x−a₁) +b₁,

with[b₁, b₂]⊆[0,1]. The value ofb₁ can be set close to zero (for example,0.0005) and the value ofb₂ can be set close to1 (for example, 0.99995). Therefore, the functionπ◦L(x) : [a1, a2]→[0,1]precisely as the logistic activation function. The values0and1should be philosophically avoided since nothing is absolutely certain.

3.3 Transforming Functions

In this section a recipe is given for cutting and shrinking a function while preserving its concavity and convexity. These functions can then be glued together to obtain infinitely many activation or probability weighting functions. In example, 3.2 the reader can choose the concave down part oflogxand glue it with concave up part ofx^1/3, x≤0and get a probability weighting function.

Consider a functionf(x)with two coordinates(a, f(a))and(b, f(b))then the piece of this function can be fit (compressed or expanded) in the unit interval [0,1]. First, transfer the point(a, f(a)) to (0,0) by the following transformation

(3.8) g(x) =f(x+a)−f(a), implying g(0) = 0.

(10)

This above transformation sends the point(b, f(b))to (b−a, g(b−a))which can be forced to pass through (1,1)by considering the function

(3.9) h(x) =g((b−a)x)

g(b−a) , implying h(0) = 0, h(1) = 1.

The above function can now be transformed into any interval[u1, u2]via the following function (3.10) `(x) =u₁+ (u₂−u₁)·h

x−u₁ u2−u1

implying `(u₁) =u₁, `(u₂) =u₂.

3.1 Example. The logistic functionf(x)in (3.5) is sketched from(−6, f(−6))to(6, f(6))in the left panel of Figure 4. It is then compressed to the domain[0,1]in the right panel.

3.2 Example. The functionlogxandx^1/3 can be glued together to get anSshaped probability weighting function using the techniques of this section. This is demonstrated and implemented at https://www.

desmos.com/calculator/x6fk1o0ng1.

3.4 Infinitely Many Activation Functions

Take a concave up part of any continuous function and transform it to a function `₁(x) on the interval [u₁, u₂] which will give`(u₁) =u₁ and`(u₂) =u₂. Now take concave down part of any continuous function transform it to a function `2(x) on the interval [u2, v] which will give `2(u2) = u2 and `2(v) = v. The functions `1 and`2 are glued together at the pointu2 giving a new activation function which is S shaped just like the logistic function.

(3.4.1) Non Differentiability and gradient descent The derivative of the activation function is needed for back propagation to minimize total error via gradient descent.

If the underlying functions are differentiable then the function obtained by gluing their finite pieces are also differentiable, except where the two functions are glued together. This problem of non differentiability can be solved in two ways.

1. The derivative at the non differentiable point can be defined as the average of the value of derivatives of the two function on the left and the right.

2. From probability theory we know that probability of drawing a single point from an interval is zero.

If we still get the non differentiable point, it can be perturbed by a small amount so that it lies again on the differentiable functions.

3.5 Interpretation of Activation Functions

Prospect theory gives the ‘correct interpretation’ of activation functions. The theory in the paper allows us to think about activation functions as probability weighting functions. The neuron could take in continuous

(11)

input but its output is discrete, fire or not fire. The activation function encodes this firing as well as the learning process in the neuron. It is this learning process which gets clarity when interpreted from the lens of prospect theory. Mathematically, it is just minimization of total error as in regression, but with the aid of derivative of activation function. However, from the point of view of prospect theory the learning process alters the probability. Less likely events are interpreted to occur with even lower chance and more likely events are interpreted to occur with higher chance as portrayed by anS shaped probability function.

The under weighting and over weighting the probabilities in order to make a binary decision is the psychology of the learning process. The neuron classifies data in the learning process, the data which will lead to firing and the data which will not lead to firing. This data is classified with the help of activation function. In this paper we are interpreting this classification of data as the change of probability.

(3.5.1) Illustration Example I recently bought a puzzle toy for my dog Apollo. There are multiple ways to get to the reward of a treat, but all of these methods have a different probability when the dog is putting in a constant effort using a particular technique. After going through all the iterations the dog has settled on a single method to get the treat. I presume that in the training process he overweighed the probability of this method and his neurons now fire to follow this technique only, although the probabilities of other ways to getting the treat remain the same.

4 Applications

The activation functions constructed in the paper can be used to build neural networks for marketing financial instruments to investors based upon their risk appetite.

According to Prospect theory investors value gains and losses asymmetrically, and make decisions based upon their utility towards perceived gains and perceived losses. It also explains the tendency of an investor that will hold stocks losing value for a substantial amount of time, but sell rising stocks very soon. Whereas, the logical action would be to sell losing stocks to prevent losses and hold onto winning stocks.

The financial organizations could determine the risk appetite of a consumer by asking about their preferences for higher gains and potentially higher losses, versus lower gains versus potentially lower losses. Once, the degree of risk seeking or risk averse behavior is determined according to Prospect theory it can be transferred to an activation function according to the techniques introduced in this paper. This activation function can then be used in a neural network for determining which financial instruments best suit the individual need of the client.

In this paper we have constructed infinitely many such functions, therefore it is possible to have a unique activation function for each investor. Thus, it is possible to uniquely tailor a neural network for each individual’s risk appetite, leading to exceptional service.

(12)

5 Conclusion

It can concluded that the probability weighting function of prospect theory and activation function of neural networks are avatars of each other. Thus, there is an unexplored bridge between the theory of behavioral economics and neural networks. This paper is the first attempt to reconcile these two distinct fields.

6 Acknowledgement

I am grateful to Tracy Mott and Nicolas Ormes for helpful discussions and suggestions for this paper.

(13)

References

[And95] J.A. Anderson. An Introduction to Neural Networks. Bradford book. MIT Press, 1995. (Cited on page 6)

[Arb03] P.H. Arbib. The Handbook of Brain Theory and Neural Networks. A Bradford book. MIT Press, 2003. (Cited on page 6)

[Bed06] Harpreet Bedi. Psychological weighting of probabilities: An assessment and suggestion of a new function. Master’s thesis, University of Denver, 2006. (Cited on page 4, 5, 6)

[Gal07] A.I. Galushkin. Neural Networks Theory. Springer Berlin Heidelberg, 2007. (Cited on page 7) [HHV05] G. Harrison, S. Humphrey, and A. Verschoor. Choice under uncertainty in developing countries.

2005. (Cited on page 5)

[HHV10] G. Harrison, S. Humphrey, and A. Verschoor. Choice under uncertainty: Evidence from Ethiopia, India and Uganda. The Economic Journal, 120(543):263–291, 2010. (Cited on page 5)

[HM95] Jun Han and Claudio Morag. The influence of the sigmoid function parameters on the speed of backpropagation learning, volume 930 of Lecture Notes in Computer Science, pages 195–201.

Springer Berlin Heidelberg, 1995. (Cited on page 6)

[Pet17] M. Peterson. An Introduction to Decision Theory. Cambridge University Press, 2017. (Cited on page 2)

[PP92] P. Peretto and P. Pierre. An Introduction to the Modeling of Neural Networks. Aléa-Saclay.

Cambridge University Press, 1992. (Cited on page 6)

[Pre98] Drazen Prelec. The Probability Weighting Function. Econometrica, 66(3):497–527, 1998. (Cited on page 5)

[RW06] Marc Oliver Rieger and Mei Wang. Cumulative prospect theory and the St. Petersburg paradox.

Economic Theory, 28(3):665–679, 2006. (Cited on page 5)

[TK79] A. Tversky and D. Kahneman. Prospect theory: An analysis of decision under risk. Econometrica, 47(2):263–291, 1979. (Cited on page 2, 3, 5)

[TK92] A. Tversky and D. Kahneman. Advances in prospect theory: cumulative representation of uncertainty. Journal of risk and uncertainty, 5:297–323, 1992. (Cited on page 2, 4)

[Wak10] P.P. Wakker.Prospect Theory: For Risk and Ambiguity. Cambridge University Press, 2010. (Cited on page 2)