3.5. Empirical methods of data analysis
3.5.2. Principal Components Analysis (PCA)
An empirical framework based on the entrepreneurial skills of farmers allows the researcher to investigate how farmers identify and develop their own skills and roles both in relation to immediate physical, social, economic and institutional environments (Morgan et al., 2010).PCA was first used to combine socioeconomic indicators into a single index (Boelhouwer and Stoop, 1999). However, due to the inappropriateness of simple aggregation procedures, Lai (2003) modified the United Nations Development Programme (UNDP), Human Development Index (HDI) by using PCA to create a linear combination of indicators of development.
Several researchers have increasingly used PCA, since the late 1990s, to compute various composite socioeconomic indices (Antony and Rao, 2007; Fukuda et al., 2007; Fotso and Kuate- defo, 2005; Havard et al., 2008). It has been used to construct an asset-based poverty index which determines the socio-economic status of households (Filmer and Pritchett 2001; Vyass and Kumaranayake 2006; Achia et al., 2010; Howe et al., 2012). Following the same logic, PCA is used in this study to create a multi-criteria on-farm entrepreneurship index.
PCA was used to generate the entrepreneurship index, and this index was, in turn, used as a dependent variable in the Tobit regression model to determine the effect rural endowment has on the entrepreneurship level of the farmers’ irrigation schemes. These different analytical techniques are explained in detail in the following sub-sections. From an initial set of 45 (See appendix A) which was cut down to 28 correlated entrepreneurship skills, motivations, self- efficacy, competencies and attributes were identified. The PCA created uncorrelated five components, where each component was a linear weighted combination of the initial skills or attributes. Only the factor scores (eigenvectors) of the first principal component (PC1) were used to construct the entrepreneurship index.
43 The aim was to create a single measure of on-farm entrepreneurship for farmers in the irrigation schemes. PCA is a powerful and relatively simple technique for extracting hidden structures from possibly high-dimensional datasets (Achia et al., 2010). Suppose we have a dataset with a high number of variables (i.e. indicators) for various observations. One can think that these indicators are measuring the same object or episode from different perspectives so all of them contain common information about the object. PCA is an orthogonal transformation of the coordinate system in which we describe our data. The new coordinate values by which we represent the data are called principal components. It is often the case that a small number of such principal components is enough to account for most of the structure in the data. These are sometimes called factors or latent variables of the data (Tabachnick and Fidell, 1983).
There are alternatives to PCA such as correspondence analysis, multivariate regression or factor analysis. Cortinovis et al. (1993) used correspondence analysis to derive an asset-based poverty index. However, the analysis can only be used for categorical data (nominal and ordinal);
continuous data would need to be reorganized into ranges. With multivariate regression, dimensionality reduction is accomplished by simply choosing which variables to leave out, at the expense of ignoring some dimensions of the data (Aicha et al., 2010).
Factor analysis has a similar aim to PCA, in terms of expressing a set of variables into a smaller number of indices or factors. However, the difference between the two is that while there are no assumptions associated with PCA, the factors derived from factor analysis are assumed to represent the underlying processes that result in the correlations between the variables (Aicha et al., 2010). The choice between using PCA and factor analysis to solve for multicollinearity also depends on the researcher’s own assessment of the fit between the common factor model, the data set and the goals of the research (Tabachnick and Fidell, 1983).
For this study, which aimed at formulating an on-farm entrepreneurship index, PCA was deemed the better choice. The alternative, factor actor analysis is more suitable when the aims of the study are to obtain hypothetical solution uncontaminated by unique and error variability as opposed to an empirical summary of results Compared with other statistical alternatives, PCA is computationally easier, can use the type of data that can be more easily collected in household surveys, and uses all of the variables in reducing the dimensionality of the data (Jobson, 1992).
PCA is concerned with explaining variability. If the variables are in different units the operations involving the trace of the covariance matrix will have no meaning and the correlation matrix will
44 be used. If the variables are in the same units taking into account the logs of the variables, the covariance matrix must be used (Jackson, 1991).
Suppose we have a set of N variables, a*1j to a*N j, representing the possession of N picket scale of an entrepreneurial trait by each farmer. Principal components start by specifying each variable normalized by its mean and standard deviation: for example, a1j = (a*1j – a*1) / (s*1), where a*1 is the mean of a*1j across all farmers and s*1 is its standard deviation.
These selected variables are expressed as linear combinations of a set of underlying components for each farm household j:
a1j = v11 × A1j + v12 × A2j +...+ v1N × ANj ... j = 1,...J aNj = vN1 × A1j + vN2 × A2j +...+ vNN × ANj , (1)
Where the As are the components and the vs are the coefficients on each component for each variable (and do not vary across farmers). Because only the left-hand side of each line is observed, the solution to the problem is indeterminate.
Principal components overcome this indeterminacy by finding the linear combination of the variables with maximum variance—the first principal component A1j— and then finding a second linear combination of the variables, orthogonal to the first, with maximal remaining variance, and so on. Technically, the procedure solves the equations (R – λnI)vn = 0 for λn and vn, where R is the matrix of correlations between the scaled variables and vn is the vector of coefficients on the nth component for each variable. Solving the equation yields the characteristic roots of R, λn (also known as eigenvalues) and their associated eigenvectors, vn.
The final set of estimates is produced by scaling the vns so the sum of their squares sums to the total variance, another restriction imposed to achieve determinacy of the problem. The “scoring factors” from the model are recovered by inverting the system implied by Eq. (1), and yield a set of estimates for each of the N principal components (Armeanu and Lache, 2008).
A1j = f11 × a1j + f12 × a2j +...+ f1N × aNj ... j = 1,...J
ANj = fN1 × a1j + fN2 × a2j +...+ fNN × aNj. (2)
45 The first principal component, expressed in terms of the original (un-normalized) variables is, therefore, an index for each entrepreneur based on the expression:
A1j = f11 × (a*1j – a*1)/(s*1) +...+ f1N × (a*Nj – a*N) / (s*N) (3)
Given that the PCA generated entrepreneurship index is censored at its minimum and maximum values (Manyong et al., 2006; Muchara et al., 2014), the 2-limit Tobit model (Greene, 2003;
Long and Freese, 1997; Wooldridge, 2002 ) was estimated to investigate the determinates of on- farm entrepreneurship in taking advantage of smallholder irrigation schemes. Since entrepreneurship can also be influenced by the individual’s farming experience, education levels, these variables and others mentioned below in Table 3.2 were included in the model. Using the index generated by PCA as the dependent variable, the Tobit regression model was estimated as follows: Y*i = β0 + βxi + εi [1]. Where Y*i is the unobservable latent on-farm entrepreneurship index of household i; xi is a vector of household characteristics; β and εi residual term.