3.4 Pilot data capturing, methods of analysis, and statistics
3.4.1 Pilot data capturing, method of analysis, and statistics for Chapter 4
3.4.1.2 Independent variables
1. Age
Despite the toddler CDI being for a specific age cohort, age is nonetheless controlled for to 1) confirm that younger children within the toddler age cohort do not produce as many vocabulary items as older children within the toddler age group as well as 2) to obtain cleaner estimates of the other effects by partialling out any effects that age might have (see Wooldridge, 2003: 78-79). Age is grouped into two cohorts only for the purposes of Table 3-2 below since it provides useful insight into the distribution of age in the sample. It is used un- grouped for the purposes of the regression.
49 The sample is too small to run a multinomial logistic regression on percentiles since the model is perfectly fitted, thus rendering any meaningful analysis using this method impossible. Due to the continuous nature of the dependent variable, MLR is preferred. Vogt, Mastin and Aussems (2015) correspondingly note how a linear trend is common with CDI scores. In a linear regression model, OLS estimates the unknown parameters by minimising the differences between the collected observations in an arbitrary data set and the responses predicted by the linear approximation of the data (Ohri 2018: 140).
60 2. Gender
The binary categories, female and male, as per the family history questionnaire, are used.
3. Mother’s level of education
Maternal education is grouped into categories based on the level of schooling completed, closely following that used by Vogt, Mastin and Aussems (2015), namely: not completed primary school, completed primary school, not completed secondary school, completed secondary school, and completed higher education (including FET training colleges).
4. Sibling as a secondary caregiver
In order to consider the role of sibling caregivers (as reported in Section 2.3.2) data was cleaned to represent whether a child is reported to have a sibling caregiver or not. The presence of a sibling caregiver is logged if a secondary caregiver under the age of 18 is reported in the family history questionnaire. This is not broken up into further age cohorts.
5. Whether the child is a twin
This variable is perfectly correlated to whether the child was born early (i.e. not full term).
Hence the latter is not included as an explanatory variable.
6. Birth order
Given the findings of existing scholarship outline in the literature review, I consider only whether the child was the first born or not (i.e. firstborn vs. laterborn). This is the same method that Reese and Read (2000) use for capturing birth order effects on the New Zealand CDI.
7. Ear problems
In the family history questionnaire, participants are asked about whether their child has experienced any health-related problems, because poor health can negatively affect language development (Vogt, Mastin and Aussems, 2015). The only health problem controlled for is ear infection/ear problem since answers to all other questions are “no”.
8. Crèche attendance
9. Number of adults in the home 10. Number of secondary caregivers 11. Number of siblings
12. Number of children in the home 13. Household income
The household income variable is calculated on a binary variable (high or low) since all participants except two fall into the R0-36000/year category.
61 14. Age-gender interaction
Lastly, I created an interaction term for age and gender. The age-gender interaction measures the effect of gender on vocabulary production due to age differences. This is to partial out the effect that the majority of the older children are female.
Table 3-2 below provides distributional information regarding the number of participants in each category of the independent variables described above:
62
Table 3-2 Independent variables
50 Due to the nature of the questionnaire, participants were not asked about more than four caregivers.
1. Age
16-23.5 months [1;4 – 1;11.15] (n=11) 23.5-30months [1;11.15 – 2;6] (n=9) 2. Gender
Female (n=13) Male (n=7) 3. Mother’s education
not completed primary school (n=0) completed primary school (n=3) not completed secondary school (n=12) completed secondary school (n=4) completed higher education (n=1) 4. Sibling as a secondary caregiver
Yes (n=6) No (n=14) 5. Twin
Yes (n=4) No (n=16) 6. First born
Yes (n=12) No (n=8) 7. Ear problems
Yes (n=4) No (n=16) 8. Crèche attendance
Yes (n=8) No (n=12)
9. Number of adults in the home One (n=2)
Two (n=5) Three (n=3) Four (n=5) Five (n=3) Six (n=2)
10. Number of secondary caregivers None (n=2)
One (n=4) Two (n=8) Three (n=3) Four50 (n=3) 11. Number of siblings
None (n=6) One (n=4) Two (n=1) Three (n=4) Four or more (n=5)
12. Number of children in the home One (n=1)
Two (n=4) Three (n=6) Four (n=2) Five (n=4) Six (n=2) Seven (n=1) 13. Household income
R0 – R36 000 (n=18) R36 0001 – R72 000 (n=2) 14. Age-gender
Age (months) Male Female
17 [1;5] 2 1
18 [1;6] 1 1
21 [1;9] 1 1
22 [1;10] 0 2
23 [1;11] 0 2
24 [2;0] 1 0
25 [2;1] 0 1
26 [2;2] 1 0
27 [2;3] 1 0
28 [2;4] 0 2
29 [2;5] 0 1
30 [2;6] 0 1
63 A multiple linear regression was run on all these variables (see Appendix C) using STATA. However, the results showed that some of the variables were highly statistically insignificant, which indicated that the model needed to be adjusted. I thus ran an F-test to determine whether this group of variables had no jointly significant effect on vocabulary production (Wooldridge, 200: 142-143). Variables tested were ‘Ear problems’, ‘Number of adults in the home’, ‘Number of children in the home’, and
‘Household income’. The null hypothesis tested is that all should have 0 coefficients. The F-statistic (1.60) is not statistically significant at the 5% or 10% levels, thus it is to be concluded that the null hypothesis cannot be rejected at the 5% level. I thus concluded that all coefficients should be jointly 0 and thus the variables have no explanatory power in the model; therefore, ‘Ear problems’, ‘Number of adults in the home’, ‘Number of children in the home’, and ‘Income’ have no effect on productive vocabulary after the other variables have been controlled for and should therefore be removed from the model.
This outcome could be attributed to the following: the number of children in the home is not accounted for above the option ‘three or more’ in the family history questionnaire, meaning this variable has an upper limit of three in the analysis. The effects of ‘Number of adults in the home’ and ‘Number of children in the home’ may also already be captured by variables such as ‘Number of secondary caregivers’ and ‘Number of siblings’ as they are likely to be correlated. The ‘Household income’ variable is problematic for reasons to be discussed in Section 3.5, and excluding it is not necessarily surprising since this is also not a common measure of SES in previous studies. The finding that ear problems do not have explanatory power in the model is not different to findings of Vogt, Mastin and Aussems (2015), who report that children for whom hearing problems were reported did not differ significantly from the rest of their sample with regard to reported scores on expressive and receptive vocabularies.
The final multiple linear regression used for analysis thus excludes these four variables and is specified as follows:
Percentage_words_produced = "#Child_Age_Mths + "$Gender + "%not_completed_sec +
"&completed_sec + "'tertiarty + "(Sibling_caregiver + ")Twin + "*First_born +
"+Creche_attendance + "#,Number_of_caregivers + "##Number_of_siblings + "#$Agegender
64