Chapter 5: Discussion, Limitations, and Conclusion
5.2. Limitations, Recommendations and Conclusions
5.2.1. Limitations and Recommendations
judge reliability that covers a wider range of judgments than means or medians (Siegel, 1956). This measure provides a simple and time efficient way of calculating over-all agreement amongst k sets of rankings (Siegel, 1956). According to Sheskin (2007), šĢ provides a measure for ādata that are rank-ordered by more than two judgesā (p.1388), alternatively one could use Spearmanās rho as āšĢ for [k] sets of ranks is linearly related to the average value of Spearmanās rho which can be computed for all possible pairs of ranksā
(p.1387). Computing multiple calculations for all possible pairs of rankings to find the average agreement amongst a large sample of k judges, however would be time consuming and likely increase the family-wise error rate (Siegel, 1956). Therefore, šĢ was a more suitable measure to find agreement amongst the ranks given by the judging participants.
Furthermore, as the data collection procedure required participants to rank order the stimuli, a parametric factorial method would not have been appropriate. Bootstrapping, provides another unique empirical method for analysing the data, which enables the researchers to predict and estimate population parameters, without making parametric assumptions (Chernick, 2008;Winston, 2004; Sprent, 1989). Furthermore, with bootstrapping, one can estimate population confidence intervals using the percentile method which although could be refined with the use of a bias corrected model with an acceleration constant, does provide good approximations of the 95% confidence interval parameter.
5.2. Limitations, Recommendations and Conclusions
sample sizes be obtained for both the stimuli sample and the judging sample, and methods for the reduction of scent loss be implemented.
Instrument decay in the form of scent loss and contamination were a great limitation to this study as this affected the sample size, which subsequently affected the power of the study.
The instrument decay affected sample size as not enough judging participants were able to be recruited before the t-shirts lost their scent and became contaminated by other scents. In an attempt to try and recruit more judging participants, stimuli participants were asked to rewear the t-shirts after a detergent free wash, to reduce scent contamination. The scent, however, did last long enough to recruit a sufficient number of participants. In future studies, it is reccommended that the method used by Singh and Bronstad (2001) to reduce scent loss and contamination be used. That is that the SS are placed in a box with a triangular hole cut into it, judging participants can then smell the SS from where the hole is cut out. This method reduces the amount of contact that the judging participants have with the t-shirts thus reducing scent contamination and scent loss.
Another limitation that may have affected this study was the use of the term āmasculinityā, as from the results it appears to have produced the least concordant results. The lack of concordance with regard to masculinity rankings may be due to the ambiguity of how the word is defined in either social or biological terms. It was assumed that participants would judge and rank the stimuli in terms of how they percieved masculinity biologically. However, this may not have been the case, as social status is also associated with masculinity, and this would have been indicated in the photographs through hair style and visible clothing, which may have swayed participants from ranking the male faces on the presence of biological markers of masculinity. In future studies, it is reccommened that stimulus participants wear identical clothing whilst posing for the photographs to reduce potential bias from social status. Digitally manipulated photographs showing the same face with both masculinised and feminised features could also be used to assess masculinity preference (Cornwell, et al., 2004). In addition perhaps the measures for which participants made judgements in previous studies should have been used, i.e. pleasantness, sexiness and intensity (Gangestad, et al., 2005; Singh & Bronstad, 2001; Thornhill & Gangestad, 1999).
The additional analyses regarding age and race showed that these factors could potentially be confounding variables, as the age categories showed significant concordance across all ranking categories, which suggests that the different age groups considered the stimuli
similarly. Furthermore, only the judges who were the same race as the stimulus partcipants did not show significant concordance amongst rankings, however, for those participants who were not of the same race did show significant concordance in their rankings, suggesting that race may possible be a confounding variable. In future studies, it may be advantageous to limit age for judging participants as well as race to match the race of the stimuli participants.
Alternatively, to add more validity to the study, adequate and matched sample sizes should be used.
A further limitation of this study is that it did not consider the potential for hormonal contraceptive use amongst women, which may have confounded the rankings given by women. In future studies, it is recommended that this information be obtained from female participants. Furthermore, the measure for predicted ovulation may not have been entirely accurate and therefore there is scope for improvement in future studies, either by gaining ethical clearance and finances to buy and administer urine ovulation tests, or by tracking the female participantsā menstrual cycle over a monthly period.
In this study, it was also necessary to use multiple separate tests such as Kendallās coefficient of concordance for many separate subgroups. Performing multiple tests is known to increase the likelihood of family-wise error to occur which may lead to inaccurately rejecting the null hypothesis (Tredoux & Durrheim, 2002). However, per a study conducted by Nichols and Hayasaka, (2003), the bootstrap test offers a flexible model that reduces the prevalence of familywise error in its estimation of population parameters
A last recommendation for this study is that a meta-analysis be conducted comparing effect sizes between all of the t-shirt and pheromone studies. According to Shanks and Vadillo (2015) publication bias and p-hacking is often a concern particularly with replicated studies, such as this one. According to Shanks and Vadillo (2015), publication bias, refers to the the tendency to only publish results which are significant, and p-hacking refers to the tendency to alter data in order to achieve significance, this could be in the form of sampling until significance is reached or removing outliers after testing. Shanks and Vadillo (2015) suggest that due to publication bias and p-hacking, published literature may not always be an accurate measure of the real world. A meta-analysis of the previous literature as well as this study may explain why some of the data here did not prove significant as expected from previous studies.