In order to provide reliable research reports and allow fellow researchers to draw valid interpretations and evaluations, many scholars have argued that it is vital to provide full disclosure on the so-called basic 4: the sample size determination rule, specific exclusion criteria, all measured variables, and all tested conditions (LeBel
& John, 2017; Simmons et al., 2012). Simmons and colleagues have suggested a 21-word solution that can easily be included in experimental studies: “We report how we determined our sample size, all data exclusions (if any), all manipulations, and all measures in the study.” Disclosing such basic information about one’s study increases transparency and provides a valid description of the intentions and meth- ods (Miguel et al., 2014). Some journals have started to adopt the policy of requiring authors to include a full disclosure statement (e.g., Eich, 2014).
Next to the aspect of disclosing information about the study design, several discussions have targeted the way researchers are reporting statistical results.
According to a survey of articles published in the Journal of Experimental Psychology: General during 2009 and 2010, Fritz, Morris, and Richler (2012) found that less than half of the articles reported effect size measures and none of them a confidence interval. Scholars have underscored the importance of reporting effect size estimates and corresponding confidence intervals (Cumming, 2012;
Fritz et al., 2012; Lakens, 2013; Lakens & Evers, 2014). On the one hand, effect sizes provide information about the practical significance of empirical studies and have been argued to facilitate cumulative science as they can be helpful in power calculations and meta- analyses. On the other hand, confidence intervals provide information on the precision of the estimate. However, popular statistical packages such as SPSS still fail to provide information of various effect sizes (e.g., t-tests), which might be one reason for the underreporting of effect size estimates. Open software applications such as jamovi (https://www.jamovi.org), JASP (https://jasp- stats.org), or R (https://www.r-project.org) provide possible solutions for such shortcomings. Similarly, Lakens (2013) has provided several easy-to-use spread- sheets that calculate effect sizes for the most common statistical tests (https://osf.
io/ixgcd/). In addition, to aid interpretation of effect sizes, Fritz et al. (2012) have suggested to include statistics such as the probability of superiority (PS), the per- centage of occasions when a sampled member of the one group has a higher mean than a randomly sampled member of the other group (Grissom, 1994), or U1, the percentage of nonoverlap of the distributions (Cohen, 1988). Finally, it has been highlighted that researchers should always report additional descriptive statistics including means, standard deviations, or correlation matrices (Fritz et al., 2012).
Reporting such information is critical for the inclusion in meta-analyses if open data is not available.
J. H. Zickfeld and T. W. Schubert
Open Data and Materials
Many scholars have advocated the sharing of study data and materials and have argued that the long-term benefits outweigh the attributed short-term costs (LeBel, Campbell, & Loving, 2017). In fact, openly available information on a particular study could ease replications and extensions and might simultaneously reduce errors as other researchers are able to reproduce calculations and analyses (Miguel et al., 2014). In addition, we believe that it increases trust in the reliability of data and analyses.
Still, recent studies found rather low rates (38%) of researchers sharing their data (Vanpaemel, Vermorgen, Deriemaecker, & Storms, 2015). In a survey among 1329 scientists, lack of time and funding have been described as the major problems related to sharing of data and material (Tenopir et al., 2011). To promote an open research culture, the Transparency and Openness Promotion Guidelines (TOP;
https://cos.io/our-services/top-guidelines/) have been drafted by the Center for Open Science (COS) to guide journals’ decisions on the level of transparency for a number of aspects such as transparency of data, design and analysis, or research materials (Nosek et al., 2015). Open sharing of data, materials, and the research process has become straightforward as there exist hundreds of online data repositories, which differ in their focus and features (see http://www.opendoar.org for a directory on repositories). Finally, to increase transparency of data sharing, an increasing number of journals have introduced so-called Open Science Badges to indicate availability of open data and materials, as well as preregistrations (Kidwell et al., 2016; Lindsay, 2017). An excellent overview of steps to ensure openness of research flow and data has been provided by O. Klein et al. (2018). They also provide a primer on how to implement transparent research management strategies and procedures.
Preregistration and Registered Reports
Preregistrations, the specification and recording of study protocols prior to conduct- ing the study, have become standard for medical trials (Lenzer, Hoffman, Furberg, Ioannidis, & Grp, 2013), but are not the default in the social and behavioral sci- ences. However, in order to reduce the occurrence of publication bias, the systemati- cally skewed publication of positive results (Dickersin, 1990; Rosenthal, 1979), and p-hacking, the flexibility in data analyses to obtain statistically significant findings (Simmons et al., 2011), scholars have called for the employment of preregistrations in the social sciences (Wagenmakers, Wetzels, Borsboom, van der Maas, & Kievit, 2012). Recently, a number of journals have adopted preregistration options and similarly journals publishing only preregistered studies have been launched (Chambers, 2013; Jonas & Cesario, 2016). In general, researchers have distin- guished between two forms of preregistration (van’t Veer & Giner-Sorolla, 2016):
156
reviewed preregistrations also called registered reports (Nosek & Lakens, 2014) and unreviewed preregistrations. For the first type researchers specify a study pro- tocol including method, materials, and proposed analyses, which are then reviewed employing the traditional peer-reviewed system prior to conducting the study. If this preregistered protocol has been vetted, the final results will be published indepen- dently of the outcome given that the researchers have adhered to the protocol. In contrast, for the unreviewed preregistration type, the study protocol is not peer reviewed but registered by the researcher. Adopting preregistration increases trans- parency between a priori planned confirmatory analyses and post hoc exploratory analyses (Nosek, Ebersole, DeHaven, & Mellor, 2018). Contrary to some precon- ceived notions, preregistration does not restrict exploratory research if it is denoted as such. Many scholars have argued that preregistration is not only able to increase transparency but also to provide a more valid description of actual effects as it dis- tinguishes between prediction and postdiction (Mellor & Nosek, 2018; Nosek et al., 2018; Wagenmakers et al., 2012). As registered reports are published independent of the outcome, researchers are not pressured to present positive results only.
Nevertheless, preregistration is not always applicable to every study design, and most literature has focused on tailoring preregistrations for quantitative experimen- tal research. In addition, envisioning every possible step and outcome of a study might pose issues and uncertainties. Therefore, van’t Veer and Giner-Sorolla (2016) have drafted a template including a number of questions with regard to hypotheses, methods, and analyses plan. Moreover, a number of online services such as AsPredicted (aspredicted.org) or the Open Science Foundation (osf.io) have made it easy to efficiently preregister one’s research using predefined templates.
Power and Accuracy
Accordingly, up-to-date psychological investigators are normally expected to include some preliminary calculations regarding power in designing their experi- ments. (Meehl, 1967, p. 107).
Already several decades ago Cohen (1962) has pointed out the importance of statistical power in the social and behavioral sciences (see also Greenwald, 1975;
Ioannidis, 2005). Statistical power refers to the probability of rejecting the null hypothesis when it is false and is therefore only relevant in the context of hypothesis testing. While power is important for planning for rejection of the null hypothesis and explore the direction of an effect, precision or accuracy1 is relevant for estimat- ing the actual effect and its main goal lies in achieving a sufficiently narrow confi- dence interval (see Maxwell, Kelley, & Rausch, 2008 for a discussion). Power is dependent on the population effect size and the sample size (and the alpha level),
1 Accuracy and precision often occur simultaneously. However, while precision refers to a narrow confidence interval, accuracy also provides information that this interval contains the true popula- tion value (see Kelley & Maxwell, 2003, 2008 for a discussion).
J. H. Zickfeld and T. W. Schubert
while accuracy is first and foremost depended on the sample size and the population variance. Thus, both approaches, power analysis and accuracy in parameter estima- tion, have two different goals, and depending on the context, planning for accuracy might sometimes result in larger sample size recommendations (Kelley & Rausch, 2006). Typically, a researcher wants to perform a power analysis or plan for accu- racy before conducting a study, in order to get an idea of how many participants need to be recruited in order to achieve either a certain amount of power or a certain degree of accuracy. Levels of 80% are often discussed as appropriate (Cohen, 1988), though some journals have adopted policies requesting higher values (e.g., Jonas &
Cesario, 2016). In general, it should be noted that recent discussions have primarily focused on power and “[...] researchers have not yet made precision a central part of their research planning” (Cumming, 2012, p. 355).
A number of attempts have been made to evaluate the mean amount of statistical power in the social and behavioral sciences. Estimates ranged from about 50%
mean power in social-personality psychology (Fraley & Vazire, 2014) to about 35%
for psychological research (Bakker et al., 2012) for a medium effect to a median of 21% (Button et al., 2013) or 12% (Szucs & Ioannidis, 2017) to detect small effects.
Similarly, evidence has been presented that researchers have problems grasping sta- tistical power and underestimate the sample size needed, especially for small effect sizes (Bakker, Hartgerink, Wicherts, & van der Maas, 2016), with similar findings showing misconceptions and ill-understanding of precision and confidence intervals (Belia, Fidler, Williams, & Cumming, 2005). However, at the same time research in the social sciences often targets primarily small to medium effects (Gignac &
Szodorai, 2016). It is therefore not surprising that scholars have called for adequate a priori power analyses or simulations in order to be able to detect possible effects accurately (Anderson, Kelley, & Maxwell, 2017; Bakker et al., 2012; Maxwell, 2004).
In order to conduct a successful a priori power analysis, a researcher would need to know the population effect size, but of course this information can only be esti- mated. Sometimes researchers select a small, medium, or large effect size based on interpreting standards (Cohen, 1988) and their own estimation of the population effect. On other occasions researchers base the effect size on pilot studies or previ- ous literature. However, Anderson et al. (2017) have warned against such practices, as such effects might often be overestimated due to publication bias and as pilot studies rely on particularly small samples (see also Albers & Lakens, 2018). They advocate the use of adjusting effect sizes for bias and uncertainty (https://designing- experiments.com/shiny-r-web-apps/). In addition, a number of online resources and software applications have been released in order to compute a priori power analy- ses. One straightforward application is G∗Power 3 including power analyses for the most common tests such as ANOVAs or t-tests (Faul et al., 2007; http://www.
gpower.hhu.de/en.html). For more complex models such as multilevel regression models (https://jakewestfall.shinyapps.io/two_factor_power/; Judd et al., 2017) or mediation models (https://schoemanna.shinyapps.io/mc_power_med/; Schoemann et al., 2017), online applications exist. While these applications cover the most com- mon models, scholars have advocated to use power simulations for more complex or uncommon situations (Maxwell et al., 2008). Power simulations are a great tool
158
to understand the nature of statistical power and also help researchers to grasp that statistical power should better be understood as a function of various parameters and not an individual fixed value.
In contrast to planning for power, there exist only few possibilities to plan for accuracy. One possibility is the MBESS package (Kelley & Lai, 2016) in R, which includes accuracy routines for the most common statistical tests. Although the com- mon conception that increasing sample size increases statistical power and accuracy is valid, there are many other aspects of a study that are able to improve power and accuracy such as the type of design (e.g., within subject designs) or the reliability of a measure (Maxwell et al., 2008). Finally, researchers have warned against calculat- ing post hoc or observed power (Cumming, 2012; Hoenig & Heisey, 2001) as it is highly dependent on the obtained estimate and p-value and might be misleading.
Given that estimating the effect size is difficult in many cases before running the study, it is useful to think about statistical power in terms of sensitivity. While a power analysis indicates the likelihood of obtaining a significant effect for a given effect size and sample size, a sensitivity analysis outputs the smallest detectable effect size with a given likelihood and sample size. The Journal of Experimental Social Psychology (JESP) made such sensitivity analyses mandatory. Sensitivity analyses could offer a way around the misleading nature of reporting post hoc power and the issue that a priori power analyses are problematic if no information on a possible effect size outcome exists.