layout: true <div class="my-header"><span>Replication Studies 101</span> </div> <div class="my-footer"><span>Francis Huang / huangf@missouri.edu </span></div> --- class: center, middle <!-- background-color: white --> ###Thinking About Reproducible/Replication Research ####Francis L. Huang, PhD #### 2021-01-05 (updated: 2024-05-13)
[@flhuang](http://twitter.com/flhuang) <BR>
https://francish.net <BR>
[huangf@missouri.edu](mailto:huangf@missouri.edu) .footnote[ .f90[.left[Based on: Huang, F. L., & Huang, A. B. (2024). Replication studies using secondary or nonexperimental datasets. *School Psychology Review*. Advance online publication. https://doi.org/10.1080/2372966X.2024.2346781] ] ] --- ## Much has been said about the replication crisis... An issue in several fields such as chemistry, biology, and medicine -- however much of the attention has been of psychology (Baker, 2015) <img src="img/definition.png" width="90%" /> .footnote[ Baker, M. (2015). Over half of psychology studies fail reproducibility test. Nature, 1-3. https://doi.org/10.1038/nature.2015.18248 ] --- ### Many studies in psychology (among other fields) fail to replicate<sup>1</sup>... .pull-left[ .center[ <img src="img/powerpose.png" width="60%" /> ] ] .pull-right[ 1. Power posing will make you feel bolder<sup>2</sup> 2. Smiling will make you feel happier 3. Exposure to words related to aging will make you walk slower 4. Feeling watched will make you behave more [honestly](https://royalsocietypublishing.org/doi/full/10.1098/rsbl.2006.0509) 5. Self control is a limited resource ... ] .footnote[ <sup>1</sup> Not just for statistical reasons. From Jarrett (2016) / https://digest.bps.org.uk/2016/09/16/ten-famous-psychology-findings-that-its-been-difficult-to-replicate/.<BR> <sup>2</sup>https://www.ted.com/talks/amy_cuddy_your_body_language_may_shape_who_you_are?language=en#t-68637. <BR> Image source: Amy Cuddy (co-author) power posing at PopTech 2011, via Erik Hersman/Flickr.<BR> Also see Carney [note](https://faculty.haas.berkeley.edu/dana_carney/pdf_My%20position%20on%20power%20poses.pdf) on PP. I don't pass any judgment on these studies-- I find them fascinating!] --- .center[ <img src="img/repro1.jpeg" width="70%" /> ] .footnote[Baker (2016). Is there are reproducibility crisis? Nature. https://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970] --- .left-column[ Different reasons! ] <img src="img/reproreason.jpg" width="55%" /> --- ### UPDATES: There is a lot riding on this-- the current big scandal is the fraud/data manipulation case of Ariely (MIT) and Gino (HBS) - Honesty experts accused of dishonesty - Three studies have already been [retracted](https://fortune.com/2023/08/02/harvard-professor-gino-dishonesty-investigation-falsified-research/)... - Superstars in their field (social psychologists, behavioral economist) -- - "Fabricated data in research about honesty. You can't make this stuff up. Or, can you?" (NPR, 2023) - Data Collada looked at the [Excel spreadsheets](https://datacolada.org/109), shapes of the [data distributions](https://datacolada.org/98), and the fonts that were used - Hartford, who provided the data (n = 6,033 vehicles), also chimed in that the data are NOT the same (n = 20,741 vehicles) - Blame was passed on to the postdocs and other researchers who were not cited as co-authors... -- So much has come [out](https://www.theatlantic.com/science/archive/2023/08/gino-ariely-data-fraud-allegations/674891/): - If you want to learn more, listen to the [Planet Money](https://www.npr.org/2023/07/27/1190568472/dan-ariely-francesca-gino-harvard-dishonesty-fabricated-data) podcast - Gino has been put on admin leave without pay - Gino has since then [sued](https://www.vox.com/future-perfect/23841742/francesca-gino-data-colada-lawsuit-gofundme-science-culture-transparency-academic-fraud-dishonesty) Harvard and Data Collada for $25m --- ### Consider your own research for a moment... You will have to reproduce your own results... - After getting comments for your submitted manuscript, reviewers may suggest that you analyze your model a particular way, add/remove a variable, etc. + Will be asked to replicate your own results-- though maybe a few months later! -- - You will need to revisit your models -- - Others may request some additional data from you in the future (e.g., for meta analyses) -- - Have you ever preregistered a study? See https://www.cos.io/initiatives/prereg --- ### Although the issue of reproducibility has gotten much traction as of late, this has been discussed some time ago (Tomek, 1993) - The issue of reproducability has focused on experiments - **However, replication can be done using secondary and nonexperimental data as well!** - Over 50 years ago, Lykken (1968) suggested that "most theories should be tested by multiple corroboration and most empirical generalizations by constructive replication" (p. 151) - Mittelstaedt and Zorn (1984) suggested replication using secondary data as a way to improve scholarship - In 1984, George and Landerman wrote: "replication is a cornerstone of the scientific method" (p. 134) .footnote[ .f80[George, L. K., & Landerman, R. (1984). Health and subjective well-being: A replicated secondary data analysis. International Journal of Aging & Human Development, 19(2), 133-156. https://doi.org/10.2190/FHHT-25R8-F8KT-MAJD Lykken, D. T. (1968). Statistical significance in psychological research. Psychological Bulletin, 70(3), 151-159. https://doi. org/10.1037/h0026141 Mittelstaedt, R. A., & Zorn, T. S. (1984). Econometric replication: Lessons from the experimental sciences. Quarterly Journal of Business and Economics, 23, 9-15. ] ] --- ### We can think of types of secondary and nonexperimental datasets... .center[ <img src="img/datatypes.jpg" width="55%" /> ] We do not consider primary / experimental data (not unless that is for validating another experiment) - Tennessee Project STAR data is publicly-available experimental (class size experiment) - Other data can be collected by agencies such as the - National Center for Educational Statistics (NCES) - IEA or OECD (e.g., TIMSS, PISA, PIRLS) - Other nonexperimental, primary data may cover school climate data (surveys) collections --- ### Huang and Huang (2024) present a typology of replication studies using secondary and nonexperimental data .center[ <img src="img/2by2.jpg" width="80%" /> ] - Based on .red[**MEASURES/APPROACH**] and .red[**DATASET**] used The four types are not mutually exclusive...often: - Types I (Audit) and II (Robustness check) go together - Types III (Generalizability) and IV (Extension) go together .footnote[*Based on Mittelstaedt, R. A., & Zorn, T. S. (1984). Econometric replication: Lessons from the experimental sciences. Quarterly Journal of Business and Economics, 9-15.] --- **Example**: reproducing `\(\rightarrow\)` Wright, J. P., Morgan, M. A., Coyne, M. A., Beaver, K. M., & Barnes, J. C. (2014). Prior problem behavior accounts for the racial gap in school suspensions. Journal of Criminal Justice, 42(3), 257-266. https://doi.org/10.1016/j.jcrimjus.2014.01.001 - I: Same data, variables, methods `\(\rightarrow\)` A methods, 'econometric' audit (Kane, 1984) or a 'literal replication' (Lykken, 1968) - II: Same data source but different variables and/or models `\(\rightarrow\)` Checking model sensitivity, specification - Huang, F. (2020). Prior problem behaviors do not account for the racial suspension gap. *Educational Researcher*, 49, 493-502. https://doi.org/10.3102/0013189X20932474. - https://francish.netlify.app/post/prior-problem-behavior-and-suspensions-a-replication/ - The goal is to check the *fragility* or *robustness* of findings - A fragile finding has results that change with the inclusion of certain variables (overly sensitive) or use of different model specification - For results to inspire confidence, they should be robust to different specifications .footnote[Kane, E. J. (1984). Why journal editors should encourage the replication of applied econometric research. Quarterly Journal of Business and Economics, 23, 3-8.] --- ## Prior problem behaviors (PPB)... ... accounted for differences in the likelihood of suspension and that .red["the racial gap in suspensions was completely accounted for by a measure of the prior problem behavior of the student-a finding never before reported in the literature"] (Wright et al., 2014, p. 257) - Paper used the Early Childhood Longitudinal Study (ECLS-K) - Would not be much of an issue if disciplinary guidance aimed at reducing the use of suspensions in schools was revoked by the U.S. Department of Education (Camera, 2019) - In the federal School Safety report, Wright et al.'s (2014) analyses was cited as a key piece of research, and the authors suggested that .red["the use of suspensions may not be as racially biased as many have argued"] (p. 264) - Note: several decades of research has suggested otherwise-- but I proposed reasons why this could be incorrect (Huang, 2018) - Given that the ECLS-K is publicly available and that Wright et al. (2014) indicated that .red["our results await replication"] (p. 263), I reanalyzed Wright et al.'s original findings and tested alternative models specifications ??? + shifting samples: model 1 showed disparities, model 2 had no more disparities once ppb was entered + the ppb measure was not actually a ppb measure --- #### Original findings: what's the story... just by looking at this -- can you think of a simple reason of what might drive results? .center[ <img src="img/ppb_wright.png" width="80%" /> ] .footnote[Source: Wright, J. P., Morgan, M. A., Coyne, M. A., Beaver, K. M., & Barnes, J. C. (2014). Prior problem behavior accounts for the racial gap in school suspensions. Journal of Criminal Justice, 42(3), 257-266. https://doi.org/10.1016/j.jcrimjus.2014.01.001 ] --- #### Replication (of original)... [discuss differences] .center[ <img src="img/ppb_huang1.png" width="55%" /> ] .footnote[ Huang, F. (2020). Prior problem behaviors do not account for the racial suspension gap. *Educational Researcher*, 49, 493-502. https://doi.org/10.3102/0013189X20932474. ] ??? - Results are similar - Took a while to figure out: + were weights used? + which version of the dataset? + etc. --- #### Replication + [discuss other model specifications] .center[ <img src="img/ppb_huang2.png" width="70%" /> ] .footnote[ Huang, F. (2020). Prior problem behaviors do not account for the racial suspension gap. *Educational Researcher*, 49, 493-502. https://doi.org/10.3102/0013189X20932474. ] --- ## Several approaches... - Replicate original analysis as closely as possible (a lot of checking, filling in the blanks, etc.) - Show how results could differ/be the same... + Using alternative hypotheses + Using different variables, different model specifications + Using multiple imputation, etc. -- - Originally intended as a short brief but got longer and longer - Findings showed that original results were driven by survivorship bias - An example of how freely-available, public data can be used --- ## Challenges of Type I replications - Are the data available? - Many journals promote the use of open data-- however... .bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph4.mt5[ Approximately only a quarter of authors who provided a "data upon reasonable request" statement actually provided data (Hussey, 2023) when contacted with a data request though this is likely to be field dependent as well (Tedersoo et al., 2021) ] .footnote[ Hussey, I. (2023). Data is not available upon request. https://psyarxiv.com/jbu9r/download?format=pdf Tedersoo, L. et al. (2021). Data sharing practices and data availability upon request differ across scientific disciplines. Scientific Data, 8(1), 192. https://doi.org/10.1038/s41597-021-00981-0 ] --- ## Why might results not replicate (from a statistical point-of-view)? 1. Differences in models (e.g., different controls) 2. Differences in data (e.g., public vs. restricted) 3. Use of alternative estimators (and different software syntax) 4. Variation in the way results are analyzed .footnote[Tomek, W. (1993). Confirmation and replication in empirical econometrics: A step toward improved scholarship. *American Journal of Agricultural Economics*, 75, 6-14. https://doi.org/10.1093/ajae/75_Special_Issue.6 ] --- ## Even when using the same data, results can vary! - Teams of researchers (29 teams of 61 analysts) were given the same dataset to address the same RQ: + "...whether soccer referees are more likely to give red cards<sup>1</sup> to dark-skin-toned players than to light-skin-toned players" + "How likely do you think it is that soccer referees tend to give more red cards to dark-skinned players?" -- - Analytic approaches varied widely and the effect sizes ranged from 0.89 to 2.93 (Mdn = 1.31) in odds-ratio (OR) units - Overall, the 29 different analyses used 21 unique combinations of covariates .footnote[<sup>1</sup> Results in the player's ejection from the game. Can be as a result of violent behavior (e.g., tackling violently, fouling with intent, hitting or spitting, using abusive language). Inherently a judgment call on the part of the referee. Silberzahn, R. et al. (2018). Many analysts, one data set: Making transparent how variations in analytic choices affect results. *Advances in Methods and Practices in Psychological Science, 1*(3), 337-356. https://doi.org/10.1177/2515245917747646 ] --- #### Twenty teams (69%) found a statistically significant positive effect, and 9 teams (31%) did not observe a significant relationship .center[ <img src="img/or.jpg" width="110%" /> ] .f90[.footnote[Silberzahn, R. et al. (2018). Many analysts, one data set: Making transparent how variations in analytic choices affect results. *Advances in Methods and Practices in Psychological Science, 1*(3), 337-356. https://doi.org/10.1177/2515245917747646 ] ] --- ### Do not think this 'many analysts, one dataset' result is field dependent - For other examples in different disciplines, see - Breznau et al. (2022), - Huntington-Klein et al. (2021), and - Gould et al. (2023) - "Researcher degrees of freedom" can alter results .footnote[ Breznau, N. et al. (2022). Observing many researchers using the same data and hypothesis reveals a hidden universe of uncertainty. Proceedings of the National Academy of Sciences of the United States of America, 119(44), e2203150119. https://doi.org/10.1073/pnas.2203150119 Gould, E. et al. (2023). Same data, different analysts: Variation in effect sizes due to analytical decisions in ecologyand evolutionary biology. https://ecoevorxiv.org/repository/view/6000 Huntington-Klein, N. et al. (2021). The influence of hidden researcher decisions in applied microeconomics. Economic Inquiry, 59(3), 944-960. https://doi.org/10.1111/ecin.12992 ] --- ### A Type II example: Class size revisited - One of the most influential and ambitious educational studies: Project STAR (Mosteller, 1995): Explain... - Data are publicly available (e.g., in the `mlmRev` package) - There were some issues with the original study (e.g., attrition, students transferring) - Analysis was redone-- using different model specifications by Nye et al. (2000) .bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph4.mt5[ Results show findings were robust-- even though different approaches were used ] .footnote[ Mosteller, F. (1995). The Tennessee study of class size in the early school grades. The Future of Children, 5(2), 113-127. Nye, B., Hedges, L. V., & Konstantopoulos, S. (2000). The effects of small classes on academic achievement: The results of the Tennessee class size experiment. American Educational Research Journal, 37(1), 123-151. ] --- ### Although Type I and II studies can test the robustness of findings, they do not address generalizability: Type III studies do .center[ <img src="img/2by2.jpg" width="80%" /> ] - III: Different data, same models/variables `\(\rightarrow\)` closest to the idea of experimental replication, "does the idea generalize?" - e.g., Huang, F., Eklund, K., & Cornell, D. (2017). Authoritative school climate, number of parents at home, and academic achievement, *School Psychology Quarterly, 32*, 480-496. doi: http://dx.doi.org/10.1037/spq0000182 - Type III and IV often go together --- #### Focused on extending an earlier study of O'Malley et al. (2015) on school climate, family structure, and achievement - Original study used a large statewide dataset from CA Healthy Kids Survey (CHKS; n = 490,000) + Authors indicated that limitations included: no validity screening and no measure of SES + *TIP*: Always read the limitations section of papers. May get ideas! -- - We have been collecting statewide data from VA since 2013 - 2020 (c/o several DOJ/NIJ grants) + our survey happened to have validity screening and measures of SES (for each year, 90 - 110 k students and > 10k teachers/staff) -- - Found very similar results [though weaker with the inclusion of SES as expected]. This is also a **Type IV** study. .f90[.footnote[ O'Malley, M., Voight, A., Renshaw, T. L., & Eklund, K. (2015). School climate, family structure, and academic achievement: A study of moderation effects. School Psychology Quarterly, 30, 142-157. http://dx.doi.org/10.1037/spq0000076 ]] --- ### Another Type III example... .center[ <img src="img/2by2.jpg" width="80%" /> ] - Type III studies are actually more common in psychology than one might think -- - Often, replications may focus on the structural components of a model BUT... -- - What if we focus on the **.green[measurement component]**? Those have often not been included in counts of replication studies --- #### Another Type III example: Revisiting the factor structure of instrument XYZ-- Very common approach (e.g., using CFA) - Literally replicates a study with a different sample (note: this is different from a measurement invariance study) -- - Instrument was tested with one sample (e.g., middle school students) `\(\rightarrow\)` Does the instrument "hold" when used with another sample (e.g., high school students) (Reinke et al., 2022) -- - Many studies using scales in one geography and testing out in another area -- - Study done using an older dataset-- does the factor structure still hold when used with a more recent dataset? (Qian et al., 2022) .footnote[ Qian, X., Shogren, K., Odejimi, O. A., & Little, T. (2022). Differences in Self-Determination Across Disability Categories: Findings From National Longitudinal Transition Study 2012. Journal of Disability Policy Studies, 32(4), 245-256. Reinke, W. M., Herman, K. C., Huang, F., McCall, C., Holmes, S., Thompson, A., & Owens, S. (2022). Examining the validity of the Early Identification System - Student Version for screening in an elementary school sample. Journal of School Psychology, 90, 114-134. ] --- ## More Type III examples: "Many datasets, one analytic approach" .footnote[ Duncan, G. J., Engel, M., Claessens, A., & Dowsett, C. J. (2014). Replication and robustness in developmental research. Developmental Psychology, 50(11), 2417-2425. Li, W., & Konstantopoulos, S. (2016). Class size effects on fourth-grade mathematics achievement: Evidence from TIMSS 2011. Journal of Research on Educational Effectiveness, 9(4), 503-530. ] - Li and Konstantopoulos (2016) Evaluated class size effects in 14 European countries - Used different country datasets BUT the same analytic approach (using regression discontinuity and instrumental variables) - Multiple studies with pooled results -- - Another example of this comes from Duncan et al. (2014) who used six longitudinal datasets to explore school readiness and future academic achievement --- #### Type IV: Extending an original study... .center[ <img src="img/2by2.jpg" width="80%" /> ] - Most Type III studies also extend into Type IV - Type IV: Different data, models, variables `\(\rightarrow\)` If the data and variables are of similar nature ('proxy' variables), can be a replication - e.g., Huang, F., Olsen, A., Cohen, D., & Coombs, N. (2020). Authoritative school climate and out-of-school suspensions: Results from a nationally-representative survey of 10th grade students. *Preventing School Failure*. doi: 10.1080/1045988X.2020.1843129 - Great chance to use findings from older studies and see if they hold up in a more modern setting --- #### Summary of examples (from Huang & Huang, 2024) .center[ <img src="img/samples.png" width="100%" /> ] --- ### Several factors have contributed to the replication crisis (Bishop, 2019) .footnote[.f80[ Bishop, D. (2019). Rein in the four horsemen of irreproducibility. Nature, 568(7753), 435-435. Maxwell, S. E. (2004). The persistence of underpowered studies in psychological research: Causes, consequences, and remedies. Psychological Methods, 9(2), 147-163. ]] 1. Publication bias - Bias for 'statistically significant' results -- 2. Low statistical power - Underpowered studies in psychology have been pointed out long ago (Maxwell, 2004) -- 3. p-hacking - Again, related to obtaining statistical significance -- 4. Hypothesizing after results are known (HARKING) - Research in reverse ??? What can be done for each? Which is most problematic for secondary datasets? Solutions? --- ## Of these factors, the ones of greatest concern are: - p-hacking - HARKING .bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph4.mt5[ - Highlights many choices that a researcher can make - Easily done as large, secondary datasets usually have a rich set of covariates - Some have suggested the pre-registration of secondary data analysis-- but some require some form of attestation that the researcher has not yet looked at the dataset! ] --- ## Benefits and challenge of replication (Tomek, 1993) .pull-left[ .green[**Benefits:**] - Alternative explanations ("rival models") may be tested - Researchers can learn from the replication("potential for confirmation research to contribute to scholarly innovations has been undervalued") + Taking apart and putting something back together is informative! - Helps make sure important studies are error free and robust ] -- .pull-right[ .red[**Difficulties:**] - Data availability + Some data change, some data are restricted - Model specification + e.g., Were squared terms included? Were weights used? etc. - Computer code ('syntax') + Software may compute certain results slightly differently - Effect on colleagues + Can be interpreted as a lack of trust ] --- ### When writing these studies, it is important to explain why your replication is BOTH important and different - You might know the reason but the reviewers need to know too! - Reviewers may easily ask, "This has already been done before so why do it again?" - Do we really need to replicate this? + If important decisions are based on the research, yes! - How different is your study? Need to justify! + Are you adding new variables? (e.g., adding controls) + Different setting? (e.g., rural vs urban) + Different sample? (e.g., high school vs middle school vs elementary) + Modified variables? (e.g., testing alternatives) + Testing different model specifications? (e.g., using FE vs MLM vs CRSE) + Is the time frame different? (e.g., original done in the 80s, technology has changed learning a lot!) --- ### When writing these studies, it is important to explain why your replication is BOTH important and different (cont.) - Should the journal readers care about this? (this is related to the journal you send to) + For example, the relationship of SES and achievement is very strong in the US, if we replicate this study using data from (for example) country xyz-- why would this be important to the readers of the journal? + Very important to know the audience of the journal. I find targeting journals upfront is very helpful for several reasons (not just for replications) --- ## Summary: Thinking about reproducible/replication research 1. There are issues with replication -- 2. Can be done with not just experimental studies but with secondary/nonexperimental datasets as well - I present a typology of studies that can be done and provide examples -- 3. There are several factors that can contribute to replication problems-- researchers need to be aware of this in order to address these -- 4. Replications are a nice way to learn more about a particular study! - You can learn a lot by teaching / redoing / extending other studies -- .bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph4.mt5[ Replication studies should be done more often: Results in the accumulation of scientific knowledge ] ??? #### Sidebar: Lot of data out there-- for several other (older) datasets... .center[ <img src="img/nces.png" width="80%" /> ] .footnote[ Wang, X., Henning, A., Cui, W., Huang, F., Armstrong, S., Kang, K., Boyer, J., & Robers, S. (2011). NCES handbook of survey methods (NCES 2011-609). S. Burns, X. Wang, & A. Henning (Eds.). Washington, D.C.: U.S. Department of Education, National Center for Education Statistics. Retrieved from http://nces.ed.gov/pubs2011/2011609.pdf ] ??? #### Sidebar: NCES has conducted several longitudinal surveys .center[ <img src="img/longnces.png" width="110%" /> ] .footnote[ Source: https://nces.ed.gov/training/datauser/COMO_07/assets/COMO_07_Slides.pdf ]