Measurement and analytic issues in longitudinal and cross-national data: Illustrations from the retracted work of Clark et al. (2020)

We (Manuel Eisner and Pasco Fearon) wrote a letter in June 2020 to Psychological Science responding to the publication of the above paper. We drew attention to the fact that the data the authors used for estimating homicide rates were quite profoundly problematic. We also noted a number of serious statistical problems. We pointed out the woeful state of the IQ data, but we did not examine this in detail, as others had already done that very comprehensively. The authors were responsive to the concerns raised by us and others, and retracted their paper, accompanied by a strongly worded statement about the huge problems in the measurements they had used, particularly with the measurement of IQ. In their words,

The persistence of these highly questionable data sources in the psychological literature has convinced us that research with certain kinds of flaws should be pulled from the record as its existence steers us further away from the truth rather than closer to it.

We thoroughly agree, and hope that the IQ databases that were used in that paper will not be used in scientific work any more. We admire the authors for making such a bold statement.

We are planning to publish our analysis of the issues associated with cross-national homicide estimates in a separate article. In this blog, we collated the various statistical points we made in the letter, and extended some of them after we had more time to do further analysis. Some of the key issues that came up are ones that are likely widespread, and we have probably made similar mistakes ourselves in the past. They raise interesting broader points about the importance of assumption-checking, data visualization, and the need to stress-test one’s hypotheses.

Analytical issues

So, what did we find when examining the analyses used in this paper? We should note that we did these analysis taking the data at face value, despite the fact that the most serious concerns about the paper relate to basic problems with measurement validity and sampling.  Even a perfect analysis could not correct those fundamental problems. Nevertheless, we thought it would be helpful to look at how robust the analyses were, in and of themselves. We did not seek to do a comprehensive re-analysis, but picked what we considered to be the most obvious issues one would think about to ensure the statistical analyses were robust.

One really nice thing about the paper was that the authors published their data and analytic scripts on the Open Science Framework website. This made all their steps crystal clear and open to checking. If these had been made available prior to publication (e.g., linked to a pre-print) the analytic issues we noted below could have been addressed or rebutted at an early stage which no doubt would have been beneficial. This is the way research is rapidly heading. All our scripts for the analyses below are also available to download here at the foot this blog. 

The main result

Before we get into the analyses, let’s remind ourselves of the main finding in this paper. The authors found a statistical interaction between religiosity (really, the proportion of the population who say they are religious – e.g., from census data) and a measure of average IQ for a country (please don’t take the latter seriously by the way – they are deeply problematic, we’re really only doing this as an exercise in data analysis). The graph below plots the basic finding.

As you can see, there is a negative association between test score (‘IQ’) and homicide rates, and the multiple lines show the predicted slope as religiosity varies – when religiosity is low the slope is steep, when it is high the slope is more shallow. The dots are the raw data, with yellow dots representing the high religious countries (above the median) and the blue ones the low religious countries (below median).

Distributional issues

The first issue we addressed was the distribution of the dependent variable. The 5-year homicide rate variable in the dataset was extremely skewed and non-normal, as the histogram and normal quantile plot in Figure 1 clearly show.  

Figure 1 Histogram and normal quantile plot of homicide rates

As far as we can tell, the authors never looked at this, and certainly did not take any remedial steps to address it. It is remarkable how little attention is paid to these distributional questions in the publication process. We drum this into our students in undergraduate statistics classes, but in our experience it is rarely raised during peer review, and rarely presented in any detail, if at all, in published papers. Getting these issues right is important not only because they help us make proper inferences, but also because this is a part of the process replete with risk for inflations in the type I error rate because of researcher degrees of freedom. Making the data available online is a step in the right direction, ideally prior to publication, but we think greater consistency in the analysis and reporting standards of journals is needed, so that these matters are addressed thoroughly at the front end. Checklists of basic analytic steps could be included at the point of submission and advertised in author guidelines.  In this particular case, it is well known in the criminology field that homicide rates have extreme skew, and it is routine to account for this (e.g., by log transformation), which again highlights the potential problems when you work with data outside your core area of expertise. It also illustrates the risks of having reviewers who do not have expertise in that substantive area. We don’t know if that was the case with this paper, but it would not be surprising if it was.

Did distributional issues make a difference to the basic result?

A fundamental point that we believe is important to emphasise is the need to conduct sensitivity analyses – stress-testing your hypotheses by seeing whether different analytic steps or assumptions, or tests motivated by different explanations, change your conclusions. This doesn’t mean running outlandish analysis to try to kill your effect – that is probably always possible. You can fish in your data to find an effect, and you can fish in your data to demolish one. So sensitivity analyses should address strong logical issues, or issues raised clearly by past research. In the case of the distribution issue it’s not clear that a sensitivity analyses is really required – there is arguably a clear need to deal with the extreme skew and there are not that many ways to do it. Nevertheless, in this context it is of interest to know whether the authors’ decision not to do that made a substantive difference. 

Running a linear mixed model with main effects of year, the main and interaction effects of country test score and religiosity (and no interactions of these latter terms with time), closely reproduced the result reported in the paper. The interaction parameter was estimated as B = 2.55 (95% CI [.76 4.35]).  The p-value associated with this term was p = .005. Re-running the model using a generalized multilevel model with gamma distribution and log link function, which fits the data markedly better, also estimates a statistically significant interaction (B = .21, 95% CI [.05, .36, p = .008; note that the parameter estimates of these two different models cannot be directly compared).  

So, even though the distribution is highly skewed, the analyses did not lead to markedly different substantive results. That’s reassuring. To illustrate how the assumptions of the model are violated by these data, we have plotted the predicted values against residuals and normal quantile plots of the level-2 residuals for the mixed model in Figure 2.

Figure 2 Plot of predicted values against residuals and normal quantile plot of residuals from mixed linear model of homicide data

Also, one other under-appreciated issue is that even if the standard model based on normal distributions works reasonably well, it can still make meaningless predictions – in this case the model makes predictions for homicide rates of less than zero, as you can see in the left hand side of Figure 3.  Although there are various models that could treat this form of extremely skewed data, multilevel generalised linear models using the gamma distribution generally work well in this case (see for example Figure 3) and we used the model for most of the rest of the analyses. 

Figure 3 Fitted values (including random terms) from mixed linear and gamma models


A crucial question that any reader would surely ask about the Clark et al results is whether they could be explained by other country-level or time-varying covariates closely correlated either with the test scores (which were not time-varying) or religiosity (which were). A fundamental problem with their Study 1 was a lack of good covariates. Given all the myriad reasons why some countries might have higher homicide rates than others, this is crucial. It was a puzzle that the reviewers did not see that as a fatal problem.

Having said that, the authors do make an important point that may not be obvious to all readers. The analysis they used in Study 1 focuses on the correlated variation within a country between religiosity and homicide rates. Focusing on the extent to which one tracks the other over time does give considerable protection, in theory, from static confounds. But, it doesn’t remove the effects of time-varying confounds (i.e., other factors that track with both homicide and religiosity over time). And, most importantly, because the interaction involves IQ which doesn’t vary within country, the analysis cannot remove the influence of static confounds linked to IQ test scores. So, controlling for covariates, and how you do it, is critical.

The authors only looked at two covariates in Study 1: GDP and the Gini (inequality) Index. They reported that the interaction term remained when GDP and Gini coefficients were controlled for. However, these terms are not equivalent to each other and at no point did they control for both of them simultaneously, which was unusual. It seems obvious that there may be more than one confound at work in a complex topic like this.

We examined the impact of controlling these two indicators on the estimation of the interaction term. Controlling for these terms’ effects purely on the intercept did not substantially change the statistical significance of the interaction term (though note the parameter estimate itself cannot be directly compared with the previous one as they involved different samples, due to missing data in the GDP and Gini data). The interaction term was estimated as B =.26, se = .09, 95% CI [.10, .43], p = .002). However, the primary concern is whether some confound might be behaving like the IQ test score, and so might also interact with religiosity. The authors themselves included such an interaction term in one of their analyses in response to reviewer requests, but again only included a single poverty-related indicator at a time. When we included these two interaction-terms (religiosity x GDP and religiosity x Gini) we found a much more marginal effect (B = .20, se = .11, 95% CI [.003, .40], p = .046). If this had been the leading analysis for Study 1 (arguably it should have been), we suspect it would not have been treated as convincing by most scientists.

As concerns might be raised about collinearity relating to these two additional interactions terms, we also conducted a simpler analysis in which z-standardized GDP (reversed) and Gini coefficients were averaged and a single additional interaction term included to reflect religiosity x poverty/inequality. Including this term led to a non-significant religiosity x test performance interaction (B =.16, se = .12, 95% CI [-.07,.39], p = .16.  

Note that the Gini data had extensive missingness, so all analyses involving this variable involved a smaller set of countries (N = 96), which could also partly explain why the effect seems weaker in this analysis. Given time constraints, we did not attempt multiple imputation of this variable, but ideally one would.  This simple set of analyses illustrates clearly how some basic stress-testing can help to get a sense of how robust a statistical finding is. It is also worth remembering too that the lack of many other very pertinent covariates is a major limitation in general here.

Time trends

As we noted already, the data on homicides, GDP and religiosity vary across time. The authors didn’t conduct any analyses to check whether the interaction they found was consistent across the different time spans of the study. The data span 70 years, so this an obvious question to ask. If the authors believe that the effect represents something intrinsic to IQ, it ought not to change much over time. Knowing whether that’s the case is crucial for weighing up whether the result holds water. Let’s try stress-testing that a bit too: a simple way is to look at each 5-year period on its own. When you do that, there is very little evidence of any reliable interactions between religiosity and test score performance.  Indeed, the 4 of the 5 interaction terms that were nominally significant at p<.05 were in the opposite direction to that reported in the paper.  See Table 1.

Year.B.Std CI
Table 1 Model coefficients for the interaction term by year

Further, as we noted already, none of the models presented in Study 1 took account of any temporal trends in the effect of religiosity or of changes in the interaction term over time. This was surprising given the study’s aims (aiming to focus on declines in religiosity, which were never tested).

What they did test was whether, within countries, homicide rates go up when religiosity goes down, and vice versa, and whether this varies according to the IQ test score variable. But that does not address the directionality of the time trends – i.e., increasing or decreasing trends over time. Again, this is pertinent, because we would want to know whether interaction is consistent across all the waves of data, or only emerges at certain time points.  That obviously makes an important difference to the interpretation.

So we re-ran the same model as before, this time including interactions with the linear slope of time (including all two-way interactions with time, and the three-way interaction between religiosity x test score performance x time).  The results revealed quite strong changes in the relationship between religiosity and homicide with time. And, most important to the interpretation of the paper, we found a three-way interaction between religiosity, mean test score performance and linear time (B = 6.36, se = 2.47, 95% CI [1.51, 11.22, p = .010). The original (lower-order) interaction between religiosity and mean test score performance was close to zero (B = .005, se = .058, 95% CI [-.11, .12], p = .93).  The estimated homicide rates are shown below in Figure 5 split by time period (early [1960], versus late [2005]) and test score (high versus low). 

Figure 5 Model estimated homicide rates by time period, religiosity and test score performance

As the plot shows, the interaction results from a markedly higher homicide rate in recent decades in relatively low religiosity and low test-score countries. The effect does not seem to be consistent over time.

Another way of stress-testing the temporal consistency of the interaction is to gradually remove one-time point at a time, and see how long the interaction remains intact. If we just remove 2010 (and include no covariates), the interaction effect is immediately marginal (B = .16, se = .08, 95% CI [.002, .31], p = .047. Removing 2005 as well renders the effect still marginal (B = .16, se = .08, 95% CI [.005, .31], p = .043. Removing 2000 too (so including data from 1945 to 1995) leads to a clearly non-significant result (B = .10, se = .10, 95% CI [-.10, .31], p = .33.

From these analyses it seems quite clear that there is no consistent interaction in these data, and that the interaction reported by the authors is critically dependent one or a small number of recent waves of data.

Another way to look at this is to see if other covariates behave similarly, as they may account for the pattern we are seeing. If an analysis looking at temporal trends is run using GDP instead of mean test score, a similar pattern emerges to that seen above, although it is statistically stronger (three-way interaction B = 1.36, se = .36, 95% CI [.66, 2.07], p <.001). The pattern of estimated homicide rates appears fairly similar to what was seen above for test score performance, as seen in Figure 6. 

Figure 6 Model estimated homicide rates by time period, religiosity and GDP

Including both pairs of three-way interactions in the model (for GDP and for test score performance, and all their lower-order two-way interactions and main effects) showed that the three-way interaction for test score performance was no longer significant (B = .11, se = .09, 95% CI [-.05, .28], p = .18), whereas the comparable interaction for GDP remained significant (B = 1.37 se = .39, 95% CI [.61, 2.12], p <.001). Similar results are obtained if GDP is replaced by Gini coefficient or if the standardized average of GDP (reversed) and Gini is used.

So, it is helpful to see how vulnerable these sorts of analyses are to some credibility checks.  We were forced to conclude that the interaction effect is not consistent across time, and may well result from confounders that show a similar pattern.

Other distributional issues

We all know that correlational studies are vulnerable to outliers. Another critical issue we’ve seen already too is that visualising your data is extremely important for checking that everything appears as you expect it to. As a basic check on the bivariate distributions of the variables, we plotted the country-level means of homicide rates against test scores, separately for countries scoring above and below the median of religiosity (which was 99.2%). These plots are shown below.

Figure 7a level-2 scatterplots of mean test score performance and 5 year homicide rates for below median of religiosity
Figure 7b level-2 scatterplots of mean test score performance and 5 year homicide rates for above median of religiosity

The obvious feature of these figures is the somewhat outlying status of 5 countries in the low religiosity group.  Four of these five are in the South American/Caribbean region. Removing these 5 countries does not entirely remove the interaction effect in the basic model, but it does reduce it quite a lot (B = .18, se = .08, 95% CI [.01, .34], p = .037). Controlling for GDP and Gini index as well (including their interactions with time) removes the interaction effect (= .24).  

Not only does this suggest that the findings may be vulnerable to outliers, it also illustrates another critical point – seeing the data, and knowing which countries we are talking about, is important – it helps us think about alternative explanations for what is happening. What else do we know about these countries with very high homicide rates…? All of them are marked by extremely turbulent political histories (including civil war) and current problems with drugs and organised crime for example – might this be important to think about? We think that goes without saying….

Repeating the analysis of Clark et al.

It would be reasonable to argue that Clark and colleagues planned their analyses differently and so it is not fair to stress-test their modelling with an entirely different model. This is reasonable, although the within-country fixed effects regression is largely equivalent to a mixed model (re-parameterised). Nevertheless, it’s fair to check whether similar issues arise when using the same model they used. When we ran their fixed effects regression we replicated the same finding as them. But, it’s important to check model assumptions first. In a fixed effects model like this, a key question is whether a random effects or fixed effects model is appropriate to adequately capture the data. Hausman’s test provides such a check. If it is significant, this suggests that a fixed effects model is appropriate. Hausman’s test for these data was not remotely significant (chi2(15) = 9.79, p = .83). Random effects are generally more approriate when differences between countries might be correlated with the outcome, so this is not very surprising.

Another key question for a within-country fixed effects model is whether fixed time effects need to be considered. In the analyses by Clark et al, these were omitted, as we noted above, but they never checked whether it was OK to omit them. We already expected that such fixed time effects would be needed because we saw time effects in the multilevel models presented earlier. When we tested for overall differences as a function of time, we saw very clear evidence of their presence (chi2 (13) = 34.97, p <.001). Running the authors’ model, but with random effects and fixed time effects that interacted with religiosity and IQ, we saw large differences in the association between IQ tests scores and homicide rates as a function of time (B = -14.2, p <.0001).  The interaction between religiosity and IQ was not significant (B = .43, p = .52).

This model included a three-way interaction between religiosity, IQ and time, and it was not significant. We were concerned that this non-significant three-way interaction might be obscuring the religiosity x IQ interaction. However, when we removed it, the religiosity x IQ interaction was still not significant (B = 1.10, se = .67, 95% CI[-.22, 2.42], p = .10). The lack of a three-way interaction with time was somewhat surprising given that we had seen one earlier. However, the analyses we just did were based on the untransformed homicide data (as per the Clark paper), which might be why the results were different to what we saw earlier. As these models are not able to accommodate gamma distributions, we re-ran the analysis with the log transformed homicide rate to see if this was why we didn’t see the effect we had seen earlier. The 3-way interaction was then significant, although not as strongly (B = 5.28, se = 2.42, 95% CI [.54, 10.02], p = .029). 

So, however you run these analyses, it is quite clear that temporal effects are really important to take account of and doing so seems to remove the interaction that was the basis of the Clark paper. There was evidence that the interaction varied by time, and as we saw earlier, it seemed to emerge primarily in the last waves of data. Covariates also appear critical, and again, we found evidence that they had an impact, even when we just considered two. Through this sort of stress-testing against reasonable counter-explanations and assumptions one is left with very considerable doubt that the results reported were robust (even if there hadn’t been very serious problems with the measures themselves).  The lesson seems to be: if you want robust findings, be your own critic and challenge yourself to ask the hard questions about your findings. 

Study 2

The data from study 2 was much simpler than study 1, as it was all cross-sectional.  The authors employed a so-called multiverse analysis – meaning they ran all permutations among several choices of variable to see how consistent the results were across these different combinations. It’s an interesting approach designed to directly address the impact researcher degrees of freedom on the outcome. However, it is rarely practical to consider all the potential researcher degrees of freedom and the key question is whether the permutations considered are the most important ones. They varied three types of IQ measure and three types of religiosity measure. The religiosity measures correlated with each other .86, .72 and .72. The IQ measures correlated .86, .87 and .98. These are very high correlations, especially for IQ, so you have to wonder whether multiverse analyses of these is likely to be all that informative. For example, with data like this, a much simpler analysis would have been to standardize and average these, or take the first principal component from them and use these as variables. In addition to being simpler, it also reduces error. We looked at how robust the result is when considered in this way. The other really important issue in study 2 is that the authors were able to include a wider range of covariates, including factors like education, population density, rule of law etc.  They also included controls for region, which again is very important and helpful (it’s not clear why this wasn’t done in Study 1). 

As before, although the authors did not correct the distribtution of the homicide data, we again used gamma models for these additional analyses. Our first question was whether, if you create one single composite of the three IQ measures and the 3 religion measures do we see evidence of the same effect the authors reported? This has the added advantage of allowing us to retain all the data on these measures by using full information maximum likelhood to estimate the factors underlying them (they have variable amounts of missing data).  Running gamma regression with latent IQ and latent religion shows the usual interaction (B = .38, p <.001). What happens if we then control for region?  The interaction is no longer significant (B = .13, se = .07, 95% CI [-.01, .26), p = .071).  

So, an elementary set of controls completely removes the interaction. How did this not come up in the authors’ paper? It’s not clear, but here is one part of the explanation: the authors only ever tested the effects of region while also controlling for several other covariates (Gini Index, GDP, education, population density). When we do that, we find, surprisingly, that the interaction re-appears (B = .26, se = .08, 95% CI [.09, .43], p = .002). That’s a little strange. There are two obvious explanations.

The first is that these other predictors capture a lot of error variance in homicide rates, and increase the precision of the estimate. It’s not a likely explanation because the point estimates themselves are very different (.13 versus .26). A more likely explanation is that missing data means that adding the covariates reduces the sample down to a subsample where, for some reason, the interaction comes out significant. To check this, we repeated the analysis we did before including just the interaction and region (and no other covariates), but restricted it to the subsample that had data on all the covariates. Sure enough, the interaction is significant (B = .28, se = .09, 95% CI [.10, .46], p = .002, based on N = 122, versus N = 208 in the above analysis).  

So, it does seem to be the case that it’s not the effects of the covariates that are causing an interaction to appear from nowhere, but the missing data. It would be very hard to argue that the smaller sample is the one we should trust.  So this illustrates another very important principle – that complex patterns of missing data cannot be ignored, because they can drastically change the results. Multiple imputation ought to have been considered, although given that the effect disappears when one simply controls for region, this may not have been worth the trouble. With this fairly sobering result, we decided it was not worth examining the results any further. 


We identified numerous problems in the Clark paper and brought those to the attention of the journal.  The most important concerned the provenance and quality of the measurements. Others have highlighted the exceptionally problematic measurement of IQ (see critique by Sear). We also drew attention to the fact that the data on homicide rates were seriously problematic, and these data sources are not recommended for predictor/causal analyses.  

The authors, on receiving this feedback and feedback from others, did the right thing and retracted the paper. This was a good outcome, and hopefully acts as a reminder to all of us to check our data carefully. It highlights the value of transparency, open research practices and the value of pre-prints and pre-publication peer review.

In this blog, we explored a number of analytic issues too, and described some of the problems we found. We found the exercise informative, and the results really underlined just how important it is to carefully check assumptions, visualise your data to spot problems, consider the impact of missing data and stress-test your analysis, checking out plausible alternative explanations for what is happening in your data.  Doing that, we found that the result seemed to ‘fall over’ with only a fairly gentle push.

Leave a Reply

%d bloggers like this: