[This is a guest post by Alex Kogan. Last week, Ed Yong at Not Exactly Rocket Science covered a paper positing an association between a genetic variant and an aspect of social behavior called prosociality. On Twitter, Daniel and Joe dismissed this study out of hand due to its small sample size (n = 23), leading Ed to update his post. Daniel and Joe were then contacted by Alex Kogan, the first author of the study in question. He kindly shared his data with us, and agreed to an exchange here on Genomes Unzipped. Our comments on the study are here; this is Alex’s reply.]
It’s a truism that resonates across science: Size matters when doing and interpreting the statistical (and practical) meaning of a study. But the size of what? Well, it’s quite a few things—all of which are very important in understanding what a study is ultimately telling us. One of the first numbers researchers focus on is the p-value. The p-value relies on a bit of counterintuitive logic: It represents the percentage of times you would get an effect as big as you got (or bigger) if there is really no effect in the general population. So we first assume that there is really no difference in some outcome between two groups across the general population (we call this the null hypothesis), and then we ask what are the chances of us finding the difference that we found (or bigger) given this assumption. If this percentage is low (many fields adopt a p = .05 standard, or a 5% chance that we’d get the effect we got or bigger if there is really no effect in the general population), then we can reject the initial idea that there is no difference in the general population. So what have we learned if the p-value is .05 or lower? That there is likely a difference in the general population—how big this difference is remains a mystery, however; the p-value never answers that question.
So what can affect a p-value? One big factor is the number of participants in your study—the more people you have, the smaller the p-value becomes, and the more sure you can be that your effect is representative of a real difference in the population (though again, the p-value doesn’t tell us how big this difference is). Another big factor is the strength of the effect—the bigger the effect (i.e. the bigger the difference between two groups), the smaller the p-value is going to be. Finally, we can think about the p-value of finding any difference in the study—so what are the odds of us finding a difference between two groups on any of the different outcomes we are looking at. Here, the more outcomes we look at, the larger the true p-value for the study becomes because through such an approach, we are actually increasing the odds of us finding a false positive (i.e. getting an effect by chance).
Let’s focus on this last point a bit more deeply. If we have one outcome we are interested in, and then do a study to see if two groups differ in this outcome, then the p-value you get is not biased. But imagine that you have 100 different outcomes in a study. Now if you check to see if there are any group differences on any of these 100, odds are you will find some because of change—remember that a p-value of .05 still means that 5% of the time, you will get a significant difference in the study even if the original population does not have any difference between the two groups. In genome-wide association studies, this point is especially important since we are looking at many, many potential outcomes. So we correct for the inflated chance of finding an effect by chance by doing a correction for 50,000 comparisons. This drops the accepted p-value to .000001, but to get anything to be significant at this level, a very large sample is necessary (and this is one of the big reasons why for genome wide studies, thousands of participants are necessary). This larger sample is necessitated by the fact that when doing so many comparisons, finding significant differences is not surprising—some will certainly occur by chance. So for any effect to be trusted, it must cross this much higher threshold. But when looking at just one outcome (as occurs in candidate gene studies), this problem of over inflated chance of finding false positives isn’t an issue since only 1 comparison is being done—and thus much smaller number of participants is needed to make a reasonable claim.
When we evaluate whether a study is reporting for us a real difference, we must consider all these different factors. For starters, what is the p-value? If the p-value is .001, then that means that there is only a 1 in 1000 chance that the study could have gotten the effect they did (or bigger) if there was really no effect in the general population. So unless the authors were incredibly unlucky (or lucky!) or are biasing their results through other less than ethical practices, we can say that there is likely a real difference between the two groups in the general population. But here we must be extremely careful! What population are we talking about? It’s not quite everyone on the planet we are talking about; the study is really only valid for the population from which the participants were drawn. So if the study was done amongst undergraduates at Harvard, well then the p-value is really telling us something only about undergraduates from Harvard; we need to do studies with other populations to see if this effect generalizes to other populations. When it fails to do so, then what is likely going on is not that there is no real effect amongst the Harvard undergrad population, but instead that there are other factors that differentiate the Harvard undergrads from the other groups that are being studied, and these other factors are serving as moderators. All of this is extremely important when we look at replication studies. If a study is attempting a true replication, then it must conduct the replication in the same population. When the population changes, then the study is introducing a plethora of new variables that that can moderate the particular effect under investigation—and this is a huge issue in evaluating the medical genetics replication studies that were mentioned in the original post.
The lesson here is that dismissing studies that fail to replicate in different populations is inappropriate; a replication is only a true replication when the same population is being evaluated. When a different population is being evaluated, the study is introducing numerous confounds—and simply having a bigger number of participants in the replication than in the original study does not in any way make up for this problem.
Additionally, what is dismissed today can be revised tomorrow. For instance, the most recent meta-analysis of the serotonin transporter gene (the sadly mislabelled “depression gene”) concluded that there is indeed an effect of the gene on depression, which prior meta-analysis (which used far fewer studies in their analyses) had concluded does not exist. The world of research is dynamic and ever-changing—and so it’s generally good practice to avoid making too strong of statements about the existence (or lack of existence) of any given relationship. We must be all too careful in dismissing any body of work—especially in a field as young and changing as the study of genetic contribution to human behavior.
All that being said, we should not take a low p-value to mean that the effect is actually a real one that would replicate even in the same population. Researchers can do many things to make their p-values appear better than they actually are. They can screen for apparent outliers; they can collect data in waves and check whether the effect has a low enough p-value at each wave, stopping once they get their effect; or they can go on a “fishing expedition”, looking at many different outcomes and only report the ones that were significant (i.e. low p-value) without making corrections for looking at the many outcomes. This last issue is especially a big one because it is a sadly a not so uncommon practice across academic fields, and there’s no way to know if the authors did this or not unless they report it. So replication is a necessary component to feeling confident about the results.
In many ways, my colleagues and I strongly agree with the spirit of the criticism that Joe and Daniel made. We must be extremely careful in putting too much stock in one study because there are so many human factors that can inflate a p-value. So a result should be taken in the context of the broader literature. Our study benefits from over a dozen studies that have reported findings very consistent with ours using much larger samples from the same general population—Caucasians in the United States. But our study is also the first to attempt to evaluate whether other people’s perceptions are predicted by genotypes. It was in fact our hope to have a much bigger sample of targets, but we sadly only had the ability to conduct our study on the sample in hand. We are now attempting much larger replication studies. Our effect must be replicated before the study is anything more than a preliminary finding—it is a start, rather than an end. And we hope it motivates future researchers to also study this particular gene.
I have focused on discussing the broad statistical issues in this post rather than the specifics of our study because Joe and Daniel’s criticism applies all too well to the majority of genetic studies looking at complex behavioral traits—most of the studies have participant numbers in the hundreds, and most look at candidate genes rather than genome wide associations. I certainly agree that genome wide association studies have the potential to provide far more information that candidate gene studies, but genome wide studies are extremely restrictive because of the super inflated false positives issue and thus the needed correction for the p-value which necessitates a very large sample (in the thousands) to detect almost anything. Sadly, data collection from such a large number remains financially difficult for many labs, and pragmatically unrealistic for any truly complex designs. It is my great hope that as our fields develop, new solutions will emerge that will allow for truly genome wide association studies to take place on a large of enough scale to make them viable in the study of complex human traits. In the meantime, I believe there is utility to smaller scale candidate gene studies, and would advocate care in evaluating these studies and their replications because of the statistical issues I have discussed.