For my first several years as a statistician, I worked at a University where a large part of my job was helping medical students, interns and residents complete their research projects. The people I worked with were hard working and well meaning, but they did not always have a good background in statistics. It was often seen as a distraction from what really mattered to them: medicine. Therefore I can say yes, I have had people claim that 'no statistically significant difference' means there was 'no difference'. I have also heard many people claim that a p-value less than 0.05 means that there is a causal relationship between the variables tested.

That is wrong.

Valentin Amrhein, Sander Greenland, and Blake McShane write in Nature that because people often have a poor grasp of the concept of a p-value or the meaning of statistical significance, that we should stop categorizing results as statistically significant or non-statistically significant and replacing statistical significance with confidence intervals. While I agree with their goal, I think that we should be more careful in our solution to the problem.

## The Problem With Confidence Intervals

The solution posed in the article is to use confidence intervals. So, taking their example of the risk ratio of new-onset atrial fibrillation after exposure to anti-inflammatory drugs, instead of p-value of 0.091 which we declare as not statistically significant [1], we would use a (I am assuming 95%) confidence interval of 0.97-1.48. Great, but we still run into the problem that people will say that the result is not significant because it includes 1. Yes, that is an error of statistical reasoning, but one I have seen repeated many times. Really, our judgment of the second study would be the same as well because we have just transformed our p-value of 0.0003 to a confidence interval of 1.09-1.33.

No matter what we do, both the p-value for a parametric test and the confidence intervals for the associated statistic suffer from the same issue: they tend to 'good' values when we increase the sample size.

Now, I do agree that confidence intervals give a better idea of the data than a p-value does because it is one step closer to the distribution of what ever we are measuring. The problem is that if we say our metric is confidence intervals, then people will game (knowingly or not) the confidence intervals. Gaming the endpoint of the analysis is exactly the problem that Amrhein *et al.* are describing.

## The Solution

My disagreement with Amrhein *et al.* is not in their argument that statistical significance is easily abusable. My disagreement is that

My solution would be for people who work in our field to make educating their collaborators their primary focus. Do not let an investigator pressure you into doing something unethical; do a detailed review of publications that you are associated with. Teach them the right way to perform a test, to develop a study and to interpret a result. Teach them as they teach us about their fields.

What would be best is to back away from this mind set that we must publish good results all the time, or we will loose our jobs. The only solution is to let's all do science and statistics right.

[1] | We should include an alpha level at which we declare results to be significant, but I see that often skipped. That is another issue with statistical significance: the belief that 0.05 is a magic number below which all things are great and significant. |