There's been a lot of writing recently about p-values and the reproducibility of scientific results. P-values are hard to explain, even more difficult to understand, and easy to exploit in the effort to get a publishable result. There have been calls to abandon reliance on p-values, along with spirited defenses.

I'm not sympathetic to p-values, and I'm on board with anyone who wants to put the analysis of experimental results on a firmer basis. However, it's hard for me to think that p-values are a Really Big Problem. At best, they're a small part of a larger problem.

Although it's hard to describe what a p-value is in concrete terms, it's essentially the probability of observing your experimental result, or an even stronger result, if the "null hypothesis" is correct. The null hypothesis is something like "the result you're trying to prove is false." So p is something like the probability of seeing your result given random data. (Yes, statisticians are probably screaming now.) There's a lot of subtlety wrapped into this. A p-value doesn't tell you the probability that your result is invalid. It doesn't tell you the probability that your result is due to random chance. It doesn't tell you whether you're looking at a strong effect or a weak one. Statistics aside, p certainly doesn't tell you anything about whether you are interpreting your results correctly. And p doesn't tell you whether your results can be attributed to experimental errors that you didn't take into account.

Using p as the primary indicator that a result is "valid" is problematic, to say the least. That isn't to say that p-values are useless, but with any metric, you can't start out by pretending the metric says things that it doesn't. However, there are bigger problems that need addressing, problems that have little to do with p-values:

• First, we don't really understand how to describe experiments precisely. In any experiment, there's a lot of local knowledge, lab folklore, stuff that "everyone knows" (but a lot of people don't really know).
• Second, there's a lot of possibly relevant data that we don't collect: calibration data, environmental data, and so on. For example, does it matter whether test tubes are made of plastic or glass? (Yes, it can make a big difference.) Could the weather make a difference in an experimental result? Probably not, but possibly: I can imagine ambient temperature and humidity affecting the behavior of some piece of equipment.
• Third, a lot of biology is about learning good lab technique. But even among people with good technique, one person's technique isn't the same as someone else's. Although it's possible to analyze person-to-person variation statistically, it is very difficult, if not impossible, to describe in a protocol. You can't say "stir it just the way I stir it."
• Fourth, we're not very good at publishing data. We're particularly poor at publishing the data from the intermediate steps of an experiment. Such data often fails the "who would possibly want that" test, though it might be precisely what's necessary to figure out why an experimental result can't be duplicated: at what step is something going wrong? Even when we publish the data, we don't have very good tools for discovering other data, comparing data sets from different experiments, or even discovering other relevant experiments.
• Fifth, we're not good at designing experiments that are statistically effective, or that have outcomes that can easily be re-used by others.

None of these problems means that experimental results are invalid, even when they can't be reproduced. But they do mean that experimental results are harder to reproduce and interpret than we'd like to think. What's the hidden knowledge, the quirk of a researcher's technique, the poorly described intermediate result that would help us to understand? We don't know. We have trouble talking about those problems; we don't have much language for talking about the ways in which attempts to perform the same experiment differ. But I'm willing to bet that those differences are a lot more important than whether p is 0.10, 0.05, or 0.01. P has become a scapegoat for bigger problems in experimental science. Arguing about p won't help if the real issue is that we don't know how to describe experiments, how to collect data, or how to publish results in ways that can be used effectively by others.