What is a p-value? The problem of hypothesis testing in observational studies

At a time when cost-effective, evidence-based care is increasingly demanded by patients, payers and our own professional organizations, research on outcomes of treatment has become more common. Over the past 10 years there has been a steady increase in the  levels of evidence reported in most of the journals read by hand surgeons. While this trend is encouraging in appearance, numerous studies have shown that “the medical evidence” (as defined by the diagnostic, prognostic or therapeutic value of clinical study findings) has not improved in accordance with the levels of evidence ascribed to their respective studies.

How can this be true?

The answer is complex, but one important and often misunderstood factor is the misinterpretation of the inferential statistics commonly used in clinical research. Inferential statistics inform the process by which observations made in a sample are extrapolated  to the general population from which that sample was derived. That process involves four steps:

  1. Postulation of a null hypothesis;
  2. Postulation of an alternative hypothesis;
  3. Hypothesis testing (Do the data in the sample provide evidence that refutes the null hypothesis and leads to acceptance of the alternative hypothesis?); and
  4. Calculation of a p-value – in other words, establish the risk that acceptance of the null hypothesis is actually valid (or invalid).

In examining this view of inferential statistics,  it stands to reason that an approach which seems to provide evidence for or against the veracity of a null hypothesis would gain popularity during a period of time when objective demonstrations of evidence are demanded.

Unfortunately, the “p-value” may not be as valuable as commonly assumed.

Describing the meaning of p-value is more difficult than one might assume. A p-value is a measure of the strength of evidence (or probability) against the null hypothesis in an individual study. Since the null hypothesis assumes the lack of effect (no difference between two groups, no association between risk factor and outcome, etc.), the lower the p-value, the greater the statistical incompatibility between the reported data and the null hypothesis. We typically accept a 5% risk (p=0.05 or less) that we reject the null hypothesis when the null hypothesis is in fact true.

How does this relate to falsely held beliefs about p-values?

Below are two particularly notable examples of how the p-value can be misconstrued:

  • A p-value above 0.05 does not necessarily mean that two populations (or samples) are the same or are “statistically similar.” It may just mean that the null hypothesis can’t be rejected. Conversely, a p-value below 0.05 does not necessarily mean that there is a difference between two populations (or samples). It may just mean that the results allow us to reject the null hypothesis based on the assumptions implicit in the study design. There are several reasons why this may happen:
    • If one does not adequately account for sources of bias (which are the rule rather than the exception in observational studies), one will obscure the true association (or lack of association) between exposure to a possible risk factor and an outcome of interest is obscured. If you conduct a study to determine differences in SAT scores after reading Barron’s or Kaplan’s review book but don’t account for the fact that you conducted your test in a town where every child has an “SAT prep consultant,” the effect of the SAT prep consultant will obscure the effect of the SAT prep books and confound the results of your study.
    • All too often, observational studies in which outcomes are evaluated using “samples of convenience” are based on a flawed study design in which case and control subjects are not truly from the same population. If you conduct a study to determine differences in omega 3 levels in wild or farmed Atlantic blue fin tuna but your “wild tuna” group included some wild Albacore, misclassification bias will confound your study results (and associated resultant p-values).
  • A p-value is more a measure of sample size (power) than of exposure effect. As has been noted by many prominent Bayesian statisticians and epidemiologists, with sufficient sample size, almost every p-value will reach the threshold of “significance.” This observation is the reason behind another important tenet of p-values; because they are influenced by the size of the study sample, they cannot tell us anything about the strength of association between exposure and outcome. While we often fault studies with small sample sizes as being “underpowered,” having too much power can be just as, if not more dangerous. Although a p-value of 0.002 sounds “very significant,” to link the size of the p-value with the strength of association is to conflate two unrelated concepts. A great example of how sample size can drive expected p-values is provided by Kevin Chung’s recent JHS article on utilization of preoperative electrodiagnostic studies for carpal tunnel syndrome. In this study of over 64,000 patients, both age between 35-44 and preoperative electrodiagnostic studies were associated with p<0.001. Because the study provides an assessment of relative risk, we are able to distinguish between a factor that is statistically significant but clinically irrelevant (age 35-44 is associated with 1% longer wait for surgery) and a factor that is statistically significant and clinically relevant (preoperative electrodiagnostic studies leads to 36% longer wait). If p-values were reported without this assessment of relative risk, the p-values may distract us from recognizing the true association between each predictor variable and the outcome.

So what can be done?

Acknowledging that I am not a statistician, here is my general approach:

  • Mentally separate the terms “statistically significant” and “p-value.” This subtle change of approach allows one to resist the subconscious tendency to overvalue or misinterpret the p-values reported in a manuscript.
  • Consider ignoring the p-values during your initial reading of a manuscript. Instead, focus on the hypothesis (Is it plausible?), the study design (Is the approach by which the authors attempt to answer their study question appropriate?), the reported results (Are they logical/Do they make sense?), and the conclusions (Are they a valid reflection of the results?).
  • Answer the question, “Do the study comparisons and the study results pass the ‘sniff test’?” If not, there may be sources of bias that are influencing the outcome.
  • If the study results go against the preponderance of evidence on this topic, consider whether or not the authors have made a compelling case for their findings.

As Dr. Graham wrote in the inaugural JHS Focus post, “The focus on evidence – understanding, interpreting and refining it – can only benefit our patients and lead to a greater understanding of their problems, and that is a good thing for everyone.” Similarly, being (or becoming) an informed consumer of the literature on which our evidence is based is a critical aspect of understanding the problems that our patients face. It is also a critical aspect to understanding the value of the treatments that we offer as solutions to those problems. A better understanding of the utility and the limitations of hypothesis testing and p-values is a small but important aspect of this focus on evidence.

Article written by:

Dr. Osei is an Assistant Professor of Orthopedic Surgery at Washington University in St. Louis. He likes to spend his weekdays doing microvascular reconstruction and thinking about clinical epidemiology. He likes to spend his time away from work hanging out with his family and cultivating his lifelong obsession with Champions League soccer.

Join the discussion

  1. Richard L. Hutchison, MD

    Encouraging or requiring researchers to report the effect size of finding is a way to help interpret study results that avoids some of the pitfalls of p-values. Statistical significance, as reported by P-values, unfortunately does not give a complete picture of the usefulness of the research findings. Most readers want to know how the manuscript results will impact their practice or advance their understanding of the research or clinical problem. Reporting the effect size along with P-values can provide a fuller understanding of findings. Standardized effect size indices are measures that quantitate the clinical or practical significance of study results. The effect size indicates to both the reader and the researcher the importance and magnitude of the findings. Understanding the characteristics of the effect size can promote the goal of increased understanding and application of statistical methods in a surgeon’s practice.

  2. Brent Graham

    I think this is a really pertinent comment because it is clear that statistical significance doesn’t always equate with clinical significance. However there are many instances where an effect size can’t be meaningfully established, for example where the statistical test is not being used in a context of treatment. For those instances statistical significance and P values will still matter. Our job as authors and editors is to ensure that where this matters, the clinical significance matches the statistical significance.

Leave a Reply

Your email address will not be published.