The American Statistician Presents the Most Important Improvement to Scientific Method in Over 80 Years

Science, we have a problem.  Several problems, actually.  Now we have solutions.  A central tenant of modern science is that experiments must be reproducible.  It turns out that a surprisingly large proportion of published results in certain fields of research – especially in the social and health sciences – do not satisfy that requirement.  This is known as the reproducibility crisis.

I was honored to have the opportunity to be involved in addressing this as both an associate editor of a ground-breaking special Issue of The American Statistician and to co-author one of the 43 articles.  This special issue is called “Moving to a World Beyond p<.05”  Some of my readers will recognize that “p<.05” refers to the “significance test”, ubiquitous in scientific research. This issue is the first serious attempt to fundamentally rethink statistical inference in science since the significance tests currently used were developed almost a century ago.

The article I authored with Alicia Carriquiry (Distinguished Professor of Statistics at Iowa State University) is titled Quality Control for Scientific Research: Addressing Reproducibility, Responsiveness, and Relevance.  We argue that addressing responsiveness and relevance will help address reproducibility.  Responsiveness refers to the fact it takes a long time before problems like this are detected and announced.  The discovery of the current problems of reproducibility only occurred because, periodically, some diligent researchers decided to investigate it.   Years of unreproducible studies continue to be published before these issues are known and even longer before they are acted on.

Relevance refers to how published research actually supports decisions. If the research is meant to inform corporate or public decisions (it is certainly often used that way) then it should be able to tell us the probability that the findings are true.  Assigning probabilities to potential outcomes of decisions is a vital step in decision theory.  Many who have been using scientific research to make major decisions would be surprised to learn that the “Null Hypothesis Significance Test” (NHST) does not actually tell us that.

However, Alicia and I show a proof about how this probability could be computed.  We were able to show that we can compute the relevant probability (i.e., that the claim is true) and that even after a “statistically significant” result the probability is, in some fields of research, still less than 50%.  In other words, a “significant” result doesn’t mean “proven” or even necessarily “likely.”  A better interpretation would be “plausible” or perhaps “worth further research.” Only after the results are reproduced does the probability the hypothesis is true start to grow to about 90%.  This would be disappointing for policy makers or news media that tend to get excited about the first report of statistical significance.  In fact, measuring such probabilities can be the basis of a sort of quality control for science that can be much more responsive as well as relevant.  Any statistically significant result should be treated as a tentative finding awaiting further confirmation.

Now, a little background to explain what is behind such an apparently surprising mathematical proof.  Technically, a “statistically significant” result only means that if there were no real phenomena being observed (the “null” hypothesis) then the statistical test result – or something more extreme – would be unlikely.  Unlikely in this context means less than the stated significance level, which in many of the research fields in question is 0.05.   Suppose you are testing a new drug that, in reality, is no better than a placebo.  If you would have run an experiment 100 times you would, by chance alone, get a statistically significant result in about 5 experiments at a significance level of .05.  This might sound like a tested hypothesis with a statistically significant result has a 95% chance of being of being true.  It doesn’t work that way.  First off, if you only publish 1 out of 10 of your tests, and you only publish significant results, then half of your published results are explainable as chance.  This is called “publication bias.”  And if a researcher has a tendency to form outrageous hypotheses that are nearly impossible to be true, then virtually all of the significant results would be the few random flukes we would expect by chance.  If we could actually compute the probability the claim is true, then we would have a more meaningful and relevant test of a hypothesis.

One problem with answering relevance is that it requires a prior probability.  In other words, we have to have to be able to determine a probability the hypothesis is true before the research (experiment, survey, etc.) and then update it based on the data.  This reintroduces the age-old debate in statistics about where such priors come from.  It is sometimes assumed that such priors can only be subjective statements of an individual, which undermines the alleged objectively of science (I say alleged because there are several arbitrary and subjective components of statistical significance tests).  We were able to show that an objective prior can be estimated based on rate at which studies in a given field can be successfully reproduced.

In 2015, a group called The Open Science Collaboration tried to reproduce 100 published results in psychology.  The group was able to reproduce only 36 out of 100 of the published results.  Let’s show how this information is used to compute a prior probability as well as the probability the claim is true after the first significant results and after it is reproduced.

The proposed hypothesis the researcher wants to test the truth of is called “alternative” to distinguish it from the null hypothesis, where the results were a random fluke.  The probability that the alternative hypothesis is true, written Pr(Ha), based only on the reproducibility history of a field, is:

Where R is the reproducibility rate, and α and π  refer to what is known as the “significance level” and the statistical “power” of the test.   I won’t explain those in much detail but these would be known quantities for any experiment published in the field.

Using the findings of The Open Science Collaboration, R would be .36 (actually we would use those results as a sample to estimate the reproduction rate which itself would have an error, but we will gloss over that here).   The typical value for α is .05 and π is, on average, perhaps as high as .8.   This means that that hypothesis being considered for publication in that field have about a 4.1% chance of being true.  Of course, you would expect that researchers proposing a hypothesis probably have reason to believe they have some chance that what they are testing is true but before the actual measurement, the probability isn’t very high.  After the test is performed, the following formula shows the probability is a claim is true given that it passed the significance test (shown as the condition P< α).

What might surprise many decision makers who might want to act on these findings that that, in this field of research, a “significant” result using an experiment with the same α and π means that now we can say that the hypothesis has only a 41.3% chance of being true.  When (and if) the result is ever successfully reproduced it, the probability the hypothesis is true is adjusted again to 91.9% with the following formula.

If we plot all three of these probabilities as a function of reproduction rate for a given α and π, we get a chart like the following.

Probability the hypothesis is true upon initial proposal, after the first significant result and after reproduction

Other reproduction attempts similar to The Open Science Collaboration show replication rates well short of 60% and more often below 50%.  As the chart shows, if we wanted to have a probability of at least 90% that a hypothesis is true before making a policy decision, reproduction rates would have to be on the order of 75% or higher holding other conditions constant.  In addition to psychology, these findings affect issues as broad as education policy, public health, product safety, economics, sociology, and many other fields where significance tests are normal practice.

Alicia and I proposed that Pr(Ha) itself becomes an important quality control for fields of research like psychology.  Instead of observing problems perhaps once every decade or two, Pr(Ha) can be updated with each new reproduction attempt and that will update all of the Pr(Ha|P<a) in the field at a rate closer to real time.  Now, our paper is only one of 43 articles in this massive special issue.  There are many articles that will be very important in rethinking how statistical inference has been applied for many decades.

The message across all the articles is the same – the time for rethinking quality control in science is long overdue and we know how to fix it.

– Doug Hubbard

Tim Peter for Marketing

I talked to someone who has both a vision as well as an impressive track record for marketing.  Tim Peter (www.timpeter.com) describes his services on his website as “proven solutions to improve your e-commerce results and your Internet marketing”.

I came across his site because I got a Google Alert on a flattering review he had written for How to Measure Anything and another review he wrote for Pulse on his blog (so, yes, I might be a bit biased in my review of Tim).  But my first impression was actually when I checked his sites rank on Alexa.com.  Checking the Alexa.com rank is now a habit I have when anyone who asks me for a radio interview, to contribute content in some way, or if I see any review of my books I read in blogs.   I found that for someone who has only been in business for 6 months, his traffic outranks many small SEO and marketing firms.  His traffic also is higher than mine and I’ve been promoting my site and the services described on it for many years.

After confirming his credentials on Alexa, I gave him a call and we talked for a while about his area of expertise.  He seemed to me to have a lot of insight in the issue of webmarketing and he tries to measure it – which, as you know, makes me a fan of what he does.  Check out his site and talk to him if you are looking for that kind of service.

Doug Hubbard

Amazon’s email blast for Pulse

Amazon just sent out its email blast for my next book Pulse: The New Science of Harnessing Internet Buzz to Track Threats and Opportunities. If you get those, you know what I mean (e.g., “If you like book X you will like this new book” etc.). With my previous books, this is usually the beginning of when sales really start to pick up for a new book release.  We’ll see.

Pulse is hitting the shelves

My new book, Pulse: The New Science of Harnessing Internet Buzz to Track Threats and Opportunities, just came in the mail.  Amazon says it is available now and it should be on bookstore shelves in a day or two.

This book seems to generate the most excitement of all of my books so far.   When I explain the idea of using Internet data mining to forecast major economic and social trends, people at the seminars where I speak are universally intrigued.  Sam Savage, author of The Flaw of Averages, told me “This may be your biggest book yet”.  That seemed to be bolstered by the fact that Publisher’s Weekly put Pulse in its “PW’s Top 10: Business & Economics” for the spring of 2011.  So I’m cautiously optimistic.

The book is coming out earlier than the originally planned release, so we are all rushing to make sure the accompanying site is up soon.  It will have a lot of interesting functionality, so be sure to check it out sometime after April 21st.

Doug Hubbard

The Zurich Risk Management Paper

Back in September of 2010, I had the honor of speaking at the Zurich annual Supply Chain Risk Management Conference in Lucerne, Switzerland.  As much fun as the event was, I was equally honored when they asked me to co-author their white paper on risk management.

This year’s risk management white paper is titled “Pushing the Boundaries”.  I wrote the section titled “Risk and Uncertainty”.  If you are familiar with my work, you will recognize the concepts I talk about.  But this has to be one of the more aesthetically attractive documents I’ve every participated in creating.

You can find the pdf on the Zurich Insurance website here: http://zdownload.zurich.com/main/Insight/Pushing_the_Boundary.pdf

Enjoy,

Doug Hubbard