Pass/Fail Questions

Originally posted on on Wednesday, July 08, 2009 2:46:05 PM by

“Hi Doug,

I want to share an observation a V.P. made after doing the 10 pass fail questions. If one was to input 50% confidence to all the questions and randomly selected T/F they would be correct 1/2 the time the difference would be 2.5.

The scoring would indicate that that person was probably overconfident. Can you help here ?.

I am considering making the difference between the overall series of answers (as a decimal) and the Correct answers(as a decimal) as needing to be greater than 2.5 for someone to be probably overconfident.

please advise

Thaks in advance – Hugh”

Yes, that is a way to “game the system” and the simple scoring method I show would indicate the person was well calibrated (but not very informed about the topic of the questions). It is also possible to game the 90% CI questions by simply creating absurdly large ranges for 90% of the questions and ranges we know to be wrong for 10% of them. That way, they would always get 90% of the answers within their ranges.

If the test-takers were, say, students, who simply wanted to appear calibrated for the purpose of a grade, then I would not be surprised if they tried to game the system this way. But we assume that most people who want to get calibrated realize they are developing a skill they will need to apply in the real world. In such cases they know they really aren’t helping themselves by doing anything other than putting their best calibrated estimates on each individual question.

However, there are also ways to counter system-gaming even in situations where the test taker has no motivation whatsoever to actually learn how to apply probabilities realistically. In the next edition of How to Measure Anything I will discuss methods like the “Brier Score” which would penalize anyone who simply flipped a coin on each true/false question and answered them all as 50% confident. In a Brier Score, the test taker would have gotten a higher score if they put higher probabilities on questions they thought they had a good chance of getting right. Simply flipping a coin to answer all the questions on a T/F test and calling them each 50% confident produces a Brier score of zero.

Thanks for your interest,

Doug Hubbard

The Statistics Behind the Calibration Scores

Originally posted on on Thursday, April 30, 2009 6:20:57 AM.

“Hi Douglas,

I want to thank you for your work in this area .Using the information in your book I used Minitab 15 and created an attribute agreement analysis plot. The master has 10 correct and I then plotted 9,8,7,6,5,4,3,2,1,0. From that I can see the overconfidence limits you refer to in the book. Based on the graph there does not appear to be an ability to state if someone is under-confident. Do you agree?

Can you assist me in the origin of the second portion of the test where you use the figure of -2.5 as part of the calculation in under-confidence?
I want to use the questionnaire as part of Black Belt training for development. I anticipate that someone will ask how the limits are generated and would like to be prepared.

Thanks in advance – Hugh”

The figure of 2.5 is based on an average of how confidently people answer the questions. We use a binomial distribution to work out the probability of just being unlucky when you answer. For example, if you are well-calibrated, and you answer an average of 85% confidence (expecting to get 8.5 out of 10 correct), then there is about a 5% chance of getting 6 or less correct (cumulative). In other words, at that level is is more likely that you were not just unlikely, but actually overconfident.

I took a full distribution of how people answer these questions. Some say they are an average of 70% confident, some say 90%, and so on. Each one has a different level for which there is a 5% chance that the person was just unlucky as opposed to overconfident. But given the average of how most people answer these questions, having a difference of larger than 2.5 out of 10 between the expected and actual means that there is generally less than a 5% chance a calibrated person would just be unlucky.

It’s a rule of thumb. A larger number of questions and a specific set of answered probabilities would allow us to compute this more accurately for an individual.



Length of Calibration

Originally posted on on Monday, March 09, 2009 9:14:11 AM.

“I just read your book and found it fascinating. Thanks.

On calibrated estimates, once experts are calibrated, do they stay calibrated?
Or do you repeat every time that you are beginning a project or making an estimate.

I’m just thinking in a corporate setting – do you just do it once for a group of people that you may want estimates for or would you do it before each project. Do it annually?

What has been your experience on how long people stay calibrated?



Books Related to Calibration

Originally posted on on Monday, February 16, 2009 11:32:17 AM.

“I am looking for some material (articles or books) on the subject of Calibration. I want to be expert in Calibration.


I certainly support your goal for becoming and expert in this topic. It is a well-studied topic but is still far too obscure in practical applications. Beyond my book, the most important sources are the purely academic literature…which I would definitely recommend for anyone who wants to be an expert. My next book The Failure of Risk Management, will cover this topic with a slightly different emphasis and, in some cases, in more detail. In both books, I resort to several academic studies, including the following.

A key source is Judgment Under Uncertainty: Heuristics and Biases, Cambridge University Press, 1982. It is a compilation of several research papers on the topic. You can buy this book on Amazon.

Here are several more articles:

A.H. Murphy and R. L. Winker, ‘‘Can Weather Forecasters Formulate Reliable Probability Forecasts of Precipitation and Temperature?,’’ National Weather Digest 2, 1977, 2–9.

D. Kahneman and A. Tversky, ‘‘Subjective Probability: A Judgment of Representativeness,’’ Cognitive Psychology 3, 1972, 430–454.

G.S. Tune, ‘‘Response Preferences: A Review of Some Relevant Literature,’’ Psychological Bulletin 61, 1964, 286–302.

E. Johnson, ‘‘Framing, Probability Distortions and Insurance Decisions,’’ Journal of Risk and Uncertainty 7, 1993, 35.

D. Kahneman and A. Tversky, ‘‘Subjective Probability: A Judgment of Representativeness,’’ Cognitive Psychology 4, 1972, 430–454.

D. Kahneman and A. Tversky, ‘‘On the Psychology of Prediction,’’ Psychological Review 80, 1973, 237–251.

A. Tversky and D. Kahneman, ‘‘The Belief in the ‘Law of Small Numbers,’’’ Psychological Bulletin, 1971.

A. Koriat, S. Lichtenstein, and B. Fischhoff, ‘‘Reasons for Confidence,’’ Journal of Experimental Psychology: Human Learning and Memory 6, 1980, 107–118

Facilitating Calibrated Estimates

The book shows that calibrated probability assessments really do work and it gives the reader some idea about how to employ them. But facilitating a workshop – with calibrated estimates or any other formal method – has its own challenges. Participants ask questions or make challenges about calibration or probabilities in general that sometimes confound reason. It’s amazing the sorts of ideas adults have learned about these topics.

Still, I’ve found that these kinds of challenges and my responses to them have become almost scripted over the years. The conceptions and misconceptions people have about these concepts fall into certain general categories and, therefore, so have my responses.

I thought about starting out with an attempt at an exhaustive list but, instead, I’ll wait for readers to come to me. Do you have any challenges employing this method in a workshop or, for that matter, do you have questions of your own about how such a method can work? Let us know and we’ll discuss it.