Originally posted on http://www.howtomeasureanything.com/forums/ on Wednesday, July 08, 2009 2:46:05 PM by

“Hi Doug,

I want to share an observation a V.P. made after doing the 10 pass fail questions. If one was to input 50% confidence to all the questions and randomly selected T/F they would be correct 1/2 the time the difference would be 2.5.

The scoring would indicate that that person was probably overconfident. Can you help here ?.

I am considering making the difference between the overall series of answers (as a decimal) and the Correct answers(as a decimal) as needing to be greater than 2.5 for someone to be probably overconfident.

please advise

Thaks in advance – Hugh”

Yes, that is a way to “game the system” and the simple scoring method I show would indicate the person was well calibrated (but not very informed about the topic of the questions). It is also possible to game the 90% CI questions by simply creating absurdly large ranges for 90% of the questions and ranges we know to be wrong for 10% of them. That way, they would always get 90% of the answers within their ranges.

If the test-takers were, say, students, who simply wanted to appear calibrated for the purpose of a grade, then I would not be surprised if they tried to game the system this way. But we assume that most people who want to get calibrated realize they are developing a skill they will need to apply in the real world. In such cases they know they really aren’t helping themselves by doing anything other than putting their best calibrated estimates on each individual question.

However, there are also ways to counter system-gaming even in situations where the test taker has no motivation whatsoever to actually learn how to apply probabilities realistically. In the next edition of How to Measure Anything I will discuss methods like the “Brier Score” which would penalize anyone who simply flipped a coin on each true/false question and answered them all as 50% confident. In a Brier Score, the test taker would have gotten a higher score if they put higher probabilities on questions they thought they had a good chance of getting right. Simply flipping a coin to answer all the questions on a T/F test and calling them each 50% confident produces a Brier score of zero.

Thanks for your interest,

Doug Hubbard

Mr. Hubbard,

Is the issue here one of accuracy versus precision? The focus on becoming calibrated seems to me essentially aimed at making sure you are accurate in estimating your level of uncertainty. If you say you are 50% or 90% certain, then when we test you with some questions, you are “calibrated” if you correctly answer 50% or 90% of the questions. If your results match up with your uncertainty estimates your calibration is the same as saying you are “accurate”. However, the person estimating at 50% is clearly much less precise than the person who is estimating at 90%. I understand the focus on improving calibration or our accuracy in estimating for uncertainty, but do we also need to have some measures that focus on the differences in our levels of precision? To take this one step further, should our initial focus be on improving our precision or our accuracy? Your book seems to place the emphasis first on the accuracy side with the exercises on calibration. This feels right to me and certainly matches the extensive literature showing how common the problem of overconfidence is. However, I think of how many quality improvement processes such as Six Sigma emphasize minimizing variation (being more precise) first and then working on accuracy (calibration). Would we be better off first trying to get a person more precise first and worry about calibration second? Probably not truly an either/or question but a matter of order or emphasis. I wanted to put this question out there to see what the reaction might be.

Dale

These are really very different concepts when applied to the measurement of uncertainty. A “precise probability” or “accurate probability” is like “exactly uncertain” or “extrodinarilly average”. The probability is itself a measure of uncertainty about some *other* state or quantity. Think of uncertainty as an error. When we state the error of a measurement the error is, itself, the description of the uncertainty about the measurement. We don’t then apply the same concepts of precision and accuracy to the error. It is redundant to ask “What’s the error of the error?”

The purpose of calbration is so that we are realistic in assessing our *initial* state of uncertainty (i.e. error). After we have accomplished that, if you want to reduce uncertainty, you then have to make further measurements. This is the point of the book. The reason we have to realistically assess our current uncertainty as it is now is because we need that to determine the value of additional information about the quantity by measuring it further.

To answer your question, work on calibration first so that they have realistic estimates for their uncertainty. Then use that to determine what should be measured further to reduce that uncertainty. I’m not sure what a precise but uncalibrated person would be doing other thn being precisely overconfident. That is like prefering to be precisely wrong over being approximately right.

Hi Doug,

from what I understand Brier Scores only work for T/F questions; for estimating continuous variables, it seems that continuous ranked probability score (CRPS) and information-based ignorance score (IGN) could be relevant alternatives, see for example

https://journals.ametsoc.org/doi/pdf/10.1175/MWR-D-11-00266.1

https://ams.confex.com/ams/90annual/techprogram/paper_163351.htm

https://www.lokad.com/continuous-ranked-probability-score

Have you looked into this yourself and could give some guidance? I would love to “gamify” our calibration effort, but currently lack a metric that cannot be tricked in a trivial way.

Thanks, Jonas