Originally posted on http://www.howtomeasureanything.com/forums/ on Thursday, April 30, 2009 10:42:04 AM, by djr.

“Let me start off by saying, I really appreciate this book and have found it very useful. I enjoyed the calibration exercises and decided to include them in a semester class on decision analysis I’ve just finished with graduate students. Unfortunately it didn’t work out as I hoped.

While I saw progress for the group as a whole initially, only a few of the 27 students even neared the 90% target and when we took a break for 5 weeks, the skills slipped. Furthermore, the students who got close primarily did so by using such extreme ranges, that they (and I) felt the conclusion was they didn’t really know much about the questions. I sought to see if students who felt they were comfortable working with numbers did better, but they did not. Students who self-described themselves as analytical did somewhat better but it was not a strong relationship. Nevertheless, the students indicated for the most part they liked the exercises. It helped them realize they were overconfident and it made them think about estimation and uncertainty. However, making progress on getting more calibrated for the most part eluded them. I recognize that unlike the scenarios you described in the book, these students are not in the business of estimation and indeed many of them are quite possibly adverse to such estimating. But I argued they all nevertheless would estimate in their professional careers (public and nonprofit management).

I’m planning on doing this again but I wanted to pose two questions.

1. One strategy for “getting calibrated” at the 90% level is to choose increasingly wider ranges, even to the point where they seem ridiculous. For example on a question about the height of Mt. Everest in miles above sea level, one student put 0.1 miles to 100,000 miles. While strictly speaking this was a range that captured the true value, its usefulness as an uncertainty range is probably approaching zero. However from the students’ perspectives, answering in this way was getting them closer to the 90% confidence range that I was pushing on them. (Even with such ranges, many students were still at 50-70%.) What would your response be to this strategy if you saw it being used and what might I as an instructor suggest to improve this? Is the conclusion to be left with, you don’t know anything if you have to choose wide ranges? Are there other measures we should combine with this such as the width of the confidence intervals? Are there other mental exercises besides those in the book that might help?

2. While students did not do well on the 90% confidence interval questions, they did do fairly well on true/false questions where they then estimated their degree of confidence. More than three-fourths of the class did get within ten percent of their estimated level of confidence by the second true/false trial (though these came after several 90% confidence interval exercises as well). At the same time, students average confidence level for individual questions, did not correlate at all with the percent of the students who correctly guessed true/false. In the book there was no discussion of improvements or accuracy with the true/false type estimation questions and I wondered if you had any observations to offer on why this seemed easier and students were better on this type of estimation. In your experience, are these type of calibrations more/less effective or representative? Should they be very different from the 90% confidence intervals in terms of calibration?

Again, great book that I think could almost be a course for my students as it is.”

Thank you for this report on your experiences. Please feel free to continue to post your progress on calibration so we can all share in the feedback. I am building a “library” of findings from different people and I would very much like to add your results to the list. I am especially interested in how you asked students to describe themselves as analytical vs. those who did not. Please feel free to send details on those results or to call or email me directly. Also, since I now have two books out discussing calibration, please let me know which book you are referring to.

I item-tested these questions for the general business, government, analyst and management crowd. Perhaps that is one reason for the difference in perceived difficulty, but I doubt that alone would make up for the results you see. My experience is that about 70% of people achieve calibration after 5 tests. We might be emphasizing different calibration strategies. Here are my response to your two questions:

1) We need to be sure to explain that “under-confidence” is just as undesirable for assessing uncertainty as overconfidence. I doubt that student really believed Mt. Everest was several times larger than the diameter of the Earth, but if he/she literally had no sense of scale, I suppose that is possible. It is more likely that they didn’t really believe Mt. Everest could be 100,000 miles high or even 10,000 miles high. Remember to apply the equivalent bet. I suspect that person believed they had nearly a 100% chance of getting the answer within the range, not 90%. They should answer the questions such that they allow themselves a 5% chance that the true value is above the upper bound and a 5% chance it is below the lower bound. But if this truly is their range that best represents their honest uncertainty, then you are correct – they are telling you they have a lot of uncertainty and the ends of that range are not really that absurd to them.

2. Yes, they always appear to get calibrated on the binary questions first. But I do discuss how to improve the true/false questions. Remember that the “equivalent bet” can apply to true false questions as well. Furthermore, repetition and feedback is a strategy for improving on either ranges or true/false questions. Finally, the corrective strategy against “anchoring” involves treating each range question as two binary questions (the anchoring phenomenon may be a key reason why ranges are harder to calibrate than true/false questions). When answering range questions, many people first think of one number, then add or subtract another “error” value to get a range. This tends to result in narrower – and overconfident – ranges. As an alternative strategy, ask the students to make the lower bound such that they could say they would answer “True with 95% confidence” to the question “Is the true value above the lower bound?” This seems to significantly reduce overconfidence in ranges.

Thanks for this information and feel free to send detailed records of your observations. I may even be able to incorporate your observations in the second edition of the How to Measure Anything book (which I’m currently writing).

Thanks,

Doug Hubbard

Mr. Hubbard,

I’d be happy to share more about my experiences. I collected the responses to all of the questions for each of the students for each of the tests. I could post a series of posts here on various findings if you think this is worth sharing or deal with you direct if this is getting into more detail than most would be interested in.

Given your particular interest, let me share here first what I found on the question of calibration against self-assessments. Given the difficulties we were having moving people to a calibrated state, I thought that one of the problems might be that I had students who fundamentally didn’t see themselves as analytical or numbers oriented and this might explain who did well on this versus who did not. You alluded to something along these lines in your first book with some people who scored low being in jobs which didn’t require forecasting. The two statements I asked students to self-assess on were “I consider being analytical one of my personal strengths” and “I am comfortable working with numbers”. I then asked students to rate themselves on a 5-point scale ranging from “strongly disagree” to “strongly agree”. On the analytical question students rated themselves highly averaging 3.9 or just below “agree”. On the working with numbers question, the assessments were lower averaging 3.3 or between neutral and agree. The interesting part was seeing whether either of these scales correlated with how students did on their last calibration range assessment. For the analytical statement, the correlation with the percent correct on the last calibration exercise was 0.23 suggesting that those who saw themselves as more analytical did somewhat better, but in a simple regression the R2 only works out to 0.05 so only 5% of the variation in how they did on the calibration could be explained by how analytical they self-assessed themselves to be. The results were less helpful for the statement about I am good with numbers. Here the correlation was only 0.13 and the R2 in the regression equation was less than 0.02 so this seemed to be even less useful. So neither of these statements appeared to be good predictors of how students did on the calibration tests. I also tried to take a more qualitative assessment based on my limited experience with these students and I was not able to see any clear patterns that might make this clearer to me.

I can post some of the summary charts here but I wasn’t sure how to get an image in.

Dale Roenigk

These findings are very interesting and I don’t recall anyone in the literature asking a question like that before calibration testing. I don’t know how large your sample was, but I would agree that the small r squared is probably well within what would be possible with random error and not evidence of any real relationship.

Can you confirm that you use each of the steps I refer to in the previous response? That is, the equivalent bet, the method to avoid anchorig, etc? If you haven’t been using all of those methods you might try them in the next class.

Feel free to send me whatever you care to share by email through the “contact the author” link on http://www.howtomeasureanything.com or “more info” at http://www.hubbardresearch.com.

A more objective measure of analytical skill might be a simple test on some other analytical problem, like a few simple story problems. Perhaps we can even design a specific experiment with your next class. A captive audience of students has been a valuable resource for many decision psychology studies!

Thanks again,

Doug Hubbard