I’m reintroducing the Measurement Challenge for the blog. I ran it for a couple of years on the old site and had some very interesting posts.
Use this thread to post comments about the most difficult – or even apparently “impossible” – measurements you can imagine. I am looking for truly difficult problems that might take more than a couple of rounds of query/response to resolve. Give it your best shot!
Doug Hubbard
How can we measure improvement in problem-solving after attending training in critical thinking?
Testing critical thinking skills is actually a well-developed area of psychology. One way is to simply give different tests before and after the training. Different people can take the tests in different order so that any change in results isn’t just due to the second test being easier. There are several standardized tests for this. (A Google search on “critical thinking skills test” produces quite a few good hits.)
Another way is to look into what the critical thinking skills are supposed to do for your firm. Why do you care that thier critical thinking skills improve? Is it because they regularly make decisions that affect the performance of their part of the firm? Then perhaps that is what you need to measure. But then the question is how do you know that the training was the reason for the performance improvement? In this case you might try training different people at different times and see if there are correlations between measured performance and when they took the test.
In either case, if you have at least a couple of dozen people being trained and tested, you may have the basis for finding the correlation you are looking for.
Thanks for your participation in the blog,
Doug Hubbard
How can we measure ‘fun’ as a benefit from a creative arts project? (in evaluations it is identified by participants as their most valued benefit from engagement with the process).
I’m back from my long European trip. I thought I answered Janet67’s question at one point but apparently not. Anyway, I remember what I was going to say.
As with all measurement problems we have to start with the definition. I will apply these following questions to each of the several measurement challenges I’ve been given including from jdconsultant and from Natalie7. These questions are:
1) What do you mean by “X” (fun, innovation, chaos, etc.)? In other words, what are the observable consequences of this? If you knew it existed or if you knew ou had a reason to measure it, you must at least have conceived of some possible observations of it. Defining the problem is the beginning of all scientific epiphanies.
2) Why do you care about it? Defining what you will do with the information tells us something about what you are really trying to observe. Are you trying to predict something? Are you going to make a decision based on this?
3) What do you know about it now? It is unlikely that anything you really want to measure has bounds of negative infinity to possitive infinity. Once you defined
If we can’t answer these questions about a measurement challenge, then thats no different than asking “How many Whazinkles can finakvil a bunch of huxoopers?” It other words, they only seem like immeasurable because you haven’t really figured out what you mean by them.
So how do you observe “fun”? You must have observed it before, otherwise it is unlikely you would even proposed that it is something that might be an “amount”. In an example in my book, I show how the quality of performances at the Cleveland Orchetstra were evaluated by measuring the number and duration of standing ovations. Of course, the board could have simply used a survey, but the standing ovation measure might even be more revealing (and much less of a bother to patrons).
But lets consider both options. One is a “revealed” opinion, the other “stated”. That is, one measure is based on what people do, the other on what people say. You could use a survey of parents and/or children. But I think you would also find revealed feedback is possible. What other observations do you make that tell you kids are having fun? Are they noisier? Are they more active? Do they laugh? Do they smile? A video that is analyzed with impartial judges armed with timers and notepads are a great place to start.
Now ask why you care? Isn’t it obvious they are having fun? Or are you saying that some decision will rely on this measure and it may come down to which activities are more fun than others?
I’ll await your answers before I finish.
Thanks,
Doug Hubbard
Okay, here are three things for you to try:
1) My partner was at Cambridge, and they discussed one day how to measure the amount of chaos in a system – how do you measure this? I don’t know what answer they came up with.
2) Progress of society, in itself over time or to compare with other societies?
3) ‘Rightness’ of a judicial decision?
Challenging enough? Also, very interested in the answer to Janet 67’s question
I’m always curious when people ask how to measure something that is actually the subject of well-developed areas of mathematics. Chaos is one such area. Feigenbaum and Mandelbrot had measured something they later called chaos (although it was not always a well defined concept). Chaos theory is all about expressing the concept in some mathematical rigor. Feigenbaum’s Constant is a measure of chaos.
If you mean “randomness” (which is not strictly the same as chaos, but as far as I know that might be what you mean), then there is another well developed method for that. Claude Shannon showed that the randomness of a signal could be measured by how small the signal could be compressed. If the signal was a repeating pattern of 01010101 then it would not take much to state it even if the signal when on for a million digits. A truly random signal, however, would be difficult to compress by very much at all.
But lets go to another one of the basic questiosn I like to ask: Why do you care? What was the decision that would be based on measuring the chaos of a system? It is ok if you were just curious. In that case it might suffice to just state it unambiguously for the purpose of academic publication. But was some busienss decision based on this? Why? That would tell me a bit about what you intend to measure.
I’ll await your response on this item but I’ll proceed with just one more of your questions before I turn in tonight (the flight back from London was long).
Regarding measuring progress, again, the first question, as always, is “what do you mean” followed by “why do you care?” The answer the second actually helps with the answer to the first if it is not apparent. Do you mean technological progress? Do you mean progress in terms of laws, democracy, scientific exploration? Regarding the “why do you care?” question, is it because you are thinking that people in societies with more progress are happier?
So that ball is in your court to respond to those questions. I’ll respond more this week.
Thanks,
Doug Hubbard
sorry, one more (more practical this time, rather than trying to come up with a ‘clever’ challenge):
I work as a school nurse and I’m interested if there is anyway a person can measure the impact of sex education on reducing teenage pregnancies.
How can you measure something like that, with all the variables involved, the secrecy/taboo factors, the uncertainty around motives (people who want to get pregnant anyway), how life happens by chance and you just end up in that situation, whether you understood it anyway, and so on.
And presumably even if there is some statistic that you could use for a population (which I cannot even imagine), you could never make a prediction for any individual’s ‘improved’ position in respect of having the education.
Look forward to your thoughts!
Hope you had an interesting trip!
Okay, to try and answer your questions.
Maybe ‘chaos’ was the wrong word. Maybe my partner was interested in that academically, I must ask, probably. But from my perspective I was thinking of something Goldratt said, about the degrees of freedom of a system. This is probably a different thing altogether, so perhaps I am using language sloppily.
If you look at this video, you might get a better idea of what I am asking:
http://www.youtube.com/watch?v=tWvMODJ9cVc
Goldratt talks about how complex a system is, and then he goes on to say something about how you can describe a system in four sentences versus a thousand pages, the latter being the more complex.
Maybe this is the measurement after all, but this sounds interesting but not sure how it helps a decision. Perhaps I am being short-sighted, but I can’t see how it helps in itself.
Then he goes on to talk about the degrees of freedom of a system which seems to be the thing that matters in practical terms. So I think want to know is how to count the degrees of freedom.
The progress question is a little different, but I think it is inspired by the whole Barack excitement and the economic recession stuff.
If Barack does a good job, presumably he moves society on. Or is that just an assumption. But there are so many things you could measure to determine that as you point out. How do you arrive at some kind of ‘net’ or aggregate value or whatever that says ‘yes, things moved on’ or got better. Surely that is what we all want to know when we vote in an election (we have one coming up in the next few weeks here).
But presumably you can’t because all of that is so value-laden.
Lord Ashcroft over in the UK, where I am, said that he felt that once the recession was over there would be a new world order, and that he felt the UK would be lower in that order than pre-recession. I know he is probably just talking about economic order, but I want to get at what that ‘order’ is fundamentally across all things.
The decision it might support is where I live or where I want my children to grow up! Maybe that is about happiness.
Maybe it begs the question, what is the measure of the value or quality of life? Surely you can’t answer that one!
Sorry, in case you need to look him up, I meant, Lord Ashdown – (Paddy), former leader of the Liberal Democrats. Too many Lords over here!
To Natalie7,
Before I head off to work for the day, let me answer your most recent question first. Of course you can measure the value and quality of a life. And, just like in all of your other questions, it is already being done.
In the US, UK and any country that has government agencies concerned with the allocation of resources to public health and safety, these decisions must be made. And in many cases, they are made with the help of a measurement. Both in the US and many other countries there is a method called the Value of a Statistical Life (VSL). This is based on the idea that people are only willing to pay up to a certain amount to reduce their OWN chance of mortality by a given increment. Would you pay an extra $10,000 for a car that was proven to reduce your chance of death on the road from 1% per year to 0.5% per year? Some peopel would, some would not. However you change the numbers, there is a limit to what you would pay. If you found this payment just barely acceptable (you would pay just $10k for a 0.5% reduction in chance of death), then you presumably value your life at about $2 million.
And you make several other decisions like that throughout the course of your life. Some governements (like the US) collect this as a “revealed preference” indication of how much you value your own life. That is, instead of asking people how much they are willing to pay to save their own life, they simply record what they *actually* spend. In the US, people appear to act as if they value their own lives at somewhere between $2 million and $20 million.
Quality adjusted value of life is, of course, just standard operating procedure for any country that has to allocate limited health care resources.
But your issue here is hidden in your statement that something cannot because “it is so value-laden.” Why is that an obstacle? When someone agrees to pay $200 for a painting, but would have refused it at $300, they told you something about how they value it. There is in fact a huge industry dedicated to measuring opinions and values using surveys and market research. This is no different.
I have a bit more time so I will also address on other comment you made in an earlier post before I leave for the day. You indicated that something would be difficult to measure because of “all the variables involved.” Are you under the impression that all or even most variables must be known before one particular variable can be correlated to an outcome? If that were the case, then all clinical drug trials would be impossible. But this is a common fallacy. In order to correlate whether one drug reduces, say, ulcers, you don’t need to first measure or even know all the other variables. We are not concerned with whether the drug fixed one particular subjects ulcers. What we want to know is whether the test group did significantly better than the placebo group. Since the drug vs. placebo was the only systemic difference between the groups (all individuals were randomly assigned to the placebo or test groups) then at some point it is unlikely that an observed difference between the groups could have been due to some other factors.
I’ll try to get to the other questions this week. But one item I noticed right away is that each of these questions are something that someone already measures on a regular basis. Always assume its been measured before. My response will be little more than telling you how other people already do it (a little Google searching would reveal the answer in each of these questions).
Finally, I will note that almost everything everyone has mentioned so far is not only already a measurement problem solved by others before me, but many similar examples were already mentioned in my book, including the VSL, measuring subjective value, and isolatiing the effect of variables (in addition to my advice of assuming its all been measured before and to define the measurement problem in terms of observables). Have you seen the book?
Thanks,
Doug Hubbard
I have seen but not read the book yet. It is on my recommended reading list along with this website.
I am in general quite convinced by most your comments, but I’m not entirely sure that a £ or $ value is the right measure of the value of life, just for instance. Is this really a useful measure, unless you are a politician, policy maker or planning a healthcare budget?
I have seen QALY (quality adjusted life years) used before in healthcare over here, and it does offer something and maybe is the only way to do things. But I am not sure that that is meaningful. Do you know what I mean?
How does that help a frontline clinician – who is bound first and foremost by a professional duty to the patient in front of them – in their decision-making?
I really do need to read the book, don’t I? 🙂
Okay Doug, I have one for you! The obvious one I think…
How can I measure the value your book will bring me over the course of my life and how can I measure the certainty you are right in advance?
Picorna,
Do you make big decisions? How often? How often are they right? You can measure your uncertainty about each of these (and your uncertainty about my claims) using the method called “calibrated probability assessment” I describe in the book. Assessing your own uncertainty subjectively but quantitatively is a skill you can learn (it turns out that bookies are quite good at putting odds on events – they are right 80% of the time they say they are 80% confident and so on).
Describe your uncertainty about the size, frequency and % correct of decisions you make in a given year by applying a calibrated confidence interval (a range that represents your uncertainty). Also apply a range to how many years you have left to live and a range to the reduction in decision errors from reading my book (the book itself offers estimates of reductions in decision errors based on previous research). Decision errors due to your inconsistency, overconfidence, and tendency to be concerned with the wrong variables are all significant factors. Even just a slight reduction in any of them would be a big payoff. The payoff is itself measured as a range (the book describes how to compute an output as a range instead of an unrealistic point value). Unless, of course, you have no risky decisions of any kind in your life, have no chance of being wrong, or expect to die very soon.
Thanks,
Doug Hubbard
There are as many different measurement methods for a life as there are decision objectives. Your challenge was originally that the value of a life was not measurable at all. I gave one example where it was for a given objective (such as a policy maker who allocates limited resources).
If you had some other specific objective for a particular person in mind, that is part of the clarification process of any measurement. Are you asking about a triage situation where a health-care professional has to decide who to save after a big disaster? Are you asking about how a hospital decides to keep a dying person on expensive life-proloning measures? If you are not confronted with allocating limited resources to save a life, then what is the actual delimma?
QALY may not be a meaningful measure depending on the decision it must support. And duty-bound is only one constraint – the other constraint is limited resources. You are duty bound to help as many as you can after a major disaster. That itself offers a measurement problem in terms of triage. Of course, this would be something that requires immediate decisions, but that, again, is where a simple equation is often better than a human judge.
More to come…
Mr. Hubbard, I am just finishing your book and have greatly enjoyed it. I do have a question about measurement. You cite work that analytic measures (even simple regression models) work as well or better than experts (even if they slightly underperform, you could determine the value of the better prediction and if the added experts are worth the cost). To play devil’s advocate, when no measurement model is present a model based on ‘uninfluenced’ behavior may be a good predictor. However, once the model criteria are known you have now created ‘measurement influenced’ behavior (a type of Hawthorne Effect). Examples are “teaching to the test” and the influence of university admissions criteria on high school students and parents. Are the (now measurement influenced) students that have checked all the right boxes for college admissions still the best predictor of success? Is the answer to continually measure and revise the criteria or keep the criteria/measurement device hidden to avoid influencing behavior? It seems that those evaluated would continually try to game the measurement system.
A recent related article on NCLB:
http://www.slate.com/blogs/blogs/thewrongstuff/archive/2010/05/17/diane-ravitch-on-being-wrong.aspx
It doesn’t appear with necessarilly disagree at all (and you even mention the Hawthorne effect I mention in the book) about your main point. I actually talk about the specific problem of unproductive incentives from measurements in the second edition of my book in regards to “The Houston Miracle”.
First, we have to separate the issue of uncertainty reduction for management vs. incentives for everyone else. There are many ways to reduce uncertainty and for each of those ways there are multiple possible incentive structures. You can have a great set of measures combined with poor incentive structures in such a way that you create more problems than you solve. But there are incentive structures that outperform other incentive strucutures. The fact that someone stumbles across one incentive structure based on a measurement which turns out to produce undesirable outcomes does not mean that all measurements should be abandoned. It just means they did it wrong.
When we evaluate the performance of any method for anything we have to ask “compared to what?” If the comparison is to decision making based on unaided intuition, then all we need to do is outperform intuition by enough to justify the cost of the analysis method.
Also, there are actually incentive systems called “proper” systems that are mathematically impossible to game. The “Brier Score” is one I mention in the second edition. It is an incentive system for making forecasts where the only way to maximize the score is to make better forecasts at the appropriate level of confidence. The Oakland A’s developed a set of baseball statistics that better correlate to the outcomes of games. The only way to “game” the system is to win more games, which is what management wants.
The flaw in most incentive systems is that if X correlates to Y, then we must incentivize X, which changes the relationship to Y. But all we need to do is compute the “contribution to Y” regardless of X.
Finally, we shouldn’t lose sight of basic benefits of measurement because it is possible to identify anecdotes where incentives based on measurements turned out to produce unproductive results. How often is this really the case? Should we be paralyzed into inaction because this happens 10% of the time? How about 1% of the time? The fact is, prior to implemention the measurement and incentive program, we don’t really know. But that just means the incentive program itself needs to be measured.
Thanks for your input,
Doug Hubbard
No, I didn’t think we disagreed, but I wanted to hear your opinion on the influence of measurement on behavior. Your distinction of measurement and incentives based on measurement clarifies the issue. It sounds like I just need to finish the book and these issues will be addressed. Thanks for your response. Between your book, Sam Savage’s book, and Stephen Powell’s writings, I have greatly changed my view of how business analytics should be presented to our students.
Hi dwhubbard and readers,
I’m just reading the book now and playing about with the excel sheets and wondering how I could use the Chapter 7 – “Continuous Information Value Calculation” to determine the value of information to reduce the uncertainty in my life long income after under taking a PhD…. or not. The sheet is really setup for product sales, but I’ve changed the following:
90% CI Upper Bound 4,800,000 – income over remaining life after undertaking a phd
90% CI Lower Bound – income over remaining without after undertaking a phd 2,376,000
Interval Range 2,424,000
Threshold 2,400,000 – not really sure what i should put here.
Loss rate (on the undesirable side of the threshold) 1 – losing a $ for each $ below the threshold
Mean 3,588,000
Standard deviation 736,778
Does the loss occur under or over the threshold? under
The absolute minimum the quantity can reach 480,000 – if i was on the pension the rest of my life!
The absolute maximum the quantity can reach 8,000,000 – i will retire if I make this much 😉
I realise this is probably all a bit upside down, but I find doing exercises help the information seek in so much deeper.
Look forward to your comments,
Kind regards,
Nathan
Mr. Hubbard,
I am on a project team that is developing a methodology to prioritize our product portfolio to enable resource tradeoff decision-making by an executive committee (time and money).
We have four excellent quantitative measures that I prioritized the portfolio with using the Dawes Z score methodology you discuss. I think it worked elegantly! However, experienced team members are demanding that we incorporate qualitative variables in any prioritization process i.e. “Product is innovative”. Being a Lean Six Sigma Black Belt, I gravitated towards my training and considered using AHP. Today, I read your assessment of the technique and now I feel that is not the right option. We have discussed using a binary or ordinal score on the qualitative variables e.g. product team assigns a “1” to their product on innovation versus a default “0”. We could then plug that data into the Z score matrix and adjust the model from there. In your opinion, is that acceptable or are we falling into bias traps? Is there a simple way to solve the problem or do you recommend we do the Full Monty i.e. develop probabilities and VOI etc.?
I look forward to your insight/recommendations.
Please excuse my delayed response. I have a question, first. How large are these investments? If they are in the range of several million dollars and higher, then the full risk/return analysis I describe would be appropriate. If these are product decisions that are each a few hundred thousand dollars or less, then a simple approximation with the Z-score or perhaps the Lens method (which I also discuss in the book) is probably an improvement on unaided intuition.
I don’t believe some subjective score is always avoidable but you are correct to be cautious. My first assumption I try to apply in any such situation is that the subjective score must be missing some more fundamental point. In this case, why is innovation important? I like the Madison Avenue quote “If it doesn’t sell, it wasn’t creative.” Evaluating whether something is innovative or not is beside the point. Don’t you really just care whether it might be commericially succesful or have some other major benefits? How is being innovative a benefit by itself it the innovation doesn’t produce other observable benefits? If it doesn’t produce other observable benefits, how innovative was it, anyway?
For those reasons, I would avoid innovation altogether. It is simply not a benefit on its own. You are ultimately concerned about the effects of the alleged innovation, not the innovation itself.
If you do have other factors that may be modeled by ordinal scales, I would first consider the possibility that they, too, are really hiding some underlying obervable result which is the real focus of your objectives. If you still find it necessary, your approach makes sense but show the result as the result of a survey of several people. The factor than really becomes “The percentage of experts who judge this project to be X.”
I would consider trying a Lens model as well. It does show a measurable improvement over both intuition and the Dawes Z-score and it avoids the arbitrary choice of weights.
Thanks for your input,
Doug Hubbard
Hi, I have a question please. I tried to re-create your Monte-Carlo risk assessment sample re. savings by leasing a new machine for the production process. In your book (page 77) you say that there is an about 14% risk of staying below breakeven. As said I tried to re-create in Excel and ended up with 18.03%. Then I tried again and got 23%. Then I downloaded Your spreadsheet sample, copied the last row to get 10000 instances and the result was 17%. Then I tried it with my spreadsheet another time and got 18.26%. Hm, is this normal? With 10000 instances I would have expected deviance of 0.1 maybe, but not a range between 14 and 23. Have I made something wrong or what is the background of these significant differences? In the moment I don’t have the feeling I can rely on this tool. Thank you for advice.
You have a considerable enthusiasm for Monte Carlo simulations, but have you ever compared these to Fuzzy Logic solution development, as used by the UK RiskAid product
regards
Richard Watson
Yes, I’m also a big fan of Fuzzy Logic. But note that prior to the 1980’s, fuzzy logic was called “Monte Carlos”. Just kidding. There are are some differences but it is true that the fuzzy logic movement seems to be a repackaging of existing stochastic methods. Fuzzy logic was all about analyzing situations without certainty. That’s exactly what most of the decision sciences were always about. I find when I talk to fuzzy logic experts they tend to be simply unfamiliar with work in the area of decisions under uncertainty and, therefore, come to the conclusion that what they are doing is “new”.
But there is one difference between Fuzzy Logic and previous methods of the analysis of uncertain systems. Proponents of fuzzy logic also attempt to apply it to situations that are not just uncertain (which is already addressed with other stochastic methods) but areas where there is ambiguity. For example, I hear fuzzy logic proponents use the examples of “baldness” or “warm” as fuzzy concepts the can model. They state that there is no exact point where if you remove one more hair the person is bald if if you increase temperature one tenth of a degree it becomes warm.
But this is the area where I find the methods fuzzy logic uses to be unecessary for a different reason. Uncertainty is not avoidable but ambiguity is. We can simply define our terms better. When we look at specific applications of fuzzy logic, I tend to find that some other unambiguous language was possible and makes the fuzzy logic application a moot point. For example, if we are assessing “baldness” to determine how big a man’s toupee needs to be, what we really end up asking is simply the area that needs to be covered. The arbitrary binary point where the person is officially “bald” is not relevant to this problem. Likewise, if we are trying to assess what temperature where people find themselves most comfortable, we find that a function that relates objective inputs to scales of subjective perception (such as Weber functions) already serve that purpose and are very useful. I think these are probably the reasons why there have been a steady decline in the use of the term “fuzzy” in the literature (an analysis of hits on the JSTOR database by year will show this). The part that works isn’t new and the part that’s new isn’t necessary.
Thanks for your input,
Doug Hubbard
Thank you for your reply
Richard Watson
Mr. Hubbard,
I’ m a Software Developer from Italy with a passion for the ‘uncertainty sciences’ last century gave us so plentiful. I’ve read both your books and I am about to order the second edition of HTMA. In fact I am so intrigued by AIE methodology that I convinced some quantitatively skilled colleagues to set up a workgroup to apply AIE on some relevant problems to us.
I want to say it’s an honor to confront you with a measurement challenge. Let’s start:
The Swiss bank UBS published an article in its ‘UBS investor’s guide’, special edition April 2010, predicting the outcome of the FIFA 2010 Soccer World Cup. http://www.ubs.com/1/e/bank_for_banks/news/topical_stories/edition_10.html
You will agree this is a relevant problem, as the ‘uncertainty reduction’ on the game’s outcome will give an advantage in sports-betting.
With hindsight, they failed the prediction miserably, claiming:
(1) Brazil is most probable winner – didn’t reach the semis
(2) Germany and Italy likely to go far – true for Germany (3th in Rank) but Italy didn’t survive the first round.
(3) “Spain – favored by many – will likely not do well, and could exit before the semi-final stage” – Spain won the World Cup.
UBS has now an inglorious record of 1 success in 3 attempts – Wordcup 2006 went good, but European Championship 2008 and Wordcup 2010 failed.
I am inclined to argue that you can’t predict the outcome of the game a priori.
1. UBS likely has built a state of the art econometric model but the conclusive verdict about the rightness of the model can only be “it works”. This show: you certainly can make a sound argument about how you measure it, but still failing miserably.
2. But you cannot know if your model is right or you had luck. This is so because the experiment is not repeatable well. The basic dilemma of social sciences: social systems are complex and adaptive. Using a model: the stochastic process is itself complex, if not random. When we cope with induction we can only believe in the stable nature of the stochastic generator. What UBS’ case tells me: there is anecdotal evidence that the underlying principles of “who wins” are not stable. You cannot say if it will work for the next FIFA world championship or not, making it useless.
3. But probably even if you would know the exogenous factors that influence the game, I suspect the endogenous factors in the system are much more important. Making any reasonable forecast before the games started futile.
Mr. Hubbard: can you measure it?
Sincere Regards,
Roland Kofler
Roland,
Thanks for your interest and contribution. I can’t tell from the link you provided but did Swiss Bank UBS produce an actual probability? Or did they just say this would “probably” happen? That would be the first big question in evaluating their prediction. Only a prediction that was certain can be wrong based on one observation. If the prediction was a stated probability, and if the probability was – say – 60%, then a single failure is not a “failure”. I’ll explain this in more detail by responding to each of your three points individually.
1) We have to state what we mean by “it works”. As I argued in the books, a model works if it reliably predicts outcomes. In physics, a sophisticated and elegant theory has failed if it didn’t predict outcomes. But in probabilistic statements, we have to look at a larger number of examples. Now, you say their track record is 1 in 3. Surely, there are hundreds of individual matches to draw data from, not just three. Remember Assumption #2 from Chapter 2 – you have more data than you think. The link you provided says “Back in 2006, UBS Wealth Management Research (WMR) made waves when it not only correctly picked Italy to win that year’s World Cup, but also correctly picked 50% of the semi-finalists, 75% of the final eight and 81% of the final 16.” This indicates a larger number of individual predictions. But even this information is not the most enlightening about the bank’s real success. I would rather see how often they were right on individual matches and what their confidence was in each match. That brings me to the next point.
2) True, you cannot know on a single event if a probability statement is absolutely wrong or right (which is why I’m not sure you can conclude the Swiss Bank model failed until we see more details). This is why I explain in my books that you have to look at a number of trials to determine if probabilities are realistic. If the bank’s model produces specific probabilities, and it predicted 100 events with 90% confidence, another 100 events with 80% confidence and so on, it should get about 90/100 events of the first group, about 80/100 of the second, and so on. On a related note, “Stability” of the individual games is not a requirement. (This is a key fallacy promoted by W. E. Demming followers.) Well-calibrated probabilistic forecasts are applied to many “unstable” systems – like the weather . All you have to look at is a large number of your predictions and see if the percentage you got right were about the same as your stated confidence. Look at the calibration questions I ask in the books. They are all from completely different topics. Is that “stable”? No, but you can still determine if the expected number of correct predictions is close to the actual number of correct predictions.
3. I don’t think which factors are exogenous or endogenous is the key issue. Whether the factors are internal or external the fact is that we are uncertain about their influence and our uncertainty can be stated.
Finally, to your question “can you measure it?”, I will, of course, say yes. I’ll invoke the first “measurement assumption” I mention in chapter 2 – It’s been done before. Yes, this can be measured and I know it can because well-calibrated methods already exist for other sporting events. I cited research showing how when “prediction markets” are applied to American football and how when we look at all of the games where the market predicts an 80% chance of a specific team winning, it won about 80% of the time. (If one were to argue that the reason for this that American football is “more stable”, I would like to see the math behind that claim)
So, in order to evaluate the bank’s model, we have to ask the following –
1) Did the bank quantify its uncertainty (e.g. “Brazil is 55% likely to win)?
2) How many total predictions are part of the bank’s model? (if it were used for just a few years, then it surely must have a large number of individual soccer games.)
3) When predictions of similar confidence are grouped together, did the predicted outcomes happen about as often as expected (e.g., the prediction was right about 75% of the time that the confidence was somewhere around 70% to 80%)?
4) What are the odds that an un-calibrated method could have produced the same results. For example, the odds of this unfortunate result being bad luck would be very low if they regularly made their predictions with 98% confidence. But if they were merely 70% confidence in outcomes, perhaps this result is not so unlikely.
5) What would the success rate – for a large number of matches – have been for the average unaided sports fan compared to the banks model? Remember, the definition of measurement is to reduce uncertainty about a quantity based on observation. It doesn’t have to be right very often if the unaided intuition is even worse. In some situations, this improvement over the unaided intuition of experts can be worth a lot of money. Would a survey of sports fans, sports “experts” or astrologers have done just as well as the bank’s model? If the model was even a slight improvement on unaided intuition – but more than what can be explained by chance alone, then it worked.
Thanks again for your input,
Doug Hubbard
Touché, and a detailed answer will follow next weekend. there are epistemological problems. no doubt. and I have mixed feelings – spending most of my life in the city of Brunswick (you teached me about this Viennese mastermind – he is totaly forgotten), Wittgenstein (you operate on language), and Critical Rationalism (Thinking about Science was ‘invented’ in Vienna) i want to argue more in detail. Dare to try you on the ‘epistemic’.
Thank you for your work!
RK
Just for curiosity -as i want to know if such prediction is credible at all:
UBS disclosed likelihoods of a teams chance to reach the next round. You can see them on page 16 of UBS Investors Guide. http://www.ubs.com/2/e/medlib/wmr/IGWM_spez_2010_en.pdf
In fact the article at page 14 explains some of the model, citation: “As in our previous studies, we rely exclusively on three factors to estimate the different winning probabilities:
1) past performance;
2) whether or not a team is a host nation;
and
3) an objective quantitative measure that assesses the strength of each team three months before the start of the World Cup. Socioeconomic factors like population size or GDP growth have been proven to have no explanatory power when it comes to forecasting the performance of a specific team.”
Roldand,
Why would it necessarilly have no credibility? Remember, the real test is whether it outperforms your intuion. If you are both equally well calibrated but they turn out to be wrong 30% of the time in a large number of trials, but you turn out to be wrong 40% of the time, then its an improvement. Heed Voltaire when he says the perfect is the enemy of the good. Measurement is about uncertainty reduction and if a model – with all its flaws – is right more often than your previous model (intuition) then it was a measurement.
Is your question about whether such a model could concevaibly outperform intution of the average sports expert? Why do you think it couldn’t? It’s all about results and if the results are extremely unlikely to be do to chance alone then the results are informative.
Doug