I’m reintroducing the Measurement Challenge for the blog. I ran it for a couple of years on the old site and had some very interesting posts.
Use this thread to post comments about the most difficult – or even apparently “impossible” – measurements you can imagine. I am looking for truly difficult problems that might take more than a couple of rounds of query/response to resolve. Give it your best shot!
Doug Hubbard
Doug, today I tried and drafted now several arguments why immeasurables exist, but no one seems to be right.
The only thing I could say is that a false reliance on induction could harm you. Because the future might change. But this is in the TFoRM book.
(btw. I think it would be an interesting experiment if you would try to convince us that immesurables exist.)
More than keeping up with the challenge I would like to send you eventually some models our little measurement study group here in Italy will come up with. Next week is our second meeting, I will calibrate the guys and then we identify some problems to work on.
Many thanks, Roland
Roland,
So then we must compare the harm of false reliance on induction to the harm of false rejection of induction. I can never say this enough: What was the rate and magnitude of error before using a particular method and the rate and magnitude of errors after using it? All of the empirical evidence says that unaided human intuition is easily outperformed by even simple modeling.
Remember, all models are wrong. They all have error. The question is whether your previous model (intution) really had less error. The eveidence says no.
Yes, feel free to tell me more about your measurement study.
Doug Hubbard
In fact I already did a simple experiment, translating the Chicago Piano Tuner problem into the Viennese Hair Salon Monte Carlo Simulation
http://objektorient.blogspot.com/2010/08/freelibre-and-open-source-aie-models.html
I hope its okay if I copyleft this even if the inspiration came from you, as the bin-slicing excel formula does.
Doug:
Here’s one for you, since you mention capture-recapture in HTMA. What I’d like to do is estimate the true number of events of type t in a system. I have two databases, A and B, neither of which has complete reporting of these events. I review records from database A and count reports of type t ( I suspect they’re under-reported). Lets say there are 193. I then review database B and count reports of type t. (This database also subject to under-reporting). There are 69. There are fields in the records that allow the analyst to identify events of type t from database A that are the same events as those found in database B. There are 20 that are common to the two counts. Can I use capture-recapture? Any caveats to its use? I come up with an estimate of 665, with a 95% CI of [476, 929].
You certainly can use capture/recapture. That is a perfect example. But, yes, you are also correct in asking about caveats. The formula I refer to works if there is no relationship at all between the event A captures and the events B captures. But if they both tend to miss the same kinds of events, then this method will be an underestimation of the total events. Likewise, if A and B are more likely to be sensitive to different kinds of events, then this method will over estimate the population of events. The good knews is that this kind of error is often the error people know about. You may already have an idea about whether A and B have a strong negative or positive correlation and, if so, you can at least put a “direction” on the error. If you know you are underestimating, for example, and the computed CI is 476 to 929, then you know that a value of 300 is even less likely than before and if that is your “decision threshold”, then you have made a slam-dunk argument for the case you are making.
Doug
Let’s say your MC model /simulation produces an output metric with some shape and skewness and a 90% interval estimate of, say Prob{event A}=[0.005, 0.015]. Now a colleague discovers an independent study that estimated Prob{A}=[0.001, 0.020], with no information about the shape of the estimate. Would you attempt to combine this information in some way? Perhaps use the first analysis as a prior with the second study as new information? Or maybe convolute the two distributions somehow? Or weight them in some way?
I recently read and deeply appreciated your book. As I set out to apply the lessons one of the luminaries of viral marketing, Seth Godin, posted a blog entitled “On buying unmeasurable media”. I would be thrilled to see you intellectually spar with him since so much of his work is on softer topics without much data to support it.
Here’s the blog link: http://www.feedblitz.com/t2.asp?/198516/5620782/3910073/http://feedproxy.google.com/~r/typepad/sethsmainblog/~3/WKYoGnEzan4/on-buying-unmeasurable-media.html
I would be interested in connecting with Seth. I’ve been swamped for a month trying to finish my next manuscript but now that I’m done, I’ll be getting back into the blogs.
Thanks for your interest.
Doug
Hmm, here’s a measurement challenge I haven’t cracked yet at work!
A lot of businesses pay external agencies to do link-building to help their search engine rankings, and from that generate more visits and money. The problem is that search engine rankings change organically anyway, simply from things like pages staying the same. Plus there can be a lag between something changing and the search engine reflecting that in its rankings.
So how can you measure how much of the natural search uplift is due to your SEO agency’s work?
I think this one may literally be an immeasurable, at least without access to Google’s ranking algorithms 🙂
If it has any observable effect at all, then it is measurable. In fact, even if it has no observable effect, you simply measured it to be zero!
Remember the general rules: 1) it has been measured before 2) you have more data than you think and 3) you need less data than you think. Regarding the first rule, I’m sure you would find many SEO measurement methods with a little more Googling. But let’s see if we can devise something on our own using the second two rules. Specific knowledge of Google’s ranking algorithm is probably not required – especially if you are highly uncertain about this. But Google Analytics does give you a lot of tools for this.
First, a little set up. Why do you want to measure this how much do you know about it now? Let’s say only 20% of your traffic can be attributed to your SEO efforts. Does this mean you give up on it or keep going with it? What if it were only 5%? At some point, there is a “threshold” where some action would be taken. Define what that threshold is. Now state how much you know about it now. Is your current estimate a wide range of 10% to 80%? 0% to 90%? If it really were on the upper end of this range, you would probably have seen a dramatic effect as soon as you started the SEO effort. Let’s say its 0% to 50% for now.
Given a range that wide, you probably don’t need much data to make the range narrower. Remember, any reduction in uncertainty counts as a measurement! Do these visitors have to register? Have you considered sampling those who register and ask them how they found your site? If you use Google Analytics and have been tracking which are the top search terms that lead to your site, are some of these searches based on terms that you only recently added to your site due to SEO efforts? Are some of the backlinks that have been added since the SEO effort among the most productive refering sites according to Google Analytics? If so, then there is at least some traffic that you know to be new since the SEO effort started. Your range is narrower.
Once the range is narrowed, take a look at your threshold. Did the uncertainty change enough so that you are confidently on one side or the other of the threshold you identified? Then you have measured it enough to make you decision about whether SEO is worthwhile or if you need to change the SEO strategy.
Keep me posted on how that is working for you. I’m sure others would like to know.
Doug Hubbard
One of the great problems of our time is how to measure education. In Texas, we use the Texas Assessment of Knowledge and Skills (TAKS) test, which is highly controversial. First of all, I suspect that reliance on a single measurement method is problematic. Secondly, the method requires everyone to take the test multiple times instead of a simpler, and less costly sampling methodology. Maybe the complete coverage is required to reduce errors sufficiently? I see a problem in measuring something (education) in which the result (“better” education) manifests itself at some future time (years to maybe decades later). Do you have any thoughts on how to measure “better” education? I assume you will include such measurements as: 1) the number of students who graduate from high school, 2) the number who graduate from college, 3) the number of scholarships awarded to graduates, and 4) the number of graduates drawing unemployment, or some similar set of measurements.
I have your book HTMA2 (as well as TFORM) and am now devoted to applying AIE.
Thank you.
Yes, this is extremely important. But we have to ask the same kinds of fundamental questions for all measurement problems. That is, what decisions could you make differently if you knew the answer, at what point would the quantity make a difference (i.e. what is the “threshold”) and how much do you know now? Are you trying to measure each child or simply the overall effectiveness of the curriculum in a school district? Are you trying to measure this in order to decide teacher bonuses? You have to figure out why you are measuring this and define the specific decisions, first. You may want to answer “all of the above” but, for now, let’s pick one to focus on.
How your answer that will determine what you need to measure and how to measure it. And whaterver that turns out to be, you will need to determine how much you know now about it. If you wish to measure the effectiveness of a new curriculum, I seriously doubt you need to measure each student several times in order to significantly reduce your uncertainty. Remember, contrary to popular belief, if you have a lot of uncertainty, you don’t need much data to significantly reduce it. And if that uncertainty reduction shows that you are very likely over some decision threshold (e.g. the point at which you need to change the curriculum because it is proving to be ineffective), then you have measured it enough to make a decision. I suspect that if your decision objective for the measurement is at that level, then the TAKS is probably overkill.
Have you thought about these kinds of questions? It is the first place to start.
But I have some thoughts about what methods you may use when you do answer these questions. As I have said before in this blog and in the books, you should assume this has been measured before, that you have more data than you think and that you need less data than you think. It is correct that education may manifest itself much later in life but that has been studied before and you can use that research to tell you about conditions in the present that correlate to the future. I would be surprised if simple test results of children now didn’t actually correlate with the 4 future measures you present and I would also be surprised if someone, somewhere, hasn’t already written a disertation on that. If you are using this to measure not individual students but overall effectiveness of a program, and if your current uncertainty is high, then giving a small fraction of the students very short tests of randomly selected questions would probably reduce your uncertainty.
Let me know your thoughts about the objective of the measurement and we can talk about how to design the best measurement approach for this critical problem.
Thanks for your interest in my books and thanks for contributing to the blog.
Doug Hubbard
Proger thats roughly where I am going. How do you
1) measure educational effectiveness and to wit choose
2) between competing educational alternatives?
I am on a research junket on number 2 above. I have barely began reading HTMA on Amazon kindle.
I hope you keep me posted about your discoveries with the help of Doug of course.
Thanks a million.
mf
Thanks for you question. See my response to Proger below. Is there a high degree of uncertainty at this point between education alternatives? Is there literally no research already on the topic? If so, than surprisingly little data can be a significant reduction in uncertainty. As with all measurement problems, we should assume it has been measured before (and we should do the research to find what was done before), we should assume we have more data than we think, and we should assume that we need less data than we think. These kinds of assumptions turn out to be much more productive when searching for measurement methods. The opposite assumptions (i.e. that it was never measured before and that we have insufficient data) invariably lead to feeling stumped by the measurement problem and are also invariably wrong.
You can see my responses to Proger but I’ll ask you another. What do you mean by “educational effectiveness”? Proger offered some ideas about what that means, but I wanted to get your input. If you can figure out what you mean, why you care (i.e. what decisions could be different), at what point that quantity affects the decision (i.e. the threshold), how much you know about it now (your current calibrated estimated) and what observations would correlate in any way to what is being measured, then you are well on the way to measuring it.
I look forward to your comments.
Doug Hubbard
Doug:
I am reading HTMA via Kindle on Amazon. I can’t “see” the exhibits I have come across so far (5.1 and 5.2).
Are they disabled on kindle or? I am holding off on reading further until I am done with the tests.
Thanks
mf
Thanks for your observation about the Kindle. I have heard this from someone else so I’m sure you are not the only person having this problem. I am having an assistant gather and post all of the exhibits to make them available for anyone having trouble reading the exhibits with their Kindle.
I’m sure other author’s have heard this, too. Kindle should provide “exhibit testing” for author’s so that we can see what our charts look like on the Kindle.
Doug Hubbard
It’s been quite a while since the “challenge” has been out but I do hope that you do get time to reply:
How would you peg a value to ‘decision analysis’?
I mean in the context of taking a decision “NOW” vs. educating the stakeholders that it is more valuable to take an informed decision with supporting evidence etc.,
Now, how would you present a ‘business case’ for performing decision analysis? (Note this is ‘internal’ to the organization i.e., a consultant is not hired to do this – let’s assume there is a someone competent in the company to ‘perform’ this activity.) In this case time would be an important variable – both the analyst’s and all the people he/she intends to interview but no direct cost of hiring a consultant.
The thing that I’m fumbling upon is this: Let’s assume you can see the future and you could take decisions accordingly i.e., Perfect Information. Now how can you calculate the EVPI of this?? How would you measure a good decision from a better decision? (There is a ‘delta’ so it is somehow measurable). What is the probability that the decision supported by decision analysis would indeed be better than the one taken “Now – in the heat of the moment” and how would you quantify/justify that it’s worth waiting and conducting the decision analysis to better understand what the decision entails (capturing this uncertainty).
I’m a bit inclined towards Multi-attribute utility theory (MAUT) so am unable to think of how to present a business case for conducting Decision Analysis using MAUT.
The comment often made is “It’ll just take more time and MAUT won’t give us a perfect decision since the future is unknown. We’d rather take a decision today and move on with it – we could correct our course later”
The case for decision analysis seems measurable but the “how” has made me scratch my head, quite a bit!
Any ideas? 🙂
Thanks for your question. I routinely compute the value of decision analysis itself since I compute the value of all of my analysis as part of the standard deliverable. We work out the value of the uncertainty reduction as I describe in chapter 7 of HTMA. If we reduce uncertainty about a decision, we change the odds of choosing an economically inferior strategy. Roughly speaking, the cost of being wrong times the chance of being wrong (i.e. the Expected Opportunity Loss or EOL) is reduced with the analysis. Since the chance of unfavorable outcomes is computed with methods I discuss in chapter 5 to 7 in HTMA, we can compute the change in EOL. There is also quite a lot of sound scientific research about how methods like this cause estimates and decisions to improve.
But the value of a method like MAUT is trickier for two reasons. First, it doesn’t actually forecast anything objective and, therefore, it is not clear how to verify whether – even in retrospect – whether the outcome of a MAUT analysis is right or wrong. The theory is simply that if your own preferences are logically consistent, you will be more satisfied. Unfortunately, research in decision psychology has generated findings that complicate this. Our preferences change for random, unrelated reasons that we are unaware of. For example, you are more risk tolerant if you are around smiling faces, if you are angry, or if your testosterone levels are higher than usual. You even tend to reengineer your memories of preferences based on immediate conditions. Our subjective utilities appear to represent only very temporary conditions of our emotional states, they change frequently, and for reasons that should have nothing to do with the decision at hand. But we do know that some for of decision analysis appears to increase confidence in decisions even when the particular method being used made forecasts and decisions measurably worse. Much of the perceived benefit of tools like MAUT may be a kind of placebo effect. I cite sources for all of this research in the second edition of HTMA.
The second reason that making the case for MAUT is difficult is because – unlike some other methods – there is actually no empirical research showing that decisions are even any better in a measurable way. Some research has shown that decision makers feel slightly more satisfied with some methods than others but that doesn’t mean the decisions are actually better. Ideally, we would like to have a large controlled experiment like a kind of tournament for forecasting and decision making. But research has rarely gone beyond showing that users of many decision analysis methods simply feel more confident (which we know they would feel even if it didn’t work, because of the placebo effect).
Fortunately, there are methods that show a measurable improvement in estimates, forecasts and decisions. MAUT just isn’t one of them, yet (but it may be if anyone manages to actually conduct a real-world experiment with sufficient data points). The methods that have been shown to work with overwhelming empirical evidence in controlled studies are the following:
1) calibration training to teach people how to provide better probabilistic estimates
2) quantitative historical models to forecast objectively observable outcomes
3) the use of quantitative modeling methods with simulations
4) the use of the Lens method for removing expert inconsistency
5) models based on empirical measurements (my method helps to identify and prioritize which uncertain variables justify empirical measurement efforts)
I cite sources for these as well in HTMA. The difference between this and MAUT is that almost all of the inputs and outputs are objective values. They may be initially subjective estimates of objectively observable values but the original subjective estimate can at least be eventually evaluated as correct, incorrect, or close enough. There is no way to determine if your stated utility curves were the “correct” ones since they state nothing except the preferences you had at the moment you defined them (which we now know changes for arbitrary reasons the decision maker is not aware of).
There is one unavoidable tradeoff that is a subjective utility curve and all of my models have at least this. We have to subjectively trade off acceptable risk vs. return. Given the problems with preference statements discovered just in the last few years with experimental psychology studies, its a good idea to minimize the use of purely subjective tradeoffs in a model – although some will be unavoidable. But most really important decisions involve at least some forecasts of objectively observable outcomes. Its not all just a matter of utility curves. If a DA method works, we should be able to show that estimates, forecasts, and decisions were – after some significant number of trials – a measurable improvement on alternatives like pure intuition. The Lens method, for example, can show measurable reductions in error in a variety of forecasts like business failures, cancer prognosis, the success of graduate students.
Good luck with the business case. If you find any research that shows that MAUT has any positive, measurable effect (other than simply the confidence of the users) over a large number of trials, then let me know. My hypothesis is that the real answer may be small or zero. Given the importance of many decisions made with this method, I hope someone tests it soon.
Thanks for your input,
Doug Hubbard
Hi Doug,
I feel genuinely obligated to start off by saying “what a great book!” I have purchased innumerable statistics books and always have been stuck at the front door. Your book is the comfortable foyer that I needed to get a proper introduction to and understanding of the guests inside those other books. Thanks!
OK, now for a crass and shameless challenge:
I am an investor and I develop trading systems for investing my retirement funds in the financial markets. The systems are by their nature prone to Data Mining and Data Snooping. I want to rank my inventory of systems and select the best system(s) to trade I and don’t want it(them) to underperform after selection. How do I pick the system (measure = historical performance rank) that is most likely to perform going forward as it has in the past (measure = reliability) and is the system that will likely perform better than the other systems in my inventory (measure ==> best rank = best forecasted return = best return)?
My many issues with this question:
a) Have I asked the right question so that it lends itself to analysis (I think I have)?
b) Do I need “high level” statistics to deal with the Data Mining/Snooping issues (they sound scarey and somewhat intractible – and were discussed briefly in a stat book by David Aronson)?
c) Is there a better decomposition that I am missing?
d)How do I iterate using easy measures first?
And yes, as you can imagine, the value of perfect information on this question is, well, huge!
Thanks in advance, – Carl
Carl,
These are good questions. In fact, these are exactly the questions that matter. You need to know if your system is working. But you might want to be prepared that the measurement might show marginal or no benefit for the systems. Zero is a possible answer. I only say that because I’m generally skeptical of methods that claim to beat the market consistently. But I am certainly keeping an open mind.
I don’t think the methods need to be all that advanced, but I would recommend the use of a concept called the “p-value” or power of a measurement. You want to ask the chance that two unrelated, random variables could show a given correlation or higher as a random fluke. If you look at large numbers of data points – say, hundreds or thousands, a correlation of greater than .9 or better would be extremely unlikely with two random variables. So the key question you need to answer is not just whether one is performing better than the others, but what is the chance that this could be a fluke? In my second book, I talk about what I called the “Red Barron” effect. Two electrical engineering professors wrote an article asking the questions “Was the Red Barron good or lucky”. They considered the fact that he had 80 kills in WWI in the context that there were over a couple of thousand German fighter pilots in WWI and that the average chance of a victory in any encounter was 0.85 or higher for the average fighter pilot. Taking these values into account, it appears that even if all Gerrman fighter pilots were equally good, there is a 0.3 chance that at least one of them would have had 80 kills.
So the best question for you to ask is not just whether one system outperforms others, but what is the chance that this performance is a random fluke? If you have a coin flipping contest with 1000 entrants, to see who can flip the most heads in a row, someone is going to be the winner. But that tells us nothing about whether the individual really has a special skill at flipping coins. If the winner had flipped, say, about 10 heads, we shouldn’t be impressed. Thats about what we would expect for the best performer out of 1000. But if the winner flipped 40 heads in a row, now we should start considering the possiblility that the winner actually has some kind of skill or is cheating. Even out of 1000 contestants, there is only a one in a billion chance of seeing one person flip 40 heads in a row.
I think this might be a trap for many financial analysts but it is not limited to them. Scientific publications have shown a “publication bias”, meaning that scientists are more likely to publish an interesting result. That means if they throw away 5 studies for every one they publish, the calculation of a p-value will be off. The chance that some observed correlation would be due to chance is actually higher than they indicate in the study because now they are taking the best of several.
Let me know if that is sufficient to get you started. You will have to do a little research on this point but it is fairly basic. It may show that observed ranks are just random flukes. But if you find that one is so much better than the others that the chance that it is a fluke is extremely unlikely, then you have a very interesting and powerful finding.
Doug Hubbard
Mr Hubbard:
As a clinically-oriented anesthesiologist I am impressed by the writing (clear, concise, not too jargon-ny) and encouraged by the message.
We’ve given a lot of thought to how (and why) we measure pt’s perception of acute pain (chronic is a completely different game). We use the Visual Analog Scale developed many years ago. I wonder what your thoughts are about that scale and how might improve our measurements. This gets at the issue of efficacy which is a big “quality” indicator in our post-hospitalization surveys that get sent out randomly to our pts. Your ideas have stimulated my ever-inquisitive mind and for that – and that alone – this book was well worth the time and money. I have recommended it to several people in the business already. Any thoughts you have will be much appreciated.
Thanks for your interest.
It is interesting that you mention pain scales. We have a family friend who is a doctor at a pain clinic. I once mentioned to him that I would be interested in interviewing him about the pain scale (I think he said they use some 1 to 10 scale) for measuring pain. I thought it was an interesting measurement problem that could merit some space in the second edition of the book. But I didn’t get around to discussing it with him in detail. Is the Visual Analog Scale the one that uses the “smiley face” to “frowny face” spectrum of responses?
I think it is an important issue because it gets at the heart of an issue that seems like it should ultimately be subjective. So one might wonder how such a scale would be validated. I suppose pain has bearing on activities and a pain scale should have some predictive power about activities of individuals. In my next book – Pulse: The New Science of Harnessing Internet Buzz to Track Threats and Oportunities – I discuss, among many other things, research which uses Twitter, Facebook and Google to track economic, social and even health trends. I also discuss research which uses accelerometers in mobile phones to track movement and predict illness. I suspect that a similar method could be used to infer pain from changes in movement.
I think that people who are experiencing certain sorts of pain probably move differently in both sutble and obvious ways. I think it would be hard to deliberately fake this behavior over a long period of time. Detailed activity tracking might be a much more objective measure of pain. Details about sleep patterns and eating patters are also likely indicators of pain. But perhaps not so much in self-reported surveys. I refer to using small and cheap tracking instruments – many of which may already be able to be approximated via our mobile phones. It takes the Skinnarian approach to pain – we only study what we actually detect objectively and we avoid the problem of comparing subjective experiences.
I will read a bit more about the research behind the Visual Analog scale. Feel free to stay in touch.
Doug Hubbard
Hi Mr. Hubbard — I’m enjoying the book a lot. As a former director of quality assurance for social services agencies I’ve often taken the position that anything can be measured. One of the hardest things to measure is progress of preschool children. They are not yet literate or able to do complex math but are developing important pre-literacy and pre-numeracy (sorry) skills. Generally I’ve settled on accomplishment of developmental milestones within fairly wide ranges as an outcome measure. I have also looked at fidelity to a preschool educational model. These are both somewhat unsatisfying and I wondered if you had any other ideas. Thank you. AB
Sorry, on re-reading I realize I left out an important point: we’re trying to judge the quality of pre-school education. Thanks for your thoughts.
Hi Doug,
I thought the book was excellent and I am already making practical use of it in the area of project portfolio management, so many thanks.
One comment I would make is in regards to the passege in the book describing the thoughts of Stephen J. Gould on subject of IQ tests. as you porbably know, Gould’s argument in “The Mismeasurement of Man” was that tests for intelligence were primarily a tool to prove the superiority of one culture or race over another. I think his scepticism with regards to IQ tests were not that they fail as a basis for measurement per se, but more that they failed as a measurement for intelligence. Intelligence as a concept is ambiguous at best and, as you suggested at the start of the book, needs to be unpacked (including the motives for requiring measurement in the first place) before it can be meaningful. In the case of an IQ test, what you would be measuring would be the ability to solve a particular set of problems. In the case of measuring potential brain damage this would be useful. it would not however tell you if a particular subject was less or more intelligent after the incident.
Anyhow, fantastic book and more power to your elbow.
Regards
Brian Calcott
Brian,
Thanks for your input. But I was refuting Gould specifically regarding his claim that IQ tests are not a measure of intelligence. He stated that the IQ score is nothing more than an “artifact” of the mathematical method used to compute it. This alone is an easily testable and falsifiable claim. If it really were nothing other than an arbitrary artifact of a calculation, we should not see any correlations between this score and other measurable phenomenon like income, incarceration, welfare, and so on. And yet we do see correlations between IQ scores and behaviours we would generally associate with high or low IQs.
IQ tests do, however, have lots of error for all of the reasons Gould lists. But the complete lack of error was never the criterion for a measurement. Even a test that has 90% confidence errors of +/-20 points two thirds of the time and then gives a completely spurious result a third of the time is still a reduction in uncertainty. I think this is where IQ test skeptics miss the point. They seem to be saying that if an IQ test even *can* give a wrong answer then it is measuring nothing. This is a misunderstanding of how the term measurement is used in the sciences.
What they are not considering is the previous state of uncertainty before a test. Suppose you have persons A and B. You know nothing about them prior to a test. You could only say that there is equal likelihood for either to have a higher intelligence than the other. Then they take an IQ test. Person A scores 145 and person B scores 95. Now would Gould say that this result would not even justify a probability of 70% that A has a higher interligence? How about just 60%? How about if the tests results were 165 and 80 respectively? Would the chance that A has higher intelligence than B still only be 50%? I think If we repeated this test on different pairs of people many times, and you had to place real-money bets on who would perform better at other tasks associated with intelligence (however it is defined) like college grades, income, or ability to write, I’m sure you prefer to keep betting on the person with the higher tested score, especially if the differences were 30 points or more.
Again, it is important to separate the claims that a measure has a lot of noise and little signal from the claim that it is only noise and no signal. Gould’s statement that IQ is merely an artifact of the method used to calculate it would imply he believes the latter. This would mean that no matter how far apart two people were on their IQ test scores, no matter how many trail pairs of people we use, that we would never have any reason to believe one person is even slightly more likely to be more intelligent than another. When you think it through, this is an extrodinary claim.
Of course, we have no basis for being certain that a person who scores 125 is actually more intelligent than one who scores 105. Maybe we couldn’t even be certain if the difference were twice as wide. But measurement isn’t about achieving certainty. Its about reduced uncertainty. For example, would you really bet your own money that pairs of people whose tests scores were 40 points apart are indistinguishable from those wit the same score? How about 60 points apart? How about 100? Is there seriously no point at which the probably that one person in the pair ever budges from 50% chance of being the more intelligent? I know where I would bet my money on repeated tests.
Regarding your example, why would it necessarilly be the case that IQ tests before and after a brain injury must tell us absolutely nothing? Are you saying that even if the person consistently scored above 150 before an injury and cannot score more than 100 after an injury that we can still conclude nothing at all about a change in intelligence? Are you saying that even if the difference was twice as much (a 100 point difference) we could still learn nothing? That also seems like an extraordinary claim to make.
If IQ tests reduce uncertainty even slightly even just some small percentage of the time, it meets the criteria of a measurement in its strictest mathematical sense. And when it comes to important issues of public policy (such as the example I used where methyl mercury is correlated to reduced IQ points of children), a slight uncertainty reduction is preferable (perhaps the equivalent of millions of dollars preferable) to no uncertainty reduction at all.
Thanks,
Doug Hubbard
Potential for organizational change.
I am thinking of measuring the level of resistance to accepting change — what’s it going to take to blast an unproductive culture out of behaviors and methods that they freely admit are not working? (This is not academic or frivolous.)
A followup on the analysis of IQ as a measurement of intelligence (and then onto a more general point):
It seems to me that M reducing uncertainty in evaluating V is a rather low hurdle for calling M a measure of V. By this criteria, wouldn’t Income be a measure of Intelligence?
Clearly, to say Income is a measure of anything other than Income makes the concept of measure unnecessarily confusing and much more prone to be misapplied. In this light, I would say IQ is only a measure of what has been defined as IQ. The question then becomes how well does IQ correlate with what we believe Intelligence to be, which seems to me to be a much more clear debate than whether IQ is a measure of Intelligence.
The main policy problem with substituting IQ for Intelligence is not so much that the correlation is not 100%, but rather than the correlation seems to vary quite a bit among different populations. In so many cases, the method for deriving a concrete measure that correlates well with an intangible concept is very context sensitive. For example, deriving a concrete measure that correlates well with intelligence in western populations may not correlate well in African populations. Deriving a concrete measure that correlates well with what we think of as Productivity on a factory floor in company A may not correlate well on factory floors in company B or in software development offices in either company. To call the measure Productivity instead of what it directly measures makes it too easy to overlook this contextual factor, and start applying a Productivity measurement where it does not apply.
Understanding this issue is more clear if our language for measurement makes it more clear that:
1. We can only directly measure concrete measurements (like IQ, throughput, income, MPG, age, …),
2. We cannot directly measure intangible concepts (like Intelligence and Productivity),
3. The book is about how to derive concrete measurements that correlate highly to intangible concepts in given contexts, not how to directly measure intangible concepts (and especially not in a context-independent way).
Steven Gordon, PhD
Steven,
Thanks for your input. I would argue that whether a measurement is “direct” or not is an unambiguous distinction mathematically, semantically and at the level of fundemental epistemology. This distinction adds no clarity whatsoever because most – or perhaps all – measures of properties in physical sciences are not “direct”. They rely on readings of an instrument (e.g. an digital scale, ohmmeter or photometer) and rely on inferences from related observations (e.g. observing the deflection of a particle in a magnetic field or the movement of planet to measure mass). I would even argue that we only percieve reality indirectly in the first place. All we can do is make observations to make inferences that ultimately have some utility to practical decisions.
In the definition I propose for measurement (which is, in fact, consistent with information theory, decision theory and measurement theory) income really is a a measure of IQ which is, in turn, a measure of intelligence. Of course, this would be a measure with a huge amount of error but, on average, estimates based on income would be slightly more accurate than estimates based on no information at all. If you doubt this, let me propose a game that would test whether you really believe your position. Let’s identify 100 people who have had IQ tests representing a wide range of IQ’s. We will know nothing about them and we will not be told their IQ’s, except that I will be told thier income from last year and you will not. We will sort them into random pairs and we will bet which person is smarter. You should be indifferent because you would have no information about whether one person has a higher IQ. But since I know their incomes, I will not be indifferent. I will, for each pair, choose that the person with the higher income also has the higher IQ. Each time I’m wrong I pay you $100 and each time I win you pay me $100. I would love to play this game for a million people if we could. If I’m right only slightly more often then I’m wrong, then I would make a lot of money off of you. I would even be willing to pay you $500 to play this game with me for 100 persons. Now would you prefer to be the one of us who had the income information? Would you even be willing to pay me if you had it? You might say you are not a betting person, but I think betting is simply the ultimate test of whether a person really believes some position they have. You might say this is impractical but I think we could find a way to make it happen.
The fact is that even when there is a small correlation between X and Y, knowledge of X slightly reduces your uncertainty about Y. And I would say that any definition of measurement that you propose that ignores the reduction in uncertainty would be far more confusing. You would have to resort to a definition that merely describes a procedure regardless of whether the result is informative. And if you said that the procedure only resulted in a measurement if it was informative while rejecting that the threshold is any uncertainty reduction, then you would have to define how much uncertainty reduction is required. Presumably, you would have to define this arbitrary threshold differently for every possible kind of measurement since error in measurements vary widely among fields.
Taking your claim “Clearly, to say Income is a measure of anything other than Income makes the concept of measure unnecessarily confusing and much more prone to be misapplied.” to its logical conclusion, you would have to reject the validity of most measures in the physical sciences since, as I said, most are inferences based on indirect observations (nobody has ever “directly” measured the mass of an electron or a star or the age of a rock or even time itself). If, as you say “IQ is only a measure of IQ” appears to say that if X is correlated to Y, X is still only a measure of X and it reveals nothing about the possible values of Y. If this is the case, then you must also hold that the glow of a hot body is a measure only of its glow, not its temperature. You must hold that the depth of a rock formation is only a measure of its depth, not its age. You must hold that a credit score is only a measure of a credit score and says absolutely nothing about whether someone is an acceptable lending risk – meaning that a 1000 people with credit scores of under 600 are just as good a credit risk as 1000 people with scores over 750. There will be exceptions, of course, but given a portfolio of 1000 people, do you really want to charge the same interest for the under-600 score group as the over-750 score group?
You stated “The main policy problem with substituting IQ for Intelligence is not so much that the correlation is not 100%, but rather than the correlation seems to vary quite a bit among different populations. In so many cases, the method for deriving a concrete measure that correlates well with an intangible concept is very context sensitive. For example, deriving a concrete measure that correlates well with intelligence in western populations may not correlate well in African populations.” How is this more uniquely “context specific” than any other measure I just mentioned? As before, if this context-specificity you speak of was any real obstacle at all to measurement, then again, most of what we know from science could not be possible. There is no complexity you can think of that does not have a measurement solution. Many of the procedures in scientific method are specifically to controll for such issues.
Or perhaps you believe that it is not the “directness” but the mere existance of error that undermines any measurement. There is a common fallacy that the the existance of noise means a lack of signal. This would force you to defend yet another position that would be impossible to reconcile with all scientific knowledge. Or perhaps you believe that the existance of noise (i.e. error) in a measurement means that any signal that does exist must not have any utility. This would be impossible to reconcile with both decision theory and common sense. If you knew a coin favored heads on a flip just 52% of the time, that knowledge would be worth a lot of money over a large number of flips.
On the other hand, I don’t deny that people misuse measures. Regarding your comment about productivity measures, if someone actually equates a measure of productivity on a factory floor to productivity in creating software, I could only respond that this would be a straw man argument. I make no claim that two obviously uncorrelated factors say anything about each other. I also point out in How to Measure Anything that most organizations measure the wrong things (because they are not computing the value of uncertainty reduction). If they applied this method consistently, they would measure what matters. And, as a side note, don’t confuse measures used for decisions like project approval to incentive programs. They are two different issues. Good incentives are based on good measures but that doesn’t mean any measurement should be part of some incentive. I clearly argue against this in the book.
And don’t forget that you always are comparing a decision analysis method based on measurements to some *other* method – presumably your intuition. That also has a measurable performance and research shows it is often not hard to beat with even simple quantitative methods.
Again, if you really believe what you believe, then let’s start recruiting some people for our bet.
Thanks for your contribution to the conversation and I look forward to your response.
Doug Hubbard
Douglas,
We are improving new procedure for financial planning whose target is to reduce the range of approximation of the foreseen from the actual range +- 20 Million likelihood 30% (as the greatest difference between actual and forecast of net financial position) +- 5 Million likelihood 70% (the most narrow difference actual vs forecast) , to the new interval +-15M E 10% /+- 5M 80%. The cost of the investment is Euro 500 K.
It is possible and meaningful measure the value of the better information available in the new procedure of financial planning using EVPI as described in the capitol “Measuring Value of Information” in your book How to Measure Anything (Paragraph “the value of information for ranges)?
Else what you suggest to set better the problem of the value of information?
Thank you in advance for your answer.
stefano palestini
internal auditing & risk management
Stefano,
This sounds like a perfectly viable problem for the value of information. However, I would need clarification on something you said. It seems you are saying that the current accuracy of forecasts to actuals has a 30% chance of being within 20 million and a 70% chance of being within 5 million, right? That seems backward to me. The wider range should have the higher probability unless I misunderstand how you are stating this. The same would seem to hold for the target accuracy.
But, that aside, improving the value of forecasts is certainly something the information value pertains to. You have a cost of overestimating and/or a cost of underestimating, right? How much would you lose for every million you over or under estimate? The product of this “loss function” and your probability distribution for the forecast is the expected opportunity loss (EOL). You have an EOL for the current state and for the desired target accuracy. The difference between the two EOLs is the value of information. Technically, you would only be computing EVPI if you were comparing the value of your current accuracy to perfect forecasts, which I’m sure is not what you mean.
Thanks for your comment,
Doug Hubbard
The likelihood comes from the observation of differences between the quartly financial position forseen and the actual, during 3 years (12 observation) I’ve seen 9 cases (70%) where the difference has been inside the range +/- 5 million and for 3 other cases has been +/- 20 million.
Thank you in advance for your answer.
stefano palestini
Hi Doug,
I have just finished reading How to Measure Anything and found it immensely valuable. The organisation I work for (a public sector transport agency in Australia) reaped the benefit of it (even before I finished reading chapter 9!) as I ran a regression analysis and quantified the link between infrastructure maintenance and reliability. Your book has aided me in linking hitherto unrelated areas of data and information with a view to bolstering and enhancing performance. So – thank you.
Another area I am starting to turn my attention to which I thought you may have an interest in from an HTMA perspective is that of quantifying the benefits of ‘reform’. Reform programs are often undertaken with rubbery objectives and vaguely defined benefits. Process can often dominate outcomes. I think there are some powerful applications of your work in being able to more clearly and quantifiably link the process inputs which form the basis of a reform program, with tangible outcomes. More sharply defining reform (“what exactly do you mean by ‘reform’?”) and correlating associated changes with performance measures such as relability and customer satisfaction are promising areas here.
One question if I may: do you have any recommended further reading in the areas of Monte Carlo simulation, and of regression. I need to dig deeper, esp. re. regression, to find out how many pairs of data points are required for a regression to be valid, and how to deal with coefficients and p-values changing depending on whether a variable is treated in a simple regression or a multiple regression. My preferred reading would have plenty of worked examples (which is a key strength of your book).
Thanks again.
Michael Carman
Michael,
Thanks for your comments. Regarding further reading on Monte Carlo, you might have noticed I’m a fan of Sam Savage’s work. But I also provide webinars and seminars on the topic. You should check out the details on the https://hubbardresearch.com/store/merchant.mvc?Screen=CTGY&Store_Code=HDR&Category_Code=Training.
Regarding regression, “validity” has to consider your prior state of uncertainty. If you use a tool like Excel for regression, the error of the estimate is based on a z-stat, which is recommended for sample sizes greater than 30. But there are other regression methods that don’t have that kind of constraint.
Suppose your current subjective 90% CI for Y is 2 to 15. Now suppose have 8 values for X and Y of (1.1,1.2), (2.1,2.5), (4.5,3.9), (5.1, 5.5), (6.6,7.1), (8.5, 8.6), (9.0, 9.1), (9.5, 9.5). Plug that into a scatter chart in Excel if that helps. I also tell you that the X value for the Y you wish to estimate is 4.1. Given what you have seen with these 8 samples of (X,Y) you wouldn’t really think it is now very likely for Y to be something over 10, would you? This is especially useful if you had a reason to believe there should be a relationship – like foot traffic in front of a store and sales or years of experience in some task and the score on a related test, etc.
I hope that helps.
Doug Hubbard