Originally posted at http://www.howtomeasureanything.com, on Wednesday, February 11, 2009 2:16:38 PM, by andrey.
“Hello Douglas,
First of all let me say I have thoroughly enjoyed reading your book. I have a technical background (software engineering) and have always been surprised at how “irresponsible” some business-level decision making can be – based on gut instincts and who-knows-what. This ‘intuitive’ approach is plagued with biases and heuristics, the effects of such approach has been widely publicized (for example here). This is one of many reasons I found your book very simulating and the AIE approach as a whole very promising.
However I have reservations about a few points you make. Please forgive me my ignorance if my questions are silly, my math has become rusty with the years passing by.
One of my concerns is the validity of the assumption that you make when explaining ‘The Rule of Five’, 90% CI and especially when using Monte Carlo simulation. I can believe (although it would’ve been great to see the sources) that ‘there is a 93% chance that the median of a population is between the smallest and largest values in any random sample of five from that population’. But when you are applying this to the Monte Carlo simulations, you assume that the mean (which is also the median for symmetric probability distributions) is exactly in the middle of the confidence interval. Which, I think, makes a big difference to the outcome because of the shape of the normal distribution function. If you assume the the median is for example very close to the lower or upper bound of the confidence interval by putting a different value into the =norminv(rand(),A, B) formula the results would be different.
I am still working through your book (second reading), trying to ‘digest’ and internalise it properly. I would be very grateful if you could explain this to me.
Thank you very much,
Andrey”
Thanks for your comment.
I don’t show a source (I’m the one who coined the term “Rule of Five” in this context) but I show the simple calculation, which is easily verifiable. The chance of randomly picking one sample with a parameter value above the true population median for that parameter is, by definition, 50%. We ask “what is the probability that I could happen to randomly choose five samples in a row that are all above the true population median?” It is the same chance as flipping five coins and getting all heads. The answer is 1/2^5 or 3.125% Likewise, the probability that we could have just picked 5 in a row that were all below the true population median is 3.125%. That means that there is a 93.75% chance that some were above and some were below – in other words, that the median is really between the min and max values in the sample. It’s not an “assumption” at all – it is a logically necessary conclusion from the meaning of the word “median”.
You can verify this experimentally as well. Try generating any large set of continuous values you like using any distribution you like (or just define a distribution function for such a set). Determine the median for the set. Then randomly select 5 from the set and see if the known population median is between the min and max values of those 5 samples. Repeat this a large number of times. You will find that 93.75% of the time the known median will be between the min and max values of the sample of 5.
I believe I also made it clear that this only applies to the median and not the mean. I further stated that if, on the other hand, you were able to make the determination that the distribution is symmetrical then, of course, it applies to the mean as well. Often, you may have reason to do this and this is no different than the assumption in any application of a t-stat or z-stat based calculation of a CI (which are always assumed to be symmetrical).
Furthermore, you certainly should not use a function that generates a normally distributed random number if you know it not to be normally distributed and I don’t believe I recommended otherwise. If you know the median and the mean of the distribution you intend to generate are not the same, then you can’t count on Excel’s normdist function to be a good approximation. For the sake of simplicity, I gave a very limited set of distributions functions for random values in Excel. But we can certainly add a lot more (triangular, beta, lognormal, etc.)
My approach is not to assume anything you don’t have to. If you don’t know that the distribution isn’t lopsided, you can simulate that uncertainty, too. Is it possible that the real distribution could actually be lognormal? Then put a probability on that and generate accordingly. Why “assume” something is true if we can explicitly model that we are uncertain about something?
Thanks,
Doug
Thank you very much for your detailed response with explanations. I am sorry I might have chosen the wrong word (‘sources’). I am too accustomed to the ‘computer slang’ meaning of it – meaning ‘source code’, English is not my first language and in my circles people routinely use this term to mean the ‘workings’ or in other words the logic behind the rule or statement. Anyway, your explanation have fully answered this question now.
I think I did not formulate my main question well either. Let me re-phrase it by coming up with a practical example. Let’s say I am testing the performance of a software component. Every test I run gives me different result, let’s say it took 5, 7, 11, 6, 15 seconds to perform my test operation in five attempts correspondingly. I think it’s reasonable to assume that the distribution shape is normal in this case, I know one can argue against it but just for the illustration purposed let’s just say it’s normal. The rule of five says there is about 93% probability that the median (and mean) of the entire population is between the highest and the lowest values in the sample. So far so good. Let’s say now that the effectiveness of the whole program somehow depends on the performance of this component, and if I knew the exact time it takes for the component to complete the operation I could say how effective the program was and ultimately whether I really need to invest more time into it’s development.
To me it seems that you would advocate using Monte Carlo simulation in this case, with the initial value calculated using for example Excel and something like this:
= norminv(rand(),10,(15 – 5)/3.29)
In the book under similar circumstances you advocate this approach (on page 76), where the formula is:
= norminv(rand(),15,(20 – 10)/3.29)
My point is concerned with the value for the mean in this formula, ie ’10’ in this case, and ’15’ in your formula on the page 76. That’s what I mean when I say that you assume that the mean (and the median) is exactly in the middle of the interval. To my mind, at least with ‘the rule of five’ it does not seem to have to be the case. The rule says it’s in between, but does not say anything about where in between the median (and mean in this case) it is. Certainly nothing concrete about it being exactly in the middle of the interval. But in the Excel formula for doing Monte Carlo simulation the mean value is the number in the middle of the interval.
Thank you for your help,
Andrey
Yes, that is correct on both points. 1) The Excel formula I provide produces a normally distributed random quanitity. 2) We do not necessarilly assume that the findings of a Rule of 5 is normal.
However, I don’t believe I ever said that the two statements above are incompatible. In fact, I do explain on page 141 that the mathless table (which includes the Rule of 5 as its first row) that we make no assumption about the shape of the distribution at all in order for the mathless table to work.
I’m sorry if I’m misinterpreting your question but it appears you believe I’m saying that if you measuring a CI with the Rule of 5, then the way to simulate the population parameter is in the Excel formula I describe for a Gaussian random – and I don’t see where I say anything like that. First, the rule of 5 isn’t even a 90% CI, its a 93.75% CI, so that part of the Excel formula would have to change. Second, I describe a number of possible distributions the mathless table could work for, including some highly irregular and skewed distributions. I explain that the Rule of 5 actually avoids some logical problems that you could run into if you apply the student-t distribution for small samples (and that distribution is symmetrical, by definition).
You seem to be describing a situations where you might have other knowledge about the distribution besides just the results of the sample. That’s good and you can use that knowledge. I describe how to produce those results in chapter 10 on Bayesian analysis.
And once you determine a distribution of any shape, you can always “cheat” with the brute force way. If you don’t know how to define a formula in Excel to generate that particular random number, just create another worksheet that has several (or several dozen) rows and two columns. The column on the right is a fixed set of increments of possible outcomes of that random number. The column on the left is the cumulative probability for all the values up to that increment. Then you use =vlookup(rand(),[table range],2) to generate that specific value. The number it generates is a little “lumpy” (since it only can generate values from a finite list) but if you divide it up finely it should be sufficient for most business case simulations. If you use this with a little experimentation, you can generate any kind of possible shape you like.
Let me know if that works.
Doug Hubbard
That’s right, now I see that nowhere you said this. I don’t know how I just assumed it, I guess it’s because in both cases the words ‘confidence’ and ‘interval’ are used in places. Thank you very much for your help.
Regards,
Andrey