Originally posted at http://www.howtomeasureanything.com, on Wednesday, February 11, 2009 2:16:38 PM, by andrey.

“Hello Douglas,

First of all let me say I have thoroughly enjoyed reading your book. I have a technical background (software engineering) and have always been surprised at how “irresponsible” some business-level decision making can be – based on gut instincts and who-knows-what. This ‘intuitive’ approach is plagued with biases and heuristics, the effects of such approach has been widely publicized (for example here). This is one of many reasons I found your book very simulating and the AIE approach as a whole very promising.

However I have reservations about a few points you make. Please forgive me my ignorance if my questions are silly, my math has become rusty with the years passing by.

One of my concerns is the validity of the assumption that you make when explaining ‘The Rule of Five’, 90% CI and especially when using Monte Carlo simulation. I can believe (although it would’ve been great to see the sources) that ‘there is a 93% chance that the median of a population is between the smallest and largest values in any random sample of five from that population’. But when you are applying this to the Monte Carlo simulations, you assume that the mean (which is also the median for symmetric probability distributions) is exactly in the middle of the confidence interval. Which, I think, makes a big difference to the outcome because of the shape of the normal distribution function. If you assume the the median is for example very close to the lower or upper bound of the confidence interval by putting a different value into the =norminv(rand(),A, B) formula the results would be different.

I am still working through your book (second reading), trying to ‘digest’ and internalise it properly. I would be very grateful if you could explain this to me.

Thank you very much,


Thanks for your comment.

I don’t show a source (I’m the one who coined the term “Rule of Five” in this context) but I show the simple calculation, which is easily verifiable. The chance of randomly picking one sample with a parameter value above the true population median for that parameter is, by definition, 50%. We ask “what is the probability that I could happen to randomly choose five samples in a row that are all above the true population median?” It is the same chance as flipping five coins and getting all heads. The answer is 1/2^5 or 3.125% Likewise, the probability that we could have just picked 5 in a row that were all below the true population median is 3.125%. That means that there is a 93.75% chance that some were above and some were below – in other words, that the median is really between the min and max values in the sample. It’s not an “assumption” at all – it is a logically necessary conclusion from the meaning of the word “median”.

You can verify this experimentally as well. Try generating any large set of continuous values you like using any distribution you like (or just define a distribution function for such a set). Determine the median for the set. Then randomly select 5 from the set and see if the known population median is between the min and max values of those 5 samples. Repeat this a large number of times. You will find that 93.75% of the time the known median will be between the min and max values of the sample of 5.

I believe I also made it clear that this only applies to the median and not the mean. I further stated that if, on the other hand, you were able to make the determination that the distribution is symmetrical then, of course, it applies to the mean as well. Often, you may have reason to do this and this is no different than the assumption in any application of a t-stat or z-stat based calculation of a CI (which are always assumed to be symmetrical).

Furthermore, you certainly should not use a function that generates a normally distributed random number if you know it not to be normally distributed and I don’t believe I recommended otherwise. If you know the median and the mean of the distribution you intend to generate are not the same, then you can’t count on Excel’s normdist function to be a good approximation. For the sake of simplicity, I gave a very limited set of distributions functions for random values in Excel. But we can certainly add a lot more (triangular, beta, lognormal, etc.)

My approach is not to assume anything you don’t have to. If you don’t know that the distribution isn’t lopsided, you can simulate that uncertainty, too. Is it possible that the real distribution could actually be lognormal? Then put a probability on that and generate accordingly. Why “assume” something is true if we can explicitly model that we are uncertain about something?