[From Stu. Images added by Ira from this source.]
OK,OK Ira, you have guilted me into action and so I will share something that has been bothering me lately. As many of us may know, the CLT is, next to the Law of Large Numbers, the most important principle in Statistics and is used to justify many a research study. So, I am thinking it behooves us all to try to understand this CLT so we can become more discerning citizens, n'est pas?
Here is the way I understand it. Given a random variable, X with mean mu and standard deviation sigma (X may or may not be normally distributed). Now we do the following m times: We draw n samples from X and compute the mean X-Bar giving us m X-Bars which will have their own particular probability distribution, PD. Finally the CLT promises that as m and n approach infinity, PD will approach a Normal distribution with mean mu (the mean of our original random variable X) and a standard deviation of sigma divided by the square root of n (the sigma of X and the n of the n samples). Pretty amazing actually. Please correct me if my understanding of this is incorrect as I'm going from memory here.
Now here's the problem that is bothering me. Say a research study is done where the researcher does not know the actual probability distribution so he or she can use the CLT to draw inferences about the population but precisely how? From what I understand they want to use as large an n as possible but surely do not use a large m (repeated sampling). And while I understand that once you have a normal/Gaussian probability distribution, it's easy to compute deviations from the mean and confidence intervals, just exactly what is the procedure used. Can anyone give me a useful easy-to-understand example?
Bewitched, bothered and bewildered,
Stu
Here is the way I understand it. Given a random variable, X with mean mu and standard deviation sigma (X may or may not be normally distributed). Now we do the following m times: We draw n samples from X and compute the mean X-Bar giving us m X-Bars which will have their own particular probability distribution, PD. Finally the CLT promises that as m and n approach infinity, PD will approach a Normal distribution with mean mu (the mean of our original random variable X) and a standard deviation of sigma divided by the square root of n (the sigma of X and the n of the n samples). Pretty amazing actually. Please correct me if my understanding of this is incorrect as I'm going from memory here.
Now here's the problem that is bothering me. Say a research study is done where the researcher does not know the actual probability distribution so he or she can use the CLT to draw inferences about the population but precisely how? From what I understand they want to use as large an n as possible but surely do not use a large m (repeated sampling). And while I understand that once you have a normal/Gaussian probability distribution, it's easy to compute deviations from the mean and confidence intervals, just exactly what is the procedure used. Can anyone give me a useful easy-to-understand example?
Bewitched, bothered and bewildered,
Stu
Thanks Stu for posting a new Topic that is quite a departure from the normal fare here.
ReplyDeleteFor those who are unfamiliar with the Central Limit Theorem (CLT), I added a couple of illustrative example animations to Stu's posting. The first starts with random samples taken from a PARABOLIC distribution. The second starts with a UNIFORM distribution.
In both cases, repeated sampling ends up with an identical NORMAL distribution! That is the whole point of the CLT. (The source for these animations also shows an initial TRIANGULAR and INVERSE distribution that, when sampled, also results in a similar NORMAL distribution, but with a different Mean.)
I'll think about this and try to come up with a practical example that makes sense from a philosophical view.
HOWARD !!! - ANY IDEAS ???
Ira Glickstein
I'm not good at abstract statistical thinking. I find examples more helpful as shown for example in Learning by Simulations.
ReplyDeleteHoward
Howard's link shows that even a BI-MODAL distribution has a NORMAL one at its heart.
ReplyDeleteWhy do YOU care?
Here is a simple explanation for readers not familiar with statistical distributions, and why it matters to you!
When you measure things in Nature, or do "social science" studies of people, you (almost) always get what is called a NORMAL (bell-shaped) distribution. The illustrations at the head of this Topic show how the familiar bell-shape emerges.
So, what is a BI-MODAL result, and how can you get it? OK, say you want to know the average height and variability in height of young men in the US. You could go to a college campus and measure the heights of the first 100 or 500 or more men who happen to pass by. As shown in Lies, Damned Lies, and Statistics on this Blog in 2007, you will find that virtually all the young men are between 62" and 77", with a peak around 69" to 70". If you graph the distribution, you will get a nice, normal bell-shape.
But, what if, by bad luck, you happen to do the experiment outside the athletic building just as a championship basketball tournament is letting out and half your sample happens to be basketball players? Competitive players tend to be over 72" tall. Instead of getting a nice symmetrical bell-shape, you will get one with a bump on one side, like the one in Howard's link.
Misuse of Anecdotal Math
It seems to be a Law of Nature that all things that can be measured tend to follow a normal curve. So, when reviewing results of experiments and surveys and so on, expect to find that bell-shape. Also, when interpreting results, the statistics of the normal curve can bring out truths that are not necessarily apparent at first glance.
For example, as I showed in Lies, Damned Lies, and Statistics, anecdotal math can be used to falisfy the truth and truthify falsehood.
Much has been made of the apparent bias against women and minorities in sports and the professions. Some of it is, unfortunately, all too true, and must be corrected of course. However, some of the disparity is valid and not due to bias. The best-selling 1994 book The Bell Curve showed why in a "politically incorrect" way.
Here is a specific example. Young women, on average, are only about 5" shorter than young men, which is less than 10%. Therefore, you would expect to find only a 10% difference in the number of women and men in athletics. Right? WRONG!
It turns out that in sports where height (and weight, etc.) are critical, that 10% average height difference should result in a 100 to 1 ratio of men and women in that sport. That is why it is legal and customary for the highest levels in many sports to have woman-only leagues that exclude men.
The bell-curve also explains representation in professions where academic intelligence is critical. The fact that some identifiable groups are under- (or over-) represented does not necessarily mean there is any illegal bias involved.
Ira Glickstein
Ira Glickstein
Thanks Ira and Howard for the timely responses. As fate would have it, after much more research I found this website:
ReplyDeletehttp://people.hofstra.edu/Stefan_Waner/realWorld/finitetopic1/confint.html
which explains pretty well how just one sample to compute the mean and standard deviation from a population with unknown probability distribution can be used to compute the confidence interval (the probability that the sample mean lies between a specified upper and lower limit around the mean of the normal distribution (usually expressed in scaled units of the standard deviation). If you read the text at the link, be careful as the author is somewhat cavalier in his use of sigma and s.
The reason I got interested in this was from my reading of "The Black Swan" by Nassim Taleb who claims that the Normal or Gaussian or Bell Curve is not the best representation of a large chunk of random variables such as the stock market, book sales, and any population where mean and std dev. are not adequately descriptive statistics. Instead he proposes fractal probability distribution for these cases --- so that is my next project and while I understand what a fractal is, a fractal probability distribution blows my mind --- so it's back to Mandelbrot...BUT, before I go, could you (Ira) pls explain this quote from your response? Why the square law?
{Ira said:}
Here is a specific example. Young women, on average, are only about 5" shorter than young men, which is less than 10%. Therefore, you would expect to find only a 10% difference in the number of women and men in athletics. Right? WRONG!
It turns out that in sports where height (and weight, etc.) are critical, that 10% average height difference should result in a 100 to 1 ratio of men and women in that sport. That is why it is legal and customary for the highest levels in many sports to have woman-only leagues that exclude men.
Here is Stu's Link in clickable form.
ReplyDeleteIn my previous Comment I link to my 2007 Blog Topic where I say:
The height of young American women ranges from about 4' 9" to 6'. For young men it is 5' 2" to 6' 5". That's a difference of about five inches -- less than ten percent.
Therefore, in basketball and other sports where height is critical, you'd expect about ten percent fewer women than men. Right?
Anything less would be proof of discrimination against women. Right?
WRONG !!!
Actually, if you had a cut-off of six feet, over 100 men would qualify for every woman who qualified!
Stu asks me to "explain this quote from your response? Why the square law?"
I understand why it seemed like a "square law" because I said a 10% difference in average male/female heights would lead to 100:1 under-representation of women in sports where height is critical. However, it is not a square law, but rather a result of overlapping male and female normal curves.
Here is why: Standard deviation in height is about 2.5", so women are about two standard deviations shorter than men. Say championship play in a sport like basketball generally requires players who are over 72" tall. That is in the 2- to 3-sigma range in male height ("very tall", and "extremely tall" on my graphic). Women over 72" would be in their 4- to 5-sigma range (even taller than "extremely tall" in their range).
Using height data for young Americans, given 1000 men and 1000 women, about 136 men will be "tall", 21 will be "extremely tall", and about 1 above that, for a total of about 158 who qualify. For women, only about 1 will qualify. So, the actual ratio will be about 158:1.
As I point out in the Blog Topic, even if we relax the height requirement to 67" (just above average for the combined male/female population), there will be more than a 5:1 disparity between men and women. (Actual calculation: about 842 men 158 women, a ratio of 5.3:1).
Ira Glickstein