[Many of the figures in this note are screen shots from a simulation at the Rice Virtual Lab in Statistics. You might enjoy trying the simulation yourself after (or even while) reading this note. Java must be enabled in your browser for this simulation to run.]
There is arguably no more important lesson to be learned in statistics than how sample means behave. It explains why statistical methods work. The vast majority of the things people do with statistics is compare populations, and most of the time populations are compared by comparing their means.
The way individual observations behave depends on the population from which they are drawn. If we draw a sample of individuals from a normally distributed population, the sample will follow a normal distribution. If we draw a sample of individuals from a population with a skewed distribution, the sample values will display the same skewness. Whatever the population looks like--normal, skewed, bimodal, whatever--a sample of individual values will display the same characteristics. This should be no surprise. Something would be very wrong if the sample of individual observations didn't share the characteristics of the parent population.
We are now going to see a truly wondrous result. Statisticians refer to it as The Central Limit Theorem. It says that if you draw a large enough sample, the way the sample mean varies around the population mean can be described by a normal distribution, NO MATTER WHAT THE POPULATION HISTOGRAM LOOKS LIKE!
I'll repeat and summarize because this result is so important. If you draw a large sample, the histogram of the individual observations will look like the population histogram from which the observations were drawn. However, the way the sample mean varies around the population mean can be described by the normal distribution. This makes it very easy to describe the way population means behave. The way they vary about the population mean, for large samples, is unrelated to the shape of the population histogram.
Let's look at an example. In the picture to the left,
In each case, the individual observations are spread out in a manner reminiscent of the population histogram. The sample means, however, are tightly grouped. This is not unexpected. In each sample, we get observations from throughout the distribution. The larger values keep the mean from being very small while the smaller values keep the mean from being very large. There are so many observations, some large, some small, that the mean ends up being "average". If the sample contained only a few observations, the sample mean might jump around considerably from sample to sample, but with lots of observations the sample mean doesn't get a chance to change very much.
Since the computer is doing all the work, let's go hog wild and do it
Here's how those means from the 10,000 samples of 25 observations each, behave. They behave like things drawn from a normal distribution centered about the mean of the original population!
At this point, the most common question is, "What's with the 10,000 means?" and it's a good question. Once this is sorted out, everything will fall into place.
This result is so important that statisticians have given it a special name. It is called The Central Limit Theorem. It is a limit theorem because it describes the behavior of the sample mean in the limit as the sample size grows large. It is called the Central limit theorem not because there's any central limit, but because it's a limit theorem that is central to the practice of statistics!
The key to the Central limit Theorem is large sample size. The closer the histogram of the individual data values is to normal, the smaller large can be.
The vast majority of the measurements we deal with are made on biological units on a continuous scale (cholesterol, birth weight, crop yield, vitamin intakes or levels, income). Most of the rest are indicators of some characteristic (0/1 for absence/presence of premature birth, disease). Very few individual measurements have population histograms that look less normal than one with three bars of equal height at 1,2, and 9, that is, a population that is one-third 1s, one- third 2s, and one-third 9s. It's not symmetric. One-third of the population is markedly different from the other two-thirds. If the claim is true for this population, then perhaps it's true for population histograms closer to the normal distribution.
The distribution of the sample mean for various sample sizes is shown at the left. When the sample size is 1, the sample mean is just the individual observation. As the number of samples of a single observation increases, the histogram of sample means gets closer and closer to three bars of equal height at 1,2,9--the population histogram for individual values. The histogram of sample individual values always looks like the population histogram of individual values as you take more samples of individual values. It does NOT look more and more normal unless the population from which the data are drawn is normal.
When samples of size two are taken, the first observation is equally likely to be 1, 2 or 9, as is the second observation.
|Obs 1||Obs 2||Mean|
And so it goes for all sample sizes. Leave that to the mathematicians. The pictures are correct. Trust me. However, you are welcome to try to construct them for yourself, if you wish.
When n=10, the histogram of the sample means is very bumpy, but is becoming symmetric. When n=25, the histogram looks like a stegosaurus, but the bumpiness is starting to smooth out. When n=50, the bumpiness is reduced and the normal distribution is a good description of the behavior of the sample mean. The behavior (distribution) of the mean of samples of 100 individual values is nearly indistinguishable from the normal distribution to the resolution of the display. If the mean of 100 observations from this population of 1s, 2s, and 9s can be described by a normal distribution, then perhaps the mean of our data can be described by a normal distribution, too.
When the distribution of the individual observations is symmetric, the convergence to normal is even faster. In the diagrams to the left, one-third of the individual observations are 1s, one-third are 2s, and one-third are 3s. The normal approximation is quite good, even for samples as small as 10. In fact, even n=2 isn't too bad!
To summarize once again, the behavior of sample means of large
samples can be described by a normal distribution even when individual
observations are not normally distributed.
This is about as far as we can go without introducing some notation to maintain rigor. Otherwise, we'll sink into a sea of confusion over samples and populations or between the standard deviation and the (about-to-be-defined) standard error of the mean.
The sample has mean and standard deviation s. The sample comes from a population of individual values with mean and standard deviation .
The behavior of sample means of large samples can be described by a normal distribution, but which normal distribution? If you took a course in distribution theory, you could prove the following results: The mean of the normal distribution that describes the behavior of a sample mean is equal to , the mean of the distribution of the individual observations. For example, if individual daily caloric intakes have a population mean = 1800 kcal, then the mean of 50 of them, say, is described by a normal distribution with a mean also equal to 1800 kcal.
The standard deviation of the normal distribution that describes the behavior of the sample mean is equal to the standard deviation of the individual observations divided by the square root of the sample size, that is, /n. Our estimate of this quantity, s/n, is called the Standard Error of the Mean (SEM), that is,
I don't have a nonmathematical answer for the presence of the square root. Intuition says the mean should vary less from sample-to-sample as the sample sizes grow larger. This is reflected in the SEM, which decreases as the sample size increases, but it drops like the square root of the sample size, rather than the sample size itself.
These results say that for large sample sizes the behavior of sample means can be described by a normal distribution whose mean is equal to the population mean of the individual values, , and whose standard deviation is equal to /n, which is estimated by the SEM. In a course in probability theory, we use this result to make statements about the a yet-to-be-obtained sample mean when the population mean is known. In statistics, we use this result to make statements about an unknown population mean when the sample mean is known.
Preview: Let's suppose we are talking about 100 dietary intakes and the SEM is 40 kcal. The results of this note say the behavior of the sample mean can be described by a normal distribution whose SD is 40 kcal. We know that when things follow a normal distribution, they will be within 2 SDs of the population mean 95% of the time. In this case, 2 SDs is 80 kcal. Thus, the sample mean and population mean will be within 80 kcal of each other 95% of the time.
The decrease of SEM with sample size reflects the common sense idea that the more data you have, the better you can estimate something. Since the SEM goes down like the square root of the sample size, the bad news is that to cut the uncertainty in half, the sample size would have to quadrupled. The good news is that if you can gather only half of the planned data, the uncertainty is only 40% larger than what it would have been with all of the data, not twice as large.
Potential source of confusion: How can the SEM be an SD? Probability distributions have means and standard deviations. This is true of the probability distribution that describes individual observations and the probability distribution that describes the behavior of sample means drawn from that population Both of these distributions have the same mean, denoted here. If the standard deviation of the distribution that describes the individual observations is , then the standard deviation of the distribution that describes the sample mean is /n, which is estimated by the SEM.
When you write your manuscripts, you'll talk about the SD of individual observations and the SEM as a measure of uncertainty of the sample mean as an estimate of the population mean. You'll never see anyone describing the SEM as estimating the SD of the sample mean. However, we have to be aware of this role for the SEM if we are to be able to understand and discuss statistical methods clearly.
[back to LHSP]