● Differentiate between a sample and a population, and understand how samples are taken
● Compare and contrast the variance and standard deviation of a sample versus a population
● Explain how sampling relates to variability, and explain why samples drawn from the same populations have different means
● Describe the central limit theorem and explain why it is important to statistics
In practice, we often don’t, or even can’t, have full information about the entire population we are interested in.
Instead, we can collect information about a population by sampling randomly from that population. When sampling randomly, every member of the population has an equal chance of being included in the sample. By taking only a subset of the larger population our goal is to characterize attributes of the population that we’re interested in fairly accurately, without having to measure every single member of the entire population.
We call the numbers that we want to know about in the population parameters and we represent parameters with Greek letters.
For example, we’ve been using mu to represent the population mean. The entire point of sampling is to use numbers calculated from our sample (which we call sample statistics) to try and learn about population parameters. These sample statistics are usually represented with Roman letters.
For example, if we wanted to know the price of sports cars in 2004, we could look at every single sports car sold in 2004. Alternately, we could randomly select a subset of the cars and use their prices to infer something about the entire population.
Now, let’s say that instead of sampling all sports cars randomly, we only select sports cars from Honda. Selecting only Hondas might give us with an estimate of average price that is different from that of the population average, because we sampled from a non-random subset of the population. In that case, our sample would be representative of the subset we selected, but not representative of the whole population.
This is why it is important to sample randomly when possible. A random sample will be representative of the entirely population. A non-random sample might be biased in some way, and might not accurately reflect the parameters of the population. That being said, sometimes when conducting research we have to accept a sample of convenience that isn’t fully random. Often this is because of a limitation of cost or time. A common example is when university researchers conduct studies that test undergraduate students instead of recruiting a random sample of people with a broader range of ages and backgrounds. In other cases, it doesn’t make sense to sample a population completely randomly, even if it would be possible to do so. For example, if we’re interested in differences in education level based on household income, then we’d probably want to make sure we had a balanced assortment of different income levels. There are more advanced sampling techniques for situations like these but we won’t talk about them here.
One important thing to understand about samples and populations is that the same set of data can be both a sample...and a population, depending on what we’re interested in. The way in which we think about the set of data depends entirely on which questions we’re interested in addressing.
Recall that a population is a complete set of observations, and a sample is a subset of that population. However, any given sample of a population may also be considered a population in its own right, if we are considering a different context.
In the previous example, the non-random sample of Honda sports car prices constitutes a sample of the population of all sports car prices. However, it also represents the entire population of sports car prices for Honda cars. When considering samples, it is important to have a good understanding of what the overall population is.
The module on central tendency and variability introduced measures of variability in general. Here we will look at measures of variability specific to samples.
When we take the variance of a sample, it is often represented by s squared.
In contrast, when we take the variance of a population, it is represented by sigma squared. Recall from our module about central tendency that population variance is calculated by squaring each deviation from the mean, summing the squared deviations, and dividing the sum by the number of observations in the population
To calculate the variance of a sample, the sum of squared deviations from the mean is divided by the number of observations minus 1.
Notice that the formula for sample variance is the same as the formula we used for population variance, except we have substituted N-1 for N in the denominator. We don't need to go into the math of why we use N-1, but generally speaking, using N-1 as the denominator for our sample variance gives us an unbiased estimate of what the true population variance is. It will be important to always keep in mind whether you should be dividing by N (for populations), or N-1 (for samples). In either case, variance is reported in the same units as your measurement.
So for example, if height is recorded in centimeters, the variance of a set of height observations is also in centimeters.
Standard deviation of a sample is calculated using the sample variance equation, rather than the population variance equation, so the sum of squared deviations should be divided by the number of observations minus
Recall that the standard deviation of a population is the square root of the variance of a population. Similarly, the standard deviation of a sample is the square root of the variance of a sample.
Why is variability important for sampling?
A sample is one set of observations that comes from a population. If we were to repeatedly sample from the population, each sample would contain different sets of observations that all come from the same parent population.
So, imagine you plunge your hand into your piggy bank and scoop out a handful of change.
You count the total value of the change you have in the handful and put it back in. You do this a second time and count the change again. You wouldn’t expect to take out the same amount both times, but based on the two handfuls of change you pulled out, you can start to estimate how much money fits in your hand on average.
In a similar way, when we generate a sample, we expect it to represent the parent population reasonably well, but not perfectly. Again, if we were to draw another sample from the parent population, we would also expect that sample to be representative of the parent population, even if it were somewhat different from the first sample.
We call the differences between samples the sampling error. Sampling error is the variability in a sample statistic due to chance, and it will vary from sample to sample.
Note that this use of the word “error” doesn’t mean something is incorrect.
Instead, it refers to random variability. We call it error in the context of prediction.
If we are trying to predict how much change we would pick out in a handful, for example, the differences from our predicted amount from sample to sample would be our sampling error.
Let’s actually do some sampling now. We’re going to draw some samples and plot the mean of each sample in a histogram. We call this distribution the sampling distribution of the sample mean and it’s going to be very important for understanding statistics we’ll talk about in other modules.
Here we have two histograms, with car weight on the x axis and count of cars on the y axis. On top we’ve plotted a histogram showing the weight of the first 50 cars in our database. We’ve coded the histogram so that each coloured rectangle you see represents a different car - this is the population we’ll be sampling from. We have the mean, standard deviation and size for the population shown on the left
Notice that some of the rectangles are black with a red outline. Those represent a random sample drawn from the population. The sample has an n of 5. You can see the sample statistics on the left.
So our first sample of 5 cars has a mean of 3732 pounds and a standard deviation of 527 pounds.
We’ve plotted the sample mean on the histogram on the bottom, which shows the sampling distribution of the sample mean. We will be calculating this distribution's mean and SD as we go. Each black box here will represent one mean value of another sample of 5 from the population above. Right now, we can’t have a standard deviation because we only have 1 mean!
Let’s take another sample.
Our new sample of 5 has a mean 3420 pounds and standard deviation of 484 pounds. We’ve averaged that with our first sample to get a sampling distribution mean of 3576 and standard deviation of 220.
We’re going to repeat this process 50 times in total, this time going a lot faster.
Let’s do the same thing again, but this time instead of 50 samples of 5 cars, we’re going to take 200 samples of 5 cars.
Now let’s compare the sampling distribution we just got with the one we obtained previously. Here they both are with a normal curve fit over the data. The curves have the same mean and standard deviation, and the means and standard deviation of each distribution are very similar. But the plot with 200 samples matches the normal curve more closely. This fit will become better and better as the number of samples increases.
Now let’s do one more type of sampling distribution, this time taking 200 samples that each have ten data points instead of five.
Let’s compare this distribution with 200 samples of ten (on the top) to our previous sampling distribution of 200 samples of 5 (on the bottom). Again, I’ve plotted normal curves over the data in red. You can see that the curves fit both our data sets fairly well. However, notice that the plot on top, using a sample size of 10, has less variability than the plot on the bottom that used a sample size of 5. The top plot clusters around the mean more. Its standard deviation is 122 compared to 201 for the bottom plot.
We call the SD of a sampling distribution of the mean the ‘standard error’. We have observed that the distribution becomes more normal as the number of samples increases, and that the standard error gets smaller as size of the sample increases. These are not coincidences.
Both of these observations are predicted by the Central Limit Theorem.
In practice, statisticians almost never construct the sampling distribution of the mean the way we just did. Instead, they use known relationships between samples and populations to approximate it. These relationships are summarized by the Central Limit Theorem
The Central Limit Theorem has two main properties:
First, given a population with a mean of μ, and a variance of sigma squared, the sampling distribution of the mean, notated mu sub x-bar (μ_x̄) will also have a mean equal to mu, and a variance equal to sigma squared over N. (σ^2 / N)
That means, just by knowing the population mean and variance, we know what the sampling distribution will look like. We can know that the mean of those sample means will be the same as the population mean, and that the variance of those sample means will be variance divided by N. But more importantly, we can also work in the other direction to learn about population means and variances using only samples from that population.
The second property of the Central Limit Theorem is that the sampling distribution will approach the normal distribution as the sample size, N, increases.
Take a look at this sigma divided by N. What this means is that the more observations we make in a sample (the larger the N), the smaller the sampling distribution will be. So if you take a larger sample, the sample mean will be more likely to approach the population mean. The distribution also becomes less variable, and more tightly clustered around the mean.
This means that the sampling distribution will approach normal as the sample size increases, even if the parent population is not normally distributed.
This property will become important when you are introduced to the kinds of assumptions many statistical tests make.
A wise person once said that you don’t get anything for free in statistics, except maybe the Central Limit Theorem.
So let’s focus on the principle that the sampling distribution will approach normality NO MATTER how the population is distributed.
That’s extremely useful for us if true because that means we can sample from populations with totally unknown distributions and still be able to get good estimates of population parameters.
Since seeing is believing, let’s do some sampling and demonstrate this principle for ourselves.
Here are two graphs. The one on top is a normal density curve with a mean of 76 and a standard deviation of 15.5
The one on the bottom is a histogram of sample means with a normal density curve plotted over it, just like the ones we’ve been looking at. Notice on the left we’ve taken 10, 000 samples of 5 from the normal curve above. And just as the Central Limit Theorem predicted, the means of these distributions are the same, and the standard deviation of the sampling distribution is equal to 15.5 divided by the square root of 5.
Okay, maybe that’s not so impressive, a normal sample from a normal curve. Let’s change that now though, and look at another example.
On top now is a histogram of some fictitious class grades for 256 students. The grades have the same mean and standard deviation as the normal distribution in the previous example. However, notice that this distribution is not normal. Instead it has a strong positive skew. Now the bottom graph represents samples from this population of class grades. We can draw new samples over here using the slider marked random seed. Notice that our sampling distribution is the same for different samples, even though the class scores are not.
Now let’s try sampling from a uniform distribution with the same mean and standard deviation - our sampling distribution is still normal!
One last test we will show is a Laplace distribution. The name is not important here, just notice how different the shape is from normal. But still, our sampling distribution remains normal. The Central Limit Theorem works!
Now, let’s use our of knowledge of the central limit theorem to take another look at sampling distributions of the sample mean. Instead of sampling though, we’re going to assume we take an infinite number of samples each time. We already saw that as the number of samples get larger, the distribution matches the normal curve more and more closely. With infinite samples you actually get a normal curve. Now, the concept of taking infinite samples is really a theoretical idea, and not something we can actually do. What is important is that we know mathematically how those different samples should relate to the population.
We’ll draw a sample from a normal population with a mean of 8 and standard deviation of .1
We’ll call this population X, and we’re going to plot its Probability Density Function - or PDF. We’re calling it a function because we generated it using the normal equation. Our y-axis, which is read f of X, is just saying that the calculated value of Y depends on the value of X.
We will plot out the sampling distribution as a density curve also, this time in red. We’ll call it X-bar, to indicate we’re plotting the mean and standard deviation of the many, many samples we took from the population.
We’ll start by drawing infinite samples of size 1 from the population. That’s what the n = 1 means on the graph.
Notice that the means of both the sample and the population are the same. Also notice when we draw many, many samples of size 1, the standard deviation of the sample (sigma xbar) and population (sigma) are the same as well, so their density distributions overlap.
What we’re going to vary next is the number of scores we use for each of the infinite number of samples. We mentioned before that the standard deviation of the sampling distribution is called standard error.
Let’s see how standard error changes as we increase the number of scores in each of our many samples from one to two…. the standard error is getting smaller, but our mean is staying the same. The standard error will keep getting smaller as we increase our N, and this change is predictable. It can be calculated when generating the sampling distribution, or it can be approximated by dividing the sample standard deviation by the square root of N, where N is the number of data points in each sample.
Okay, so maybe at this point you’re asking why this all matters. As we’re about to see, understanding how the variability in populations and samples are related allows us to make some good guesses about what is going on in a population just from numbers we calculate from a sample!
In this module you have been introduced to sampling and the central limit theorem.
Fundamentally, the central limit theorem describes how the mean and standard deviation of a sample will be related to the mean and standard deviation of a population. Knowing this allows us to predict numbers in a population, that we can’t measure directly, using numbers in a sample, which we can.
For example if we wanted to know how tall Canadians are on average, we couldn’t measure every Canadian! But we could take a sample and use it to estimate the height of the average Canadian using the central limit theorem and our sample numbers.
The goal of this module is to give you the background information you’ll need to understand the specific tests that are included in the following modules.
Now that you’ve reached the end of this module:
You should know what a sample is and how it’s different than a population
This includes recognizing how the variance of a population relates to the variance of a sample taken from that population.
You should also be able to talk about the variability between samples that we call sampling error
You can describe the two basic principles of the central limit theorem, which state that:
The average of our sample means will itself be the population mean, and the standard deviation of the sample means equals the standard error of the population mean
Also The sampling distribution will approach the normal distribution as the sample size, N, increases.
Finally, you should understand how we use these to predict population statistics from sample statistics
Descriptive & Inferential Statistics Populations and Samples