Stats Module 2

Menu
Notes

Title
What Will You Learn?
Central Tendency & Variability
The Mean
The Mean
The Mean
The Median
The Mode
The Mode
Histogram
Checkpoint
Distributions: Next steps
Measures of Dispersion
Range
Interquartile Range & The 5 Number Summary
Variance
Variance
Standard Deviation
Graphs of Distribution
Skewness
Skewness
Skewness
Outliers & Boxplots
Boxplots Continued
Boxplots Continued
Checkpoint
Summary

By the end of this module you should be able to:

Describe distributions of data.

Define, interpret, and calculate measures of central tendency including the mean, median, and mode. Compare and contrast the utility of these 3 measures with different types and sets of data.

And last, define, interpret, and calculate measures of variability or spread in the data including variance, standard deviation, range, and quartiles. You will also understand how these relate to measures of central tendency.

When we say ‘central tendency’ we are talking about how data clusters around some middle value. Put another way, we are looking to identify a typical, representative, or average value in a dataset. We do this sort of thing intuitively all the time.

For example, if you ask a friend how much dinner at a certain restaurant costs, you wouldn’t expect them to list off all the menu items they could think of. Instead, they’d likely say something like, ‘about 20 dollars’. Hearing that, we wouldn’t be that surprised even if there were NO dishes for 20 dollars exactly - we’d expect some to be a bit more, some a bit less but all ‘in that neighborhood’.

Or if you are looking to buy a house, you might look around at a number of different locations to get an idea of how much they sell for, ‘in general’.

Variability is another useful way to describe data and tells us about how our data spreads out. Is it tightly clustered around a central value or is it diffuse? Does it include many very high or very low values? Different brands of cola have very similar costs - in other words, cola prices have low variability.

That house you’re looking for though could be very different in price from location to location, and we’d say there is high variability in house prices.

Statistical measures of central tendency are simply mathematical ways of describing these typical, or central, values. Importantly, they allow us to get a useful and simplified summary of what’s going on in our data - something that is critical in big data sets. Likewise, statistical measures of variability use numbers to characterize and simplify our understanding of the spread in our data.

We’ll begin by describing three measures of central tendency: the mean, the median, and the mode. We’ll use histograms (introduced last module) and other graphs to help us visualize these concepts as we go.

In everyday language, “the mean” is what most people are talking about when they say ‘the average’.

To calculate the mean we take all the scores, add them up, then divide that sum by the number of scores.

In statistics it is often more concise to describe a sequence of mathematical instructions like the one we just gave using symbols. Let’s look at how we would represent the mean using statistical notation.

On the left of the equals sign, take a look at the X with the bar above it (We pronounce that “x bar”). Placing a bar above a variable, in this case X, means to take the average of or to average ‘over’ that variable. But, unlike in simple algebra, ‘X’ represents a set of values rather than one single value. That set is composed of whatever scores we are interested in at the moment. That might be the height of all members of the basketball team or the cost of entry level domestic cars from different manufacturers.

The instructions on how to average are on the right hand side of the equation.

The uppercase ‘N’ in the denominator stands for Number of scores. In the numerator, we see the uppercase Greek letter Sigma followed by X. This notation means sum or add all the values in the set X (that is, all of the scores on our variable X). Taken together, this means we should sum all the X values and divide the results by the number of scores in X, just as we described before.

To demonstrate this I pulled out all the different numbers of cylinders a car might have (1 cylinder is a rotary cylinder!):

4 6 3 8 5 12 10 1

There are 8 scores, so N is 8. So, our mean will be:

X bar = 4 + 6+ 3 + 8 + 5 +12+ 10+ 1, divided by 8, equals 6.125. So for this set of cars, the mean number of cylinders is 6.125.

Let’s use those same numbers to look at median and mode.

The median is simply the middle value in a distribution of numbers if we order them from highest to lowest or vice versa.

Let’s reorder them now.

If there’s an odd number of values the median would actually be a score from the distribution

With an even number of values like our example, however, there are two ‘middle’ scores. In that case we take the mean of these two middle values as the median.

More formally, the median is also the 50th percentile. In other words, 50 percent of the scores in the distribution fall below the median. We can define any percentile between 0 and 100, and it would have the same meaning. So the 3rd percentile is the score that is above only 3 percent of the scores (this will come up again in a moment when we discuss measures of variability).

So our median number of cylinders is 5.5 - slightly different than our mean.

Our third measure of central tendency is the mode. To find the mode, we find the most commonly occurring score in the distribution. We just count how many times each score in the distribution occurs and whichever has the highest total is our mode.

If it turns out that we have a tie, and those values are adjacent, like 4 and 5 if we were using integers, we would take the mean of those values and report that as our mode. If on the other hand the scores were not adjacent, we would say that there were two modes, that is, the distribution is bimodal. It’s even possible for a distribution to be described as multimodal if the are more than two distinct clusterings of scores.

To demonstrate mean and median and at the same time keep the calculations simple, I used only the unique values for number of cylinders in our cars dataset. That doesn’t make much sense for the mode because it summarizes how values REPEAT. Let go back and get all the different values that occur and calculate our mode from that.

Because viewing all 428 values in the cars dataset at once would be difficult and cluttered, let’s look at a frequency table that counts how often each value occurs.

This format also makes it immediately apparent what our mode is. You can see that 6 cylinder engines are most common.

Before moving on to describe some ways to measure variability, let’s recalculate all 3 of our measures of central tendency using the same values we did for the mode and plot them on the same histogram.

Recall that the x-axis is divided into bins each representing a range of values in the number of possible cylinders. Because we have a small number of unique values, each bin is exactly one cylinder wide here. The y-axis tells us how many scores are in each bin.

I’ve plotted each of the measures of central tendency as a coloured vertical line. The line’s colour matches the text displaying that value. Note that our mean and median are different than before - that's because we’ve recalculated them using the same procedures but using all 428 numbers - that is, the entire dataset - rather than just a subset. Note also that all three are clustered close together. When we talk about the shape of a distribution later in the module, you’ll see that this tells us something significant about our distribution.

But for now, let’s move on to variability.

So far, we have talked about viewing distributions graphically, and calculating various measures to describe their central tendency. We discussed how we use summary statistics to identify the middle of a large data set. However, by describing the data using a single value we do not capture spread in the data. We’d like to be able to summarize this spread statistically as well.

Identifying the centre of the data AND how spread out the data are around that centre can be incredibly powerful. With as few as two numbers we can often summarize even enormous datasets effectively.

Often, we describe the variability (or spread of data) around, or relative to a measure of central tendency, usually the mean.

Recall that variability refers to how "spread out" a group of scores is.

In the next few slides, we will discuss four frequently used measures of variability: range,interquartile range, variance, and standard deviation.

Measures of central tendency give us a nice summary of the middle or average value. However, what if we want to characterize the extremes of a distribution? In other words, what if we want to know about the values furthest from the center - to understand the boundaries of the distribution?

A very simple way to asses that is to find the smallest number, or minimum value, and the largest number, or maximum value.

Continuing with our cylinders example our minimum is 1, and maximum is 12. In some sense these values are the endpoints or borders of our distribution. A question that naturally follows from this is how long is our data set - in terms of length along a number line?

To summarize that we can use the range: a measure of distance from the highest score to the lowest score in a data set. We calculate the range by subtracting the minimum from the maximum. So in this example, our range is 11.

Range as a measure of spread is more often associated with the median rather than the mean. Of course, we can use whatever summaries we think are best, in any combination. But if we consider why this might be the case, we notice that both the range and the median are related to position within an ordered dataset. The median is the middle position, and the range describes the difference between the maximum - the last position, and the minimum - the first position.

Two more measures of spread in the data are often reported with the minimum, median and maximum. Taken together these values form the 5 number summary - a powerful tool for summarizing distributions of data. Those two additional values are the first and third quartiles. Recall when we were talking about the median I told you we could also call it the 50th percentile - the point at which 50% of the scores were lower than it. Imagine dividing a distribution into 100 equal segments called percentiles. The median is the point below which 50 of these segments fall. A quartile chops a distribution into just 4 pieces, and so the first quartile has ¼ of the entire distribution below it and the third quartile 3/4. We could equivalently call them the 25th and 75th percentile, since 100/4 equals 25.

If that sounds complicated, it might be easier to think of it this way. If we split the dataset according to the median, the first quartile is the median or middle value of the lower half of the distribution and the 3rd quartile is the median of the top half

Just like the median proper, if we have an even number of scores, we will have to take the mean of the two middle values.

We can find the interquartile range, or IQR, by subtracting the first quartile from the third quartile.

Let’s do the full 5 number summary here.

We already know that the median is 5.5. Notice that we could also call this the second quartile. So, our 1st quartile must be the mean of 3 and 4 or 3.5 and our 3rd quartile is the mean of 8 and 10. Minimum is 1 and max is 2. Our range is 11 and Inter-quartile range is 9 - 3.5, 5.5

The 5 number summary can be visualized very effectively on a graph called the boxplot, which can be a powerful tool for detecting skew in your distribution. We’ll come back to the idea of boxplots and skew later.

For now, let’s talk about variance and standard deviation, two measures of spread in the data around the mean.

Variance provides us with a single number that measures how far, on average, our scores fall from the mean. One way to get at this difference could be to take the difference between the mean score and each data point and then calculate the mean of those differences. Such differences are called deviations because they measure how a score deviates from the mean.

Using our new notation it would look like this: get the deviations for each point

then average them by summing together and dividing by the number of data points, N)

The greater this average deviance, the more variability, right? As it turns out, however, this is problematic.

Because we will have observations above the mean (which yield differences that are positive) and below the mean (which yield differences that are negative), the average of those deviations around the mean will always be zero!

Even though we know there is variability around the mean, the math has betrayed us!

To get around this, we must square these deviations from the mean before adding them together - we could also call this raising to the power of two and it means we multiply each deviation by itself. Recall that multiplying a negative number by another negative number results in a positive number.

Therefore whether the deviations are negative or positive, we will end up with positive value.

We usually denote variance using the lower case greek letter sigma raised to the power of 2.

Let’s go back and look at the possible unique number of cylinders a car from 2004 might have:

We determined earlier that our mean is 6.125.

To calculate the variance, let’s follow the steps as they’re laid out in the formula.

First, we calculate the deviance for each score then square it - our first value is four subtract 6.125 then square that - notice that the brackets around 4 and 6.125 remind us to subtract before squaring. We then repeat this for each number, add them up and divide by 8.

Note that to save space we have placed an ellipsis which means do the same thing with all the intervening numbers.

Adding up all of the deviations gives us a numerator of 94.86 divided by 8 in the denominator. Finally, our variance is 11.85.

Script:

1. Let’s go back and look at the possible unique number of cylinders a car from 2004 might have.

2. We determined earlier that our mean is 6.125 [mean appears]

3. . To calculate the variance, we’ll follow the steps as they’re laid out in the formula. [variance formula]

4. First, we calculate the deviance for each score. Then square it - our first value is four subtract 6.125, and then square that value. Notice that the brackets around 4 and 6.125 remind us to subtract before squaring. We’ll then repeat this process for each number, add up the values, and divide the total by 8

Note that to save space we have placed an ellipsis, which is the dot-dot-dot symbol. It tells us to do the same thing with all the intervening numbers.

5. Adding up all of the deviations gives us a numerator of 94.86 divided by 8 in the denominator. That means our variance is 11.85.

Now, I’m sure some of you are feeling a little alarmed about what we just did with the variance. I mean, by squaring the deviations, haven’t we inflated those differences? yes we did, we increased them exponentially, specifically by a power of two. We’ll see in later modules that doing this can be very useful for conducting certain statistical tests. But, it doesn’t make for a good summary of the average deviation around the mean, which after all is what we wanted in the first place.

Fortunately, there is a simple way to fix this problem: we can take the positive square root of our variance, which reverses the effect of squaring.

We call the resulting number the standard deviation and denote it with a lowercase sigma. It is like the variance, but it is expressed in the original units of measurement.

It’s calculation would look like this.

For our cylinders example, sigma is the square root of 11.85, or 3.44.

Together with the mean, we have a nice summary of the distribution of unique cylinder values. On average, there are 6.1 cylinders, and the average deviation from that mean is 3.4 cylinders.

(HP comment: Put these interpretations on the slides so that students have a visual of the interpretation. Otherwise they will have to replay the audio several times to get the interpretation. Alternatively, we could have a summary slide that gives the real-world interpretation of each of the measures of central tendency and variability somewhere at the end of this section).

0. Now, perhaps some of you are feeling a little alarmed about what we just did with the variance. By squaring the deviations, we’ve increased them exponentially. We’ll see in later modules that doing this can be very useful for conducting certain statistical tests. But, this inflated value doesn’t make for a very good summary of the average deviation around the mean, which is what we wanted in the first place.

Fortunately, there is a simple way to fix this problem: we can take the square root of our variance, which reverses the effect of squaring.

We call the resulting value the standard deviation and denote it with a lowercase sigma. It’s like the variance, but it’s expressed in the original units of whatever we were measuring.

Its calculation would look like this.

For our cylinders example, sigma is the square root of 11.85, which is 3.44.

Together with the mean, we have a nice summary of the distribution of unique cylinder values in our data. On average, there are 6.125 cylinders, and the average deviation from that mean is 3.44 cylinders.

So far we’ve been talking about how to summarize distributions using mathematical summary statistics. But we can, and absolutely should look at graphs of distributions as well to help us understand them. A very valuable first step is the histogram, which was introduced last module and again when we were talking about mean median and mode. A number of terms exist to describe the shape of a distribution and we’re going to talk about some of those now.

Let’s revisit our histogram of number of cylinders.

Think back to our discussion of the mode. We said that a distribution could be unimodal, bimodal (2 modes) or multi modal (many modes). These are terms that describe the shape of a distribution, so you already have one term under your belt. You can describe modality! The distribution here is unimodal.

We can also describe the shape of a distribution in terms of its symmetry. An object exhibits symmetry when it has an axis of symmetry, a line you can draw through it where each half is reflection of the other. Butterflies are symmetrical down the line of their body, and so are humans and countless other things in nature.

When talking about a distribution, this axis of symmetry is the mean. Thus, if the distribution is roughly the same on each side of the mean, it is symmetrical.

But, by convention, rather than describing distribution in terms of how symmetrical they are, we describe them by their skewness, that is, the degree to which a distribution is asymmetrical about its mean. So, skewness is a measures of A-symmetry, its lack of symmetry.

Let’s take a look through our cars database and see if we can find some examples of symmetrical and skewed distributions.

First let’s look at a symmetrical distribution.

We’re looking at a histogram of car wheelbase. The red, dashed vertical line is the mean of the distribution. Imagine that this graph was on a piece of paper. Now imagine folding along that red mean line. The two halves would overlap as they touched, reinforcing that this distribution is symmetric around its mean. This makes sense if you think about. Limitations of parking and roadways dictate that wheelbase should vary only a little around some optimal size.

Now let’s take a look at a histogram of suggested retail prices, again with a red line representing the mean.

Notice how the graph is different on either side, there would be less overlap if we folded it along the red line. This graph is asymmetric or skewed. This is exactly the same kind of pattern you might see with wages, with a small number of people making very high earnings, like corporate CEOs, with most others making very little. Here the presence of a few expensive luxury cars is ‘pulling’ the mean up - skewing the distribution.

We can describe two types of skew, positive and negative.

Positively skewed distributions have a few high scores, while the rest of the scores congregate around a lower value therefore the distribution trails off to the right or has a long tail to the right.. Negative skew is the opposite a few low scores below a congregation of higher scores and a distribution trailing off to the left.

Think for a second whether this is positively or negatively skewed…..that’s right, it’s positively skewed.

Incidentally, both of these distributions are unimodal.

I’ve placed the two distributions side by side and plotted all three measures of central tendency that we discussed using coloured vertical lines.

Notice that in a symmetric, unimodal distribution like our wheelbase histogram, mean, median, and mode converge on a very similar value.

In a skewed distribution, like our price distribution, the mean and median diverge. This divergence is not a problem however, it’s actually informative. Notice that the median line is closer to the highest count of car prices than the mean line. So, in a skewed distribution, the median is often a better estimate of central tendency than the mean. Because median is influenced by the number of observations, and not their values, we say it is robust - or resistant - to outliers

The concept of an outlier is an important one in statistics and we’ll explore it in more detail in the next slide. But before we leave, take a look at the modal car price in green. If we think about it, it doesn't tell us that much about car prices, because looking at the data we see that only 2 cars share that price. It just turns out that very few cars are exactly the same price. The mode can be useful when looking for the most common values among repeated values and we want to capture that pattern, but it doesn’t really mean much here. For example, modal shoe size might be of great interest to a shoe retailer since it would be important for them to keep stocked!

I mentioned earlier that I would return to summarizing data using boxplots, so let’s do that now. In this last section we’ll discuss outliers and show how boxplots are valuable for detecting them in addition to being a useful overall visual summary of your data.

Before talking about boxplots numerically, let’s just think about what it means to be an outlier. In one sense, we can think of outliers as the opposite of the typical values identified by our summary statistics, they are for some reason unique or at least significantly not like the other scores - they are usually too high or too low when compared to our typical values. Because of this, we might question whether it is appropriate to include them in our summaries of typical values. On the other hand, we must be careful not to disregard real and important differences. How to appropriately deal with outliers will very much depend on what kind of data you are looking at. If, for example, a patient showed unusual immunity to a disease, we might not want to disregard that data but instead focus on understanding her increased immunity.

On the other hand, at $192,465, the most expensive car in our data set may be so far from a ‘normal car’ that maybe we would want to ignore it when summarizing car prices.

The famous statistician John Tukey defined potential outliers as any data points that fall outside of 1.5 times the interquartile range, and as extreme outliers anything outside more than 3 times the IQR. He developed boxplots to help observe these outliers and to examine skew in a distribution.

Here I’ve plotted the distribution of Suggested price and wheelbase again, but as box plots. The y-axis (vertical) represents the numeric range of the data, on the x- we have only one group - all cars from 2004.

Let’s interpret these graphs from the middle out. The black line bisecting the box represents the median value. The height of the box represents the IQR, the lower bound of the box is the 1st quartile, the top is the 3rd quartile. The vertical lines extending from the box are called whiskers. The top whisker contains all scores in the 4th quartile that are not classified as outliers, the bottom whisker does the same for the 1st quartile. Outliers are plotted beyond the whiskers as points. Although this is not part of a classic boxplot, I have plotted mean values at the dashed red line for comparison.

We can note several things about our distributions from these plots. If a distribution is symmetric, the median line will be at or very close to the middle of the box. We see that wheelbase median is much closer to the middle than that of price. If the median line is higher than the middle the distribution is negatively skewed; if the median is lower than the middle, the distribution is positively skewed.

The number of dots can also tell us if a long ‘tail’ extends in either direction, another indication of skew.

Boxplots can also be used to compare different groups, and are sometimes plotted horizontally instead of vertically. Here, we’re looking at the distribution of prices for cars, with cars grouped by how many cylinders they have.

Another common, and efficient way, to examine a distribution is by plotting a histogram and placing a boxplot of the same data in the bottom or top margin of the figure. Here we’ve plotted a boxplot in the bottom margin of our price histogram - in this case the boxplot uses the same x-axis the histogram.

Let’s look back at our Intended Learning Outcomes.

In this module we discussed how to describe and depict central tendency and variability.

We’ve defined and calculated measures of central tendency including the mean, median, and mode and overviewed measures of variability or spread in the the data including variance, standard deviation, range, and quartiles.

FINISH

SUBMIT

Title

Title

Title