By the end of this module on correlation, you should be able to:
• Read and interpret scatterplots of data
• Understand various types of relationships between two variables
• Calculate Pearson’s r and interpret correlations
• Understand the significance of a correlation
In research, we often want to examine two variables that we believe to be related or dependent on one another in some way. We can directly examine the relationship between two variables by studying how one variable changes as a function of the other.
In other words, one variable might increase as the other variable increases, or it might increase as the other variable decreases.
For example, we might be interested to know whether the final grade a student gets in a course is related to how many hours they spent studying. We might expect the grade to be higher following more hours of study.
To best understand how to examine relationships between variables, we will review correlations in this module and regression in a separate module. Since these are related topics, we recommend that you also study the module on regression, and then return to this module afterward to review correlation again for a complete understanding of these topics and how they relate to one another.
The simplest and clearest way to visualize the relationship between two variables is to use a kind of graph we call a scatter plot. In a scatter plot, every observation pair is represented by a single point.
When we want to represent two values for a single observation, a very useful way to represent them is with a horizontal x axis and a vertical y axis.
When we have a dataset that comes with a number of paired values for each observation, we can assign one of those values in the pair to be the x value and the other to be the y value. Then we can plot each of the datapoints in our set in a two dimensional space. Let's take a look at a scatter plot, using some data.
In this set of imaginary data, we are interested in the question of how ice cream sales rise or fall depending on the temperature in a day. To investigate this, we’ve measured the total ice cream sales and mean temperature on separate days.
Before we look at these data, most of us would expect that more ice cream would be sold on hotter days. This is the sort of thing we can visualize on a scatterplot.
On the x axis we have temperature, in degrees Celsius,. On the y axis we have ice cream sold, in dollars. So this first point, here [highlight point], is at about 12 degrees, and about $180 [dotted line appears]. So x = 12, and y = 180 for this point. Up here, x is about 25 [highlight point], and y is about 600 [dotted line appears]. Notice that the lowest x value also has the lowest y value, and the highest x value also has the highest y value. There seems to be a similar relationship in the other data points, too. As x increases, y almost always increases. This scatterplot depicts a positive relationship between the two variables.
We can do some interesting statistics to draw a straight line right through the middle of these so the points are as close as possible to the line. This is called the regression line, or line of best fit. To learn more about the regression line and how it is generated, you can refer to the Regression Module.
For now we can use this line to tell us something about the relationship. The more tightly the dots are clustered around this line, the stronger the relationship between the two variables.
This is the heart of correlation: how well do all the datapoints fit on a line we can draw, and in what direction is that line sloping? If we can draw a straight line that fits all the data points perfectly, that means we have a perfect linear relationship. Of course, this kind of perfect relationship rarely happens in real data, but what we are interested in is exactly HOW STRONG a relationship is.
[Data comes from Pierce, Rod. (2 Oct 2014). "Correlation". Math Is Fun. Retrieved 21 Jul 2015 fromhttp://www.mathsisfun.com/data/correlation.html
In some cases, we might expect a negative relationship between two variables. A negative relationship indicates that one variable increases as the other decreases, where smaller values of x are generally associated with larger values of y. For example, as age increases, the number of hairs on one's head decreases.
Notice that these datapoints don’t really “fit” on this line as well as they did in our ice cream data.
Finally let’s look at a data set with no linear relationship. Don’t worry about the units this time, we’ll just call the variables x and y. On first inspection, these data don’t seem to have a negative or positive relationship. We will draw the regression line as we did before.
It doesn't look as though the data are really grouping around this line, so the line may not accurately capture a pattern in the data. No matter where we draw the line, there are going to be a lot of data points far away from it. There's no clear trend up or down and no way to group all the points closely to this particular straight line. So in this case, we can say that there is little or no relationship between these two variables.
So far we’ve discussed three types of relationships depicted using scatterplots. Now we need a formal way to summarize the strength of these relationships so that we can compare datasets that may not be using the same units of measurement.
We can make an assessment of the relationships between variables more formal by implementing some statistics.
To do this, we can use a statistical test called a correlation to measure the relationship between two variables. By conducting a correlation, we can summarize this relationship using one value called the correlation coefficient.
Recall that we can transform all different kinds of normally distributed scores into standardized z scores. When we know the z-score for something, we know how extreme a score is compared to the mean, regardless of the original units of measurement. Correlation has a similar goal: to be able to take these different relationships and put them on a common scale.
Essentially correlation does this by transforming the original units into standardized scores and then finding the linear relationship between those new transformed scores. In this module, we will not focus on the equations for correlation, but rather on understanding the correlation statistic.
A correlation can describe the relationship between either population or sample variables. Correlation can be calculated for any two paired variables with equal numbers of observations. In this way, it is similar to a paired t-test (but instead of looking for differences between means, we want to know about the relationship between two variables).
We can calculate a correlation for any variables that are measured on an interval or ratio scale. There are other kinds of correlation you can do on ordinal data, but we won’t go into those here.
When we look at the linear relationship between standardized scores, we can now effectively summarize the relationship between variables we are comparing using a correlation coefficient.
Pearson’s r is the most common correlation coefficient. It is a value between negative 1 and positive 1 that represents the strength of the relationship between two variables.
The sign of the correlation indicates whether the slope of the regression line is positive or negative, and the Pearson’s r value indexes the strength of the relationship. A value of 0 means that the variables are unrelated. The further the value is from 0 (regardless of the sign), the stronger the relationship is and the more closely the data points cluster around the regression line.
So r = 0.9 and r = -0.9 have equal strength of relationships, but when r is +0.9 as one variable increases the other increases. When r is -0.9 as one variable increases the other variable decreases.
As we mentioned, we won’t go into calculating Pearson’s r by hand in this module. Fortunately, Excel does a nice job of making this calculation simple for us, using the CORREL() function.
Let’s go back to our ice cream sales example. We observed a positive relationship, or positive correlation, between ice cream sales and temperature.
Our regression line shows a positive trend for the data points and we can see that the points are clustered tightly around the line, so we expect the relationship to be relatively strong.
Now we will calculate Pearson’s r in Excel. Here we have the paired data points for the temperature, which represents our x values, and our total ice cream sales in dollars, which represents our y values. We will use the formula bar at the top of the page to type in our function.
First, select the cell in which you want your Pearson’s r value to appear.
Next, select the formula bar and type “=CORREL(“. The prompt will display “array 1, array 2”. The function wants you to fill in your x variable as array 1 and your y variable as array 2. To do this, you can highlight the x values by clicking the first temperature and dragging over the rest of the temperatures by holding down your mouse. The cells that include your temperature data are now highlighted and your function is updated.
Now, you need to enter array 2 as a y variable. Insert a comma to separate your arrays. Then select the first value for ice cream sales and drag downward while holding down your mouse button to highlight the entire column of numbers. The function is again updated to reflect the data you just selected. End your function by entering a closing parenthesis. Press enter.
Your Pearson’s r has now been computed and you can see that your value is 0.958. Since this value is positive and very close to 1, we can say that this is a strong positive correlation.
It is important to note that if we calculate the slope of the line on the graph as you see it here, using the linear algebra you learned from high school, the slope would not equal .958. This is because the graph is still depicting the original units of measurement (which are temperature in celsius and ice cream sales in dollars).
By using a correlation statistic, which always must be between -1 and 1, we can compare the strength of this relationship to any other relationship between variables, regardless of the original units. Remember that our correlation statistic doesn’t just tell us about directionality of slope on the original measurements, but also tells us about how perfectly the line matches all of the datapoints. This also means that just looking at how STEEP the slope is on our original scatterplot doesn’t give us a good idea of how strong the correlation will be.
Visuals:
Now let’s try this with our hair example. Here we observed a negative relationship between age in years and the number of hairs on one’s head. Our regression line shows a negative trend for the data points and we can see that the points are less tightly clustered around this line than in the previous example.
We can also calculate Pearson’s r in Excel for this example. We have our paired data points for age and the number of hairs on one’s head. Again, we select a cell for our Pearson’s r value and type “=CORREL(“. We highlight the x values by clicking the first age value and dragging over the rest of the age data points. We insert a comma to separate the arrays and select the first value for number of hairs. Drag downward while holding down your mouse button to highlight the entire column of numbers. End your function by entering a closing parenthesis and press enter.
Your Pearson’s r has now been computed and you can see that the value is -0.448.
In this example, our r value is negative and not as strong as in our previous example. As we noticed before, datapoints are not clustered as tightly around this line as in the previous example.
Now we have measured Pearson’s r, but how do we interpret this value? In other statistical tests, we sometimes want to calculate effect size, to see how big a difference is. Here, when we are trying to measure how strong a relationship between two variables is, the r value is very useful in that it tells you quite directly about the strength of the relationship.
Occasionally you might encounter rules for interpreting the magnitude of correlation values. These ranges may vary between fields but generally we can think of our Pearson’s r values as falling into the following ranges:
|0.0| to |0.3| is considered very low
|0.3| to |0.5| is considered low
|0.5| to |0.7| is considered moderate
|0.7| to |1.0| is considered high
Note that when thinking about the magnitude of the relationship, we don’t consider the whether the correlation is positive or negative, we just consider the size of the number itself, which is why we consider the absolute value.
In our ice cream sales example, we observed a correlation of 0.958. Based on our categories we can interpret this as a high correlation, or a strong positive correlation.
In our hairs on a head example, we observed a correlation of -0.448. This is a low correlation, or we can say this is a weak negative correlation.
Keep in mind that it is best to exercise some caution when using these approximate ranges. Think carefully and use logic while interpreting your correlation values.
We’ve talked about the size of correlations, but we can also talk about the significance of the correlation.
In other words, what is the likelihood of obtaining these r values for a sample if the true relationship in the population is a correlation of 0? This becomes useful when we are trying to generalize a correlation of a sample to an entire population.
For example, we have a sample of ice cream sales for one vendor on a number of days with different temperatures. How much can we generalize this to all ice cream vendors throughout the entire year?
If we want to infer how this relationship might generalize to an entire population then we can look at the level of significance for a correlation.
We can test the significance of Pearson’s r much the same way that we might test the significance of a t test. All we need to know is the degrees of freedom and the significance level we wish to test at, which is usually 0.05. For a discussion on degrees of freedom and significance levels see the module t-tests part 1.
The degrees of freedom for a correlation is N-2, where N is the number of observation pairs. We can find our calculated degrees of freedom on this table to determine the critical Pearson’s r value for our chosen significance level.
In our hairs on head example where we observed a negative correlation, we had 27 pairs of observations. This means that our df is 25. We can look up our critical value in the table at an alpha level of 0.05 and find that our critical value is 0.381 [highlight in table]. The absolute value of our Pearson’s r was 0.448, which is more extreme than the critical value of 0.381. So, we can say that this correlation reached significance at an alpha level of 0.05.
Therefore, we have sufficient evidence to conclude that there is actually a correlation between age and the number of hairs on one’s head. Notice that as the degrees of freedom increase, the smaller the critical value is. Like many of our tests, when we have a larger N, we can be more sure about whether our observations are likely to have occurred by chance or not, even when we have smaller relationships.
Some variables in the world may have only a very weak linear relationship, but it is nonetheless a real relationship. This is why we test statistical significance, as well as the magnitude of the correlation.
Now that we’ve talked about how to calculate, test, and interpret correlations, we should discuss some important theoretical considerations.
We will first consider the range of our data, where it is possible to limit or restrict the range of one or both of our variables. To examine this issue, let's take another look at a dataset we saw previously.
In this dataset, we observed a positive correlation between the temperature and the amount money from ice cream sales. Note that the temperature on the x axis goes from 11.9 degrees celsius to 25.1 degrees celsius.
If we decide to only look at the days where the temperature is warmer, say above 20 degrees [highlight last four data points], and conduct a correlation on this range, the correlation is weaker than when we looked across all of the days; it drops from 0.958 to 0.808. By looking over a larger range of the data, rather than restricting the analysis to a smaller range, stronger relationships can emerge. If we broaden our range, we may even find that a significant correlation emerges where there previously was none! This illustrates the importance of considering whether your value of r is affected by a restriction of range in your dataset.
Pearson’s r → linear relationships
Age vs. Hair graph with straight line
Notice how the lines that we drew through our scatterplots were all straight. There are many cases where a relationship is not linear in nature, so Pearson's r will not adequately capture those relationships.
Pearson’s r only captures linear relationships. For example, in our hairs on a head example a straight line is not the best choice for trying to fit these data.
For this reason we might see a Pearson’s r value that is quite low for these relationships, and the line will not fit well.
While we can compute a Pearson’s r for these situations, by looking at the data alone we can see that there might be other nonlinear patterns that exist in the data.
Other more complex types of analyses are better equipped to handle these relationships. We won’t go into nonlinear relationships in this module, but it’s an important idea to be aware of.
At times we will also see extreme values in our data, which are datapoints that act differently than the rest of our dataset. These can have different effects on our correlation measure. We will discuss the effects of these extreme values, called outliers, in the Regression module.
We want to note that when interpreting the meaning of a correlation, we should avoid concluding that one variable causes another.
For example, if we look at CO2 emissions and obesity rates over the last 50 years, they both may increase in a similar way, but we know that this does not mean that CO2 emissions cause obesity.
All we can definitively say with a significant Pearson’s r value is that we observe a relationship between the two variables. We know that this relationship might also include other factors.
It is important to note that for correlation, it makes no difference as to which is the x variable, and which is the y variable. The linear correlation between ice cream sales and temperature is always exactly the same as the correlation between temperature and ice cream sales. The correlation statistic is RECIPROCAL.
We will discuss making predictions about related variables in the Regression module.
In this module, we discussed how to read and interpret scatterplots with various examples.
We overviewed how to identify negative and positive relationships between two variables by visualizing the data and drawing a regression line on scatterplots.
We calculate Pearson’s r using Excel and learned how to interpret this statistic. We also discussed how to find the significance of the statistic.
We mentioned some caveats in using and interpreting Pearson’s r, including restrictions on range, requirements of linearity, and the idea that correlations do not necessarily imply a causal relationship between variables of interest.