This module introduces simple regression, which predicts linear relationships between variables.
By the end of this module you will be able to:
- Differentiate between regression and correlation
- To read and use a simple regression equation to calculate a predicted value
- To Calculate regression coefficients and r2 using Excel
- To Test whether the regression slope is statistically different than zero using Excel
- And to Identify the underlying assumptions of simple regression
Correlation and regression are closely linked and we recommend that you work through the two modules together, starting with the module on correlation.
Correlation and regression both deal with the linear relationship between two variables. Correlation focuses on just the relationship between variables. However, regression focuses on prediction of one variable using another. Regression is a powerful tool that has a wide range of applications in business, government policy making, medicine, sports, science, and many other endeavors.
For example, we would expect the selling price of canned soup and the numbers of units sold to be correlated - if we lower the price, we should sell more cans of soup. However, if you were a business owner, you would want even more information - you’d like to be able to predict how many more cans of soup you would sell, for every 25c you lowered the price.
In a different situation, if you were a government official, you might want to know about the relationship between government spending on education and unemployment rates. Would increased education spending decrease unemployment? If so, by how much? Regression is a technique that can address these types of predictive questions.
Regression will make a lot more sense if we first do a quick review of: the cartesian coordinate system, drawing lines in cartesian space, and linear algebra.
This graph represents the cartesian coordinate system. Using it, we can uniquely specify a location using a pair of values, one horizontal (x) and one vertical (y). We write them like this:
So the red dot shown here is at (1,1) and the green dot is at (-2,-3)
With two points, we can make a line. Imagine placing the edge of a ruler so that it just touches both points then drawing along the edge,
like so.
We can describe the resulting line between these two points using…
an equation. In the equation, y and x are the cartesian coordinates, b is the slope of our line and a is our y intercept.
Remember that slope is defined as “rise over run”. The y intercept is the point where the line crosses the y axis. In other words, the intercept is the y value when x=0
The equation for our line is y=4/3x–1/3
Importantly, in this equation, y and x are acting as variables. So, by inputting x, we can solve to find y. If we know the x value, we can figure out the corresponding y value. This is true for ANY possible x value, once we know the equation for a given line. Let’s input an x value of 3, and solve for y now. When you multiply 4/3 by 3 and then subtract 1/3, you get 3 and ¾. So, we should find that when x equals 3, y will equal 3 ¾
Examining the graph, we see that this is exactly where our line passes. We can see x equals 3 and y equals 3 and ¾ shown with the blue point.
Well, in regression we are looking to characterize the linear relationship between an independent variable, X, and a dependent variable, Y, both of which are measured on a ratio scale.
Most often, we are trying to predict Y from X, so in this context, X is the predictor variable and Y is the response variable [label appears]. We’ll stick with those terms going forward. We can use a regression line to describe the relationship between variables and to make predictions by mapping these variables on to the X and Y axes of a cartesian plane.
This line is defined by the regression equation that we just discussed, where the two important parts of the equation are the slope, and the intercept. We often call this a regression model, because we are using a line to mathematically model the relationship between the two variables. It is called a linear model because we are using a straight line to describe the relationship between X and Y, rather than a curve, or some other more complicated relationship. We often choose to use a linear model because it is the simplest, but also it is usually very powerful.
The typical goal of a linear model is to make predictions about the entire population, rather than just describe the relationship within a specific sample. The regression line we get from our sample gives us an estimate for our population slope and intercept. Now, given any NEW value of X, we can make a prediction for what the y value will be.
Let’s use a concrete example to demonstrate. Imagine that you are a large scale contractor who builds residential homes in Canada. If you put your houses on a larger lot, they sell for more money, but you also sell fewer houses because you only have so much land available to build on. You’re wondering how big you should make each lot to get the best return on your investment. In other words, you want to predict selling price of the house based on square footage of the lot.
To do that, we would assign square footage as our x variable and selling price as our y variable, and then go collect some sample data.
This plot shows actual houses from the city of Windsor, Ontario in 1987. They represent a sample from the population of all houses in Canada. For each house in this sample, we know both the selling price and the lot size. So, if we plot a regression line with this sample, we can then use that line to predict the selling price of the houses you plan to build.
This is because with a known X value, we can look at the line, and see what the corresponding y value is. That’s what we mean by predicting the value of y. The hat symbol on top of Y in our formula indicates that it is the predicted value of Y.
This gives you a good estimate of a value that you otherwise couldn’t have known ahead of time. Lot size seems like a sensible predictor of price. We do know that people pay more for large properties. If a given variable is a good predictor, the slope of the line either be positively, or negatively sloped. However, if there’s no relationship, the regression line would be flat, and the slope would be zero. A non-zero slope means that knowing about x does helps us to predict y. In this case, if we plot the regression line for our sample of houses we’ll see that lot size does help predict selling price - there’s a positive slope to the line.
So, if we are going to use a line to predict a y value based on an x value, how do we know where that line should be? What should the slope be, and what should the intercept be? First, we need to estimate this from our sample. We want a line that passes as close as possible to all the data points from our sample. Remember, we are trying to find the line that will predict y values as well as possible. Statisticians have shown that the best way to do this is to plot a line that minimizes the differences between the actual y values from our sample and the y values that our line predicts. We call this the least squares line. Technically, we will calculate the least squares line with a measurement we’ll describe as the squared residuals of Y.
A residual is the difference between the value our line predicts for Y, called Y hat, and the actual or observed value of Y in our dataset, and we calculate it for each pair of datapoints in our sample. That is, for a particular observation, we see what the y value ACTUALLY is. Then, we calculate y hat from our equation, using the x variable. This lets us see what the difference is between the observed y and the calculated y value.
In this graph, the residual is the dashed line between the data points and the regression line. You can also think about this as the error of our prediction, which captures how far off our predicted value was from the actual value.
The gray boxes represent the squared value of that residual.
Just like when calculating a variance, we square these residuals to get rid of the sign, since residuals can be positive (above the line) or negative (below the line).
The line right now represents the minimum of the squared residuals, which is noted “Sums of Squares” on top. Watch what happens as we change the slope of the line.
The total area of the grey boxes gets larger, as shown by the changing sums of squares value.
Remember, we are trying to fit a line to our data that MINIMIZES the sum of these squared residuals.
Now let’s reset the slope of our least squares line back to its original value. Now if we change the intercept instead, a similar thing happens ---the further we move away from our initial value, the larger the sum of squares value becomes.
Now let’s change to a different data set. Notice our SS is larger with this dataset. Let's reset to the least squares line. This shows we need a different slope and intercept for each dataset to find the line of best fit.
Let’s use an example from our cars database to help us understand regression.
In this example, we’ll use engine size as our predictor variable (x) and horsepower as our response variable (y). If we know how big an engine is, can we predict what its horsepower will be? It seems that these two variables should be related. If precise testing for engine horsepower takes time and money, a quick way to estimate horsepower using engine size would be helpful. That’s what regression allows us to do. We can take a sample where we measure both engine size and horsepower, and figure out a general relationship between the two. Then we could use this on any further examples where we know engine size of a vehicle and want to make a reasonable guess as to what the horsepower will be.
Let’s put up a scatter plot of these two values from our sample, with engine size on the x axis and horsepower on the y. Each dot represents one of the 428 engines in our cars database.
The actual calculations for regression are fairly involved and taking the time to walk through them would probably not be time well spent. So, for now, let’s just quickly get Excel to do that for us.
We’ve laid out our x and our y variables in two columns. Notice that I’ve hidden some of the values so I could get it all on one screen. To run our regression, we go to the data tab at the top and select data analysis. Scroll down to ‘regression’ and click OK. Now we’ll enter the range where the y values are stored, and the range of the x values. We’ll click ‘labels’ to indicate the top row includes column labels instead of data, and hit OK. Excel creates the regression output in a new sheet. I’ll expand the column widths so we can see everything.
There’s a lot of information here but for now just focus on the slope and the intercept. In your output file, these are also labelled as the regression coefficients. So, the intercept coefficient is where the line crosses the y axis, and the engine size coefficient gives the slope of our line.
So our least squares line is y=51x + 53.
Let’s plot the line using that formula. It tells us that for every additional litre of engine size, we add about 51 horsepower. For example, we could use this equation to predict the horsepower of an engine that had a 5.7 litre capacity:
51 times 5.7 + 53 equals 343.7 horsepower.
We have already mentioned that the values we calculated for a (the intercept) and b (the slope) are sample values. They would be different, for example, if we took engines from 2005 or 2006 instead of the 2004 dataset we’re using here.
We know from our module on populations and sampling that these values are sample statistics for the population parameters that we want to estimate. So, we need some way to represent this uncertainty in the data, and some test statistic to tell us if our sample values are extreme enough that they didn’t occur by chance.
Although we have 2 parameters, slope and intercept, we are usually only interested in testing a hypothesis about our slope [cross out intercept] because it is the slope that tells us about the relationship between our variables. Further, the value of y when x is zero is often nonsensical, uninteresting, or both. In our data, the y-intercept would represent a 0 litre engine---something that clearly doesn’t make sense.
The real question we want answered is whether our slope differs from zero because that tells us if our predictor actually gives us information about our response variable. Our question is about the population parameter for slope. Instead of using b, the sample parameter, we will use beta to represent the population parameter for slope.
The null hypothesis for this test is that Beta = 0, and the alternative hypothesis is that Beta does not equal 0.
We can test this hypothesis using a t-test.
Again, let’s ignore the details of the calculation for now. Our regression output from excel already includes a sensible t test and a confidence interval.The p-value for our test for non-zero slope is expressed in scientific notation. This means there are 90 zeroes and then 1. It is extremely unlikely that we observed this slope simply by chance, and we can reject the null hypothesis. Our slope is not equal to zero.
Let’s look at one more number from this output: r-squared. This is closely related to the r value we encountered in the correlation module.
A Pearson’s r value tells us about the strength of the relationship between two variables. That is, how WELL does the regression line fit all the data points.
In correlation when we take an r value and square it, it tells us what proportion of all the variance is explained by the regression line. Similarly, an r squared value from the regression table which we designate with a capital R squared, tells us what proportion of variance the regression equation accounts for. So, in this case, engine size explains about 62% of the variability in horsepower. We get a good amount of information about horsepower just from knowing engine size.
If we took the square root of r-squared, we’d now have r, the correlation between engine size and horsepower. That would be .79, which is a strong correlation. Remember that the value for slope from the regression equation was positive, so this is a strong positive correlation. Both our correlation and regression analyses indicate that as engine size increases, so does horsepower. Remember, correlation and regression are really two different approaches to characterizing the relationship between 2 variables. As we’ve already stated, regression is the approach that is better suited to prediction.
For a regression model to be valid, we must satisfy a few assumptions.
Firstly, there must be a linear relationship between the x and y variables.
Look at the scatter plot, and summary statistics above. r indicates the correlation. Notice that these statistics stay largely the same, even though the shape of the underlying data changes. Only dataset A shows a linear relationship between x and y.
The second assumption necessary for the regression model to be valid is that the distribution of the residuals must be normal.
The plot you are seeing here is called a normal quantile plot of the residuals, plotting some transformed engine sizes along the x axis, and the size of the residuals on the y axis.
When we say that the residuals should be normally distributed, for example, it is bad if the model fits one part of the data set well (leading to small residuals), but fits another part of the dataset poorly (with big residuals). For example, it would violate the assumption if the low values of x had y hat values that fit well, but that high values of x had y hat values that fit much more poorly. Ideally, we'd expect most of the the data points to be pretty close to the predicted value, and for there to be no particular pattern about which ones are close and which ones are further away.
We won’t go into all the details of the normal quantile plot further here, but know that this is something you can seek out to use to test your assumptions
The third assumption necessary for the regression model to be valid is homoscedasticity. This means that the variance should be the same across all values of the predictor variable
We can test for this by creating a residual plot. In this plot, we are simply lining up the residual values from 1 to 429, corresponding to each different observation in the dataset (that’s what ‘index’ means on the x axis). We are showing their residual value on the y axis.
Because the residuals are spread approximately even [lines appear] our sample shows homoscedasticity.
If there were a pattern where the residuals changed, getting either further apart or closer together, then the sample would not have equal variance across all values of the predictor.
The fourth assumption necessary for the regression model to be valid regards outliers. An outlier is a datapoint that unduly influences the slope of the regression line.
In this video, the green line represents our least squares line. Watch as we pull a single data point far away from the others.
Notice that when this data point moved, our regression line was pulled towards the outlier. Because that single point has a large influence on the slope of the line, we should carefully consider whether it really belongs with the rest of our data.
In addition to visually inspecting the data, you should know that we can also use a statistic called Cook’s distance to detect potential outliers, although we won’t cover it in detail in this module.
Having now talked about some of the assumptions necessary for a linear regression model to be valid, there is one final caution to bear in mind when using regression to make predictions.
In statistics, making predictions about values outside the observed range is called extrapolation. However, these extrapolated predictions are not always meaningful or even useful. For example, an engine cannot have negative engine size , but we can still calculate a predicted horsepower value for it. Given this, we need to think carefully about the range of values over which a relationship would make sense. Predictions about the relationship between variables outside that range will potentially be very inaccurate.
Here this comic’s author uses days to predict number of husbands, then extrapolates wildly.
[Comic provided under Creative Commons Attribution-NonCommercial 2.5 License courtesy of xkcd.com]
So far this module has discussed simple regression, which is regression using only one predictor variable. However, it is possible to have more than one predictor variable and use multiple regression.
For example, we might want to predict a vehicle’s horsepower by considering not just engine size, but also considering whether or not the vehicle is a sports car. In this case, we would have two predictors. Engine size is a scale variable, and car type would be a nominal variable.
We won’t cover the details of multiple regression here, but a number of good resources and tutorials exist on the web should you choose to pursue this further.
In this module we showed that, like correlation, regression characterizes the relationship between two variables. We also showed that regression focuses on prediction.
Using Excel, we obtained regression coefficients and used them to form a regression equation. We then used the equation to predict a response value using a predictor value.
We used the excel output to test whether the slope of the regression line was non-zero, and used r-squared to understand how much variance our regression explained.
Finally we explored a number of assumptions that need to be met for regression to be valid.