Descriptive & Inferential Statistics t-tests Part 1
Welcome
What will you learn?
Central Limit Theorem
Testing hypotheses about means
Null Hypothesis: one-tailed or two-tailed
When sigma is known: z-test
When sigma is unknown: t-test
Degrees of freedom
Step 1: Hypotheses
Step 2: Calculate the observed t-score
Step 3: Compare observed t to critical t
Step 3: Compare observed t to critical t (cont'd)
Confidence intervals and confidence limits
Confidence intervals (cont'd)
One-sample t test: Effect Size
Paired t-test
Paired t-test
Paired t-test: Example
Paired t-test: Example (cont'd)
Paired t-test: Confidence interval
Paired t-test: Effect size
Checkpoint
Summary
Hello, and welcome to the first of two modules on t-tests
In research, we often need to compare the average value for a group to a single specified value, or compare the average values for two groups. For example, when studying a group of patients undergoing a medical treatment, we may want to compare the treatment group’s average blood pressure to normal average blood pressure. Alternately, we might want to compare the average blood pressures of two groups when each group is receiving different treatments. Using t-tests allows us to make these comparisons between groups statistically.
In the following two modules “t-tests part 1 and t-tests part 2” you will learn how to conduct three types of t-test: one-sample t-tests, paired t-tests, and unpaired t-tests. You need to learn both parts to properly understand the t-tests, so please make sure to look at part 2 as well.
Part 1 focuses on the one-sample t-test, in which we’ll compare a group mean to a specified value, and also on the paired t-test, in which we’ll compare the means of two related groups. By the end of this module you will be able to do the following:
•
Compare and contrast the z-test, one-sample t-test, and paired t-test, and know when to apply each of these tests
•
Conduct a one-sample t-test and a paired t-test
•
Calculate the confidence interval and the effect size for a one-sample or paired t-test
•
Understand type I error
•
Describe the pros and cons of using related samples
●
Before we dive into t-tests, let’s recall what we have learned about the Central Limit Theorem and about the z-test.
If we repeatedly draw samples of a certain size N from a population, and calculate the mean for each sample, we obtain a distribution of means.
This is called the sampling distribution of the mean. We’ll refer to it as the sampling distribution for short. It consists of all the possible values for the sample mean that we would expect to obtain by sampling from the population. It’s important to remember that the sampling distribution is a distribution of sample means, and not a distribution of individual data points.
If we take the mean of all the possible sample means, we should expect it to equal the mean of the whole population. In fact, this is what the Central Limit Theorem tells us. Given a population with a mean of μ, the sampling distribution will have a mean of μ sub-X-bar equal to μ.
The Central Limit Theorem also tells us the standard deviation of the sampling distribution of the mean. We call this standard deviation ‘the standard error of the mean’, or the standard error for short. We denote the standard error as sigma-sub-X-bar [with the narration, circle σ-sub-X-bar in the ‘sigma-sub-Xbar’ equation]. N is the sample size, which is the number of data points within each sample. Standard error of the mean is equal to the population standard deviation, sigma, divided by the square root of N.
The module on “Populations and Samples” already taught us that as sample size increases, the denominator of this equation gets bigger, so the standard error becomes smaller. That is, the larger the sample size, the more tightly the sample means cluster around the population mean, and the narrower the sampling distribution becomes.
The Central Limit Theorem allows us to test hypotheses about means. For example, we can test whether the mean of a group differs from a specific value.
Suppose you own an auto shop that frequently services lighter Japanese cars, like Toyotas. Your shop has a hydraulic lift that can safely lift a maximum of 3384 lbs.
You recently turned away several customers who owned Cadillac vehicles, because those cars were too heavy. You're wondering whether it might be easiest to just turn away all Cadillacs, because you believe most of them are going to be too heavy. Now, it’s possible that some Cadillacs would be light enough to go on the lift, but you want to make a conclusion about the group as a whole.
In other words, our research question is: On average, do Cadillacs weigh significantly more than 3384 lbs?
The research question asks what we really want to know, but to make a statistical statement we need to translate the question into a formal hypothesis with a falsifiable prediction that we can test.
We call this prediction the null hypothesis.
There are generally two different ways to set up the null hypothesis. One is to state that the population of Cadillacs weighs 3384 lbs on average.
That is, if μ is the mean of the population of Cadillac weights, then μ = 3384.
The corresponding alternative hypothesis could be that μ≠3384. This would be a two-tailed test, meaning that we would reject the null hypothesis if the data suggest that the Cadillac weight is either significantly greater or significantly less than 3384 lbs. If we simply wonder whether the weight of Cadillacs is different from 3384 lbs but don’t care about whether they are heavier or lighter, then we should use a two-tailed test. It will allow us to detect the difference in either direction.
Now, the other way to state the null hypothesis could be to claim that the population of Cadillacs weighs no more than 3384 lbs on average. That is, μ is less than or equal to 3384.
The corresponding alternative hypothesis would be that μ is greater than 3384. In this case we are carrying out a one-tailed test. We would only reject the null hypothesis if the data suggest that the average Cadillac weighs more than 3384 lbs. We would not have to reject the null hypothesis if the average Cadillac weighs less than 3384 lbs. If we are only concerned about whether Cadillacs are heavier, and don’t care if they are lighter, then it makes more sense to use a one-tailed test.
We should set up the alternative hypothesis so that it’s consistent with our research question, and set up the opposite possibilities as the null hypothesis. This determines whether we choose to use a two-tailed or a one-tailed test.
For this example let’s do a one-tailed test, since we only need to know whether cars are too heavy for our hydraulic lift to lift safely. It is not an issue if cars are too light. In this case, the alternative hypothesis is that Cadillacs do weigh more than 3384 lbs on average. We want to see if we have enough evidence to reject the null in favour of the alternative.
Before we go on, a brief note of caution for you to consider: When conducting research, you should decide whether you will conduct a two-tailed or a one-tailed test *before* collecting or looking at your data. You should also decide the significance level α before analyzing your data. It’s scientific dishonesty and very bad practice to change your hypotheses or analyses either part way through data collection or after you’ve had a peek at your dataset. It might boost the possibility of the test results supporting your predicted outcome, but it’s not good scientific practice. We will discuss the issues associated with this later in the module.
Recall that in statistics, we denote population parameters with Greek letters, such as μ and σ, and we denote sample statistics with Roman letters, such as X_bar and s. It’s important to remember that hypothesis testing is always about the population, not the sample, even though we are relying on the sample to make inferences about the population. As such, we do not write the null hypothesis as X_bar = 3384, because what we are really interested in is not just a particular sample, but the average weight of the entire population of Cadillacs which is mu.
How do we test our hypothesis? Recall what we learned about z-tests in the “Hypothesis Testing and Z-Tests” module.
If we know the population standard deviation σ, then we can calculate the z-score using the following equation:
In the denominator we have the standard error, which equals the population standard deviation, sigma, over the square root of the sample size, N.
But what if you don’t have information about the entire population? In real life, this is usually the case, which means we cannot use a z-test.
How do you test your hypothesis when you don’t know about the population standard deviation? You can use a t-test instead.
The t-score formula is very similar to the z-score formula, except that we substitute the population standard deviation σ with the sample standard deviation s. Because we don’t know sigma, we use s as an ESTIMATE for σ.
Estimating sigma using s leads to some fundamental changes in the underlying statistical distribution.
Previously, we were able to test using the z-score because we knew the z distribution – it is the standard normal distribution. This means we could tell exactly how extreme an observed value was based on its z-score. Similarly, when we do a t-test we need to know how the t-scores are distributed.
However, t-scores do not have the standard normal distribution. In fact, the shape of the t-distribution differs depending on the sample size.
So, the shape of the t-distribution is a function of the degrees of freedom. “Degrees of freedom” is a term that is going to be important to a lot of our statistical tests from here on in. Technically, degrees of freedom has to do with how many independent values go into the estimation of a parameter, but you don't need to worry too much about that.
What’s most important to know now is that degrees of freedom has a lot to do with sample size, N. However, the degrees of freedom don't usually just equal N. There will be different rules coming up later that will tell us how to calculate degrees of freedom for different statistics, but in this case, the degrees of freedom are equal to the sample size, N, minus 1.
We already said that the shape of the t-distribution changes as the degrees of freedom change. So the shape of the distribution is really dependent on the sample size.
Let’s look at how this change works.
The yellow curve represents the standard normal distribution, which is the distribution of the z-scores.
The blue curve is a t-distribution associated with 2 degrees of freedom. Like the standard normal distribution, the t-distribution is symmetrical around zero and is bell-shaped.
As the degrees of freedom increase, the t-distribution becomes closer and closer to the standard normal distribution. This is because when the number of observations in a sample increases, the sample standard deviation, s becomes a better estimate of the population standard deviation, σ.
Once the degrees of freedom are close to 30, the t-distribution is very similar to the standard normal distribution. As the sample size approaches infinity, the t-distribution becomes the standard normal distribution.
When we have a very large sample size and the t-distribution approaches the standard normal distribution, we could use the z-test for hypothesis testing and the z-table to approximate probabilities. However, in practice we usually use the t-test whenever σ is unknown, even with a very large sample size.
Let’s conduct a one-sample t-test step by step using the Cadillac example. Suppose you have a sample of 8 Cadillacs and they weigh as follows: Cadillac weights (in lbs): 3694 3984 4044 3992 3647 5367 4302 5879 (N=8)
Your research question is: Do Cadillacs weigh significantly more than 3384 lbs?
Let’s look at the individual weights. Some are less than 3384 lbs, some are more. Based on our sample of 8 Cadillacs, we are trying to infer whether the population of Cadillacs typically weighs more than 3384 lbs.This is called a one-sample t-test, because we are comparing the mean of one group to a specified value. We are also doing a one-tailed test on this sample, since we only care about testing whether the cars weigh more than 3384 pounds.
Because we have chosen a one-tailed test, our null hypothesis is Ho: μ≤3384, and the alternative hypothesis is Ha: μ>3384.
Whether we are conducting a two-tailed or one-tailed test, we’ll use the same formula to calculate the t-score. We will refer to the t-score as t_observed, because it’s what we observe from the sample we have. In the formula shown here, the numerator is the difference between the sample mean and the null hypothesis.
The denominator is the standard error, that is, the standard deviation of the sampling distribution. This formula calculates the size of the difference between the sample mean and the specified value, as measured in the units of standard error.
Let’s first calculate the sample mean and standard deviation for the 8 Cadillacs in our sample. You are familiar with how to calculate the mean.
Our sample mean is 4364 pounds. Although this is greater than 3384 lb, we still can’t be confident that the population mean of Cadillacs is significantly different from 3384 lbs. We need to calculate the standard deviation to give us a sense of how much spread there is among the data points.
For the sample standard deviation, first recall the formula for sample variance: S_squared equals the sum of the squared deviations of individual observations from their own mean, over the degrees of freedom N-1.
The sample standard deviation is the square root of the sample variance.
Now take another look at the excel table. With these values, we can calculate t_observed which equals 3.4. This t-value tells us that the difference between the sample mean and 3384 lbs is as large as 3.4 units of standard error, which is a very big difference! But how do we know if it’s a big enough difference for us to reject the null hypothesis and conclude that Cadillacs weigh significantly more than 3384 pounds? We need to compare our observed t-value to a critical cut-off value.
We could also have calculated these values quickly using Excel.
We’ll show you how in this short video. Take special note of the formula for standard deviation in cell B12. The second s there stands for sample; if you want population standard deviation you would replace that with P.
Here we have the weights of all our Cadillacs.
We’ll calculate the mean first, using the AVERAGE function in Excel. Type ‘=AVERAGE’ and a set of brackets. With the cursor inside the brackets, highlight all of the values to average them.
We can do the same thing to get our standard deviation. Notice when I start typing it allows me to select from a drop down list - pick STDEV.S - the S on the end means sample.
Using the same technique but changing the function to COUNT will give us our N value.
To calculate the square root, we use ‘=SQRT’. We select the cell with our N of 8 to calculate its square root.
Now we have everything we need to calculate t_observed. First we take the sample mean, and subtract mu to form the numerator. We put that in brackets so it is calculated before the division. Then, we divide by our standard error, which is the standard deviation divided by the square root of N.
We will look up the critical t-value in a t-table. To do this, we need 3 pieces of information: whether the test is one-tailed or two-tailed, our chosen significance level, α, and the degrees of freedom, df. For a one-tailed test with an α of 0.05, find ‘.05’ in the row ‘One Tailed Alpha’. And then, find the degrees of freedom for our test. In this case, our degrees of freedom equals N minus 1. 8-1= 7. If we find the intersection of the one-tailed alpha of .05 and the degrees of freedom of 7, and we have the t_critical value of 1.895.
Had we chosen a two-tailed test instead, we would look for α = .05 under the row ‘two tailed alpha’, and the critical t-value would then be 2.365.
Next, compare t_observed with t_critical. If the absolute value of t_observed is greater than t_critical, then we will reject the null hypothesis.
In this case, our |t_obs| is equal to 3.40, which is greater than the t critical of 1.895. Based on this difference, we will reject the null hypothesis and conclude that on, average, Cadillacs weigh more than the 3384 pound capacity of the hydraulic lift. This means that as the owner of the shop, you might decide to turn away all Cadillacs, given that they will be too heavy to be serviced safely by your hydraulic lift.
An observed t-score that’s more extreme than our critical t-value falls into the rejection region, as demonstrated in this figure. The entire shaded region beyond t_critical is the rejection region, and its area equals α. A t_observed that’s more extreme than t_critical is further toward the tail of the distribution and is associated with an area that’s smaller than α, and it’s shaded in red. This red represents the p-value. A small p-value means that, if the null hypothesis is true, the probability to obtain a t-score that’s as extreme or more extreme than what we observed would be very small, so we conclude that the null hypothesis is likely to be false.
To recap, if the magnitude of the observed t-score is greater than the critical t, or equivalently, if the p-value is smaller than α, then we will reject the null hypothesis.
You may have realized that it’s easier to reject the null hypothesis and reach statistical significance for a one-tailed test than for a two-tailed test, because the one-tailed test has a smaller critical value. In our previous example, the critical value for a one-tailed test was 1.895 but the two-tailed critical value was 2.365.
This is because for a one-tailed test, the rejection area is in only one tail of the curve, and the size of the area equals α.
For a two-tailed test, the rejection area is split between both tails in order to detect a difference in either direction. This means each tail has only half the value of α. As illustrated in the figures, the larger rejection area in the upper tail for the one-tailed test compared to the two-tailed test makes it more likely that we will detect a difference in that direction. In other words, when all other things are equal, a one-tailed test has greater power to detect a true difference than a two-tailed test does.
However, with this greater power also comes a greater risk of a statistical false alarm. A one-tailed test is more likely to falsely reject the null hypothesis when there’s actually no difference between the null and alternative hypotheses. With this trade-off in mind, we should carefully choose whether to do a one-tailed versus two-tailed test, and we should make this choice before looking at the data.
We just showed that our estimate of mean Cadillac weight is significantly higher than 3384 lbs. But do we really believe that our sample of eight cars exactly estimates the mean weight of all Cadillac cars in the population? That is, do we really think that the mean weight of ALL Cadillacs is 4364 lbs? It seems unlikely. In fact, if we took another sample of 8 Cadillacs, we'd probably find that their mean weight is slightly different from our current sample of 4364 pounds. But how can we find the true mean weight of the entire population of Cadillac cars? One way to estimate this population parameter is by calculating the confidence interval.
That is, given that we’ve rejected the null hypothesis, we are trying to find a range of weights that includes the true population Cadillac weight, with 95% certainty. We call this interval the confidence interval. When α =.05, we can calculate a 95% confidence interval. The confidence level 95% equals (1-α)%. With an α of 0.1, the confidence level would be 90%.
To calculate a confidence interval, we simply need to rearrange the t-formula to find the values of μ at the upper and lower limits of the interval.
Rearranging to solve for μ, we have:
The t-value in this equation is the two-tailed critical t-value at α=.05 for a degree of freedom of 7, which equals ±2.365. We use the two-tailed critical t value to construct the confidence interval, because the confidence interval is symmetrical and we want to cut off half an α on either end.
The upper limit for μ is: [formula]
The lower limit for μ is: [formula]
These limits are called the confidence limits. The 95% confidence interval is: [formula]
And The general formula for a 95% confidence interval is [formula]
Now we can say that we are 95% confident that the true Cadillac population mean is between 3754 and 4974 lbs.
Note that the our lift’s capacity, 3384 lb, is not included in this interval. We rejected the null hypothesis that Cadillacs weigh 3384 lbs on average. If the confidence interval included the null value, then the null hypothesis would not have been rejected.
So far we know that the average Cadillac weighs significantly more than 3384 lbs, but this doesn’t necessarily mean that the weight difference is meaningful in the real world. How can we measure how meaningful the difference actually is? Most people probably don’t know how much the average car weighs and therefore don’t know if the weight difference is a lot or a little, practically speaking. But we can measure the difference in standard deviations.
This measure is called the effect size, and it’s calculated using Cohen’s d: Doing this calculation allows anyone who understands a bit about statistics to know if this is a large or a small difference, even if they don’t know anything about cars.
We want to compare our sample mean with some kind of baseline, in units of standard deviation. In this example, we are looking at the size of the difference between the mean weight of Cadillacs, and the maximum weight our lift can support. In general, our baseline is usually the population mean of something, and we compare our sample mean to see how much it differs from the population mean.
The resulting formula is shown here.
We use X_bar to estimate the parameter μ1, and we use s to estimate the standard deviation - these are the mean and standard
deviation of our Cadillac weight data. We would usually estimate our baseline or comparison population mean with μ2 - here, we are using the maximum lift weight as our comparison value instead.
We write d_hat because it is an estimate of d.
In this example, d_hat = (4364-3384)/815.32=1.2.
Cohen’s guidelines for interpreting effect size state that a d value of approximately 0.2 is a small effect, a d value of approximately 0.5 is a medium effect, and a d value of 0.8 or greater is a large effect. This means that our effect size value of 1.2 represents a large effect. It means that the mean weight of Cadillacs is 1.2 standard deviations greater than the value of 3384 lbs .
Remember, the effect size is interested in measuring the SIZE of the difference - NOT how significant the difference might be. This reminds us about a VERY important issue in all of statistics - remember that we always want to know TWO things when we do a statistical test. We want to know what the SIZE of some difference is (the Effect Size); and we want to know how SIGNIFICANT that difference is (that is, how sure we can be that the difference we see is not just due to chance). BOTH of these things - the significance AND the effect size - are important to know so we can properly understand our data.
So far we’ve talked about comparing a single group mean to a single specified value. But what if we need to compare the means of two groups?
As a first example, if we take two measures from the same patient in a clinical study, then those two measures are RELATED. What’s mathematically important here is that knowing one value gives you some information about the other value. For example, if a patient had very high anxiety levels when they started participating in a drug study, we can make a reasonable guess that they will probably be one of the more anxious patients after taking an experimental medication.
We would say that data from these before and after measures are correlated. We will talk about correlation in another module, but for now, just know that the values are related to each other.
A slightly different example of a type of related sample is matched samples. For instance, we can use identical twins for social psychology studies or married couples for a marriage satisfaction assessment.
On the other hand, we could have used two different groups of patients to test our anxiety medication: one group that gets the drug being tested, and another group that gets an inactive placebo pill. Those two samples would be unrelated or uncorrelated.
But there are advantages to using the same patients in a study instead of unrelated groups. For example, people’s anxiety may differ widely at the beginning of a study. If we compare one individual’s baseline anxiety to a different individual’s anxiety after taking a treatment, we can’t conclude whether any observed difference between patients is due to the treatment itself, or if the difference we see is actually due to individual differences in baseline anxiety levels. However, we can get around this by testing the same patient in both circumstances. If we take the difference score between the two observations for the same person, then the individual idiosyncrasies will be present in both observations and will be largely cancelled out. This means we can be more confident to attribute any observed differences in anxiety to the effects of the treatment itself.
In other words, using related samples instead of unrelated samples to test the difference between two means gives us better control over individual variability. We also have better control over extraneous variables that have nothing to do with the treatment of interest. These variables might otherwise obscure the effects of the treatment.
To analyze two related measures, we pair them up. We take each pair of values and calculate the difference between them. This provides us with a single value, called a difference score, for each pair of values. We can then conduct a t-test on the difference scores, instead of the pairs of original scores.
This kind of t-test is called a paired t-test. It has many other names, such as related-samples t-test, dependent-samples t-test, and so on, but these names all refer to the same test.
The math for a paired t-test is very similar to a one-sample t-test, but is different from the calculations for a t-test with unrelated independent samples. Because of this, we’ll learn about the paired t-test in this module, before we delve into the unpaired t-test in the next module.
Let’s conduct a paired t-test step by step. Suppose we want to know whether the fuel efficiency of Toyotas, measured in miles per gallon (MPG), differs for city driving compared to highway driving.
We have data from 28 Toyota models sold in 2004, and for each car, we have values for the city and highway MPG. In this case, we have two related measures, because the two values come from the same car. Since we don’t know the population standard deviation for Toyota MPG, we can conduct a related-samples t-test.
Since we want to know whether there is a difference between city and highway miles per gallon, the null hypothesis is that there is no difference between MPG in the City and MPG on the Highway. In other words, we hypothesize that the mean difference is zero.
In this formula, μ-sub-D is the mean of the population distribution of the DIFFERENCE scores, hence the subscript D.
The alternative hypothesis is Ha: μ-sub-D ≠ 0. If we conduct a one-tailed test to evaluate whether MPG is greater on highways than in cities, then we can set the null hypothesis as μ-sub-D ≤ 0, and the alternative hypothesis as μ-sub-D > 0. Notice how the paired t-test is similar to a one-sample t-test. In a one-sample t-test we compared the mean of a group to a specified value. Here we are comparing the mean of the difference scores to zero.
The t-formula for related samples is:
Notice that this formula is in the same format as the formula for a one-sample t-test, except that here we are dealing with difference scores. The logic is very similar to a one-sample t-test, too. We are calculating how big the mean difference is in terms of standard error. In the numerator we compare the sample mean of the difference scores, D_bar, with the population mean , μ-sub-D. μ-sub-D is the expected mean difference between the two groups and is equal to zero under the null hypothesis, so the numerator simply reduces to D_bar. The denominator s-sub-D_bar is the standard error of the difference scores. It equals the standard deviation of the difference scores over the square root of N. It’s important to note that the N for a paired t-test is the number of pairs, that is the number of difference scores, The N is 28 in this case.
The t-score can be easily computed in Excel. We’ll arrange the city and highway data points in two separate columns.
First, we need to calculate the difference scores. Simply take each pair of related scores and calculate the difference between them. Do this calculation for the first cell, click on the bottom-right corner of the cell, then drag and hold it down to highlight all 28 cells. Release the cursor and Excel will automatically fill out all 28 differences. For each car, we subtract MPG in the City from MPG on the Highway. Note that if we subtract Highway from City instead, we will still reach the same conclusion, since the rest of the calculations we’ll do only use the difference scores.
We calculate the mean of the differences, D_bar, with the AVERAGE function. The sample standard deviation is calculated with the STDEV.S function. Here, “S” is short for “sample.” N can be calculated with the COUNT function, or you can simply type 28, since we have 28 pairs of values here. The observed t-score equals Dbar divided by the standard deviation over the square root of N. The excel function for the square root is SQRT. We get the result 9.71
Next, we need to determine whether the observed t-score is big enough to be significant for this sample size. Let’s look up the critical t-value in the t-table. Bear in mind that N is the number of pairs, so the degrees of freedom is 28 minus 1, which is 27. Given a two-tailed test with α=.05, t_critical = 2.052. If we had done a one-tailed test, t_critical would be 1.703.
Our observed t-score of 9.71 is much greater than either critical t-value, so we will reject the null hypothesis and conclude that there is indeed a difference between the highway mileage and city mileage. Because 9.71 is a large positive value this indicates a positive difference, so we can conclude that Toyotas get better mileage on highways than in cities.
Visuals:
1. Equation
We can construct a confidence interval to estimate the true population mean for the difference between Toyota MPG on highways and in cities. This will give us an estimate of how different city fuel efficiency and highway fuel efficiency is for Toyotas.
Remember to use the two-tailed critical t-value to calculate the confidence interval.
Clearly, this 95% confidence interval does not include 0. The value of the population mean difference would be 0 if the null hypothesis were true. The true population mean difference is likely to be between 5 and 7.66 MPG.
We can calculate the effect size of the observed difference as:
Note that in the denominator we do not use the standard deviation of the difference scores. Instead, we use s-sub-X1, the standard deviation of the first group.
How do we determine which group is the first group? It depends on what we want to compare the difference with. If we want to evaluate how much more fuel-efficient Toyotas are when driven on a highway than in a city, then we should compare the MPG difference with the standard deviation of the city MPG, and treat the city MPG as the first group.
In this case, d_hat = D_bar / StDev(MPG_City) = 6.32/9.32 = 0.68. Alternatively, we may want to evaluate how much fuel efficiency is lost when driving in the city compared to on a highway. Then we can use the standard deviation of the highway MPG instead. If we do this, we’ll obtain a slightly different effect size.
In this particular example, it’s not obvious which group should be used for calculating s-sub-X1. However, in other examples of paired t-tests, it’s more obvious and intuitive to choose one group over the other. For instance, if the two related samples are patients’ pre-treatment and post-treatment scores, then we should use the pre-treatment scores as the first group. Similarly, for a related-samples design with a control or baseline group and an experimental group, we should use the control group as the first group.
In general, it’s meaningful to use the standard deviation of the group that’s closer to the baseline, because this allows us to express the size of the effect in the same units as the original measure. By contrast, the standard deviation of the difference scores themselves doesn’t carry much meaning for evaluating the effect size.
A quick note before we move on: In the numerator of the Cohen’s d formula we have the mean difference, D_bar. In case you don’t know D_bar, you can simply calculate the mean of each group and then take the difference between the means,
Mathematically, D_bar = X1_bar – X2_bar. In other words, the mean of the differences equals the difference of the means.
This module focused on the one-sample t-test and the two-sample t-test with related samples.
In the beginning of the module, we reviewed the Central Limit Theorem, and compared and contrasted the t-test with the z-test. We learned that when the population standard deviation is unknown, we should choose a t-test over a z-test. We then learned how to conduct a one-sample t-test, where you compare a single sample to another value that you specify. Then we talked about tests where you compare two samples to each other to see whether they are the same or different.
The first decision we had to make was whether the two samples of interest were related, in which case we would need to do a paired t-test. This would test whether the difference between the two samples is zero or not. In each test, we also learned how to choose a one-tailed or a two-tailed test, state the null hypothesis and alternative hypothesis, calculate the t-statistic, compare it to a critical value, and draw a conclusion.
We defined and calculated confidence intervals, and effect size.
In the next module, we will learn about two-sample t-tests with unrelated samples, as well as the conditions necessary to conduct any t-test.
Descriptive & Inferential Statistics t-tests part 1 of 2