Hello, and welcome to the second of two modules on z-tests
By the end of this module you will be able to:
- Define a statistical hypothesis
- Understand, define and, generate null and research hypotheses
- Understand and carry out hypothesis tests using the one sample z-tests to compare a sample mean to a population with known sigma
The idea of formulating and testing hypotheses is at the core of scientific inquiry, engineering, business planning, or even just deciding to go to the beach or do homework next weekend before your big exam.
The Oxford dictionary defines a hypothesis as: A supposition or proposedexplanation made on the basis of limited evidence as a starting point for further investigation:
Hypothesis testing is all about seeing if those suppositions line up with what we can observe in the world.
An astrophysicist might think about how the stars should behave.
Then she uses a telescope to observe the stars to see if that’s what actually happens.
You might hypothesize that you are ready for that big exam, then you observe your test score and are hopefully proven right!
In statistics, however, we have a special, narrower meaning for hypothesis.
A hypothesis is a testable assumption about a number that describes or characterizes a population. We call that number a parameter. So, our astrophysicist might hypothesize in the more general sense that stars are really big, but to test a statistical hypothesis, she would have to assign numbers - for example - do most stars have a diameter larger than 1000 km? In that hypothesis, the average diameter of all stars is the parameter.
For this and the remainder of the modules unless we say otherwise, we are talking about a statistical hypothesis when we say hypothesis.
2.
Although it may seem odd at first, in statistics we provide evidence for our hypotheses by disproving another one. There is a lot of philosophy and logic surrounding this and won’t explore it in detail. The key point is that positive evidence is not as convincing as negative evidence. I think we can offer a simple example that should make sense.
Imagine I set out to prove that all swans have white feathers. I can collect hundreds, even thousands of swans, in my lifetime all with white feathers. But although this is consistent with my theory, it doesn’t confirm it because I can’t say for certain that there aren’t other swans out there with different coloured feathers.
On the other hand, finding a single black swan would disconfirm my claim. It’s this strong disconfirming evidence that we are looking for in statistical testing.
As we said, to find this disconfirming evidence we consider the evidence using two different hypotheses: the null hypothesis and our research hypothesis which is also sometimes called the alternative hypothesis because it is an alternative to the null.
Let’s talk about the null hypothesis first. Null means nothing, zero, or empty. So when we propose the null hypothesis, we mean there is no difference, no effect, or nothing happened in whatever situation or comparison we consider.
The research hypothesis states just the opposite - there is some difference, there is an effect, or something happened.
Thus these two hypotheses form a contradictory pair, they can never BOTH be true
Because of this, disproving one of these hypotheses tips the scales in favour of the other
When hypothesis testing, we assume the null hypothesis to be true. We then weigh the evidence against the null to determine if we can reject it in favour of our research hypothesis. Finally, we make a decision to either reject the null hypothesis or fail to reject the null. Notice I said fail to reject the null, and not accept the null. Why is that?
To see why this makes sense imagine the defendant in a court case. There are only two possible results for her, guilty or not guilty. We often hear that you are presumed innocent until proven guilty in Canada - but the judge always pronounces you guilty or not-guilty and this is the form her formal plea will take as well. So really, not guilty until proven guilty would be more accurate.
A not guilty plea therefore is much like the null hypothesis - it’s assumed until proven otherwise. Whatever she stands accused of is like the alternative hypothesis. The court will weight the evidence against the not-guilty verdict, to determine if it can be rejected in favour of a guilty verdict. The judge or jury then decides guilty or not guilty.
Accepting the null hypothesis therefore would be like pronouncing the defendant innocent. But one reason we can’t do that is because we don’t know if she actually is innocent. We only know that we don’t have enough evidence to prove beyond a reasonable doubt that she’s guilty. If we had more evidence, more time, our decision might change.
The situation is exactly parallel with hypothesis testing. Failing to find a difference doesn’t prove there is no difference - only that we didn’t have enough evidence to be sure. So, we never accept the null hypothesis, we can only reject or fail to reject it
In a criminal court case, we don’t decide whether we think it is a little more likely that the defendant is guilty or innocent, and then make that conclusion. We think that we need to have a great deal of proof to be able to pronounce someone guilty. Similarly in statistics, we aren’t deciding which of the null and alternative hypotheses is a little more likely. We require a great deal of certainty to reject the null.
Just like in a court case, errors can be made during hypothesis testing. We can classify these errors into two types, Type I and Type II.
A type I error is thinking you found something that’s not really there - a false positive. In a court case, it would be like finding someone guilty who wasn’t.
A Type II error is the opposite, failing to find something that is really there - a false negative. In a court case that would be like finding someone not guilty when they actually did the crime.
We use the Greek letter alpha to represent the chance of type I error and beta to represent the chance type II. But don’t worry too much about the 1beta, we’ll only be focusing on alpha for these modules.
That should give us a big picture view of what hypothesis testing is about. But remember we said that statistical hypotheses are about numbers. So we need to measure and express these mathematically.
To do that, we use some standardized notation.
We represent the null hypothesis using H0 and the research hypothesis as H1 or Ha for alternative.
Exactly how we express them after that will depend on the exact test we’re doing. But expressing these is the first step in the process. To make this more concrete let’s show a quick preview of the steps then follow them through with using some actual data and our first inferential statistic:
The z test.
1. State your null and research hypotheses
2. Set your alpha criterion
3. Collect your data
4. Calculate your test statistic
5. Decide whether you reject your null hypothesis
6. State your conclusion
7. Asses the size and practical importance of any significant differences
By now we’ve learned a fair amount about the z or normal density distribution. We learned in the z scores Module that many variables in nature follow this pattern and it was hinted that it could be important for many of our statistics. We also learned how to calculate z-scores and calculate probabilities based on the percentiles of the standard normal distribution. In the populations, samples and distributions module, we learned that sampling distributions get more and more like the normal curve as the number of samples increases thanks to the Central Limit Theorem. With the 1 sample z-test we’ll start to see how some of these pieces fit together.
The z test is the first of a series of inferential statistics you will learn. Unlike descriptive statistics, which are used to summarize or simplify data, most inferential statistics allow you to use data to answer questions about population parameters by applying hypothesis testing to samples.
The z-test tells us if a sample comes from a particular population by comparing means, and we need to know what the whole population is like. This is an unusual situation, we aren't usually in the position to actually know what the whole population is like.
Usually, we only know about the sample, and then want to make inferences about the population
Inferential statistics all come with some assumptions. These assumptions tell you when the conclusions drawn using the test will be valid, and when they may not be.
The one sample z test assumes random sampling. This means that every member of the population has an equal chance of being in the sample.
Our data must be measured on either an interval or a ratio scale.
Finally, the sample must be drawn from a normal population with a known mean (mu) and standard deviation (sigma)
Let’s use the data set of all cars produced in 2004.
Imagine that you are an engineer at Toyota, getting ready to switch production over from the 2004 models to the 2005 models. Due to a shipping error, 9 random motors have arrived at your plant without labels but you really need to install them today!. You can tell what car they go in and their horsepower by looking at them, but you don’t know if they are for 2004 or 2005 models.
Fortunately, you have the master blueprints, so you know that the average horsepower of toyota engines in 2004 was 180.7 and the standard deviation was 54.2 and that these were normally distributed.
The 9 motors have the following horsepower ratings: 230 ,130,225 ,157 ,138 ,157 ,142 ,240, 130
Let’s use our z-test to see if we can solve this imaginary mix up! So, our population of interest is all Toyota cars made in 2004.
First we check our assumptions: we’ll just assume that the engines are totally random, we know the population mean and standard deviation for horsepower and it is normally distributed, and horsepower is on a ratio scale. All 3 are good.
We begin the process of hypothesis testing by stating our hypotheses mathematically.
In our example, we are trying to determine if a sample of 9 engines comes from the population of 2004 Toyota cars using their horsepower. We are going to do that by comparing the mean of our sample, which we’ll denote as Xbar to the mean of our population, which we’ll show as mu.
We always begin by specifying our null hypothesis. And the null hypothesis for a given type of test is always the same. To specify our null hypothesis we need to think what no effect or no difference means in this context. A Z-test tells us if a sample comes from a population. And we know from the central limit theorem our best guess for a sample mean is the population mean, even though there will be variability from sample to sample.
So no difference could be stated as: Ho: Xbar is equal to mu
And that is exactly the null hypothesis for any one sample z test. The mean of the sample is the same as the mean of the population.
What about our research hypothesis? It should specify that the null is not true. There are 2 different ways that the alternative could be not true when you are comparing 2 numbers.
The sample mean could be higher than mu, Xbar greater than mu in this example or
The sample mean could be lower than mu, Xbar less than mu in this example.
Step 1: State your hypotheses: One or Two tailed?
When comparing 2 means, as our z-test does, we have 3 ways to look for those 2 possible differences. The first two are directional, the last one is non-directional. So one, we could check if
xBar is higher than mu. Xbar greater than mu
We could check if xBar is lower than mu or xbar less than mu
Or we could check both options at the same time. xbar not equal to mu
The symbol in the middle means ‘not equal to’ so we’re saying we want to detect sample means that are either higher or lower than the population mean
We call the first two options a one-tailed test and
the last option a two-tailed test. We’ll go over why in more detail in Step 5 but it has to do with which ‘tail’ or tails we look to find our critical values of z.
Unlike the null, the research hypothesis changes depending on what you want to know. Some scientists suggest you should always use the third option, that way you are covering all the bases. So, why would you not do that? Well there are some practical considerations. For example, if you were testing a new medicine, would you really care if it made your patients slightly worse? No, you wouldn’t use the drug unless it improved their health. We’ll talk about another reason when we look at critical values.
Whether or not our hypothesis is one- or two-tailed is an important question that needs to be decided before moving on. In our example, we want to do a two-tailed test. We don’t know whether the sample motors should have more or less horsepower. we really want to know if they differ at all from the 2004 motors
This is the beginning of what we’ve been building up to for several Modules now. Many inferential statistics , and all of the ones in this series of modules, are about the relationship of two important probabilities: alpha and p-value. P-values we will talk about in detail in steps 4 and 5.
The alpha criterion is also sometimes called your alpha level or significance level. But what is it?
First, it’s a probability, so it’s a number between 0 and 1 that indicates the likelihood of something occurring.
Alpha is also the decision point or criterion value that you set to determine if your results are statistically significant. When we say something is statistically significant, we are saying that we believe there is sufficient evidence to reject the null hypothesis - it wasn’t just sampling variation - there was some effect or difference. For that reason we also say it defines a rejection region. We’ll see that very soon.
We already mentioned that it’s also the probability of finding something when it’s not actually there. As an experimenter, we can choose the alpha levels that we are comfortable with, simultaneously setting the threshold of false discovery and significance.
An alpha of .05 for example means a 5 in 100 chance of Type I error and a 5% significance level
How much risk of error you are willing to take really depends on what test you are doing and why.
Falsely diagnosing someone with a terminal disease is very serious, you would want to be very careful and set a very low alpha for example. But there are tradeoffs that we’ll discuss shortly.
In the social sciences, an alpha of .05 is considered acceptable and is the conventional value.
We’ll go ahead and use that value for our example.
We’ll touch on this again in steps 4 and 5.
As we discussed in the intro Module one, data can come from lots of places.
· We can do experiments,
· gather data in surveys,
· or even access existing databases, like the cars data in our example. All of those are worthy of an entire other course of study but let’s focus just on the stats for now.
How we formulate our test statistic will change for different types of tests. But all of them are a means to the same end:
knowing the p-value of our data. Test statistics and p-value are linked together.
A p-value is the probability of observing a score as extreme or more extreme than what we observed in our sample, if the null hypothesis were true. That is, if there really were no difference, how likely is it that we could observe as extreme a value as we did on the basis of chance alone? We say as extreme or more extreme, rather than higher or lower, to account for the fact that statistics can be positive or negative.
Don’t forget, that we always assume the null hypothesis to be true. That, combined with the other assumptions of our test, allow us to use test statistics to figure out probabilities.
Let’s use our example to see how we translate a Z statistic into a p-value.
This equation sets a general pattern for our test statistics. They will have some measure of the difference in divided by, or scaled by, some measure of the expected variability.
This should look familiar from our module on z scores: The only difference is in the denominator. Notice that if our sample size was one, the square root of 1 is 1 and we would be dividing by our standard deviation. So these are not really different equations at all - the z-score you learned is really just a Z statistic with a sample size of 1.
Hopefully that denominator is also ringing bells from our Module on the central limit theorem. Recall that it predicted that the standard deviation of the sampling distribution of the sample mean would be the population standard deviation divided by the root of the sample size. We call that number the standard error and we showed it was a really good estimate of the variability we could expect in a sample. That’s exactly why we use it as our denominator.
Here is the z-statistic for our sample which we call z observed because we derived it from our observation of the world
One more piece of the puzzle will get us to the p values we want. Remember again from our module on z scores that when we calculate z we finding its position on the standard normal density distribution. And with that position we can use the quantiles of the normal distribution to calculate probabilities.
And we just saw that z-scores and the z statistic are really the same thing.
Okay, that was a lot. Let’s walk through all that again, combining everything with some visuals, and completing our hypothesis test.
Here we have the standard normal density distribution with z scores on the x axis and density values on the y and a total area under the curve of
Notice along top there are 3 tabs. Each corresponds to one of the possible research or alternative hypotheses.
Right now we’re looking at the one-tailed upper hypotheses. I think you can see why we refer to the hypotheses by their tails now. A one tailed test has the shaded region in one tail, the positive tail in this case.
Our hypotheses are about the population of Toyota cars from 2004, with a normal distribution with known mu of 180.7 and sigma of 54.2. But we just saw that our z statistic will convert to the standard normal so this is the only graph we need.
You can see those two important probabilities for hypothesis testing on the sliders on the left: the alpha value and the p-value. I’ve also shown their matching standard zed scores on the graph. The dashed line shows the boundary of the area for each region.
Let’s move the observed value aside for a moment and focus on alpha. I’ve set it to the conventional value of .05. So the red shaded area to the right of the critical z value of 1.645 is 5% of the total area of the graph. We already learned that this means if we drew randomly from this population we would have a 5% chance each time of drawing a score from that region.
Or to put it another way, a 5 % chance of drawing a score as extreme or more extreme than the score that corresponds to that alpha - with extreme values being positive only in this case. And now you we can see why we use this as is our level of significance - if we observe a score in that region, we can be fairly sure that it wasn’t by chance - since that only happens 5% of the time. We use this knowledge as evidence against the null hypothesis. But, if the null hypothesis were true - and we can never know for sure - then we would be committing a type one error! Which is why alpha is both our level of significance and our type one error rate. Watch as I increase our alpha. Two thing undesirable things are happening. First, we are reducing the strength of the evidence against the null, people won’t be that impressed is you observe something that happens frequently by chance. And two, you are much more likely to say there is a difference when there isn't one.
Each time I move the z-observed, we can imagine that I’m drawing a new sample and doing a new test. We then compare it to the critical value (which you’ll note can be expressed as the probability alpha, or a z value, which we call critical z. We always look for p-values that are lower than alpha to reject the null. With an upper tailed hypothesis, we look for values for z observed that are higher than z critical. They move together but in opposite directions - as p increases, z decreases. This z-obs would be significant and this one wouldn’t and so on.
Let’s see what happens when we switch to a lower tailed hypothesis. Our critical z-region is now on the left hand side, and our z-critical becomes negative and the area to the left is 5% of the total. Actually, it’s perfectly symmetrical to the corresponding positive z-critical. Note we still want low p-values to reject the null, but now we also want low observed Z scores.
An important thing to realize is that ‘extreme’ scores are now negative only. So even if we observed a very unlikely positive z (like this) your z test would not be significant. That’s a risk of only looking only at one tail - you might miss something unexpected and possibly important.
We now have the knowledge to complete our hypothesis testing. Let’s move over to the final tab because we agreed that we should test a two tailed hypothesis.
Notice that we now have 2 rejection regions. One in each tail of the distribution. Therefore, the critical z is expressed as an absolute value, that is, considering only magnitude and not the sign. So in this case, an extreme z-observed is either very high or very low. We would reject the null if the score entered either rejection region. To reject the null, we would take the absolute value of our z-observed and compare that to the absolute value of z-critical.
Take a look over at the alpha slider for this panel. Note that we still have .05 set for alpha. But, notice that the z-critical value is more extreme (both higher and lower) than a one-tailed hypothesis with the same alpha.
That will always be the case. The reason for that is that we have split the same amount of area (and hence probability) into two pieces.
Look how the regions are smaller on each side.
Notice how that means we need a more extreme z-statistic to reject the null hypothesis. That’s another reason some statisticians think you should always use 2 tailed hypothesis testing - less chance of type I error.
So, can we conclude that the sample of 9 engines was drawn from the population of 2004 Toyotas?
Let’s put our observed-z of -.48 into the graph. We can see visually that we aren’t in the rejection regions, and the absolute value of -.48 is .48, which is not greater than 1.96. We must fail to reject the null hypothesis. Remember, we don’t accept it. They may well be 2004 engines, or some may be. We don’t have the evidence to know for sure. We probably shouldn’t find this too surprising, we were only about a half a standard deviation from the population mean.
Another way of thinking about the result of this test is to say that we have no evidence that our sample mean is different than what a sample mean would be if drawn from a population with a mu of 180.7 and sigma 54.2. So, although in this example, we are asking if the engines were literally a sample from the 2004 population, the test itself is not restricted to such a literal scenario.
An effect size is a way to describe how important or large an observed difference is. Even small, essentially meaningless differences can be statistically significant - especially when sample sizes are large. The doesn't necessarily mean they have practical importance. For example, if I could increase your grade by 2% with virtual certainty given 40 hours of extra study a week - you probably still wouldn't do it. The work is not worth the payoff. So even if my intervention was highly reliable, meaning it worked again and again, it’s not important.
If the data you are using can easily be understood by your intended audience, reporting the raw difference is a great idea. Most students understand the value and relative importance of percentage grades for example. Sometimes though people wouldn’t have a clear idea if the difference you report is large enough to be important. In that case, you would report a statistical effect size as well. That is the case for our example. Many people don’t know how many horsepower are enough to make a difference.
So we can use a measure called Cohen’s d
As you can see, Cohen’s d is actually identical to z in the case for the one sample z test. That’s because, like the z, d expresses the size of the difference in standard deviation units. It won’t generally be the same as other test statistics, although they may have some similarities.
Usually, you wouldn’t calculate an effect size when you fail to reject the null. In that case your test is telling you that there was no detectable difference so it would be strange to then call it large or small. We did it here for instructional purposes.
What constitutes a big effect really depends on your data, but if you have no information to work from, the following rules of thumb are recommended
Cohen’s d Effect Size 0.20 Small 0.50 Medium 0.80 Large
In this module we discussed the logic of hypothesis testing and how to set up a null and research hypothesis and the steps for conducting hypothesis testing:
1. State your null and research hypotheses
2. Set your critical value
3. Collect your data
4. Calculate your test statistic
5. Decide whether you reject your null hypothesis
6. State your conclusion ]
7. Effect size and practical importance
We have discussed the logic of the z-test and went over some examples of setting up and conducting a z-test comparing a sample mean to a population.
In the next few modules you will learn about concepts you may have already started wondering about.
What happens when you don’t have information on the entire population? How do you test for differences between samples? This will be addressed in the module on t-tests.
What do you do if you’re not interested in differences between means, but instead interested in the relationship between groups? This will be addressed in the modules on correlation and regression.