Descriptive & Inferential Statistics Introduction to Non-Parametric Statistics

Menu
Notes

Welcome
What Will You Learn?
What is a Nonparametric Statistic?
Chi-Square Test
Chi-Square Test
Chi-Square: Goodness of Fit Test
Chi-Square: Test of Independence
Chi-Square: Logic & Distribution
Checkpoint
Effect Size
Effect size
Introduction to Rank Tests
Rank Randomization Tests
Checkpoint
Checkpoint
Checkpoint
Summary: Non-Parametric Statistics
Module 10 slide 7 revised
Untitled Slide
Untitled Slide
Untitled Slide
Untitled Slide
Untitled Slide
Untitled Slide
Untitled Slide
Untitled Slide
Untitled Slide
Untitled Slide

Hello, and welcome to the module on non-parametric statistics

In this module, we will introduce the idea of nonparametric statistics. Nonparametric analyses are a set of procedures you can employ when the parametric statistics you’ve learned are not appropriate. In this module, we will learn chi-square analyses, a technique that’s used to deal with nominal datasets. We’ll walk you through how to complete different types of chi-square analyses, and we will also very briefly introduce you to some other nonparametric tests.

By the end of this module, you should be able to

• Describe the difference between parametric and nonparametric tests

• Interpret frequency data

• Calculate expected frequencies and contingency tables

• Compute a chi-square statistic and interpret the results, and

• Know when to perform specific nonparametric tests

Several different types of statistics allow you to test hypotheses about one sample, or between two or more samples. These tests, such as t-tests and ANOVA, require that you adhere to certain assumptions, population distributions, and estimations of specific parameters. For this reason, those tests all fall under the category of parametric tests.

But what if you cannot make these assumptions about your data? What if your sample is not normally distributed? In that case, parametric tests are no longer applicable to the data and you will have to look for alternative techniques, which we call nonparametric tests. As the name suggests, these tests do not require the specific assumptions of parametric tests, and are often referred to as distribution-free tests because they do not rely on normally distributed data.

One of the primary instances when nonparametric tests come in handy is when the dependent measure is on an ordinal scale, instead of the ratio scale that t-tests and ANOVAs require. Recall that the ordinal scale represents data where measurements have an order, or a rank, and thus you know the order of measurements relative to each other. For example, imagine a list of cars ranked from most to least expensive, without any actual prices. You would know that a Ferrari is more expensive than an Acura which is more expensive than a Ford, but you don’t know how much the price differences are between them. You only know they follow a particular order.

Another example where nonparametric statistics are useful is when the data you have are nominal. In this case, the only information you have is categorical, or frequency data, where you know, for example, how many observations are seen in Group A vs. Group B. Imagine that given a certain neighbourhood in Toronto, you want to know how common cars are compared to SUVs. The only information you have is sampling the number of cars and SUVs in a downtown Toronto grocery store parking lot. These numbers do not carry with them any distributions, and therefore there is no measure of variance. How does one approach a problem like this?

One of the things we often do with frequency data is to compare the frequency distribution that we observe in our sample to some other expected frequency distribution, and see if the distributions match or not. This is the fundamental idea behind an important non-parametric statistic called the Chi-Square test.

The null hypothesis of the Chi-Square test is always that our observed frequencies match our expected frequencies. In other words, we predict that there is no difference.

The alternative hypothesis is that the frequencies are different in any way.

What do we mean by “expected frequencies”? What is interesting about a chi-square test is that we can define the expected frequencies in any way we want to, depending on our null hypothesis.

For example, our expected values could come from population frequencies. We know how many vehicles are owned in Canada. We know exactly how many of those vehicles are cars, how many are SUVs, how many are vans, and how many are pickup trucks. We know these frequencies because the data have been collected over time by the Ministry of Transportation. We can use the information we already know about the world as our expected frequencies, and then we can test whether what we observe is the same as what we expected.

When we know the frequency distribution of different vehicle types across Canada, we can then use these values to see whether different regions within Canada have the same distribution as the whole of Canada. For example, let’s say we want to know if the frequencies of cars, SUVs, minivans and pickup trucks are the same in Toronto as they are in all of Canada. However, we have reason to believe the frequencies may be different. For example, there is substantially less agricultural space in Toronto compared to many other parts of Canada, so there may be fewer pickup trucks in Toronto.

Of course, we don’t always have reason to expect a certain distribution of frequencies.

For example, the record of vehicle types in Canada does not include the colour of the vehicle. As such, we have no prior knowledge on what frequency distribution to expect of differently coloured cars in Toronto. In this case our null hypothesis might be that all colours occur equally frequently, and using that we can calculate what the expected frequency distribution would be if all colour categories were equal. If we can reject the null hypothesis that all categories were equal, that would tell us that there are differences between groups, and that some colours are more common than others. This is sort of like the t-test and ANOVA we’ve learned about.

Testing whether all groups are equally frequent or not is a very common use of chi-square. However, remember that we can test our sample against any other distribution, whether the distribution came from population numbers, from a theoretical expectation, from another sample, or from an assumption of equality. When you have no other prior information at all about an expected distribution, the accepted standard is to assume that all categories would occur equally frequently.

So recall our example from earlier---we want to know how common cars are vs. SUVs in a certain neighbourhood in downtown Toronto.

For our sample, we went to a grocery story parking lot in that neighbourhood and counted up how many vehicles were in each category. The total was 7 SUVs and 23 cars.

This is our observed frequency distribution. In this case, because we just want to see whether there are any differences between the frequency of cars and SUVs, our null hypothesis for the Chi-Square test is that they are equally frequent.

There are 30 vehicles in the parking lot in total, all of which are either cars or SUVs. If there is no difference in the number of cars and SUVs, you would expect that half of the vehicles would belong in each category. 50% of 30 vehicles is 15, so you would expect to see 15 cars and 15 SUVs. As the name implies, these are our expected frequencies that we have just calculated. Clearly, the observed frequencies of cars and SUVs don’t match the expected frequencies. The Chi-square test allows us to test whether the observed frequencies are significantly different than the expected frequencies.

The test outlined above, where you compare the observed frequencies to the expected frequencies, is the basis of the first type of Chi-square, which is the goodness of fit Chi Square test. The goodness of fit test, as its name suggests, tests if the data (in this case, observed frequencies) are a “good fit” to your theory (in this case, expected frequencies). Goodness of fit tests whether the frequency distributions have the same shape, or not.

The test uses a Chi-square statistic, which is calculated using the following equation. For each observation, you subtract the expected frequency from the observed frequency, square that difference, and then divide it by the expected frequency. You then sum this across all the observations to get a single chi-square value.

The structure of this formula should make intuitive sense. In the numerator, we obtain a difference between the observed data and the expected data. We square the difference so that both positive and negative difference values count in the same way, just like we do when we calculate variance. Then, in the denominator, we normalize the difference by comparing it to the expected frequency. By that we mean, if the observed and expected frequencies were 105 and 100, then the difference of 5 is not very substantial (52/100 = 0.25).

However, if the observed and expected frequencies were 15 and 10, then the difference of 5 would be very substantial (52/10 = 2.5). In other words, imagine you had 15 people turn up to a party when you were expecting 10; that’s a lot more people than you expected. Compare that to if you had 105 turn up when you were expecting 100; you may not even notice those extra people in your house.

The next step is just like what we have done in the past with our t and F statistics. We take the observed chi-square value, and we compare it to the Chi-square distribution table in order to determine whether the observed frequencies are significantly different than the expected frequencies at our desired significance level.

The chi-square distribution also relies on degrees of freedom. In a goodness-of-fit test, the degrees of freedom are determined by k-1, where k is the number of categories. In the cars vs. SUVs example, there are 2 categories of vehicle, and k=2, Thus our degrees of freedom will be 2-1, which equals 1. Although our current example has only 2 categories, you should know that the goodness-of-fit test does not restrict you to factors with only 2 categories. Just like a one-way ANOVA, you can have a single factor with as many categories as you wish.

Let’s work through the entire parking lot example. Our null hypothesis is that the observed frequencies are not significantly different from each other. If they are not different, their frequencies should follow that of the null hypothesis, which is exactly what the expected frequencies in the table represent. Our expected frequencies reflect the hypothesis that cars and SUVs are equally frequent in downtown Toronto. Numerically, we can easily see that the observed frequencies are different than the expected frequencies, but we have to use a chi-square goodness-of-fit test to determine if the difference is significant.

We can use the chi-square equation to calculate our chi-square value of 8.53.

There are two categories of cars, so our degrees of freedom is (2-1 =) 1. We want to use a significance level of 0.05, so we refer to the chi-square distribution table to identify the critical chi-square value, which in this case is 3.84.

Since our obtained chi-square of 8.53 is greater than the critical chi-square of 3.84 we can reject the null hypothesis that the observed frequencies were the same as the expected frequencies. We can conclude that the number of cars and SUVs in the parking lot are significantly different than our expectation that they would be equally frequent in downtown Toronto.

In the previous example, we were just comparing along one grouping factor: vehicle type. But what if we wanted to compare along two factors?

For example, imagine you wanted to compare the frequency of SUVs and cars between downtown and suburban Toronto. Maybe the frequency of SUVs and cars depends on, or is contingent on, location. To test this, we could now take a sample from a grocery store parking lot in downtown Toronto, and count how many of each type of vehicle were there, and then do the same in a grocery store parking lot in the suburbs.

To represent these data, we use something called a contingency table. A contingency table organizes your frequencies across the two grouping factors, where rows represent one variable (such as vehicle type), and columns represent the other variable (such as location). This shows us not just how common SUVs and cars are in our sample, nor how many vehicles we saw in each location, but it shows us how the two grouping factors relate to each other.

In our previous example, we were testing the null hypothesis that the two types of vehicles were equally likely, by testing the observed frequencies against what would be expected if they were in fact equally likely. However, in this two factor example, we are not testing for equal likelihood, we are testing for the independence of the two grouping factors. That is, does the proportion of SUVs depend on the location or do we see the same proportions in both locations? To figure this out, we will use a chi-square test of independence.

The basic idea of the chi-square test of independence is exactly the same as the goodness-of-fit test, because we are testing observed values against expected values. The main difference now is that we calculate the expected values in a different way, based on the contingency table.

We will need to calculate the expected frequencies using the marginal totals, which are the totals for the levels of one variable summed across the levels of the other variable. Marginal totals are calculated by summing the observed frequencies in the rows (i) and the columns (j) of a contingency table. Marginal totals represent the frequency of a vehicle type (in this case, cars vs. SUVs, regardless of location), and the frequency of the vehicle being in each location (in this case, downtown vs. suburb, regardless of the type of vehicle).

Our calculation of the expected frequency if the factors were independent comes from an idea about probability. If two events are independent, you can calculate the probability of them BOTH happening at once (e.g., a vehicle being both downtown and being an SUV) by simply multiplying the two probabilities together. This is what we do when we multiply the marginal totals for one event, and divide that by the total number of observations.

For expected frequency Eij, you take the product of the marginal totals of row Ri and column Cj and divide it by the total number of observations N. So, the expected frequency for SUVs in the suburbs (Ess in the calculation shown) is taken by multiplying the total of all SUVs by the total of all vehicles in the suburbs and dividing it by all vehicles in total giving an expected frequency of 6.29. You can do this for each of the four possible outcomes.

Once we have all of our expected frequencies calculated, we can proceed with calculating the chi-square statistic, which we do in the exact same way as we did for a goodness-of fit test. To calculate the chi-square statistic, for each observation you square the difference between each observed and expected frequency pair and divide it by the expected frequency. You then sum all these values together to obtain your chi-square statistic.

The degrees of freedom is slightly different than before. You now have 2 variables that you have to account for in your degrees of freedom, so you take the number of categories in each variable minus 1, and take the product between the two variables. In simpler terms, it’s times. In our example, the degrees of freedom are 2-1 times 2-1, which is still 1. We can now use our degrees of freedom and our desired significance level of 0.05 to identify the critical chi-square value, which is 3.84.

In our cars vs. SUVs example, the obtained chi-square of 2.31 is smaller than the critical chi square value of 3.84, so we fail to reject the null hypothesis. This means we fail to find any evidence that the two variables are not independent of each other. In other words, we conclude that the the relative frequency of cars and SUVs is the same regardless of whether you are downtown or in the suburbs.

One thing that is useful about this test is that it works even when the number of observations in the different groups is very different. For example, if we had observed 200 cars downtown, and only 40 in the suburbs, the test will still be able to tell us whether the relative proportions are the same across location.

As we already discussed, the critical chi square depends on the degrees of freedom.

This is because chi-square is actually a distribution of values, and the shape of the distribution is dependent on the degrees of freedom. In principle this is very similar to the t- and F-distributions.

The critical chi-square value at a significance level of 0.05 is the value for which, given a true null hypothesis, there is less than a 5% likelihood that you would obtain a chi-square value bigger than the critical chi. Five percent is generally an acceptable level of significance; we say that if there is less than 5% chance to obtain a certain value, then it’s likely that the null hypothesis is not true. We can see from the different chi-square distributions with different degrees of freedom that as degrees of freedom become larger and larger (which means you have more categories in one or both of your variables), the critical chi value also becomes larger. In other words, as degrees of freedom increases, you must obtain a larger chi-square statistic to reach the same level of significance. In comparison, for the t and F distributions, as degrees of freedom increase, critical values become smaller.

One final note is that when evaluating the chi-square statistic, the critical value is based on a one-tailed test; that is, you are only ever checking whether the chi-square statistic is greater than the critical chi-square. This is because of how the chi-square statistic is calculated: no matter if the observed frequencies are smaller or larger than the expected frequencies, the chi-square statistic grows with the absolute magnitude of the difference, not with the direction of the difference. This is simply due to the fact that the difference score is always squared, and basic rules of mathematics dictate that squaring positive or negative numbers will always lead to a positive value.

You should also know the assumptions that are required for a valid chi-square analysis. A primary assumption is that the categories must be mutually exclusive of one another. In our previous example, mutual exclusivity is respected because [car and SUV appear] 1) a vehicle can be either a car or an SUV, but not both; [downtown and suburbs appear] and 2) a vehicle can be either in the suburbs or downtown, but not both. An example of non mutually exclusive categories would be if our categories in this case were SUVs and four-door vehicles. These categories are not mutually exclusive because SUVs can also be 4-door vehicles, so clearly our frequency counts of SUVs and 4-door vehicles would both include the number of SUVs.

Additionally, it is important to know that when forming a contingency table and calculating a chi-square statistic, you must be dealing with actual frequency data, and not proportion data. For example, if you only knew that 10% of vehicles were SUVs and 90% were cars in the city compared to 20% SUVS and 80% cars in the suburbs, you couldn’t do any actual statistics. You need to convert these proportions back into frequency data, by multiplying each proportion by the total number of cars in the city and suburbs, respectively. Then you can proceed with calculating your expected frequencies and chi-square statistic.

One final rule of thumb is that it is best practice to only do a chi-square test if none of the expected frequencies is less than 5, although we actually violated this in our previous example

The chi-square analysis informs us when observed frequencies are significantly different than expected frequencies.

However, the truth is that a statistically significant effect does not necessarily translate to a practically meaningful effect. It is possible to get a very significant chi square even when the difference between observed and expected frequencies is not actually large enough to be meaningful. This means we need a measure that not only tells us if a difference is statistically significant, but that also tells us about the size of the difference between categories. That is, in addition to knowing if the difference is significant, we want to have a measure of how big the difference is.

One of the best measures of effect size when dealing with a chi-square analysis is the d-family effect size statistic. The d-family uses a contingency table to derive risks and odds, which are two very similar but still meaningfully different measures of a difference. Risk and odds provide a more descriptive way for us to look at frequencies between categories.

A risk is measured as the ratio of the frequency of one category divided by the frequency of all categories in a variable.

For example, the risk of SUVs in a downtown parking lot is measured by dividing the number of SUVs downtown (7) by the total number of vehicles in the downtown lot (30) to give a Risk of 7/30=0.23. We can also measure the risk of a SUV in the suburbs by dividing 4/40 to get 0.10. The risk informs us of the proportion of occurrences in a category (in this case, SUVs) relative to the total number of occurrences across all categories of that variable (or, as a proportion of all cars either downtown or in the suburbs). Here you can see the risk calculations for all categories.

We can use these risk values to measure the risk difference, which is simply the difference between the risks of two groups.

For example, the risk difference between SUVs downtown vs. in the suburbs is 0.23 - 0.1, which equals 0.13. That means there is a 13% difference between the frequency of SUVs in downtown and suburban grocery store parking lots.

There is a small problem with using risk difference as a measure of an effect between two categories; the size of the risk difference depends on the size of the risk itself. We observed almost twice the number of SUVs in the downtown grocery store compared to the suburban grocery store parking lot, yet the risk difference is a mere 13%.

If we care about the actual numbers, not just the proportions, there is a better and more informative measure of the difference in risk. This measure is the risk ratio, also referred to as the relative risk. The risk ratio is calculated by dividing the two risks to better understand the risk relative to one another.

So, in our example, the Risk Ratio of SUVs in a downtown grocery store parking lot vs. a suburban grocery store is 0.23 / 0.1 = 2.3. That means the chance of there being a SUV in downtown is 2.3 times greater than an SUV in a suburban grocery store lot, which sounds like a much larger difference between the locations than the risk difference measure implied.

Similar to risks and risk ratios are two measures we call odds and odds ratios. Although they may be similar, and will often provide very similar values, in fact their difference is a very important one.

As we just saw, risks are calculated by dividing the occurrences in one category by the total number of occurrences across all categories.

By contrast, odds are measured by dividing the frequency of occurrences in one category by the frequency of occurrences in another category.

For example, the odds of SUVs vs cars in downtown would be 7/23=0.30. The odds of SUVs vs. cars in the suburbs would be 4/36=0.11. Here you can see the other odds calculations.

The only difference between risks and odds is what goes in the denominator [highlight denominator as narrated]: when calculating risks, you divide a category’s occurrences by the total number of occurrences, while in odds you divide the category’s occurrences by another category’s occurrences. Although the calculation is only slightly different, the way you should think about risks and odds is different.

A risk informs you how likely an event is to occur, and thus it will always be a number between 0 and 1, with 0 being no probability of it occurring, and 1 being a 100% probability of it occurring. Odds, however, tell you how likely an event is to occur compared to how unlikely it is to occur. An odds of 1 tells you that for every car you see, you will also see an SUV

In the real world, risks are often the measure used by healthcare professionals. For example, the risk of having infection X is 2%, which mean 2 out of every 100 people will have the infection. On the other hand, odds are often used when gambling. For example, the odds of getting a pair of aces compared the odds of not getting a pair of aces.

To compare the relative odds of two occurrences, we use an odds ratio. An odds ratio is very similar to a risk ratio, but uses the calculated odds in the ratio instead.

Here we are calculating the odds ratio of SUVs downtown compared to in the suburbs, so we take the odds we calculated above. 0.30/0.11=2.73. Therefore, the odds of finding an SUV in a downtown parking lot are 2.73 times greater than the odds of finding an SUV in the suburbs.

Let’s remember why we are discussing risks, odds, and their ratios: to provide a measure of effect size for nominal data. A contingency table provides us with the raw frequency of events in different categories. A chi-square statistic tells us if the frequencies are significantly different than what’s expected. Risks and odds ratios provide us with measures to better conceptualize and identify if the difference between categories is of a meaningful magnitude.

So far we’ve talked about the chi-square statistic to deal with nominal data, and we created contingency tables to better understand how different categories of a variable have differently frequent occurrences.

Nominal data themselves carry little information other than telling us how frequently events occur. When presented with nominal data, a useful choice is a chi-square analysis; the data do not have any mean or variance that can be used to complete any of the standard parametric tests, such as t-tests and ANOVAs.

You can imagine if we sampled 100 downtown parking lots and 100 suburban parking lots for the frequency of SUVs and cars, we could compute an Analysis of Variance to see if the average downtown or suburban parking lot carried more SUVs or cars. However, what if you had this rich information, where you could in fact sort the data and compute means and variances, but you realize that your dataset does not comply with some of the underlying assumptions necessary for a parametric test?

We’ve already touched briefly on the idea of nonparametric tests: statistical procedures used to analyze data sets that violate specific assumptions required for parametric tests. These violations could include the violation of the assumption of normality or the violation of the assumption of homogenous variance. Valid conclusions from the results of t-tests and ANOVAs require that these assumptions are not violated, so when they are violated, we have to use tests that do not make these assumptions.

Several nonparametric tests have been developed by statisticians to enable us to analyze nonparametric data. We will not discuss in depth how to compute these nonparametric tests, but we will outline briefly how they work, and when you will need to use each of them.

The nonparametric tests we will discuss are all a type of rank-randomization test. You don’t need to know the details of what defines a rank-randomization test, other than the fact that all these tests require that you deal with ordinal, or ranked data. If all you have is ordinal data, then you can only do one of these tests. If you have ratio data, but the assumptions of parametric tests are invalid, then you can transform your ratio data to ordinal data and do a rank-randomization test. How you rank your numbers may differ in each test, so we will not get into the details of how the transformation occurs, although it is often very simple.

The four most frequently used nonparametric rank randomization tests are inspired by their parametric counterparts, t-tests and ANOVAs. When you have two independent samples and you want to compare their means using a t-test, but the data are not normally distributed, then you have to use the Mann-Whitney U test instead. Similarly, when the two samples are matched and call for a matched-sample t-test, then you do a Wilcoxon Matched-pairs signed-rank test instead. When the design gets more complicated and includes more than 2 groups whose means you wish to compare and the samples are not related, you would do a Kruskal-Wallis one-way ANOVA, also known as the Kruskal-Wallis H test. And lastly, when you have more than 2 groups where the samples are related, or are part of a repeated-measures design, but normality cannot be assumed, then you would use Friedman’s Rank test instead.

The hypotheses between the parametric and nonparametric tests are very similar, although they are generally a bit more broad and less specific in a nonparametric test. Otherwise, there are pros and cons to using parametric vs. nonparametric tests, but we will leave that for you to research down the road. Until then, you should focus on the fact that nonparametric tests provide a means to conduct statistical tests on your data when the data violate assumptions of normality.

Before we conclude, it is good to know that there are also nonparametric ways to compute correlations. For example, the nonparametric Spearman’s rho correlation is used when you have ordinal, rather than ratio data.

This module has discussed when parametric statistics are no longer appropriate, and has introduced the idea of alternative, non parametric statistical tests.

We learned how to conduct a chi-square test on distributions of nominal data, including both the goodness-of-fit test, and the test of independence.

We also developed an understanding of risks and odds, to investigate how meaningful the differences in our chi-square analyses are.

In the future, you can refer to the decision tree provided to make the appropriate decision on which statistical test to proceed with.

In the previous example, we were just comparing along one grouping factor: vehicle type. But what if we wanted to compare along two factors?

1. For example, imagine you wanted to compare the frequency of SUVs and cars between downtown and suburban Toronto. Maybe the frequency of SUVs and cars depends on, or is contingent on, location. To test this, we could now take a sample from a grocery store parking lot in downtown Toronto, and count how many of each type of vehicle were there, and then do the same in a grocery store parking lot in the suburbs.

2. [values not shown] To represent these data, we use something called a contingency table. A contingency table organizes your frequencies across the two grouping factors, where rows represent one variable (such as vehicle type), and columns represent the other variable (such as location). This shows us not just how common SUVs and cars are in our sample, nor how many vehicles we saw in each location, but it shows us how the two grouping factors relate to each other.

3. [only first line] The basic idea of the chi-square test of independence is exactly the same as the goodness-of-fit test, because we are testing observed values against expected values. The main difference now is that we calculate the expected values in a different way, based on the contingency table.

4. For expected frequency Eij, you take the product of the marginal totals of row Ri and column Cj and divide it by the total number of observations N.

So, the expected frequency for SUVs in the suburbs is taken by multiplying the total of all SUVs [highlight cell containing ‘11’] by the total of all vehicles in the suburbs [highlight cell containing ‘40’] and dividing it by all vehicles in total [highlight cell containing ‘70’] giving an expected frequency of 6.29.

[Reveal other three calculations and Expected values in table in the order shown here] You can do this for each of the four possible outcomes.

The degrees of freedom is slightly different than before. You now have 2 variables that you have to account for in your degrees of freedom, so you take the the number of categories [put coloured box around Downtown and all cells under it and another around Suburbs and all cells under it, then put a different coloured box around SUVs and all cells to the right of it and another around Cars and all cells to the right of it] in each variable minus 1, and take the product between the two variables. In simpler terms, it’s the number of rows minus 1 times the number of columns minus 1. In our example, the degrees of freedom are 2-1 times 2-1, which is still 1. We can now use our degrees of freedom and our desired significance level of 0.05 to identify the critical chi-square value, which is 3.84.

FINISH

SUBMIT

Title

Title

Title