Descriptive & Inferential Statistics Intro & Types of Data
Title
Welcome
What Will You Learn?
What are Data?
What is Statistics?
The Two Questions
Datasets
Datasets
Variables
Scales of Measurement
Nominal Scale
Ordinal Scale
Interval Scale
Ratio Scale
Scales of Measurement
Checkpoint
Checkpoint Review
Summary Statistics
Frequency Distributions
Frequency Distributions
Histograms
Checkpoint
Summary: Module 1
Hello, and welcome to Module 1.
What we hope you will take away from this series of modules is an appreciation for how we can use data to answer important questions.
Broadly speaking, we want you to be statistically literate: to clearly understand what conclusions a statistical test or analysis does, or does not, support.
To understand that, we first need to understand what are data and, of course, statistics, are.
[Photo credit: Lendingmemo.com]
Now let’s talk about some of the things you will be learning in this module.
By the end of this module you should be able to:
-Understand what data are and how we talk about them quantitatively.
-Define, interpret, and apply the 4 scales of measurement to data: Nominal, Ordinal, Interval, and Ratio
- And read and interpret graphical displays of data.
The key player in statistics is data.
The oxford dictionary defines data as: Facts and statisticscollected together for reference or analysis.
As an additional grammar note: Data is plural. The singular form is datum
Data, at a fundamental level, represent information. Data can be a collection of observations from an experiment, like time to push a button in response to an image presented on a screen, or observed in the world at large, like how often people buy potatoes at the grocery store.
But importantly, we have observed something and codified or quantified it (that is, assigned numbers to it), allowing us to summarize and compare them.
Now that we have defined data, let’s take a look at statistics.
Modern statistics is a combination of probability theory, mathematics, digital computation, and logic that seeks to reveal patterns, information, and evidence not visible to the naked eye. You don’t have to be an expert in these fields to understand and use statistics though.
Just as a race car driver can win the race without understanding how the drive train of her car is assembled, we hope to teach you the fundamental skills involved in interpreting and using statistics, leaving complex theoretical and mathematical development to the experts.
In modules 1 and 2, we’ll talk primarily about statistics that describe or summarize data - taking many observations, or data points, and making them simpler to comprehend. Future modules will talk about comparing data, making inferences, and drawing valid conclusions from data.
Throughout these modules, we will be framing our examination of statistics using two broad questions.
One: Is there really a difference? Or equivalently, how confident are we that there is a real difference - how strong is the evidence of difference? For example, do people buy more potatoes at Store A or Store B?
And two: If there is a difference, how big is it and does it matter on a practical level given our question of interest? For example, a small increase in academic grades may be less significant than a small increase in survival rates against a particular disease for example.
We know that statistical concepts can be quite abstract. To help ground these concepts in something concrete, we will be using a variety of real datasets throughout the modules as examples.
Here we are going to introduce a collection of data, referred to as a dataset, on cars sold in 2004.
We chose cars because we felt that most learners should be familiar with them, and perhaps even some of the statistics used to describe them.
The dataset is laid out in a spreadsheet. Each of the 429 rows represents 1 type of car sold during 2004.
Each of the columns is a variable describing something about that car. We call it a variable because it takes on different values based on the car.
Variables of this sort, that simply flag whether something has a feature or property are often called, indicator variables.
The variables included in the data set, in order, are:
Manufacturer - the company that made the car
Model - the name of the car
Sports_Car - is a sports car, yes or no
SUV - is it a sport utility vehicle, yes or no
Wagon - is it a wagon, yes or no
Minivan - is it a minivan, yes or no
Pickup - is it pickup truck, yes or no
All_Wheel_Drive - does it feature all wheel drive, yes or no
Rear_Wheel_Drive - does it have rear wheel drive, yes or no
Hybrid - is it a gas/electric hybrid
Suggested_Retail_Price - the manufacturer's suggested sale price
Dealer_Cost - the amount the dealer pays the manufacturer for the car
Engine_Size_Litres - size of the engine, in litres
Cylinders - how many cylinders does it have
Horsepower - how much horsepower can the engine produce
MPG_City - the miles per gallon of fuel in the city
MPG_Highway - the miles per gallon of fuel on highways
Weight_Pounds - the weight of the car
Wheel_Base_Inches - the distance between the centers of the front and rear wheels in inches
Length_Inches - overall length of the car in inches
Width_Inches - overall width of the car in inches
Number_of_Doors - 2 door or 4 door
Missing values are indicated with NA. As you can see, we are talking about miles, gallons, pounds and inches. This dataset is from the United States, so we will unfortunately, have to leave the metric system behind when using this example.
We said before that data were observations or information that had been categorized or quantified. We do this using scales of measurement.
A scale of measurement is simply a way to assign categories or numbers to things in the world.
We use scales of measurement everyday. We measure distance in kilometers (assigning a set distance the value of 1 kilometer), or temperature in degrees Celsius [temp picture], or describe restaurants as ‘short order’, ‘fine dining [dessert picture] or ‘fast food’.
This gives us the ability to compare and contrast things in a standardized way without having to refer to examples.
What is important to understand, though, is that there are different KINDS of scales of measurement.
Image Credits:
Roasted Sucking Pig at Troquet: Dale Cruse
Burger and Fries: punctuated
Let’s take a look at the first type of scale: the nominal scale.
Things measured on a nominal scale are merely given unique names. The name represents a category of things and thus the resulting data are often called categorical data. Each name is arbitrary and tells us nothing about the other names in the scale.
Note that numbers can also be categorical variables, if they are just arbitrary labels like Group 1 and 2 for example.
Looking at our cars data, we see a number of variables that are categorical. For example, model of the car and manufacturer.
We can’t do any math with categories, they just distinguish things as different. You can, however, count how MANY cars are, for example, Hyundais vs Fords.
Ordinal data is like nominal data in that the levels are names, but there is an additional constraint: the levels are ordered. We can therefore make comparisons like, greater than or less than. However, the interval between ranks may not be constant. A classic example is competitive ranking. If someone comes first in a race, we know they were faster than all the other competitors. But we can’t be sure that the distance between first and second will be that same as the distance between the second and the third.
Interval scales, as the name suggests, have equally spaced intervals between all the values on the scale. Thus, we can compare, saying things like greater or less than, but we can also assign exact values to that difference by subtracting or adding.
For example, temperature is an interval scale. We know that the difference between 5 degrees and 10 degrees is the same as the difference between 15 degrees and 20 degrees. But we can’t form ratios or divide these - it is not sensible to say that 10 degrees is half of 20 degrees for example.
A ratio scale has all the properties of the previous scale as well as a meaningful zero point. That zero point is where none of that value exists (even if this doesn't often or ever happen in actuality).
Let’s go back to our race example [insert racers photo], but instead of simply having rankings of 1st, 2nd, or 3rd place, let’s imagine we have the number of seconds it took each competitor to reach the finish line. Those numbers are on a ratio scale. These times do have a meaningful zero. This zero point is conceptually meaningful even though no competitor ever actually crosses the finish line at zero seconds. Thus it is appropriate to say that someone who finished in 300 seconds was twice as fast as someone who finished in 600 seconds.
There are countless other variables that are on a ratio scale.
We’ve laid out the name, properties, and permitted mathematical operations of each scale in this table. Please come back to this table any time you need to remind yourself about the different measurement scales.
Now that we know a little more about scales of measurement, let’s look more closely at collections of observations, ordered and measured using these scales, or data.
Usually when we talk about data, we are talking about many numbers and labels. Too many to easily comprehend by just looking at them. Take a look at the full data set for cars for example. With so many cars, and so many variables, I think you’ll agree it’s almost impossible to understand anything meaningful just by inspecting the data directly.
This is where summary statistics come in. We are looking for a single number, or small set of numbers, that simplify and represent many numbers, hopefully without biasing or misrepresenting those data.
In the next module we’ll talk about 2 types of numeric summaries: measures of central tendency and measures of variability in the data. But before we do that, we have an even more intuitive and powerful tool at our disposal: pictures. The three rules of real estate form a mantra: location, location, location. Translated to statistics, this mantra becomes: pictures, pictures, pictures.
Humans process visual information very efficiently so a great first step is to look at a summary of your data in a graph.
Before we get started on graphing data, let’s consider an important tool for looking at our data directly - the frequency distribution.
Let’s use suggested retail price from our cars database to demonstrate how this works.
The numbers you are seeing are all the different car prices. It’s tough to see any patterns when presented in this way.
Let’s sort all the values from lowest to highest. That helps a little but is still too much information to process all at once.
A frequency distribution is a table that displays several numeric intervals, called bins, and counts the number of scores within that interval.
Here, we’re displaying car price in 10 bins. Now we begin to see begin to see meaningful patterns emerge. We note for example that most cars cost less than 60000 dollars.
Let’s plot this information in a graph
Graphs are nothing more than highly ordered pictures. Different graphs share some common properties. They have two axes: x and y. The x axis is horizontal and the y axis is vertical. Each axis represents the value of a variable using relative position and needs to be labeled to show what they represent.
We’ll be encountering many types of graphs throughout these modules. For now we’ll focus on histograms.
A histogram is a graph that places all the values occurring in a variable along the x-axis and a count of how often those values occurs along the y in an arbitrary number of bins. The bins are the white rectangles just above the x-axis. The width represents the numeric interval of the bin and the height the number of values in that interval - thus the area of the rectangles (that is, height times width) represents the frequency of values in the distribution.
A histogram is an ideal way to visualize frequency distributions.
This histogram plots the frequency distribution table we just made using 10 bins
Now let’s increase the number of bins. A few things happen as we do this 1) the width of each bin decreases, 2) the count of scores in each bin decreases, and 3) we see see the ‘shape’ of the distribution more and more clearly.
The number of bins is arbitrary and the appropriate number will depend on the underlying data and the purpose of the graph
Understanding graphs generally, and histograms specifically, will help us understand data at a glance. We can more precisely, but simply, interpret our data using summary statistics, which will be discussed in Module 2.
Let’s look back at our Intended Learning Outcomes.
In this module we discussed how to define data and statistics.
We have described how to interpret data on the 4 scales of measurement: Nominal, Ordinal, Interval, and Ratio scales
And we can now read and understand graphical data displays including histograms.
Module 2 will move on to numeric summaries of our data. We’ll be using histograms to help us understand these as we go.