animal | iron | infect | weight | day | |
---|---|---|---|---|---|
129 | A06 | Iron | NonInfected | 305 | 508 |
509 | A23 | Iron | Infected | 150 | 166 |
471 | A21 | Iron | Infected | 230 | 380 |
299 | A13 | NoIron | Infected | 410 | 781 |
270 | A12 | NoIron | Infected | 295 | 599 |
187 | A09 | NoIron | Infected | 130 | 166 |
307 | A14 | NoIron | Infected | 200 | 296 |
277 | A13 | NoIron | Infected | 125 | 122 |
494 | A22 | Iron | Infected | 200 | 380 |
330 | A15 | NoIron | Infected | 205 | 296 |
Day 2
Review
Population: the entire collection of individuals about which information is sought.
Sample: a subset of population, containing the individuals that are actually observed.
Why do we sample?
Simple Random Sample
- A sample chosen by a method where every selection from the population made is equally likely to make up the sample
Stratified Sampling
Divide the population into similar groups (i.e., group students by college)
Randomly sample from those groups (strata)
Cluster Sampling
Divide the population into clusters (i.e., split Manhattan, KS by street block)
Randomly sample from the clusters
Systematic Sampling
Randomly choose a start point in a “lined-up” population
Sample every \(k^{th}\) item
i.e., Starting from the \(4^{th}\) batch of ice cream produced on a given day, Call Hall will check the quality of every \(4^{th}\) batch that comes off the production line
Voluntary Response Samples
- Customer support reviews
Sample of Convenience
- Class height
Questions?
Goals for Today:
Introduce the fundamental data and variable types
Present the most common simple graphical summaries
Differentiate between the use-cases of graphical summaries
Visualizing Data
Data
What is data?
Data: Information that has been collected
Individual: Something the information has been collected on (People/Places/Things/etc.)
Variables: Characteristics about the individuals we collected information (data) from
\[ \begin{array}{|c|c|c|c|c|} \hline \text{N} & \text{Age Class} & \text{Weight} & \text{Sex} & \text{Location} \\ \hline 1 & 0.5 & 30.8 & \text{M} & \text{B} \\ \hline 2 & 0.5 & 21.8 & \text{M} & \text{B} \\ \hline 3 & 2.5 & 47.6 & \text{M} & \text{A} \\ \hline 4 & 0.5 & 29.0 & \text{F} & \text{B} \\ \hline 5 & 2.5 & 65.8 & \text{M} & \text{A} \\ \hline \end{array} \]
We collected information on deer
The variables are age class, weight, sex, and species
The values of those variables are called data
\[ \begin{array}{|c|c|c|c|c|} \hline \text{N} & \text{Treatment} & \text{% Nitrogen} & \text{Replicate} & \text{Stage} & \text{Plot Location}\\ \hline 1 & \text{E} & 1.29 & 4 & \text{P3} & \text{East} \\ \hline 2 & \text{C} & 2.16 & 3 & \text{P2} & \text{West} \\ \hline 3 & \text{C} & 2.33 & 2 & \text{P2} & \text{West} \\ \hline 4 & \text{B} & 1.46 & 3 & \text{P3} & \text{East} \\ \hline 5 & \text{D} & 2.42 & 4 & \text{P1} & \text{West} \\ \hline 6 & \text{A} & 2.14 & 1 & \text{P2} & \text{East} \\ \hline \end{array} \]
- How many individuals?
- What are the variables?
- What are the data for individual 3?
Variables
What’s the difference between column 2 of the soil nitrogen table, and column 3?
Qualitative (Categorical) variable: The value of the variable represents a descriptive categories
Identifying labels/names
We can’t really do math with a label or name
We code these into numbers to fix that
- i.e., Cat-owners = 0 | Dog-owners = 1 | Both = 2
Quantitative variable: The value of the variable represents a meaningful number
Height of a person, sales of a product
- We can do math with these
A lot of how we do statistics depends on what data we have.
\[ \begin{array}{|c|c|c|c|c|} \hline \text{N} & \text{Treatment} & \text{% Nitrogen} & \text{Replicate} & \text{Stage} & \text{Plot Location}\\ \hline 1 & \text{E} & 1.29 & 4 & \text{P3} & \text{East} \\ \hline 2 & \text{C} & 2.16 & 3 & \text{P2} & \text{West} \\ \hline 3 & \text{C} & 2.33 & 2 & \text{P2} & \text{West} \\ \hline 4 & \text{B} & 1.46 & 3 & \text{P3} & \text{East} \\ \hline 5 & \text{D} & 2.42 & 4 & \text{P1} & \text{West} \\ \hline 6 & \text{A} & 2.14 & 1 & \text{P2} & \text{East} \\ \hline \end{array} \]
How would you organize Column 2?
What about Column 6?
Qualitative variables can be ordinal or nominal
Ordinal variables: Categories/values of the variable have a natural ordering
Letter grade: A, B, C, D
Clothing size: S, M, L
Nominal variable: Categories/values of the variable cannot be ordered naturally
State of residence
Degree program
Quantitative variables can be discrete or continuous
Discrete variable: A countable number of values (0, 1, 2, 3, 4, …)
Number of students in a classroom
Population size of fish in a pond
How many times a coin flip was successfully called
Continuous variable: A continuous range of numbers (0, 0.1, 0.11, 0.111, …)
Temperature
Volume of liquid in a glass
Height/Weight
Quantitative variables can be categorized by level of measurement used for obtaining data values:
Interval level
Differences between values make sense
Ratios don’t make sense because zero has no meaning
Temperature in Celsius/Fahrenheit (Does 0 mean there’s no heat?)
Dates (Is there a meaningful ratio you can make out of 1997 and 2020?)
Ratio level
Numerical measurement
Differences between values make sense
Ratios also make sense
Zero has meaning, it represents absence of the quantity
Height (If you’re 0 inches tall, do you have height? Is there a meaningful percentage difference in height between 64 and 67 inches?)
\[ \begin{array}{|c|c|c|c|c|} \hline \text{Sample ID} & \text{pH} & \text{Temperature} \ (\text{C}^\circ) & \text{Colony Count} & \text{Cholera (+/-)} \\ \hline 1 & 6.89 & 23.6 & 41 & - \\ \hline 2 & 7.19 & 21.2 & 79 & - \\ \hline 3 & 6.98 & 22.1 & 55 & + \\ \hline 4 & 7.31 & 20.4 & 49 & - \\ \hline 5 & 7.02 & 22.7 & 96 & + \\ \hline \end{array} \]
Categorize the variables:
- pH
- Temperature \((\text{C}^\circ)\)
- Colony Count
- Cholera \((+/-)\)
Communicating with Data
Raw data isn’t entirely useful
Statistics is really good at summarizing and visualizing data
Choosing the “best” graph for displaying our data depends on our data
What kind of data do we have?
Categorical?
Numerical?
What are we trying to do?
Describe our sample?
Look at the distribution of our data?
See how two or more variables are related?
- Bar graph: One or more categorical variable
- Histogram: One numerical variable
- Scatterplot: More than one numerical variable
Summarizing Data
Even when clean, data is messy
Interpreting information is how we make decisions
Every decision we make is data driven
- Even when it’s “emotional data”
Statistics gives us tools to summarize and interpret data rapidly
Frequency Distribution
\[ \begin{array}{|c|c|c|c|c|} \hline \text{Sophomore} & \text{Sophomore} & \text{Junior} & \text{Junior} & \text{Sophomore} \\ \hline \text{Freshman} & \text{Sophomore} & \text{Senior} & \text{Freshman} & \text{Senior} \\ \hline \end{array} \]
Random Sample of 10 students in a class
- How many variables?
- What type of variable?
Frequency distribution:
Groups data into categories
Records the number of observations that fall into each category
“How frequently do these variables occur in my sample?”
Relative frequency distribution
Divide the number in each category by the total number of observations
This gives us the proportion of units in each category
“What percentage of my sample is represented by this variable?”
\[ \begin{array}{|c|c|c|} \hline \text{Class Level} & \text{Frequency} & \text{Relative Frequency} \\ \hline \text{Freshman} & 2 & 2/10=0.20 \\ \hline \text{Sophomore} & 4 & 4/10=0.40 \\ \hline \text{Junior} & 3 & 3/10=0.30 \\ \hline \text{Senior} & 2 & 2/10=0.20 \\ \hline \text{Total} & 10 & 10/10=1.00 \\ \hline \end{array} \]
Count up how many times each variable occurs in the sample
For each variable, divide the occurrences of the variable by the sample total
\(4\) students are sophomores
\(10\) students total in the sample
\({4 \over 10}=0.40\)
\(0.40*100\%=40\%\)
How is this useful?
What percentage of the class drinks coffee?
What percentage of the class drinks tea?
Whats our sample?
Population?
What are the variables and variable types?
\[ \begin{array}{|c|c|c|} \hline \text{Drink Preference} & \text{Frequency} & \text{Relative Frequency} \\ \hline \text{Coffee} & & \\ \hline \text{Tea} & & \\ \hline \text{Both} & & \\ \hline \text{Neither} & & \\ \hline \text{Total} & & \\ \hline \end{array} \]
Bar Graphs
Why would I want to graph my data?
It looks good.
Sophomore | Sophomore | Sophomore | Senior | Freshman | Freshman |
Sophomore | Senior | Sophomore | Junior | Sophomore | Senior |
Junior | Senior | Junior | Freshman | Freshman | Freshman |
Junior | Junior | Freshman | Sophomore | Junior | Junior |
Sophomore | Senior | Sophomore | Junior | Sophomore | Freshman |
Sophomore | Junior | Junior | Sophomore | Sophomore | Senior |
One or more categorical variables
- So we use a bar graph
We can also just flip this horizontal
- This is useful for when you have longer category names
Pie Charts
Bar graphs can be converted into pie charts:
Generally a pie chart will show relative frequency
- “What’s my piece of the pie?”
They’re very pretty
Not extensively useful
Interpretability is everything
Visualizing Quantitative Data
We’ve looked at some qualitative (categorical) visualizations
What about quantitative (numerical) visualizations?
When we have one quantitative variable we have several options:
Histograms
Steam-and-leaf plots
Dotplots
With two quantitative variables we generally use a scatterplot
We’ll talk about this at length in Chapter 4 (so not important right now)
Side Note: we can use more than two quantitative variables in a scatterplot
It’s not very useful
Why? Can you think in 3 dimensions? What about 4? 5?
Frequency Distribution
- In a study of the yield of corn in varied fertilizers, \(54\) values were recorded for yields:
46.4 | 140.8 | 103.2 | 94.0 | 98.8 | 109.3 |
85.2 | 138.2 | 119.0 | 103.1 | 119.8 | 115.8 |
110.2 | 139.4 | 102.7 | 100.2 | 130.7 | 110.7 |
112.7 | 61.4 | 59.2 | 113.1 | 133.1 | 111.0 |
44.4 | 83.3 | 98.1 | 86.6 | 119.0 | 99.8 |
63.8 | 106.4 | 106.3 | 109.4 | 128.0 | 112.9 |
65.4 | 112.8 | 113.4 | 107.2 | 126.2 | 130.2 |
68.1 | 109.1 | 112.6 | 110.6 | 121.0 | 135.5 |
131.0 | 85.6 | 67.6 | 106.4 | 119.2 | 135.2 |
- Variables?
- Variable types?
Quantitative variables can also be summarized with a frequency distribution
Define interval(s) for the data (referred to as classes/class)
Record the number of observations that fall into each class
Class | 40.0-59.9 | 60.0-79.9 | 80.0-99.9 | 100.0-119.9 | 120.0-139.9 | 140.0+ |
Frequency | 3 | 5 | 8 | 26 | 11 | 1 |
There’s no “one” right way to choose the number of classes or the width for a frequency distribution
Histograms
Histogram: visual representation of a frequency distribution
- Not a bar graph
Bar height (y-axis) represents class frequency
Bar width (x-axis) represents class width
This histogram has 6 classes
You can choose a different number of classes
You can choose different widths
Free will exists, there are no rules
It’s not wrong (it’s very unhinged behavior)
- Is this interpretable though?
We care about the shape of our data
This is the primary purpose of a histogram
So we want to not fail at that task
The shape of our data can help us observe the distribution of our data
Symmetric - mirror image on both sides of it’s center
Unimodal - One peak/hump
Positively-skewed - Long, narrow tail to the right
Negatively-skewed - Long, narrow tail to the left
Uniform - box
Bimodal - two peaks/humps
Histogram showing the GPAs of a sample of students at a certain college
- Which class has the highest frequency?
- How many students were in the sample?
- What percentage of the students had GPAs between 2.0 and 3.0?
- Describe the shape of the above histogram.
One of these histograms represents the age at death from natual causes, (heart attack, cancer, etc.)
The other represents the age at death from accidents
Which represents the age at death from accidents?
- Justify your answer
Histograms can be used to summarize both small and large data sets
Sometimes we prefer more detailed visualizations for smaller data sets
Stem-and-leaf plots and dotplots are alternative summaries that display the actual values
Stem-and-leaf plots
Each observation should have at least two digits
The digit furthest to the right is the “leaf”
The digits to the left form the “stem”
The data:
\[ \begin{array}{|c|c|c|c|c|c|c|c|c|c|} \hline 87 & 7 & 95 & 76 & 32 & 28 & 84 & 98 & 93 & 88\\ \hline 78 & 100 & 68 & 76 & 55 & 65 & 42 & 57 & 77 & 96 \\ \hline \end{array} \]
- The stem-and-leaf plot:
\[ \begin{array}{r|l} 0 & 7 \\ 1 & \\ 2 & 8 \\ 3 & 2 \\ 4 & 2 \\ 5 & 5 \; 7 \\ 6 & 5 \; 8 \\ 7 & 6 \; 6 \; 7 \; 8 \\ 8 & 4 \; 7 \; 8 \\ 9 & 3 \; 5 \; 6 \; 8 \\ 10 & 0 \end{array} \]
- Can you describe the shape of the data?
Dotplots
We can represent each observation by a dot above its value on a number line
Data:
\[ \begin{array}{|c|c|c|c|c|} \hline 3 & 6 & 2 & 5 & 1 \\ \hline 2 & 3 & 4 & 3 & 4 \\ \hline \end{array} \]
- Dotplot:
Time Plots
As a variable changes over time, we can record that change with a time plot
Time should always be on the horizontal scale
Your measured variable should be on the vertical scale
Generally, you want to include points
- There should typically be a line connecting points
How many students were bedridden on Jan 30?