Day 2

Review

Population: the entire collection of individuals about which information is sought.
Sample: a subset of population, containing the individuals that are actually observed.

Why do we sample?

Simple Random Sample

A sample chosen by a method where every selection from the population made is equally likely to make up the sample

Stratified Sampling

Divide the population into similar groups (i.e., group students by college)
Randomly sample from those groups (strata)

Cluster Sampling

Divide the population into clusters (i.e., split Manhattan, KS by street block)
Randomly sample from the clusters

Systematic Sampling

Randomly choose a start point in a “lined-up” population
Sample every \(k^{th}\) item
i.e., Starting from the \(4^{th}\) batch of ice cream produced on a given day, Call Hall will check the quality of every \(4^{th}\) batch that comes off the production line

Voluntary Response Samples

Customer support reviews

Sample of Convenience

Class height

Questions?

Goals for Today:

Introduce the fundamental data and variable types
Present the most common simple graphical summaries
Differentiate between the use-cases of graphical summaries

Visualizing Data

Data

What is data?

Data: Information that has been collected
Individual: Something the information has been collected on (People/Places/Things/etc.)
Variables: Characteristics about the individuals we collected information (data) from

\[ \begin{array}{|c|c|c|c|c|} \hline \text{N} & \text{Age Class} & \text{Weight} & \text{Sex} & \text{Location} \\ \hline 1 & 0.5 & 30.8 & \text{M} & \text{B} \\ \hline 2 & 0.5 & 21.8 & \text{M} & \text{B} \\ \hline 3 & 2.5 & 47.6 & \text{M} & \text{A} \\ \hline 4 & 0.5 & 29.0 & \text{F} & \text{B} \\ \hline 5 & 2.5 & 65.8 & \text{M} & \text{A} \\ \hline \end{array} \]

We collected information on deer
The variables are age class, weight, sex, and species
The values of those variables are called data

\[ \begin{array}{|c|c|c|c|c|} \hline \text{N} & \text{Treatment} & \text{% Nitrogen} & \text{Replicate} & \text{Stage} & \text{Plot Location}\\ \hline 1 & \text{E} & 1.29 & 4 & \text{P3} & \text{East} \\ \hline 2 & \text{C} & 2.16 & 3 & \text{P2} & \text{West} \\ \hline 3 & \text{C} & 2.33 & 2 & \text{P2} & \text{West} \\ \hline 4 & \text{B} & 1.46 & 3 & \text{P3} & \text{East} \\ \hline 5 & \text{D} & 2.42 & 4 & \text{P1} & \text{West} \\ \hline 6 & \text{A} & 2.14 & 1 & \text{P2} & \text{East} \\ \hline \end{array} \]

How many individuals?

What are the variables?

What are the data for individual 3?

Variables

What’s the difference between column 2 of the soil nitrogen table, and column 3?

Qualitative (Categorical) variable: The value of the variable represents a descriptive categories

Identifying labels/names
We can’t really do math with a label or name
We code these into numbers to fix that
- i.e., Cat-owners = 0 | Dog-owners = 1 | Both = 2

Quantitative variable: The value of the variable represents a meaningful number

Height of a person, sales of a product
- We can do math with these

A lot of how we do statistics depends on what data we have.

How would you organize Column 2?

What about Column 6?

Qualitative variables can be ordinal or nominal

Ordinal variables: Categories/values of the variable have a natural ordering
- Letter grade: A, B, C, D
- Clothing size: S, M, L

Nominal variable: Categories/values of the variable cannot be ordered naturally
- State of residence
- Degree program

Quantitative variables can be discrete or continuous

Discrete variable: A countable number of values (0, 1, 2, 3, 4, …)
- Number of students in a classroom
- Population size of fish in a pond
- How many times a coin flip was successfully called
Continuous variable: A continuous range of numbers (0, 0.1, 0.11, 0.111, …)
- Temperature
- Volume of liquid in a glass
- Height/Weight

Quantitative variables can be categorized by level of measurement used for obtaining data values:

Interval level

Differences between values make sense
Ratios don’t make sense because zero has no meaning
Temperature in Celsius/Fahrenheit (Does 0 mean there’s no heat?)
Dates (Is there a meaningful ratio you can make out of 1997 and 2020?)

Ratio level

Numerical measurement
Differences between values make sense
Ratios also make sense
Zero has meaning, it represents absence of the quantity
Height (If you’re 0 inches tall, do you have height? Is there a meaningful percentage difference in height between 64 and 67 inches?)

\[ \begin{array}{|c|c|c|c|c|} \hline \text{Sample ID} & \text{pH} & \text{Temperature} \ (\text{C}^\circ) & \text{Colony Count} & \text{Cholera (+/-)} \\ \hline 1 & 6.89 & 23.6 & 41 & - \\ \hline 2 & 7.19 & 21.2 & 79 & - \\ \hline 3 & 6.98 & 22.1 & 55 & + \\ \hline 4 & 7.31 & 20.4 & 49 & - \\ \hline 5 & 7.02 & 22.7 & 96 & + \\ \hline \end{array} \]

Categorize the variables:

Temperature \((\text{C}^\circ)\)

Colony Count

Cholera \((+/-)\)

Communicating with Data

Raw data isn’t entirely useful

	animal	iron	infect	weight	day
129	A06	Iron	NonInfected	305	508
509	A23	Iron	Infected	150	166
471	A21	Iron	Infected	230	380
299	A13	NoIron	Infected	410	781
270	A12	NoIron	Infected	295	599
187	A09	NoIron	Infected	130	166
307	A14	NoIron	Infected	200	296
277	A13	NoIron	Infected	125	122
494	A22	Iron	Infected	200	380
330	A15	NoIron	Infected	205	296

Statistics is really good at summarizing and visualizing data

Choosing the “best” graph for displaying our data depends on our data

What kind of data do we have?
- Categorical?
- Numerical?
What are we trying to do?
- Describe our sample?
- Look at the distribution of our data?
- See how two or more variables are related?

Bar graph: One or more categorical variable

Histogram: One numerical variable

Scatterplot: More than one numerical variable

Summarizing Data

Even when clean, data is messy

Interpreting information is how we make decisions
Every decision we make is data driven
- Even when it’s “emotional data”
Statistics gives us tools to summarize and interpret data rapidly

Frequency Distribution

\[ \begin{array}{|c|c|c|c|c|} \hline \text{Sophomore} & \text{Sophomore} & \text{Junior} & \text{Junior} & \text{Sophomore} \\ \hline \text{Freshman} & \text{Sophomore} & \text{Senior} & \text{Freshman} & \text{Senior} \\ \hline \end{array} \]

Random Sample of 10 students in a class

How many variables?

What type of variable?

Frequency distribution:

Groups data into categories
Records the number of observations that fall into each category
“How frequently do these variables occur in my sample?”

Relative frequency distribution

Divide the number in each category by the total number of observations
This gives us the proportion of units in each category
“What percentage of my sample is represented by this variable?”

\[ \begin{array}{|c|c|c|} \hline \text{Class Level} & \text{Frequency} & \text{Relative Frequency} \\ \hline \text{Freshman} & 2 & 2/10=0.20 \\ \hline \text{Sophomore} & 4 & 4/10=0.40 \\ \hline \text{Junior} & 3 & 3/10=0.30 \\ \hline \text{Senior} & 2 & 2/10=0.20 \\ \hline \text{Total} & 10 & 10/10=1.00 \\ \hline \end{array} \]

Count up how many times each variable occurs in the sample

For each variable, divide the occurrences of the variable by the sample total
- \(4\) students are sophomores
- \(10\) students total in the sample
- \({4 \over 10}=0.40\)
- \(0.40*100\%=40\%\)

How is this useful?

What percentage of the class drinks coffee?

What percentage of the class drinks tea?

Whats our sample?
Population?
What are the variables and variable types?

\[ \begin{array}{|c|c|c|} \hline \text{Drink Preference} & \text{Frequency} & \text{Relative Frequency} \\ \hline \text{Coffee} & & \\ \hline \text{Tea} & & \\ \hline \text{Both} & & \\ \hline \text{Neither} & & \\ \hline \text{Total} & & \\ \hline \end{array} \]

Bar Graphs

Why would I want to graph my data?

It looks good.

Sophomore	Sophomore	Sophomore	Senior	Freshman	Freshman
Sophomore	Senior	Sophomore	Junior	Sophomore	Senior
Junior	Senior	Junior	Freshman	Freshman	Freshman
Junior	Junior	Freshman	Sophomore	Junior	Junior
Sophomore	Senior	Sophomore	Junior	Sophomore	Freshman
Sophomore	Junior	Junior	Sophomore	Sophomore	Senior

One or more categorical variables

So we use a bar graph

We can also just flip this horizontal

This is useful for when you have longer category names

Pie Charts

Bar graphs can be converted into pie charts:

Generally a pie chart will show relative frequency

“What’s my piece of the pie?”

They’re very pretty

Not extensively useful
Interpretability is everything

Visualizing Quantitative Data

We’ve looked at some qualitative (categorical) visualizations

What about quantitative (numerical) visualizations?
When we have one quantitative variable we have several options:
- Histograms
- Steam-and-leaf plots
- Dotplots

With two quantitative variables we generally use a scatterplot

We’ll talk about this at length in Chapter 4 (so not important right now)
Side Note: we can use more than two quantitative variables in a scatterplot
- It’s not very useful
- Why? Can you think in 3 dimensions? What about 4? 5?

Frequency Distribution

In a study of the yield of corn in varied fertilizers, \(54\) values were recorded for yields:

46.4	140.8	103.2	94.0	98.8	109.3
85.2	138.2	119.0	103.1	119.8	115.8
110.2	139.4	102.7	100.2	130.7	110.7
112.7	61.4	59.2	113.1	133.1	111.0
44.4	83.3	98.1	86.6	119.0	99.8
63.8	106.4	106.3	109.4	128.0	112.9
65.4	112.8	113.4	107.2	126.2	130.2
68.1	109.1	112.6	110.6	121.0	135.5
131.0	85.6	67.6	106.4	119.2	135.2

Variables?

Variable types?

Quantitative variables can also be summarized with a frequency distribution

Define interval(s) for the data (referred to as classes/class)
Record the number of observations that fall into each class

Class	40.0-59.9	60.0-79.9	80.0-99.9	100.0-119.9	120.0-139.9	140.0+
Frequency	3	5	8	26	11	1

There’s no “one” right way to choose the number of classes or the width for a frequency distribution

Histograms

Histogram: visual representation of a frequency distribution

Not a bar graph

Bar height (y-axis) represents class frequency

Bar width (x-axis) represents class width

This histogram has 6 classes
- You can choose a different number of classes
- You can choose different widths
- Free will exists, there are no rules

It’s not wrong (it’s very unhinged behavior)
- Is this interpretable though?

We care about the shape of our data
- This is the primary purpose of a histogram
- So we want to not fail at that task
The shape of our data can help us observe the distribution of our data

Symmetric - mirror image on both sides of it’s center

Unimodal - One peak/hump

Positively-skewed - Long, narrow tail to the right

Negatively-skewed - Long, narrow tail to the left

Uniform - box

Bimodal - two peaks/humps

Histogram showing the GPAs of a sample of students at a certain college

Which class has the highest frequency?

How many students were in the sample?

What percentage of the students had GPAs between 2.0 and 3.0?

Describe the shape of the above histogram.

One of these histograms represents the age at death from natual causes, (heart attack, cancer, etc.)

The other represents the age at death from accidents

Which represents the age at death from accidents?
- Justify your answer

Histograms can be used to summarize both small and large data sets
Sometimes we prefer more detailed visualizations for smaller data sets
Stem-and-leaf plots and dotplots are alternative summaries that display the actual values

Stem-and-leaf plots

Each observation should have at least two digits
- The digit furthest to the right is the “leaf”
- The digits to the left form the “stem”
The data:

\[ \begin{array}{|c|c|c|c|c|c|c|c|c|c|} \hline 87 & 7 & 95 & 76 & 32 & 28 & 84 & 98 & 93 & 88\\ \hline 78 & 100 & 68 & 76 & 55 & 65 & 42 & 57 & 77 & 96 \\ \hline \end{array} \]

The stem-and-leaf plot:

\[ \begin{array}{r|l} 0 & 7 \\ 1 & \\ 2 & 8 \\ 3 & 2 \\ 4 & 2 \\ 5 & 5 \; 7 \\ 6 & 5 \; 8 \\ 7 & 6 \; 6 \; 7 \; 8 \\ 8 & 4 \; 7 \; 8 \\ 9 & 3 \; 5 \; 6 \; 8 \\ 10 & 0 \end{array} \]

Can you describe the shape of the data?

Dotplots

We can represent each observation by a dot above its value on a number line
Data:

\[ \begin{array}{|c|c|c|c|c|} \hline 3 & 6 & 2 & 5 & 1 \\ \hline 2 & 3 & 4 & 3 & 4 \\ \hline \end{array} \]

Dotplot:

Time Plots

As a variable changes over time, we can record that change with a time plot

Time should always be on the horizontal scale
Your measured variable should be on the vertical scale
Generally, you want to include points
- There should typically be a line connecting points

How many students were bedridden on Jan 30?

Day 2

Review

Why do we sample?

Visualizing Data

Data

Variables

Communicating with Data

Summarizing Data

Frequency Distribution

Bar Graphs

Pie Charts

Visualizing Quantitative Data

Frequency Distribution

Histograms

Stem-and-leaf plots

Dotplots

Time Plots

Attendance QOTD

Go away