Day 2

Review

  • Population: the entire collection of individuals about which information is sought.

  • Sample: a subset of population, containing the individuals that are actually observed.


Why do we sample?


Simple Random Sample

  • A sample chosen by a method where every selection from the population made is equally likely to make up the sample


Stratified Sampling

  • Divide the population into similar groups (i.e., group students by college)

  • Randomly sample from those groups (strata)


Cluster Sampling

  • Divide the population into clusters (i.e., split Manhattan, KS by street block)

  • Randomly sample from the clusters


Systematic Sampling

  • Randomly choose a start point in a “lined-up” population

  • Sample every \(k^{th}\) item

  • i.e., Starting from the \(4^{th}\) batch of ice cream produced on a given day, Call Hall will check the quality of every \(4^{th}\) batch that comes off the production line


Voluntary Response Samples

  • Customer support reviews


Sample of Convenience

  • Class height



Questions?



Goals for Today:

  1. Introduce the fundamental data and variable types

  2. Present the most common simple graphical summaries

  3. Differentiate between the use-cases of graphical summaries




Visualizing Data


Data

What is data?



  • Data: Information that has been collected

  • Individual: Something the information has been collected on (People/Places/Things/etc.)

  • Variables: Characteristics about the individuals we collected information (data) from

\[ \begin{array}{|c|c|c|c|c|} \hline \text{N} & \text{Age Class} & \text{Weight} & \text{Sex} & \text{Location} \\ \hline 1 & 0.5 & 30.8 & \text{M} & \text{B} \\ \hline 2 & 0.5 & 21.8 & \text{M} & \text{B} \\ \hline 3 & 2.5 & 47.6 & \text{M} & \text{A} \\ \hline 4 & 0.5 & 29.0 & \text{F} & \text{B} \\ \hline 5 & 2.5 & 65.8 & \text{M} & \text{A} \\ \hline \end{array} \]

  • We collected information on deer

  • The variables are age class, weight, sex, and species

  • The values of those variables are called data



\[ \begin{array}{|c|c|c|c|c|} \hline \text{N} & \text{Treatment} & \text{% Nitrogen} & \text{Replicate} & \text{Stage} & \text{Plot Location}\\ \hline 1 & \text{E} & 1.29 & 4 & \text{P3} & \text{East} \\ \hline 2 & \text{C} & 2.16 & 3 & \text{P2} & \text{West} \\ \hline 3 & \text{C} & 2.33 & 2 & \text{P2} & \text{West} \\ \hline 4 & \text{B} & 1.46 & 3 & \text{P3} & \text{East} \\ \hline 5 & \text{D} & 2.42 & 4 & \text{P1} & \text{West} \\ \hline 6 & \text{A} & 2.14 & 1 & \text{P2} & \text{East} \\ \hline \end{array} \]


  1. How many individuals?


  1. What are the variables?


  1. What are the data for individual 3?




Variables

What’s the difference between column 2 of the soil nitrogen table, and column 3?



Qualitative (Categorical) variable: The value of the variable represents a descriptive categories

  • Identifying labels/names

  • We can’t really do math with a label or name

  • We code these into numbers to fix that

    • i.e., Cat-owners = 0 | Dog-owners = 1 | Both = 2


Quantitative variable: The value of the variable represents a meaningful number

  • Height of a person, sales of a product

    • We can do math with these


A lot of how we do statistics depends on what data we have.



\[ \begin{array}{|c|c|c|c|c|} \hline \text{N} & \text{Treatment} & \text{% Nitrogen} & \text{Replicate} & \text{Stage} & \text{Plot Location}\\ \hline 1 & \text{E} & 1.29 & 4 & \text{P3} & \text{East} \\ \hline 2 & \text{C} & 2.16 & 3 & \text{P2} & \text{West} \\ \hline 3 & \text{C} & 2.33 & 2 & \text{P2} & \text{West} \\ \hline 4 & \text{B} & 1.46 & 3 & \text{P3} & \text{East} \\ \hline 5 & \text{D} & 2.42 & 4 & \text{P1} & \text{West} \\ \hline 6 & \text{A} & 2.14 & 1 & \text{P2} & \text{East} \\ \hline \end{array} \]


How would you organize Column 2?


What about Column 6?


Qualitative variables can be ordinal or nominal

  • Ordinal variables: Categories/values of the variable have a natural ordering

    • Letter grade: A, B, C, D

    • Clothing size: S, M, L


  • Nominal variable: Categories/values of the variable cannot be ordered naturally

    • State of residence

    • Degree program


Quantitative variables can be discrete or continuous


  • Discrete variable: A countable number of values (0, 1, 2, 3, 4, …)

    • Number of students in a classroom

    • Population size of fish in a pond

    • How many times a coin flip was successfully called

  • Continuous variable: A continuous range of numbers (0, 0.1, 0.11, 0.111, …)

    • Temperature

    • Volume of liquid in a glass

    • Height/Weight


Quantitative variables can be categorized by level of measurement used for obtaining data values:


Interval level

  • Differences between values make sense

  • Ratios don’t make sense because zero has no meaning

  • Temperature in Celsius/Fahrenheit (Does 0 mean there’s no heat?)

  • Dates (Is there a meaningful ratio you can make out of 1997 and 2020?)


Ratio level

  • Numerical measurement

  • Differences between values make sense

  • Ratios also make sense

  • Zero has meaning, it represents absence of the quantity

  • Height (If you’re 0 inches tall, do you have height? Is there a meaningful percentage difference in height between 64 and 67 inches?)



\[ \begin{array}{|c|c|c|c|c|} \hline \text{Sample ID} & \text{pH} & \text{Temperature} \ (\text{C}^\circ) & \text{Colony Count} & \text{Cholera (+/-)} \\ \hline 1 & 6.89 & 23.6 & 41 & - \\ \hline 2 & 7.19 & 21.2 & 79 & - \\ \hline 3 & 6.98 & 22.1 & 55 & + \\ \hline 4 & 7.31 & 20.4 & 49 & - \\ \hline 5 & 7.02 & 22.7 & 96 & + \\ \hline \end{array} \]


Categorize the variables:

  • pH


  • Temperature \((\text{C}^\circ)\)


  • Colony Count


  • Cholera \((+/-)\)





Communicating with Data

Raw data isn’t entirely useful


animal iron infect weight day
129 A06 Iron NonInfected 305 508
509 A23 Iron Infected 150 166
471 A21 Iron Infected 230 380
299 A13 NoIron Infected 410 781
270 A12 NoIron Infected 295 599
187 A09 NoIron Infected 130 166
307 A14 NoIron Infected 200 296
277 A13 NoIron Infected 125 122
494 A22 Iron Infected 200 380
330 A15 NoIron Infected 205 296


Statistics is really good at summarizing and visualizing data


Choosing the “best” graph for displaying our data depends on our data

  • What kind of data do we have?

    • Categorical?

    • Numerical?

  • What are we trying to do?

    • Describe our sample?

    • Look at the distribution of our data?

    • See how two or more variables are related?


  • Bar graph: One or more categorical variable


  • Histogram: One numerical variable


  • Scatterplot: More than one numerical variable



Summarizing Data

Even when clean, data is messy


  • Interpreting information is how we make decisions

  • Every decision we make is data driven

    • Even when it’s “emotional data”
  • Statistics gives us tools to summarize and interpret data rapidly



Frequency Distribution

\[ \begin{array}{|c|c|c|c|c|} \hline \text{Sophomore} & \text{Sophomore} & \text{Junior} & \text{Junior} & \text{Sophomore} \\ \hline \text{Freshman} & \text{Sophomore} & \text{Senior} & \text{Freshman} & \text{Senior} \\ \hline \end{array} \]


Random Sample of 10 students in a class


  • How many variables?


  • What type of variable?


Frequency distribution:

  • Groups data into categories

  • Records the number of observations that fall into each category

  • “How frequently do these variables occur in my sample?”


Relative frequency distribution

  • Divide the number in each category by the total number of observations

  • This gives us the proportion of units in each category

  • “What percentage of my sample is represented by this variable?”


\[ \begin{array}{|c|c|c|} \hline \text{Class Level} & \text{Frequency} & \text{Relative Frequency} \\ \hline \text{Freshman} & 2 & 2/10=0.20 \\ \hline \text{Sophomore} & 4 & 4/10=0.40 \\ \hline \text{Junior} & 3 & 3/10=0.30 \\ \hline \text{Senior} & 2 & 2/10=0.20 \\ \hline \text{Total} & 10 & 10/10=1.00 \\ \hline \end{array} \]


Count up how many times each variable occurs in the sample

  • For each variable, divide the occurrences of the variable by the sample total

    • \(4\) students are sophomores

    • \(10\) students total in the sample

    • \({4 \over 10}=0.40\)

    • \(0.40*100\%=40\%\)


How is this useful?


What percentage of the class drinks coffee?


What percentage of the class drinks tea?


  • Whats our sample?

  • Population?

  • What are the variables and variable types?


\[ \begin{array}{|c|c|c|} \hline \text{Drink Preference} & \text{Frequency} & \text{Relative Frequency} \\ \hline \text{Coffee} & & \\ \hline \text{Tea} & & \\ \hline \text{Both} & & \\ \hline \text{Neither} & & \\ \hline \text{Total} & & \\ \hline \end{array} \]


Bar Graphs

Why would I want to graph my data?




It looks good.


Sophomore Sophomore Sophomore Senior Freshman Freshman
Sophomore Senior Sophomore Junior Sophomore Senior
Junior Senior Junior Freshman Freshman Freshman
Junior Junior Freshman Sophomore Junior Junior
Sophomore Senior Sophomore Junior Sophomore Freshman
Sophomore Junior Junior Sophomore Sophomore Senior


One or more categorical variables

  • So we use a bar graph



We can also just flip this horizontal


  • This is useful for when you have longer category names




Pie Charts

Bar graphs can be converted into pie charts:


Generally a pie chart will show relative frequency

  • “What’s my piece of the pie?”


They’re very pretty

  • Not extensively useful

  • Interpretability is everything


Visualizing Quantitative Data

We’ve looked at some qualitative (categorical) visualizations

  • What about quantitative (numerical) visualizations?

  • When we have one quantitative variable we have several options:

    • Histograms

    • Steam-and-leaf plots

    • Dotplots


With two quantitative variables we generally use a scatterplot

  • We’ll talk about this at length in Chapter 4 (so not important right now)

  • Side Note: we can use more than two quantitative variables in a scatterplot

    • It’s not very useful

    • Why? Can you think in 3 dimensions? What about 4? 5?


Frequency Distribution

  • In a study of the yield of corn in varied fertilizers, \(54\) values were recorded for yields:


46.4 140.8 103.2 94.0 98.8 109.3
85.2 138.2 119.0 103.1 119.8 115.8
110.2 139.4 102.7 100.2 130.7 110.7
112.7 61.4 59.2 113.1 133.1 111.0
44.4 83.3 98.1 86.6 119.0 99.8
63.8 106.4 106.3 109.4 128.0 112.9
65.4 112.8 113.4 107.2 126.2 130.2
68.1 109.1 112.6 110.6 121.0 135.5
131.0 85.6 67.6 106.4 119.2 135.2


  • Variables?


  • Variable types?


Quantitative variables can also be summarized with a frequency distribution


  • Define interval(s) for the data (referred to as classes/class)

  • Record the number of observations that fall into each class



Class 40.0-59.9 60.0-79.9 80.0-99.9 100.0-119.9 120.0-139.9 140.0+
Frequency 3 5 8 26 11 1


There’s no “one” right way to choose the number of classes or the width for a frequency distribution



Histograms

Histogram: visual representation of a frequency distribution

  • Not a bar graph


Bar height (y-axis) represents class frequency

Bar width (x-axis) represents class width



  • This histogram has 6 classes

    • You can choose a different number of classes

    • You can choose different widths

    • Free will exists, there are no rules



  • It’s not wrong (it’s very unhinged behavior)

    • Is this interpretable though?


  • We care about the shape of our data

    • This is the primary purpose of a histogram

    • So we want to not fail at that task

  • The shape of our data can help us observe the distribution of our data


Symmetric - mirror image on both sides of it’s center

Unimodal - One peak/hump



Positively-skewed - Long, narrow tail to the right



Negatively-skewed - Long, narrow tail to the left



Uniform - box



Bimodal - two peaks/humps



Histogram showing the GPAs of a sample of students at a certain college



  • Which class has the highest frequency?


  • How many students were in the sample?


  • What percentage of the students had GPAs between 2.0 and 3.0?


  • Describe the shape of the above histogram.



One of these histograms represents the age at death from natual causes, (heart attack, cancer, etc.)

The other represents the age at death from accidents



  • Which represents the age at death from accidents?

    • Justify your answer



  • Histograms can be used to summarize both small and large data sets

  • Sometimes we prefer more detailed visualizations for smaller data sets

  • Stem-and-leaf plots and dotplots are alternative summaries that display the actual values



Stem-and-leaf plots

  • Each observation should have at least two digits

    • The digit furthest to the right is the “leaf”

    • The digits to the left form the “stem”

  • The data:

\[ \begin{array}{|c|c|c|c|c|c|c|c|c|c|} \hline 87 & 7 & 95 & 76 & 32 & 28 & 84 & 98 & 93 & 88\\ \hline 78 & 100 & 68 & 76 & 55 & 65 & 42 & 57 & 77 & 96 \\ \hline \end{array} \]

  • The stem-and-leaf plot:

\[ \begin{array}{r|l} 0 & 7 \\ 1 & \\ 2 & 8 \\ 3 & 2 \\ 4 & 2 \\ 5 & 5 \; 7 \\ 6 & 5 \; 8 \\ 7 & 6 \; 6 \; 7 \; 8 \\ 8 & 4 \; 7 \; 8 \\ 9 & 3 \; 5 \; 6 \; 8 \\ 10 & 0 \end{array} \]


  • Can you describe the shape of the data?



Dotplots

  • We can represent each observation by a dot above its value on a number line

  • Data:

\[ \begin{array}{|c|c|c|c|c|} \hline 3 & 6 & 2 & 5 & 1 \\ \hline 2 & 3 & 4 & 3 & 4 \\ \hline \end{array} \]

  • Dotplot:




Time Plots


As a variable changes over time, we can record that change with a time plot



  • Time should always be on the horizontal scale

  • Your measured variable should be on the vertical scale

  • Generally, you want to include points

    • There should typically be a line connecting points


How many students were bedridden on Jan 30?




Attendance QOTD


Go away