Statistics as a Language

STAT 240 - Fall 2025

Robert Sholl

Statistics as a Science

The Scientific Method

Where does statistics come into the picture?

\[ \begin{array}{|c|} \hline \text{Observe}\\ \hline \text{Question}\\ \hline \text{Hypothesize}\\ \hline \text{Experiment}\\ \hline \text{Analyze}\\ \hline \text{Conclude}\\ \hline \end{array} \]

Observe and Question

Tsunamis can occur after earthquakes
They tend to happen after high magnitude earthquakes
How high magnitude does an earthquake have to be to cause a tsunami?
How high are the tsunami waves from high magnitude earthquakes?

Hypothesize

“On 30 July 2025, at 11:24:52 PETT (23:24:52 UTC, 29 July), a \(M_w\) 8.8 megathrust earthquake struck off the eastern coast of the Kamchatka Peninsula in the Russian Far East, 119 km (74 mi) east-southeast of the coastal city of Petropavlovsk-Kamchatsky.” – Wikipedia

This earthquake was well above the severity to cause a tsunami
Higher magnitude earthquakes result in tsunamis with high waves

Experiment

We can’t experiment well with this kind of science
But we can run a sort of “psuedo”-experiment
We select \(100\) tsunamis at random and see if they were the product of an earthquake
We find out that \(68\) of them had some connection to a earthquake

Vocabulary (Get used to it)

Population: the entire collection of individuals about which information is sought.
Sample: a subset of population, containing the individuals that are actually observed.
Statistics: is the study of procedures for collecting, describing, and drawing conclusions from information.
- the science of describing or making inferences about a population, from a sample.

Analyze

Of the 748 tsunamis that have occurred since 1970, 464 of them can be directly associated with an earthquake.

From this data we can say two things
\(62\%\) of tsunamis are caused by earthquakes
\(68\%\) of the tsunamis in our sample were caused by earthquakes
- Parameter: a value that describes an entire population.
- Statistic: a value that describes a sample.

Sampling Techniques

I hope you’re hungry

Simple Random Sampling

How clean are the campus dining halls? Let’s find out!

We decide to line up all of the dining hall workers at Kramer
Everyone is assigned a number, \(1\) to \(50\)
We randomly select \(10\) numbers
Then swab the hands of every person who’s number came up

Simple Random Sampling

Simple Random Sample (SRS): a sample chosen by a method in which collection of n population items is equally likely to make up the sample.
Every worker had the same chance of being selected
- \(1/50\)
- Note: this only works if we select all \(10\) numbers simultaneously
- We’ll play with that concept more later on

Stratified Sampling

Maybe the \(10\) workers we selected all work the same station together. How can we make sure we get a sample from every station?

Divide up workers by station, \(5\) in total
Assign each group of workers numbers \(1\) through \(10\)
Randomly select \(2\) numbers
Swab the hands of each of those numbers from each group

Statified Sampling

Stratified Sample: The population is divided into groups, called strata, where the members of each stratum are similar in some way. Then a SRS is drawn from each stratum.
We stratified by station
We defined a SRS for the strata
We sampled from the strata based off of that SRS

Cluster Sampling

Foodborne outbreaks can also happen as a result of contaminated regions, not just bad worker hygeine.

We split the dining center up into a grid of 5 ft by 5 ft squares
We select \(10\) squares at random and swab the entire square
Each of these squares are called clusters
Cluster Sampling: Items are drawn from the population in groups, or clusters.

Systematic Sampling

At the end of the day, the delivery mechanism of foodborne illness is the food itself.

We decide to sample the food coming off the line directly
It’d be a massive interruption of service to sample every food item coming out
- Plus, we can’t serve food once we’ve sampled it
So we decide to sample every \(10^{th}\) item that comes from any station for possible contaminants

Systematic Sampling

Define a number, denoted \(k\)
Sample the \(k^{th}\) item that’s observed
Repeat the sampling process every \(k\) items that are observed

Voluntary Response

We design a simple survey that asks the students:
- On a scale of 1 to 10 how clean does Kramer dining center appear?
- Have you ever felt sick after eating at the dining center?
- Do you think the dining center prioritizes food safety adequately?
We send the survey to every student that’s eaten at Kramer in the past year

Samples of Convenience

How does the class feel about Kramer on a scale of 1 to 10?
It an easy sample to acquire, we just used who we had
It’s a terrible sampling technique in general
Why?

Data Types

Data

What is data?

Data: Information that has been collected
Individual: Something the information has been collected on (People/Places/Things/etc.)
Variables: Characteristics about the individuals we collected information (data) from

Variables

\[ \scriptsize \begin{array}{|c|c|c|c|c|c|} \hline \textbf{Location of harvest} & \textbf{Date of harvest} & \textbf{Sex} & \textbf{Age class} & \textbf{Body mass in kg} \\ \hline \text{Boyer} & 2005\text{-}10\text{-}15 & \text{Female} & 3.5 & 34.0 \\ \text{Desoto} & 2004\text{-}12\text{-}12 & \text{Male} & 3.5 & 71.2 \\ \text{Desoto} & 2009\text{-}10\text{-}17 & \text{Male} & 0.5 & 21.8 \\ \text{Desoto} & 2010\text{-}01\text{-}02 & \text{Male} & 0.5 & 19.5 \\ \text{Desoto} & 2005\text{-}12\text{-}11 & \text{Female} & 3.5 & 45.4 \\ \hline \end{array} \]

Not all variables are held equal
What’s the difference between column 1 and column 2?
- What about 3 and 4?

Categorical Variables

Qualitative (Categorical) variable: The value of the variable represents a descriptive categories

Identifying labels or names
We can’t really do math with a label or name
“Male” \(= 0\), “Female” \(= 1\)

Categorical Variables

Ordinal variables: Categories/values of the variable have a natural ordering
- Letter grade: A, B, C, D, F
- Age classifications: Young, Middle-aged, Old
Nominal variable: Categories/values of the variable cannot be ordered naturally
- Degree program
- Sex: Male / Female

Numeric Variables

Quantitative variable: The value of the variable represents a meaningful number

Body mass, age, height, time
We can inherently do math with these
Can still be problematic due to units

Numeric Variables

Discrete variable: A countable number of values (0, 1, 2, 3, 4, …)
- Population size of fish in a pond
- How many times a coin flip was successfully called
Continuous variable: A continuous range of numbers (0, 0.1, 0.11, 0.111, …)
- Temperature
- Volume of liquid in a glass
- Body Mass

Numeric Variables

Interval level
- Differences between values make sense
- Ratios don’t make sense because zero has no meaning
Ratio level
- Differences between values make sense
- Ratios also make sense
- Zero has meaning, it represents absence of the quantity

Summarizing Data

Frequency Distribution

Statistics is really good at summarizing grotesque amounts of information

Of the \(748\) tsunamis that have occurred since 1970, when do they usually occur?
We can look at the months, since exact date is a little impractical
Instead of a table of raw data, we can just record each time a specific month appears

Frequency Distribution

Frequency distribution:

Groups data into categories
Records the number of observations that fall into each category
“How frequently do these variables occur in my sample?”

\[ \begin{array}{|c|c|c|c|c|c|c|} \hline \text{Month} & \text{Jan} & \text{Feb} & \text{Mar} & \text{Apr} & \text{May} & \text{Jun} \\ \hline \text{Frequency} & 64 & 53 & 68 & 62 & 62 & 63\\ \hline \text{Month} & \text{Jul} & \text{Aug} & \text{Sep} & \text{Oct} & \text{Nov} & \text{Dec}\\ \hline \text{Frequency} & 66 & 67 & 58 & 63 & 65 & 57 \\ \hline \end{array} \]

Relative Frequency

Relative frequency distribution

Divide the frequency by the total observations
This gives us the proportion of units in each category
“What percentage of my sample is this category?”

\[ \begin{array}{|c|c|c|c|c|c|c|} \hline \text{Month} & \text{Jan} & \text{Feb} & \text{Mar} & \text{Apr} & \text{May} & \text{Jun} \\ \hline \text{Frequency} & 64 & 53 & 68 & 62 & 62 & 63\\ \hline \text{Relative Frequency} & 0.09 & 0.07 & 0.09 & 0.08 & 0.08 & 0.08\\ \hline \text{Month} & \text{Jul} & \text{Aug} & \text{Sep} & \text{Oct} & \text{Nov} & \text{Dec}\\ \hline \text{Frequency} & 66 & 67 & 58 & 63 & 65 & 57 \\ \hline \text{Relative Frequency} & 0.09 & 0.09 & 0.08 & 0.08 & 0.09 & 0.08 \\ \hline \end{array} \]

Numeric Frequency

We can do the same thing with numeric values
If we wanted to group the tsunamis caused by earthquakes by magnitude of earthquake
We’d need to define some kind of classification for earthquake magnitude
Then we could sort them by that
Let’s say: \(0.0-0.9\), \(1.0-1.9\), …

Numeric Frequency

There were no tsunamis caused by earthquakes less than a magnitude of 4, so we can remove those classes

\[ \begin{array}{|c|c|c|c|c|c|c|} \hline \text{Class} & 4.0-4.9 & 5.0-5.9 & 6.0-6.9 & 7.0-7.9 & 8.0-8.9 & 9.0-9.9 \\ \hline \text{Frequency} & 2 & 22 & 188 & 225 & 25 & 2\\ \hline \text{Relative Frequency} & 0.0043 & 0.0474 & 0.4052 & 0.4849 & 0.0539 & 0.0043\\ \hline \end{array} \]

Same table, same general process, same interpretation
Class definitions are arbitrary but try to make them make sense

Statistics as a Language

Statistics as a Science

The Scientific Method

Observe and Question

Hypothesize

Experiment

Vocabulary (Get used to it)

Analyze

Sampling Techniques

Simple Random Sampling

Simple Random Sampling

Stratified Sampling

Statified Sampling

Cluster Sampling

Systematic Sampling

Systematic Sampling

Voluntary Response

Samples of Convenience

Data Types

Data

Variables

Categorical Variables

Categorical Variables

Numeric Variables

Numeric Variables

Numeric Variables

Summarizing Data

Frequency Distribution

Frequency Distribution

Relative Frequency

Numeric Frequency

Numeric Frequency

Visualizing Data

Bar Graphs

Bar Graphs

Pie Chart

Histograms

Histograms

Histograms

Time plots

Go Away