STAT 240 - Fall 2025
\[ \begin{array}{|c|} \hline \text{Observe}\\ \hline \text{Question}\\ \hline \text{Hypothesize}\\ \hline \text{Experiment}\\ \hline \text{Analyze}\\ \hline \text{Conclude}\\ \hline \end{array} \]
Tsunamis can occur after earthquakes
They tend to happen after high magnitude earthquakes
How high magnitude does an earthquake have to be to cause a tsunami?
How high are the tsunami waves from high magnitude earthquakes?
“On 30 July 2025, at 11:24:52 PETT (23:24:52 UTC, 29 July), a \(M_w\) 8.8 megathrust earthquake struck off the eastern coast of the Kamchatka Peninsula in the Russian Far East, 119 km (74 mi) east-southeast of the coastal city of Petropavlovsk-Kamchatsky.” – Wikipedia
This earthquake was well above the severity to cause a tsunami
Higher magnitude earthquakes result in tsunamis with high waves
We can’t experiment well with this kind of science
But we can run a sort of “psuedo”-experiment
We select \(100\) tsunamis at random and see if they were the product of an earthquake
We find out that \(68\) of them had some connection to a earthquake
Population: the entire collection of individuals about which information is sought.
Sample: a subset of population, containing the individuals that are actually observed.
Statistics: is the study of procedures for collecting, describing, and drawing conclusions from information.
Of the 748 tsunamis that have occurred since 1970, 464 of them can be directly associated with an earthquake.
From this data we can say two things
\(62\%\) of tsunamis are caused by earthquakes
\(68\%\) of the tsunamis in our sample were caused by earthquakes
Parameter: a value that describes an entire population.
Statistic: a value that describes a sample.
I hope you’re hungry
How clean are the campus dining halls? Let’s find out!
We decide to line up all of the dining hall workers at Kramer
Everyone is assigned a number, \(1\) to \(50\)
We randomly select \(10\) numbers
Then swab the hands of every person who’s number came up
Simple Random Sample (SRS): a sample chosen by a method in which collection of n population items is equally likely to make up the sample.
Every worker had the same chance of being selected
\(1/50\)
Note: this only works if we select all \(10\) numbers simultaneously
We’ll play with that concept more later on
Maybe the \(10\) workers we selected all work the same station together. How can we make sure we get a sample from every station?
Divide up workers by station, \(5\) in total
Assign each group of workers numbers \(1\) through \(10\)
Randomly select \(2\) numbers
Swab the hands of each of those numbers from each group
Stratified Sample: The population is divided into groups, called strata, where the members of each stratum are similar in some way. Then a SRS is drawn from each stratum.
We stratified by station
We defined a SRS for the strata
We sampled from the strata based off of that SRS
Foodborne outbreaks can also happen as a result of contaminated regions, not just bad worker hygeine.
We split the dining center up into a grid of 5 ft by 5 ft squares
We select \(10\) squares at random and swab the entire square
Each of these squares are called clusters
Cluster Sampling: Items are drawn from the population in groups, or clusters.
At the end of the day, the delivery mechanism of foodborne illness is the food itself.
We decide to sample the food coming off the line directly
It’d be a massive interruption of service to sample every food item coming out
So we decide to sample every \(10^{th}\) item that comes from any station for possible contaminants
Define a number, denoted \(k\)
Sample the \(k^{th}\) item that’s observed
Repeat the sampling process every \(k\) items that are observed
We design a simple survey that asks the students:
On a scale of 1 to 10 how clean does Kramer dining center appear?
Have you ever felt sick after eating at the dining center?
Do you think the dining center prioritizes food safety adequately?
We send the survey to every student that’s eaten at Kramer in the past year
How does the class feel about Kramer on a scale of 1 to 10?
It an easy sample to acquire, we just used who we had
It’s a terrible sampling technique in general
Why?
What is data?
Data: Information that has been collected
Individual: Something the information has been collected on (People/Places/Things/etc.)
Variables: Characteristics about the individuals we collected information (data) from
\[ \scriptsize \begin{array}{|c|c|c|c|c|c|} \hline \textbf{Location of harvest} & \textbf{Date of harvest} & \textbf{Sex} & \textbf{Age class} & \textbf{Body mass in kg} \\ \hline \text{Boyer} & 2005\text{-}10\text{-}15 & \text{Female} & 3.5 & 34.0 \\ \text{Desoto} & 2004\text{-}12\text{-}12 & \text{Male} & 3.5 & 71.2 \\ \text{Desoto} & 2009\text{-}10\text{-}17 & \text{Male} & 0.5 & 21.8 \\ \text{Desoto} & 2010\text{-}01\text{-}02 & \text{Male} & 0.5 & 19.5 \\ \text{Desoto} & 2005\text{-}12\text{-}11 & \text{Female} & 3.5 & 45.4 \\ \hline \end{array} \]
Not all variables are held equal
What’s the difference between column 1 and column 2?
Qualitative (Categorical) variable: The value of the variable represents a descriptive categories
Identifying labels or names
We can’t really do math with a label or name
“Male” \(= 0\), “Female” \(= 1\)
Ordinal variables: Categories/values of the variable have a natural ordering
Letter grade: A, B, C, D, F
Age classifications: Young, Middle-aged, Old
Nominal variable: Categories/values of the variable cannot be ordered naturally
Degree program
Sex: Male / Female
Quantitative variable: The value of the variable represents a meaningful number
Body mass, age, height, time
We can inherently do math with these
Can still be problematic due to units
Discrete variable: A countable number of values (0, 1, 2, 3, 4, …)
Population size of fish in a pond
How many times a coin flip was successfully called
Continuous variable: A continuous range of numbers (0, 0.1, 0.11, 0.111, …)
Temperature
Volume of liquid in a glass
Body Mass
Interval level
Differences between values make sense
Ratios don’t make sense because zero has no meaning
Ratio level
Differences between values make sense
Ratios also make sense
Zero has meaning, it represents absence of the quantity
Statistics is really good at summarizing grotesque amounts of information
Of the \(748\) tsunamis that have occurred since 1970, when do they usually occur?
We can look at the months, since exact date is a little impractical
Instead of a table of raw data, we can just record each time a specific month appears
Frequency distribution:
Groups data into categories
Records the number of observations that fall into each category
“How frequently do these variables occur in my sample?”
\[ \begin{array}{|c|c|c|c|c|c|c|} \hline \text{Month} & \text{Jan} & \text{Feb} & \text{Mar} & \text{Apr} & \text{May} & \text{Jun} \\ \hline \text{Frequency} & 64 & 53 & 68 & 62 & 62 & 63\\ \hline \text{Month} & \text{Jul} & \text{Aug} & \text{Sep} & \text{Oct} & \text{Nov} & \text{Dec}\\ \hline \text{Frequency} & 66 & 67 & 58 & 63 & 65 & 57 \\ \hline \end{array} \]
Relative frequency distribution
Divide the frequency by the total observations
This gives us the proportion of units in each category
“What percentage of my sample is this category?”
\[ \begin{array}{|c|c|c|c|c|c|c|} \hline \text{Month} & \text{Jan} & \text{Feb} & \text{Mar} & \text{Apr} & \text{May} & \text{Jun} \\ \hline \text{Frequency} & 64 & 53 & 68 & 62 & 62 & 63\\ \hline \text{Relative Frequency} & 0.09 & 0.07 & 0.09 & 0.08 & 0.08 & 0.08\\ \hline \text{Month} & \text{Jul} & \text{Aug} & \text{Sep} & \text{Oct} & \text{Nov} & \text{Dec}\\ \hline \text{Frequency} & 66 & 67 & 58 & 63 & 65 & 57 \\ \hline \text{Relative Frequency} & 0.09 & 0.09 & 0.08 & 0.08 & 0.09 & 0.08 \\ \hline \end{array} \]
We can do the same thing with numeric values
If we wanted to group the tsunamis caused by earthquakes by magnitude of earthquake
We’d need to define some kind of classification for earthquake magnitude
Then we could sort them by that
Let’s say: \(0.0-0.9\), \(1.0-1.9\), …
There were no tsunamis caused by earthquakes less than a magnitude of 4, so we can remove those classes
\[ \begin{array}{|c|c|c|c|c|c|c|} \hline \text{Class} & 4.0-4.9 & 5.0-5.9 & 6.0-6.9 & 7.0-7.9 & 8.0-8.9 & 9.0-9.9 \\ \hline \text{Frequency} & 2 & 22 & 188 & 225 & 25 & 2\\ \hline \text{Relative Frequency} & 0.0043 & 0.0474 & 0.4052 & 0.4849 & 0.0539 & 0.0043\\ \hline \end{array} \]
Same table, same general process, same interpretation
Class definitions are arbitrary but try to make them make sense
\[ \begin{array}{|c|c|c|c|c|c|c|} \hline \text{Class} & 4.0-4.9 & 5.0-5.9 & 6.0-6.9 & 7.0-7.9 & 8.0-8.9 & 9.0-9.9 \\ \hline \text{Frequency} & 2 & 22 & 188 & 225 & 25 & 2\\ \hline \text{Relative Frequency} & 0.0043 & 0.0474 & 0.4052 & 0.4849 & 0.0539 & 0.0043\\ \hline \end{array} \]