Day 3
Review
What is data?
\[ \begin{array}{|c|c|c|c|c|} \hline \text{N} & \text{Age Class} & \text{Weight} & \text{Sex} & \text{Location} \\ \hline 1 & 0.5 & 30.8 & \text{M} & \text{B} \\ \hline 2 & 0.5 & 21.8 & \text{M} & \text{B} \\ \hline 3 & 2.5 & 47.6 & \text{M} & \text{A} \\ \hline 4 & 0.5 & 29.0 & \text{F} & \text{B} \\ \hline 5 & 2.5 & 65.8 & \text{M} & \text{A} \\ \hline \end{array} \]
How many individuals?
How many variables?
Qualitative (Categorical) variable: The value of the variable represents a descriptive categories
Quantitative variable: The value of the variable represents a meaningful number
Qualitative variables can be ordinal or nominal
- Ordinal variables: Categories/values of the variable have a natural ordering
- Nominal variable: Categories/values of the variable cannot be ordered naturally
Quantitative variables can be discrete or continuous
- Discrete variable: A countable number of values (0, 1, 2, 3, 4, …)
- Continuous variable: A continuous range of numbers (0, 0.1, 0.11, 0.111, …)
Quantitative variables can be categorized by level of measurement used for obtaining data values:
Interval level
Differences between values make sense
Ratios don’t make sense because zero has no meaning
Ratio level
Numerical measurement
Differences between values make sense
Ratios also make sense
Zero has meaning, it represents absence of the quantity
Graphics
Why would I graph anything?
When should I use a bar graph?
What about a pie chart?
Histogram: visual representation of a frequency distribution
- Not a bar graph
Bar height (y-axis) represents class frequency
Bar width (x-axis) represents class width
We care about the shape of our data
This is the primary purpose of a histogram
So we want to not fail at that task
The shape of our data can help us observe the distribution of our data
Symmetric - mirror image on both sides of it’s center
Unimodal - One peak/hump
Positively-skewed - Long, narrow tail to the right
Negatively-skewed - Long, narrow tail to the left
Uniform - box
Bimodal - two peaks/humps
Histograms can be used to summarize both small and large data sets
Sometimes we prefer more detailed visualizations for smaller data sets
Stem-and-leaf plots and dotplots are alternative summaries that display the actual values
Stem-and-leaf plots
Each observation should have at least two digits
The digit furthest to the right is the “leaf”
The digits to the left form the “stem”
The data:
\[ \begin{array}{|c|c|c|c|c|c|c|c|c|c|} \hline 87 & 7 & 95 & 76 & 32 & 28 & 84 & 98 & 93 & 88\\ \hline 78 & 100 & 68 & 76 & 55 & 65 & 42 & 57 & 77 & 96 \\ \hline \end{array} \]
- The stem-and-leaf plot:
\[ \begin{array}{r|l} 0 & 7 \\ 1 & \\ 2 & 8 \\ 3 & 2 \\ 4 & 2 \\ 5 & 5 \; 7 \\ 6 & 5 \; 8 \\ 7 & 6 \; 6 \; 7 \; 8 \\ 8 & 4 \; 7 \; 8 \\ 9 & 3 \; 5 \; 6 \; 8 \\ 10 & 0 \end{array} \]
Dotplots
We can represent each observation by a dot above its value on a number line
Data:
\[ \begin{array}{|c|c|c|c|c|} \hline 3 & 6 & 2 & 5 & 1 \\ \hline 2 & 3 & 4 & 3 & 4 \\ \hline \end{array} \]
- Dotplot:
Questions?
Goals for Today:
Describe and calculate Mean, Median, and Mode
Recognize the use and importance of Statistical/Mathematical notation
Numerical Summaries of Data
Measures of Center
Graphics are good for taking data and making it easier to view
Numerical summaries are how we take data and make it easier to understand
Statistical Inference in a nutshell
“Using sample statistics to describe population parameters”
Which way is this histogram skewed?
How would you describe the “center” of this data?
Mean or average: Balance point (fulcrum) of the dataset
Median: Half of the data points are above the median, half are below
Mode: Where the peak is
Mean
Sum all of the data then divide by the number of observations
- Most commonly used metric for summarizing data
\[ \begin{array}{|c|c|c|c|c|} \hline 7 & 3 & 12 & 3 & 5 \\ \hline \end{array} \]
\[\text{Mean} \ = \ {7+3+12+3+5 \over 5} = {30 \over 5} = 6\]
If the data we calculated a mean for comes from a sample:
- Sample mean
If the data we calculated a mean for comes from a population:
- Population mean
Mathematical Notation Soapbox
“Letter math”
I resisted it forever
But trust me, it does end up being helpful
Data values can be denoted as \(x_1,x_2,x_3,...\)
\(x_1\) refers to the observed value of the variable \(x\) from individual 1
\(x\) can be anything
It doesn’t even have to be \(x\)
It’s convention, not law
Sample size (the number of individuals in the sample)
- Denoted with \(n\) (Note: lower-case)
Population size
- Denoted with \(N\) (Note: capital)
Summation
This is referring to the sum (addition) of everything contained in the expression
We denote this with the Greek letter \(\Sigma\)
With this notation we can describe:
\[\sum\limits_{i=1}^nx_i=x_1+x_2+...+x_n\]
- “The summation of \(x_i\) to the \(n^{th}\) term, starting from 1”
With sigma notation we can express the sample and population mean formulas:
- Sample mean (denoted \(\bar{x}\)):
\[{1 \over n}\sum\limits_{i=1}^nx_i\]
- Population mean (denoted \(\mu\)):
\[{1 \over N}\sum\limits_{i=1}^Nx_i\]
Greek letters usually mean population parameters
Lower-case letters usually mean sample statistics
In practice:
\[ \begin{array}{|c|c|c|c|c|} \hline \text{Student} & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 \\ \hline \text{Abscences} & 2 & 6 & 1 & 2 & 4 & 0 & 1 & 3 & 0 & 2 \\ \hline \end{array} \]
\[{1 \over n}\sum\limits_{i=1}^nx_i\]
\[{1 \over 10}(x_1+x_2+x_3+x_4+x_5+x_6+x_7+x_8+x_9+x_{10})\]
\[{1 \over 10}(2+6+1+2+4+0+1+3+0+2)\]
\[{1 \over 10} *21 = {21 \over 10} = 2.1\]
Properties of the mean:
Common
Easy to interpret
Susceptible to outliers
The average number of Super Bowl rings between me and Tom Brady is \(3.5\)
(As of 2021) the top \(1\%\) of households in the United States hold \(32.3\%\) of the country’s wealth, while the bottom \(50\%\) hold \(2.6\%\)
A statistic is resistant if its value is not affected heavily by outliers
Is the mean resistant?
Median
Middle value, half the data are below and half are above
\[ \begin{array}{|c|c|c|c|c|} \hline 7 & 3 & 12 & 3 & 5 \\ \hline \end{array} \]
Sort your data in increasing order (low to high)
\[ \begin{array}{|c|c|c|c|c|} \hline 3 & 3 & \boldsymbol{5} & 7 & 12 \\ \hline \end{array} \]
The median is 5
If \(n\) is odd: choose position \({(n+1)\over2}\) in the ordered dataset
- So \(n=5\)
\[{(n+1)\over2}={(5+1)\over2}=3\]
We pick the \(3^{rd}\) data point after sorting
\[ \begin{array}{|c|c|c|c|c|c|} \hline 7 & 3 & 12 & 3 & 5 & 8\\ \hline \end{array} \]
If \(n\) is even after ordering:
Pick \(n\over 2\) and \({n \over 2}+1\)
Average the two data points
\[ \begin{array}{|c|c|c|c|c|} \hline 3 & 3 & \boldsymbol{5} & \boldsymbol{7} & 8 & 12 \\ \hline \end{array} \]
\[n=6\]
\[{6\over 2}=3 \ ,\ {6 \over 2}+1=4\]
3rd data point: \(5\)
4th data point: \(7\)
\[{(5+7)\over 2}=6\]
The median is \(6\)
Properties of median:
It doesn’t use all of the data directly
This makes it resistant
- Outliers have little/no effect
Sometimes a more realistic measurement:
Median Household Income (Kansas): \(\$57,422\)
Average Household Income (Kansas): \(\$77,509\)
- Why does median make more sense than average here?
Difference between median and mean depend on skew of the histogram
Mode
The most frequent observation
Useful for qualitative data
- “Which species of Salmonella is most commonly growing in my flour?”
Not as useful for quantitative data
“What’s the most common weight of cattle on our research farms?”
- Why isn’t this a helpful metric?
A data set can have any number of modes \((0,1,2,...)\)
\[ \begin{array}{|c|c|c|c|c|} \hline 3 & 3 & 5 & 7 & 12 \\ \hline \end{array} \]
The most frequently observed value in this data set is \(3\)
- The mode is \(3\)
The below data set displays a sample of \(n=7\) observations:
\[ \begin{array}{|c|c|c|c|c|c|c|} \hline 2 & 1 & 3 & 4 & 3 & 5 & 4 \\ \hline \end{array} \]
- Find the mean:
- Find the median:
- Find the mode(s):
Attendance QOTD
\[ \begin{array}{|c|c|c|c|c|} \hline \text{Age} & \text{Sex} & \text{Body Temperature} & \text{Serum Cholesterol} & \text{Chest Pain Type} & \text{Max Heart Rate} \\ \hline 34 & \text{M} & 98.7 & 182 & 3 & 174 \\ \hline 63 & \text{M} & 97.8 & 233 & 3 & 150 \\ \hline 37 & \text{M} & 97.6 & 250 & 2 & 187 \\ \hline 41 & \text{F} & 98.2 & 204 & 1 & 172 \\ \hline 56 & \text{M} & 98.0 & 236 & 1 & 178 \\ \hline \end{array} \]