Day 3

Review

What is data?

\[ \begin{array}{|c|c|c|c|c|} \hline \text{N} & \text{Age Class} & \text{Weight} & \text{Sex} & \text{Location} \\ \hline 1 & 0.5 & 30.8 & \text{M} & \text{B} \\ \hline 2 & 0.5 & 21.8 & \text{M} & \text{B} \\ \hline 3 & 2.5 & 47.6 & \text{M} & \text{A} \\ \hline 4 & 0.5 & 29.0 & \text{F} & \text{B} \\ \hline 5 & 2.5 & 65.8 & \text{M} & \text{A} \\ \hline \end{array} \]

How many individuals?

How many variables?

Qualitative (Categorical) variable: The value of the variable represents a descriptive categories

Quantitative variable: The value of the variable represents a meaningful number

Qualitative variables can be ordinal or nominal

Ordinal variables: Categories/values of the variable have a natural ordering

Nominal variable: Categories/values of the variable cannot be ordered naturally

Quantitative variables can be discrete or continuous

Discrete variable: A countable number of values (0, 1, 2, 3, 4, …)

Continuous variable: A continuous range of numbers (0, 0.1, 0.11, 0.111, …)

Quantitative variables can be categorized by level of measurement used for obtaining data values:

Interval level

Differences between values make sense
Ratios don’t make sense because zero has no meaning

Ratio level

Numerical measurement
Differences between values make sense
Ratios also make sense
Zero has meaning, it represents absence of the quantity

Graphics

Why would I graph anything?

When should I use a bar graph?

What about a pie chart?

Histogram: visual representation of a frequency distribution

Not a bar graph

Bar height (y-axis) represents class frequency

Bar width (x-axis) represents class width

We care about the shape of our data
- This is the primary purpose of a histogram
- So we want to not fail at that task
The shape of our data can help us observe the distribution of our data

Symmetric - mirror image on both sides of it’s center

Unimodal - One peak/hump

Positively-skewed - Long, narrow tail to the right

Negatively-skewed - Long, narrow tail to the left

Uniform - box

Bimodal - two peaks/humps

Histograms can be used to summarize both small and large data sets
Sometimes we prefer more detailed visualizations for smaller data sets
Stem-and-leaf plots and dotplots are alternative summaries that display the actual values

Stem-and-leaf plots

Each observation should have at least two digits
- The digit furthest to the right is the “leaf”
- The digits to the left form the “stem”
The data:

\[ \begin{array}{|c|c|c|c|c|c|c|c|c|c|} \hline 87 & 7 & 95 & 76 & 32 & 28 & 84 & 98 & 93 & 88\\ \hline 78 & 100 & 68 & 76 & 55 & 65 & 42 & 57 & 77 & 96 \\ \hline \end{array} \]

The stem-and-leaf plot:

\[ \begin{array}{r|l} 0 & 7 \\ 1 & \\ 2 & 8 \\ 3 & 2 \\ 4 & 2 \\ 5 & 5 \; 7 \\ 6 & 5 \; 8 \\ 7 & 6 \; 6 \; 7 \; 8 \\ 8 & 4 \; 7 \; 8 \\ 9 & 3 \; 5 \; 6 \; 8 \\ 10 & 0 \end{array} \]

Dotplots

We can represent each observation by a dot above its value on a number line
Data:

\[ \begin{array}{|c|c|c|c|c|} \hline 3 & 6 & 2 & 5 & 1 \\ \hline 2 & 3 & 4 & 3 & 4 \\ \hline \end{array} \]

Dotplot:

Questions?

Goals for Today:

Describe and calculate Mean, Median, and Mode
Recognize the use and importance of Statistical/Mathematical notation

Numerical Summaries of Data

Measures of Center

Graphics are good for taking data and making it easier to view

Numerical summaries are how we take data and make it easier to understand

Statistical Inference in a nutshell

“Using sample statistics to describe population parameters”

Which way is this histogram skewed?

How would you describe the “center” of this data?

Mean or average: Balance point (fulcrum) of the dataset
Median: Half of the data points are above the median, half are below
Mode: Where the peak is

Mean

Sum all of the data then divide by the number of observations

Most commonly used metric for summarizing data

\[ \begin{array}{|c|c|c|c|c|} \hline 7 & 3 & 12 & 3 & 5 \\ \hline \end{array} \]

\[\text{Mean} \ = \ {7+3+12+3+5 \over 5} = {30 \over 5} = 6\]

If the data we calculated a mean for comes from a sample:

Sample mean

If the data we calculated a mean for comes from a population:

Population mean

Mathematical Notation Soapbox

“Letter math”
I resisted it forever
But trust me, it does end up being helpful

Data values can be denoted as $x_1,x_2,x_3,...$

$x_1$ refers to the observed value of the variable $x$ from individual 1
$x$ can be anything
- It doesn’t even have to be $x$
- It’s convention, not law

Sample size (the number of individuals in the sample)

Denoted with $n$ (Note: lower-case)

Population size

Denoted with $N$ (Note: capital)

Summation

This is referring to the sum (addition) of everything contained in the expression
We denote this with the Greek letter $\Sigma$

With this notation we can describe:

\[\sum\limits_{i=1}^nx_i=x_1+x_2+...+x_n\]

“The summation of $x_i$ to the $n^{th}$ term, starting from 1”

With sigma notation we can express the sample and population mean formulas:

Sample mean (denoted $\bar{x}$):

\[{1 \over n}\sum\limits_{i=1}^nx_i\]

Population mean (denoted $\mu$):

\[{1 \over N}\sum\limits_{i=1}^Nx_i\]

Greek letters usually mean population parameters

Lower-case letters usually mean sample statistics
In practice:

\[ \begin{array}{|c|c|c|c|c|} \hline \text{Student} & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 \\ \hline \text{Abscences} & 2 & 6 & 1 & 2 & 4 & 0 & 1 & 3 & 0 & 2 \\ \hline \end{array} \]

\[{1 \over n}\sum\limits_{i=1}^nx_i\]

\[{1 \over 10}(x_1+x_2+x_3+x_4+x_5+x_6+x_7+x_8+x_9+x_{10})\]

\[{1 \over 10}(2+6+1+2+4+0+1+3+0+2)\]

\[{1 \over 10} *21 = {21 \over 10} = 2.1\]

Properties of the mean:

Common
Easy to interpret
Susceptible to outliers
- The average number of Super Bowl rings between me and Tom Brady is $3.5$
- (As of 2021) the top $1\%$ of households in the United States hold $32.3\%$ of the country’s wealth, while the bottom $50\%$ hold $2.6\%$
- A statistic is resistant if its value is not affected heavily by outliers

Is the mean resistant?

Median

Middle value, half the data are below and half are above

\[ \begin{array}{|c|c|c|c|c|} \hline 7 & 3 & 12 & 3 & 5 \\ \hline \end{array} \]

Sort your data in increasing order (low to high)

\[ \begin{array}{|c|c|c|c|c|} \hline 3 & 3 & \boldsymbol{5} & 7 & 12 \\ \hline \end{array} \]

The median is 5

If $n$ is odd: choose position ${(n+1)\over2}$ in the ordered dataset

So $n=5$

\[{(n+1)\over2}={(5+1)\over2}=3\]

We pick the $3^{rd}$ data point after sorting

\[ \begin{array}{|c|c|c|c|c|c|} \hline 7 & 3 & 12 & 3 & 5 & 8\\ \hline \end{array} \]

If $n$ is even after ordering:

Pick $n\over 2$ and ${n \over 2}+1$
Average the two data points

\[ \begin{array}{|c|c|c|c|c|} \hline 3 & 3 & \boldsymbol{5} & \boldsymbol{7} & 8 & 12 \\ \hline \end{array} \]

\[n=6\]

\[{6\over 2}=3 \ ,\ {6 \over 2}+1=4\]

3rd data point: $5$
4th data point: $7$

\[{(5+7)\over 2}=6\]

The median is $6$

Properties of median:

It doesn’t use all of the data directly
This makes it resistant
- Outliers have little/no effect
Sometimes a more realistic measurement:
- Median Household Income (Kansas): $\$57,422$
- Average Household Income (Kansas): $\$77,509$
  - Why does median make more sense than average here?
Difference between median and mean depend on skew of the histogram

Mode

The most frequent observation

Useful for qualitative data

“Which species of Salmonella is most commonly growing in my flour?”

Not as useful for quantitative data

“What’s the most common weight of cattle on our research farms?”
- Why isn’t this a helpful metric?

A data set can have any number of modes $(0,1,2,...)$

\[ \begin{array}{|c|c|c|c|c|} \hline 3 & 3 & 5 & 7 & 12 \\ \hline \end{array} \]

The most frequently observed value in this data set is $3$

The mode is $3$

The below data set displays a sample of $n=7$ observations:

\[ \begin{array}{|c|c|c|c|c|c|c|} \hline 2 & 1 & 3 & 4 & 3 & 5 & 4 \\ \hline \end{array} \]

Find the mean:

Find the median:

Find the mode(s):

Attendance QOTD

\[ \begin{array}{|c|c|c|c|c|} \hline \text{Age} & \text{Sex} & \text{Body Temperature} & \text{Serum Cholesterol} & \text{Chest Pain Type} & \text{Max Heart Rate} \\ \hline 34 & \text{M} & 98.7 & 182 & 3 & 174 \\ \hline 63 & \text{M} & 97.8 & 233 & 3 & 150 \\ \hline 37 & \text{M} & 97.6 & 250 & 2 & 187 \\ \hline 41 & \text{F} & 98.2 & 204 & 1 & 172 \\ \hline 56 & \text{M} & 98.0 & 236 & 1 & 178 \\ \hline \end{array} \]