Day 5

Review

Population Mean

\[\mu = {1 \over N}\sum\limits_{i=1}^Nx_i\]

\[\mu = {\sum\limits_{i=1}^Nx_i\over N}\]

Sample Mean

\[\bar{x} = {1 \over n}\sum\limits_{i=1}^nx_i\]

\[\bar{x} = {\sum\limits_{i=1}^nx_i\over n}\]

Spread of data

Range

\[\text{Range} = \text{Maximum} - \text{Minimum}\]

\[ \begin{array}{|c|c|c|c|c|c|c|c|c|} \hline \text{Species A} & 3.11 & 2.95 & 2.00 & 2.62 & 3.34 & 3.41 & 2.13 & 2.81 & 1.80 & 2.89 \\ \hline \text{Species B} & 3.43 & 3.33 & 3.03 & 3.40 & 3.32 & 3.13 & 2.81 & 3.04 & 3.38 & 3.24 \\ \hline \end{array} \]

Smaller spread is usually closer to the mean

Larger spread is usually further from the mean

Variance

Population variance (denoted $\sigma^2$):

\[\sigma^2 = {{{\sum\limits_{i=1}^N}(x_i-\mu)^2}\over N}\]

Sample variance (denoted $s^2$):

\[s^2 = {{{\sum\limits_{i=1}^n}(x_i-\bar{x})^2}\over (n-1)}\]

Standard Deviation

\[\sqrt{\sigma^2}=\sigma \rightarrow \text{Population Standard Deviation}\]

\[\sqrt{s^2}=s \rightarrow \text{Sample Standard Deviation}\]

What is the variance and standard deviation of the data below?

\[ \begin{array}{|c|c|c|c|c|c|c|c|c|} \hline \text{Year} & 2004 & 2005 & 2006 & 2007 &2008 & 2009 & 2010 \\ \hline \text{Price} & 1.347 & 1.737 & 2.026 & 2.298 & 2.651 & 1.802 & 2.222 \\ \hline \end{array} \]

\[\bar{x}=2.012\]

The Empirical Rule

For a population that has an approximately bell-shaped distribution:

$\approx 68\%$ of the data is within ONE standard deviation of the mean

\[\approx 68\% = \begin{cases} \mu - \sigma \\ \mu + \sigma \end{cases}\]

$\approx 95\%$$ of the data is within TWO standard deviations of the mean

\[\approx 95\% = \begin{cases} \mu - 2\sigma \\ \mu + 2\sigma \end{cases}\]

$\approx$ All or almost all of the data is within THREE standard deviations of the mean

\[\approx 100\% = \begin{cases} \mu - 3\sigma \\ \mu + 3\sigma \end{cases}\]

Questions?

Goals for Today:

Introduce the common measures of position
Develop the five number summary
Discuss the concept of outliers

Numerical Sumaries of Data

Measures of Position

Suppose we have a male white-tailed deer who’s $141$ lbs and a female who’s $112$ lbs

We have the tools to say how different they are from each other
But how can we say who’s more different from their specific group?

We need to define a way to look at differences relative to groups

This is referred to as position
- We’ll develop three general measures of it:
  - z-scores
  - percentiles
  - quartiles

z-scores

Let $x$ be a value from a population with mean $\mu$

The z-score is:

\[z={x-\mu \over \sigma}\]

For a sample:

\[z={x-\bar{x} \over s}\]

Suppose you score $x=75$ on Exam 1. The class average on Exam 1 is $80$ with a standard deviation of 5. The $z$-score of your exam is:

\[z={75-80 \over 5}=-1\]

Population:

\[z={x-\mu \over \sigma}\]

Sample:

\[z={x-\bar{x} \over s}\]

A $z$-score data value $x$ is the number of standard deviations $x$ is from the mean of the data set

$z < 0 \Rightarrow$ the value of $x$ is less than the mean

$z = 0 \Rightarrow$ the value of $x$ is equal to the mean

$z > 0 \Rightarrow$ the value of $x$ is greater than the mean

In our example:

\[x=75\]

\[z=-1\]

Note: this measure is unitless

The mean weight for adult male white-tailed deer is $\mu=150$ lbs, with a standard deviation of $\sigma = 12$ lbs

The mean weight for adult female white-tailed deer is $\mu=105$ lbs, with a standard deviation of $\sigma = 9$ lbs

Who is heavier relative to their sex, a male deer at $141$ lbs or a female deer at $112$ lbs?

\[z={x-\mu \over \sigma}\]

z-scores and the Empirical Rule

Recall the Empirical Rule:

$\approx 68\%$ of the data will be between $\mu - \sigma$ and $\mu + \sigma$

$\approx 95\%$ of the data will be between $\mu - 2\sigma$ and $\mu + 2\sigma$

$\approx 99.9\%$ of the data will be between $\mu - 3\sigma$ and $\mu + 3\sigma$

With z-scores:

$\approx 68\%$ of the data will be between $z=-1$ and $z=1$

$\approx 95\%$ of the data will be between $z=-2$ and $z=2$

$\approx 100\%$ of the data will be between $z=-3$ and $z=3$

Both of these imply a bell-shaped distribution

A data set has a mean of $20$ and a standard deviation of $3$. A histogram for the data is shown below.

Is it appropriate to use the Empirical Rule to approximate the proportion of the data between $14$ and $26$? If so, find the approximation. If not, explain why not.

Quartiles

Every data set has three quartiles:

$1^{st}$ quartile, denoted $\textbf{Q}_1$ separates the lowest $25\%$ of the data from the highest $75\%$

$2^{nd}$ quartile, denoted $\textbf{Q}_2$ separates the lowest $50\%$ of the data from the highest $50\%$ ($\textbf{Q}_2 = \text{Median}$)

$3^{rd}$ quartile, denoted $\textbf{Q}_3$ separates the lowest $75\%$ of the data from the highest $25\%$

The quartiles of the below data set of size $n=8$:

\[ \begin{array}{|c|c|c|c|c|c|c|} \hline 1 & 2 & 2 & 3 & 4 & 5 & 7 & 12 \\ \hline \end{array} \]

\[\text{Q}_1 = 2\]

\[\text{Q}_2 = 3.5\ \text{(Median)}\]

\[\text{Q}_3 = 6\]

Percentiles

For a number $p$ between $1$ and $99$, the $p^{th}$ percentile separates the lowest $p\%$ of the data from the highest $(100-p)\%$

Quartiles separate data into $4$ parts

Each part is $\approx 25\%$ of the data

Percentiles divide the data set into $100$ parts

Procedure for Computing Percentiles or Quartiles

Arrange the data in increasing order

Let $n$ be the number of values in the data set. For the $p^{th}$ percentile, compute the value:

\[L={p\over 100}*n\]

If $L$ is a whole number, the $p^{th}$ percentile is the average of the number in position $L$ and the number in position $L+1$

If $L$ is not a whole number, round it up the the next higher whole number. The $p^{th}$ percentile is the number in the position corresponding to the rounded-up value

What’s the position for $85^{th}$ percentile for male deer in a data set of $n=96$?

\[L={p\over 100}*n\]

\[p = 85\]

\[n = 96\]

\[L={85\over 100}*96=81.6 \rightarrow 82\]

Five-Number Summary

The five-number summary is a set of five measures of position computed from a data set. The summary consists of:

\[ \begin{array}{|c|c|c|c|c|c|c|} \hline \text{Min} & \text{Q}_1 & \text{Median} & \text{Q}_3 & \text{Max} \\ \hline \end{array} \]

Below is a table of total of infected counts from a series of FMD outbreaks in cattle

\[ \begin{array}{|c|c|c|c|c|c|c|} \hline 192 & 152 & 90 & 124 & 178 & 180 & 127 & 182 & 196 & 118 \\ \hline \end{array} \]

The five number summary is:

\[ \begin{array}{|c|c|c|c|c|c|c|} \hline \text{Min} & \text{Q}_1 & \text{Median} & \text{Q}_3 & \text{Max} \\ \hline 90 & 124 & 165 & 182 & 196\\ \hline \end{array} \]

A new outbreak occurs with 190 getting sick. How does that compare to our original data?

$Q_3 < 190 < Max \Rightarrow 190$ is greater than $75\%$ of the outbreaks but is not the largest

What about an outbreak of 128?

$Q_1 < 128 < Median \Rightarrow 128$ is greater than $25\%$ of the outbreaks, but less than of the others

Outliers

An outlier is a value that is considerably large or smaller than most of the values in a data set

What do you do with outliers?

If it’s a mistake/type/bad measure, ideally fix it
If you can’t fix it, chuck it
If it’s a valid observation, you keep it in the data
- You keep outliers if they’re valid observations
  - Keep outliers if they’re valid
  - Resistant statistics

How do you find outliers?

Interquartile Range (IQR)

The IQR is a measure of spread that is often used to detect outliers

Take the difference between $Q_1$ and $Q_3$:

\[\text{IQR} = \text{Q}_3 - \text{Q}_1\]

Notice the IQR contains the middle $50\%$ of the data

To find outliers with the IQR we use the IQR Method

Define outlier boundaries:

\[\text{Lower Outlier Boundary} = \text{Q}_1 - 1.5*\text{IQR}\]

\[\text{Upper Outlier Boundary} = \text{Q}_3 + 1.5*\text{IQR}\]

Check to see if any data is outside of these boundaries:

\[\text{Upper Boundary} < x < \text{Lower Boundary}\]

Recall our five-number summary of the FMD data:

\[ \begin{array}{|c|c|c|c|c|c|c|} \hline \text{Min} & \text{Q}_1 & \text{Median} & \text{Q}_3 & \text{Max} \\ \hline 90 & 124 & 165 & 182 & 196\\ \hline \end{array} \]

The IQR for the dataset is:

\[\text{IQR} = \text{Q}_3 - \text{Q}_1 = 182 - 124 = 58\]

Defining the outlier boundaries:

\[\text{Q}_3 + 1.5*\text{IQR} = 182 + (1.5*58)=269\]

\[\text{Q}_1 - 1.5*\text{IQR} = 124-(1.5*58)=37\]

Any value in the dataset $>269$ or $<37$ is an outlier

Day 5

Review

Spread of data

Range

Variance

Standard Deviation

The Empirical Rule

Numerical Sumaries of Data

Measures of Position

z-scores

z-scores and the Empirical Rule

Quartiles

Percentiles

Five-Number Summary

Outliers

Interquartile Range (IQR)

Attendance QOTD

Go away