Day 5

Review


Population Mean


\[\mu = {1 \over N}\sum\limits_{i=1}^Nx_i\]


\[\mu = {\sum\limits_{i=1}^Nx_i\over N}\]


Sample Mean


\[\bar{x} = {1 \over n}\sum\limits_{i=1}^nx_i\]


\[\bar{x} = {\sum\limits_{i=1}^nx_i\over n}\]



Spread of data



Range


\[\text{Range} = \text{Maximum} - \text{Minimum}\]


\[ \begin{array}{|c|c|c|c|c|c|c|c|c|} \hline \text{Species A} & 3.11 & 2.95 & 2.00 & 2.62 & 3.34 & 3.41 & 2.13 & 2.81 & 1.80 & 2.89 \\ \hline \text{Species B} & 3.43 & 3.33 & 3.03 & 3.40 & 3.32 & 3.13 & 2.81 & 3.04 & 3.38 & 3.24 \\ \hline \end{array} \]



Smaller spread is usually closer to the mean



Larger spread is usually further from the mean




Variance


Population variance (denoted \(\sigma^2\)):


\[\sigma^2 = {{{\sum\limits_{i=1}^N}(x_i-\mu)^2}\over N}\]


Sample variance (denoted \(s^2\)):


\[s^2 = {{{\sum\limits_{i=1}^n}(x_i-\bar{x})^2}\over (n-1)}\]



Standard Deviation


\[\sqrt{\sigma^2}=\sigma \rightarrow \text{Population Standard Deviation}\]


\[\sqrt{s^2}=s \rightarrow \text{Sample Standard Deviation}\]

  • What is the variance and standard deviation of the data below?


\[ \begin{array}{|c|c|c|c|c|c|c|c|c|} \hline \text{Year} & 2004 & 2005 & 2006 & 2007 &2008 & 2009 & 2010 \\ \hline \text{Price} & 1.347 & 1.737 & 2.026 & 2.298 & 2.651 & 1.802 & 2.222 \\ \hline \end{array} \]


\[\bar{x}=2.012\]






The Empirical Rule


For a population that has an approximately bell-shaped distribution:

  • \(\approx 68\%\) of the data is within ONE standard deviation of the mean


\[\approx 68\% = \begin{cases} \mu - \sigma \\ \mu + \sigma \end{cases}\]



  • \(\approx 95\%\)$ of the data is within TWO standard deviations of the mean


\[\approx 95\% = \begin{cases} \mu - 2\sigma \\ \mu + 2\sigma \end{cases}\]



  • \(\approx\) All or almost all of the data is within THREE standard deviations of the mean


\[\approx 100\% = \begin{cases} \mu - 3\sigma \\ \mu + 3\sigma \end{cases}\]






Questions?



Goals for Today:

  1. Introduce the common measures of position

  2. Develop the five number summary

  3. Discuss the concept of outliers




Numerical Sumaries of Data


Measures of Position

Suppose we have a male white-tailed deer who’s \(141\) lbs and a female who’s \(112\) lbs


  • We have the tools to say how different they are from each other

  • But how can we say who’s more different from their specific group?



We need to define a way to look at differences relative to groups

  • This is referred to as position

    • We’ll develop three general measures of it:

      • z-scores

      • percentiles

      • quartiles



z-scores

Let \(x\) be a value from a population with mean \(\mu\)


  • The z-score is:


\[z={x-\mu \over \sigma}\]


  • For a sample:


\[z={x-\bar{x} \over s}\]




Suppose you score \(x=75\) on Exam 1. The class average on Exam 1 is \(80\) with a standard deviation of 5. The \(z\)-score of your exam is:


\[z={75-80 \over 5}=-1\]


Population:


\[z={x-\mu \over \sigma}\]


Sample:


\[z={x-\bar{x} \over s}\]


A \(z\)-score data value \(x\) is the number of standard deviations \(x\) is from the mean of the data set


  • \(z < 0 \Rightarrow\) the value of \(x\) is less than the mean


  • \(z = 0 \Rightarrow\) the value of \(x\) is equal to the mean


  • \(z > 0 \Rightarrow\) the value of \(x\) is greater than the mean



In our example:


\[x=75\]


\[z=-1\]

  • Note: this measure is unitless





The mean weight for adult male white-tailed deer is \(\mu=150\) lbs, with a standard deviation of \(\sigma = 12\) lbs


The mean weight for adult female white-tailed deer is \(\mu=105\) lbs, with a standard deviation of \(\sigma = 9\) lbs


Who is heavier relative to their sex, a male deer at \(141\) lbs or a female deer at \(112\) lbs?


\[z={x-\mu \over \sigma}\]






z-scores and the Empirical Rule

Recall the Empirical Rule:


  • \(\approx 68\%\) of the data will be between \(\mu - \sigma\) and \(\mu + \sigma\)


  • \(\approx 95\%\) of the data will be between \(\mu - 2\sigma\) and \(\mu + 2\sigma\)


  • \(\approx 99.9\%\) of the data will be between \(\mu - 3\sigma\) and \(\mu + 3\sigma\)


With z-scores:


  • \(\approx 68\%\) of the data will be between \(z=-1\) and \(z=1\)


  • \(\approx 95\%\) of the data will be between \(z=-2\) and \(z=2\)


  • \(\approx 100\%\) of the data will be between \(z=-3\) and \(z=3\)


Both of these imply a bell-shaped distribution




A data set has a mean of \(20\) and a standard deviation of \(3\). A histogram for the data is shown below.

Is it appropriate to use the Empirical Rule to approximate the proportion of the data between \(14\) and \(26\)? If so, find the approximation. If not, explain why not.





Quartiles

Every data set has three quartiles:


\(1^{st}\) quartile, denoted \(\textbf{Q}_1\) separates the lowest \(25\%\) of the data from the highest \(75\%\)


\(2^{nd}\) quartile, denoted \(\textbf{Q}_2\) separates the lowest \(50\%\) of the data from the highest \(50\%\) (\(\textbf{Q}_2 = \text{Median}\))


\(3^{rd}\) quartile, denoted \(\textbf{Q}_3\) separates the lowest \(75\%\) of the data from the highest \(25\%\)



The quartiles of the below data set of size \(n=8\):


\[ \begin{array}{|c|c|c|c|c|c|c|} \hline 1 & 2 & 2 & 3 & 4 & 5 & 7 & 12 \\ \hline \end{array} \]


\[\text{Q}_1 = 2\]


\[\text{Q}_2 = 3.5\ \text{(Median)}\]


\[\text{Q}_3 = 6\]




Percentiles

For a number \(p\) between \(1\) and \(99\), the \(p^{th}\) percentile separates the lowest \(p\%\) of the data from the highest \((100-p)\%\)


  • Quartiles separate data into \(4\) parts


  • Each part is \(\approx 25\%\) of the data


  • Percentiles divide the data set into \(100\) parts



Procedure for Computing Percentiles or Quartiles


  1. Arrange the data in increasing order


  1. Let \(n\) be the number of values in the data set. For the \(p^{th}\) percentile, compute the value:


\[L={p\over 100}*n\]


  1. If \(L\) is a whole number, the \(p^{th}\) percentile is the average of the number in position \(L\) and the number in position \(L+1\)


  • If \(L\) is not a whole number, round it up the the next higher whole number. The \(p^{th}\) percentile is the number in the position corresponding to the rounded-up value




What’s the position for \(85^{th}\) percentile for male deer in a data set of \(n=96\)?


\[L={p\over 100}*n\]


\[p = 85\]

\[n = 96\]


\[L={85\over 100}*96=81.6 \rightarrow 82\]




Five-Number Summary

The five-number summary is a set of five measures of position computed from a data set. The summary consists of:


\[ \begin{array}{|c|c|c|c|c|c|c|} \hline \text{Min} & \text{Q}_1 & \text{Median} & \text{Q}_3 & \text{Max} \\ \hline \end{array} \]


Below is a table of total of infected counts from a series of FMD outbreaks in cattle

\[ \begin{array}{|c|c|c|c|c|c|c|} \hline 192 & 152 & 90 & 124 & 178 & 180 & 127 & 182 & 196 & 118 \\ \hline \end{array} \]


The five number summary is:


\[ \begin{array}{|c|c|c|c|c|c|c|} \hline \text{Min} & \text{Q}_1 & \text{Median} & \text{Q}_3 & \text{Max} \\ \hline 90 & 124 & 165 & 182 & 196\\ \hline \end{array} \]


A new outbreak occurs with 190 getting sick. How does that compare to our original data?


\(Q_3 < 190 < Max \Rightarrow 190\) is greater than \(75\%\) of the outbreaks but is not the largest


What about an outbreak of 128?

\(Q_1 < 128 < Median \Rightarrow 128\) is greater than \(25\%\) of the outbreaks, but less than of the others




Outliers

An outlier is a value that is considerably large or smaller than most of the values in a data set


What do you do with outliers?

  • If it’s a mistake/type/bad measure, ideally fix it

  • If you can’t fix it, chuck it

  • If it’s a valid observation, you keep it in the data

    • You keep outliers if they’re valid observations

      • Keep outliers if they’re valid

      • Resistant statistics



How do you find outliers?




Interquartile Range (IQR)

The IQR is a measure of spread that is often used to detect outliers


  • Take the difference between \(Q_1\) and \(Q_3\):


\[\text{IQR} = \text{Q}_3 - \text{Q}_1\]


Notice the IQR contains the middle \(50\%\) of the data



  • To find outliers with the IQR we use the IQR Method


  1. Define outlier boundaries:


\[\text{Lower Outlier Boundary} = \text{Q}_1 - 1.5*\text{IQR}\]


\[\text{Upper Outlier Boundary} = \text{Q}_3 + 1.5*\text{IQR}\]


  1. Check to see if any data is outside of these boundaries:


\[\text{Upper Boundary} < x < \text{Lower Boundary}\]



Recall our five-number summary of the FMD data:


\[ \begin{array}{|c|c|c|c|c|c|c|} \hline \text{Min} & \text{Q}_1 & \text{Median} & \text{Q}_3 & \text{Max} \\ \hline 90 & 124 & 165 & 182 & 196\\ \hline \end{array} \]


The IQR for the dataset is:

\[\text{IQR} = \text{Q}_3 - \text{Q}_1 = 182 - 124 = 58\]


  • Defining the outlier boundaries:


\[\text{Q}_3 + 1.5*\text{IQR} = 182 + (1.5*58)=269\]


\[\text{Q}_1 - 1.5*\text{IQR} = 124-(1.5*58)=37\]


Any value in the dataset \(>269\) or \(<37\) is an outlier




Attendance QOTD


Go away