Day 4

Review


Which way is this histogram skewed?


How would you describe the “center” of this data?



Numerical summaries help us understand data sets faster


Notation is how we communicate statistical/mathematical equations:


Data values can be denoted as \(x_1,x_2,x_3,...\)

  • \(x_1\) refers to the observed value of the variable \(x\) from individual 1


Summation

  • This is referring to the sum (addition) of everything contained in the expression

  • We denote this with the Greek letter \(\Sigma\)


\[\sum\limits_{i=1}^nx_i=x_1+x_2+...+x_n\]




Mean

  • Balance point (fulcrum) of the dataset

  • Population mean (denoted \(\mu\)):


\[\mu = {1 \over N}\sum\limits_{i=1}^Nx_i\]


\[\mu = {\sum\limits_{i=1}^Nx_i\over N}\]


\[N \approx 10000\]


\[\mu = {1 \over 10000}(x_1 + x_2 +x_3 + ...+x_{10000})\]


\[\mu = 1258.771\]


  • Is \(\mu\) a statistic?


Properties of the mean:

  • Common

  • Easy to interpret

  • Susceptible to outliers

    • A statistic is resistant if its value is not affected heavily by outliers




Median

Middle value, half the data are below and half are above


\[ \begin{array}{|c|c|c|c|c|c|c|c|c|c|} \hline 1149 & 1577 & 1138 & 1319 & 1399 & 1074 & 1091 & 1324 & 1048 & 1462 \\ \hline \end{array} \]


Sort your data in increasing order (low to high)


\[ \begin{array}{|c|c|c|c|c|c|c|c|c|c|} \hline 1048 & 1074 & 1091 & 1138 & 1149 & 1319 & 1324 & 1399 & 1462 & 1577 \\ \hline \end{array} \]


If \(n\) is odd: choose position \({(n+1)\over2}\) in the ordered dataset


If \(n\) is even after ordering:

  • Pick \(n\over 2\) and \({n \over 2}+1\)

  • Average the two data points


\[{10\over2}=5 \ , \ {10 \over2}+1=6\]


\[{\ \ \ + \ \ \ \over2}= \ \ \]


Properties of median:

  • It doesn’t use all of the data directly

  • This makes it resistant

    • Outliers have little/no effect
  • Difference between median and mean depend on skew of the histogram




Mode

The most frequent observation


Where the peak is


  • What could be the mode here?



Useful for qualitative data


Not as useful for quantitative data


A data set can have any number of modes \((0,1,2,...)\)


\[ \begin{array}{|c|c|c|c|c|c|c|c|c|} \hline 8 & 10 & 7 & 11 & 14 & 12 & 11 & 9 & 14 \\ \hline 11 & 8 & 2 & 8 & 10 & 7 & 12 & 11 & 8 \\ \hline \end{array} \]


Whatever method works best for you is the one you use

  • Personally, I prefer to make a frequency table:


\[ \begin{array}{|c|c|c|c|c|c|c|c|c|} \hline \text{Value} & 2 & 7 & 8 & 9 & 10 & 11 & 12 & 14 \\ \hline \text{Frequency} & 1 & 2 & 4 & 1 & 2 & 2 & 4 & 2 \\ \hline \end{array} \]




Questions?




Goals for Today:

  1. Introduce the measures of spread: Range, Variance, and Standard Deviation

  2. Describe the Empirical rule




Numerical Summaries of Data


Measures of Spread

Statistics is roughly about making inference from data


  • Is it fair to make inference from mean/median/mode alone?

    • Is the mean resistant?

    • Is the median representative of all the data?

    • Does the mode tell us about outliers?


  • When we look at data there’s a certain spread to it:




  • Spread is an important metric for understanding the differences or variation in data



Range

Difference between the largest and smallest data value

  • Simplest measure of spread


\[\text{Range} = \text{Maximum} - \text{Minimum}\]



Calculate the mean and range of the aboveground biomass for Clover Species A and Clover Species B in this data set:


\[ \begin{array}{|c|c|c|c|c|c|c|c|c|} \hline \text{Species A} & 3.11 & 2.95 & 2.00 & 2.62 & 3.34 & 3.41 & 2.13 & 2.81 & 1.80 & 2.89 \\ \hline \text{Species B} & 3.43 & 3.33 & 3.03 & 3.40 & 3.32 & 3.13 & 2.81 & 3.04 & 3.38 & 3.24 \\ \hline \end{array} \]







What do our results mean?


With every measure/metric there’s good and bad

  • Range does let us look at spread

    • It’s imperfect



These two data sets could have the same range and mean

  • Are they the same data set if that’s true?



  • Smaller spread is usually closer to the mean



  • Larger spread is usually further from the mean




Variance

Measure of how far, on average, values in a data set are from the mean


  • Population mean (denoted \(\mu\)):


\[\mu = {1 \over N}\sum\limits_{i=1}^Nx_i\]

\[\mu = {\sum\limits_{i=1}^Nx_i\over N}\]


  • Sample mean (denoted \(\bar{x}\))

\[\bar{x} = {1 \over n}\sum\limits_{i=1}^nx_i\]

\[\bar{x} = {\sum\limits_{i=1}^nx_i\over n}\]


Let \(x_1,x_2,...,x_N\) be values in a population of \(N\) size

  • The difference between \(i^{th}\) population value and the mean is:


\[x_i - \mu\]


  • We want to take these differences from \(1\) to \(i\) and divide them by \(N\)


There’s a problem though:



  • Positive and negative difference can cancel out

  • We can fix that by squaring the value of each difference:


\[(x_i - \mu)^2\]


Given this:

  • Variance should never be negative

  • Zero or positive


Larger variance means more variability





  • Population variance (denoted \(\sigma^2\)):


\[\sigma^2 = {{{\sum\limits_{i=1}^N}(x_i-\mu)^2}\over N}\]


  • Sample variance (denoted \(s^2\)):


\[s^2 = {{{\sum\limits_{i=1}^n}(x_i-\bar{x})^2}\over (n-1)}\]


Remember that statistics is roughly:

Making inference about a population parameter using sample statistics


In practice we almost never have a population variance

  • We use a sample variance to estimate population variance




Given the data on gas prices in Kansas below

\[ \begin{array}{|c|c|c|c|c|c|c|c|c|} \hline \text{Year} & 2004 & 2005 & 2006 & 2007 &2008 & 2009 & 2010 \\ \hline \text{Price} & 1.347 & 1.737 & 2.026 & 2.298 & 2.651 & 1.802 & 2.222 \\ \hline \end{array} \]


\[\bar{x}=2.012\]


  1. Calculate the sample variance

\[s^2 = {{{\sum\limits_{i=1}^n}(x_i-\bar{x})^2}\over (n-1)}\]





  1. Attempt to calculate the sample variance without the squaring step

\[s^2 = {{{\sum\limits_{i=1}^n}(x_i-\bar{x})}\over (n-1)}\]





  1. Attempt to calculate the sample variance by adding up all of the points:

\[\sum\limits_{i=1}^n x_i\]


Then subtracting them from the sample mean (\(\bar{x}\)), squaring the result, and dividing by \(n-1\)








Standard Deviation

Variance is a squared unit of the data

  • It’s annoying to think in squares



The units for the variance we calculated in the previous example is “\(\text{Dollars}^2\)

  • This is an easy fix:


\[\sqrt{\text{Dollars}^2}=\text{Dollars}\]


This is called standard deviation

  • It lets us work in the original units of the data


The notation is simple enough as well:


\[\sqrt{\sigma^2}=\sigma \rightarrow \text{Population Standard Deviation}\]


\[\sqrt{s^2}=s \rightarrow \text{Sample Standard Deviation}\]


In the previous example we calculated variance

  • What would be the standard deviation?


\[ \begin{array}{|c|c|c|c|c|c|c|c|c|} \hline \text{Year} & 2004 & 2005 & 2006 & 2007 &2008 & 2009 & 2010 \\ \hline \text{Price} & 1.347 & 1.737 & 2.026 & 2.298 & 2.651 & 1.802 & 2.222 \\ \hline \end{array} \]


\[\bar{x}=2.012\]



How does that compare to your answer in question (2) of the example problem?

Attempt to calculate the sample variance without the squaring step




Empirical Rule

Many data sets have a single peak in the center and an approximately symmetric shape

  • We call this a bell-shape



When we see this distribution:

  • We can use standard deviation to describe how much of the data is within a certain range of the mean



The Empirical Rule


For a population that has an approximately bell-shaped distribution:

  • \(\approx 68\%\) of the data is within ONE standard deviation of the mean


\[\approx 68\% = \begin{cases} \mu - \sigma \\ \mu + \sigma \end{cases}\]



  • \(\approx 95\%\)$ of the data is within TWO standard deviations of the mean


\[\approx 95\% = \begin{cases} \mu - 2\sigma \\ \mu + 2\sigma \end{cases}\]



  • \(\approx\) All or almost all of the data is within THREE standard deviations of the mean


\[\approx 100\% = \begin{cases} \mu - 3\sigma \\ \mu + 3\sigma \end{cases}\]






A researcher for KDWP records the weight of 250 raccoons that have been captured as part of a culling effort. The raccoons had sample mean weight of \(\bar{x} = 15\) pounds and sample standard deviation \(s = 2\) pounds. The histogram is approximately bell-shaped.


  1. Find an interval that is likely to contain approximately \(68\%\) of the scores


  1. Approximately what percentage of the raccoons were between \(11\) and \(19\) pounds?


  1. Approximately how many raccoons were between \(11\) and \(19\) pounds?




Attendance QOTD

\[ \begin{array}{|c|c|c|c|c|} \hline 12 & 8 & 7 & 5 & 6 & 4 & 15 \\ \hline \end{array} \]


Go away