Day 4
Review
Which way is this histogram skewed?
How would you describe the “center” of this data?
Numerical summaries help us understand data sets faster
Notation is how we communicate statistical/mathematical equations:
Data values can be denoted as \(x_1,x_2,x_3,...\)
- \(x_1\) refers to the observed value of the variable \(x\) from individual 1
Summation
This is referring to the sum (addition) of everything contained in the expression
We denote this with the Greek letter \(\Sigma\)
\[\sum\limits_{i=1}^nx_i=x_1+x_2+...+x_n\]
Mean
Balance point (fulcrum) of the dataset
Population mean (denoted \(\mu\)):
\[\mu = {1 \over N}\sum\limits_{i=1}^Nx_i\]
\[\mu = {\sum\limits_{i=1}^Nx_i\over N}\]
\[N \approx 10000\]
\[\mu = {1 \over 10000}(x_1 + x_2 +x_3 + ...+x_{10000})\]
\[\mu = 1258.771\]
- Is \(\mu\) a statistic?
Properties of the mean:
Common
Easy to interpret
Susceptible to outliers
- A statistic is resistant if its value is not affected heavily by outliers
Median
Middle value, half the data are below and half are above
\[ \begin{array}{|c|c|c|c|c|c|c|c|c|c|} \hline 1149 & 1577 & 1138 & 1319 & 1399 & 1074 & 1091 & 1324 & 1048 & 1462 \\ \hline \end{array} \]
Sort your data in increasing order (low to high)
\[ \begin{array}{|c|c|c|c|c|c|c|c|c|c|} \hline 1048 & 1074 & 1091 & 1138 & 1149 & 1319 & 1324 & 1399 & 1462 & 1577 \\ \hline \end{array} \]
If \(n\) is odd: choose position \({(n+1)\over2}\) in the ordered dataset
If \(n\) is even after ordering:
Pick \(n\over 2\) and \({n \over 2}+1\)
Average the two data points
\[{10\over2}=5 \ , \ {10 \over2}+1=6\]
\[{\ \ \ + \ \ \ \over2}= \ \ \]
Properties of median:
It doesn’t use all of the data directly
This makes it resistant
- Outliers have little/no effect
Difference between median and mean depend on skew of the histogram
Mode
The most frequent observation
Where the peak is
- What could be the mode here?
Useful for qualitative data
Not as useful for quantitative data
A data set can have any number of modes \((0,1,2,...)\)
\[ \begin{array}{|c|c|c|c|c|c|c|c|c|} \hline 8 & 10 & 7 & 11 & 14 & 12 & 11 & 9 & 14 \\ \hline 11 & 8 & 2 & 8 & 10 & 7 & 12 & 11 & 8 \\ \hline \end{array} \]
Whatever method works best for you is the one you use
- Personally, I prefer to make a frequency table:
\[ \begin{array}{|c|c|c|c|c|c|c|c|c|} \hline \text{Value} & 2 & 7 & 8 & 9 & 10 & 11 & 12 & 14 \\ \hline \text{Frequency} & 1 & 2 & 4 & 1 & 2 & 2 & 4 & 2 \\ \hline \end{array} \]
Questions?
Goals for Today:
Introduce the measures of spread: Range, Variance, and Standard Deviation
Describe the Empirical rule
Numerical Summaries of Data
Measures of Spread
Statistics is roughly about making inference from data
Is it fair to make inference from mean/median/mode alone?
Is the mean resistant?
Is the median representative of all the data?
Does the mode tell us about outliers?
- When we look at data there’s a certain spread to it:
- Spread is an important metric for understanding the differences or variation in data
Range
Difference between the largest and smallest data value
- Simplest measure of spread
\[\text{Range} = \text{Maximum} - \text{Minimum}\]
Calculate the mean and range of the aboveground biomass for Clover Species A and Clover Species B in this data set:
\[ \begin{array}{|c|c|c|c|c|c|c|c|c|} \hline \text{Species A} & 3.11 & 2.95 & 2.00 & 2.62 & 3.34 & 3.41 & 2.13 & 2.81 & 1.80 & 2.89 \\ \hline \text{Species B} & 3.43 & 3.33 & 3.03 & 3.40 & 3.32 & 3.13 & 2.81 & 3.04 & 3.38 & 3.24 \\ \hline \end{array} \]
What do our results mean?
With every measure/metric there’s good and bad
Range does let us look at spread
- It’s imperfect
These two data sets could have the same range and mean
- Are they the same data set if that’s true?
- Smaller spread is usually closer to the mean
- Larger spread is usually further from the mean
Variance
Measure of how far, on average, values in a data set are from the mean
- Population mean (denoted \(\mu\)):
\[\mu = {1 \over N}\sum\limits_{i=1}^Nx_i\]
\[\mu = {\sum\limits_{i=1}^Nx_i\over N}\]
- Sample mean (denoted \(\bar{x}\))
\[\bar{x} = {1 \over n}\sum\limits_{i=1}^nx_i\]
\[\bar{x} = {\sum\limits_{i=1}^nx_i\over n}\]
Let \(x_1,x_2,...,x_N\) be values in a population of \(N\) size
- The difference between \(i^{th}\) population value and the mean is:
\[x_i - \mu\]
- We want to take these differences from \(1\) to \(i\) and divide them by \(N\)
There’s a problem though:
Positive and negative difference can cancel out
We can fix that by squaring the value of each difference:
\[(x_i - \mu)^2\]
Given this:
Variance should never be negative
Zero or positive
Larger variance means more variability
- Population variance (denoted \(\sigma^2\)):
\[\sigma^2 = {{{\sum\limits_{i=1}^N}(x_i-\mu)^2}\over N}\]
- Sample variance (denoted \(s^2\)):
\[s^2 = {{{\sum\limits_{i=1}^n}(x_i-\bar{x})^2}\over (n-1)}\]
Remember that statistics is roughly:
Making inference about a population parameter using sample statistics
In practice we almost never have a population variance
- We use a sample variance to estimate population variance
Given the data on gas prices in Kansas below
\[ \begin{array}{|c|c|c|c|c|c|c|c|c|} \hline \text{Year} & 2004 & 2005 & 2006 & 2007 &2008 & 2009 & 2010 \\ \hline \text{Price} & 1.347 & 1.737 & 2.026 & 2.298 & 2.651 & 1.802 & 2.222 \\ \hline \end{array} \]
\[\bar{x}=2.012\]
- Calculate the sample variance
\[s^2 = {{{\sum\limits_{i=1}^n}(x_i-\bar{x})^2}\over (n-1)}\]
- Attempt to calculate the sample variance without the squaring step
\[s^2 = {{{\sum\limits_{i=1}^n}(x_i-\bar{x})}\over (n-1)}\]
- Attempt to calculate the sample variance by adding up all of the points:
\[\sum\limits_{i=1}^n x_i\]
Then subtracting them from the sample mean (\(\bar{x}\)), squaring the result, and dividing by \(n-1\)
Standard Deviation
Variance is a squared unit of the data
- It’s annoying to think in squares
The units for the variance we calculated in the previous example is “\(\text{Dollars}^2\)”
- This is an easy fix:
\[\sqrt{\text{Dollars}^2}=\text{Dollars}\]
This is called standard deviation
- It lets us work in the original units of the data
The notation is simple enough as well:
\[\sqrt{\sigma^2}=\sigma \rightarrow \text{Population Standard Deviation}\]
\[\sqrt{s^2}=s \rightarrow \text{Sample Standard Deviation}\]
In the previous example we calculated variance
- What would be the standard deviation?
\[ \begin{array}{|c|c|c|c|c|c|c|c|c|} \hline \text{Year} & 2004 & 2005 & 2006 & 2007 &2008 & 2009 & 2010 \\ \hline \text{Price} & 1.347 & 1.737 & 2.026 & 2.298 & 2.651 & 1.802 & 2.222 \\ \hline \end{array} \]
\[\bar{x}=2.012\]
How does that compare to your answer in question (2) of the example problem?
Attempt to calculate the sample variance without the squaring step
Empirical Rule
Many data sets have a single peak in the center and an approximately symmetric shape
- We call this a bell-shape
When we see this distribution:
- We can use standard deviation to describe how much of the data is within a certain range of the mean
The Empirical Rule
For a population that has an approximately bell-shaped distribution:
- \(\approx 68\%\) of the data is within ONE standard deviation of the mean
\[\approx 68\% = \begin{cases} \mu - \sigma \\ \mu + \sigma \end{cases}\]
- \(\approx 95\%\)$ of the data is within TWO standard deviations of the mean
\[\approx 95\% = \begin{cases} \mu - 2\sigma \\ \mu + 2\sigma \end{cases}\]
- \(\approx\) All or almost all of the data is within THREE standard deviations of the mean
\[\approx 100\% = \begin{cases} \mu - 3\sigma \\ \mu + 3\sigma \end{cases}\]
A researcher for KDWP records the weight of 250 raccoons that have been captured as part of a culling effort. The raccoons had sample mean weight of \(\bar{x} = 15\) pounds and sample standard deviation \(s = 2\) pounds. The histogram is approximately bell-shaped.
- Find an interval that is likely to contain approximately \(68\%\) of the scores
- Approximately what percentage of the raccoons were between \(11\) and \(19\) pounds?
- Approximately how many raccoons were between \(11\) and \(19\) pounds?
Attendance QOTD
\[ \begin{array}{|c|c|c|c|c|} \hline 12 & 8 & 7 & 5 & 6 & 4 & 15 \\ \hline \end{array} \]