Day 4

Review

Which way is this histogram skewed?

How would you describe the “center” of this data?

Numerical summaries help us understand data sets faster

Notation is how we communicate statistical/mathematical equations:

Data values can be denoted as $x_1,x_2,x_3,...$

$x_1$ refers to the observed value of the variable $x$ from individual 1

Summation

This is referring to the sum (addition) of everything contained in the expression
We denote this with the Greek letter $\Sigma$

\[\sum\limits_{i=1}^nx_i=x_1+x_2+...+x_n\]

Mean

Balance point (fulcrum) of the dataset
Population mean (denoted $\mu$):

\[\mu = {1 \over N}\sum\limits_{i=1}^Nx_i\]

\[\mu = {\sum\limits_{i=1}^Nx_i\over N}\]

\[N \approx 10000\]

\[\mu = {1 \over 10000}(x_1 + x_2 +x_3 + ...+x_{10000})\]

\[\mu = 1258.771\]

Is $\mu$ a statistic?

Properties of the mean:

Common
Easy to interpret
Susceptible to outliers
- A statistic is resistant if its value is not affected heavily by outliers

Median

Middle value, half the data are below and half are above

\[ \begin{array}{|c|c|c|c|c|c|c|c|c|c|} \hline 1149 & 1577 & 1138 & 1319 & 1399 & 1074 & 1091 & 1324 & 1048 & 1462 \\ \hline \end{array} \]

Sort your data in increasing order (low to high)

\[ \begin{array}{|c|c|c|c|c|c|c|c|c|c|} \hline 1048 & 1074 & 1091 & 1138 & 1149 & 1319 & 1324 & 1399 & 1462 & 1577 \\ \hline \end{array} \]

If $n$ is odd: choose position ${(n+1)\over2}$ in the ordered dataset

If $n$ is even after ordering:

Pick $n\over 2$ and ${n \over 2}+1$
Average the two data points

\[{10\over2}=5 \ , \ {10 \over2}+1=6\]

\[{\ \ \ + \ \ \ \over2}= \ \ \]

Properties of median:

It doesn’t use all of the data directly
This makes it resistant
- Outliers have little/no effect
Difference between median and mean depend on skew of the histogram

Mode

The most frequent observation

Where the peak is

What could be the mode here?

Useful for qualitative data

Not as useful for quantitative data

A data set can have any number of modes $(0,1,2,...)$

\[ \begin{array}{|c|c|c|c|c|c|c|c|c|} \hline 8 & 10 & 7 & 11 & 14 & 12 & 11 & 9 & 14 \\ \hline 11 & 8 & 2 & 8 & 10 & 7 & 12 & 11 & 8 \\ \hline \end{array} \]

Whatever method works best for you is the one you use

Personally, I prefer to make a frequency table:

\[ \begin{array}{|c|c|c|c|c|c|c|c|c|} \hline \text{Value} & 2 & 7 & 8 & 9 & 10 & 11 & 12 & 14 \\ \hline \text{Frequency} & 1 & 2 & 4 & 1 & 2 & 2 & 4 & 2 \\ \hline \end{array} \]

Questions?

Goals for Today:

Introduce the measures of spread: Range, Variance, and Standard Deviation
Describe the Empirical rule

Numerical Summaries of Data

Measures of Spread

Statistics is roughly about making inference from data

Is it fair to make inference from mean/median/mode alone?
- Is the mean resistant?
- Is the median representative of all the data?
- Does the mode tell us about outliers?

When we look at data there’s a certain spread to it:

Spread is an important metric for understanding the differences or variation in data

Range

Difference between the largest and smallest data value

Simplest measure of spread

\[\text{Range} = \text{Maximum} - \text{Minimum}\]

Calculate the mean and range of the aboveground biomass for Clover Species A and Clover Species B in this data set:

\[ \begin{array}{|c|c|c|c|c|c|c|c|c|} \hline \text{Species A} & 3.11 & 2.95 & 2.00 & 2.62 & 3.34 & 3.41 & 2.13 & 2.81 & 1.80 & 2.89 \\ \hline \text{Species B} & 3.43 & 3.33 & 3.03 & 3.40 & 3.32 & 3.13 & 2.81 & 3.04 & 3.38 & 3.24 \\ \hline \end{array} \]

What do our results mean?

With every measure/metric there’s good and bad

Range does let us look at spread
- It’s imperfect

These two data sets could have the same range and mean

Are they the same data set if that’s true?

Smaller spread is usually closer to the mean

Larger spread is usually further from the mean

Variance

Measure of how far, on average, values in a data set are from the mean

Population mean (denoted $\mu$):

\[\mu = {1 \over N}\sum\limits_{i=1}^Nx_i\]

\[\mu = {\sum\limits_{i=1}^Nx_i\over N}\]

Sample mean (denoted $\bar{x}$)

\[\bar{x} = {1 \over n}\sum\limits_{i=1}^nx_i\]

\[\bar{x} = {\sum\limits_{i=1}^nx_i\over n}\]

Let $x_1,x_2,...,x_N$ be values in a population of $N$ size

The difference between $i^{th}$ population value and the mean is:

\[x_i - \mu\]

We want to take these differences from $1$ to $i$ and divide them by $N$

There’s a problem though:

Positive and negative difference can cancel out
We can fix that by squaring the value of each difference:

\[(x_i - \mu)^2\]

Given this:

Variance should never be negative
Zero or positive

Larger variance means more variability

Population variance (denoted $\sigma^2$):

\[\sigma^2 = {{{\sum\limits_{i=1}^N}(x_i-\mu)^2}\over N}\]

Sample variance (denoted $s^2$):

\[s^2 = {{{\sum\limits_{i=1}^n}(x_i-\bar{x})^2}\over (n-1)}\]

Remember that statistics is roughly:

Making inference about a population parameter using sample statistics

In practice we almost never have a population variance

We use a sample variance to estimate population variance

Given the data on gas prices in Kansas below

\[ \begin{array}{|c|c|c|c|c|c|c|c|c|} \hline \text{Year} & 2004 & 2005 & 2006 & 2007 &2008 & 2009 & 2010 \\ \hline \text{Price} & 1.347 & 1.737 & 2.026 & 2.298 & 2.651 & 1.802 & 2.222 \\ \hline \end{array} \]

\[\bar{x}=2.012\]

Calculate the sample variance

\[s^2 = {{{\sum\limits_{i=1}^n}(x_i-\bar{x})^2}\over (n-1)}\]

Attempt to calculate the sample variance without the squaring step

\[s^2 = {{{\sum\limits_{i=1}^n}(x_i-\bar{x})}\over (n-1)}\]

Attempt to calculate the sample variance by adding up all of the points:

\[\sum\limits_{i=1}^n x_i\]

Then subtracting them from the sample mean ($\bar{x}$), squaring the result, and dividing by $n-1$

Standard Deviation

Variance is a squared unit of the data

It’s annoying to think in squares

The units for the variance we calculated in the previous example is “$\text{Dollars}^2$”

This is an easy fix:

\[\sqrt{\text{Dollars}^2}=\text{Dollars}\]

This is called standard deviation

It lets us work in the original units of the data

The notation is simple enough as well:

\[\sqrt{\sigma^2}=\sigma \rightarrow \text{Population Standard Deviation}\]

\[\sqrt{s^2}=s \rightarrow \text{Sample Standard Deviation}\]

In the previous example we calculated variance

What would be the standard deviation?

\[\bar{x}=2.012\]

How does that compare to your answer in question (2) of the example problem?

Attempt to calculate the sample variance without the squaring step

Empirical Rule

Many data sets have a single peak in the center and an approximately symmetric shape

We call this a bell-shape

When we see this distribution:

We can use standard deviation to describe how much of the data is within a certain range of the mean

The Empirical Rule

For a population that has an approximately bell-shaped distribution:

$\approx 68\%$ of the data is within ONE standard deviation of the mean

\[\approx 68\% = \begin{cases} \mu - \sigma \\ \mu + \sigma \end{cases}\]

$\approx 95\%$$ of the data is within TWO standard deviations of the mean

\[\approx 95\% = \begin{cases} \mu - 2\sigma \\ \mu + 2\sigma \end{cases}\]

$\approx$ All or almost all of the data is within THREE standard deviations of the mean

\[\approx 100\% = \begin{cases} \mu - 3\sigma \\ \mu + 3\sigma \end{cases}\]

A researcher for KDWP records the weight of 250 raccoons that have been captured as part of a culling effort. The raccoons had sample mean weight of $\bar{x} = 15$ pounds and sample standard deviation $s = 2$ pounds. The histogram is approximately bell-shaped.

Find an interval that is likely to contain approximately $68\%$ of the scores

Approximately what percentage of the raccoons were between $11$ and $19$ pounds?

Approximately how many raccoons were between $11$ and $19$ pounds?

Attendance QOTD

\[ \begin{array}{|c|c|c|c|c|} \hline 12 & 8 & 7 & 5 & 6 & 4 & 15 \\ \hline \end{array} \]