Day 3

Review

What is data?


\[ \begin{array}{|c|c|c|c|c|} \hline \text{N} & \text{Age Class} & \text{Weight} & \text{Sex} & \text{Location} \\ \hline 1 & 0.5 & 30.8 & \text{M} & \text{B} \\ \hline 2 & 0.5 & 21.8 & \text{M} & \text{B} \\ \hline 3 & 2.5 & 47.6 & \text{M} & \text{A} \\ \hline 4 & 0.5 & 29.0 & \text{F} & \text{B} \\ \hline 5 & 2.5 & 65.8 & \text{M} & \text{A} \\ \hline \end{array} \]


How many individuals?


How many variables?


Qualitative (Categorical) variable: The value of the variable represents a descriptive categories


Quantitative variable: The value of the variable represents a meaningful number


Qualitative variables can be ordinal or nominal

  • Ordinal variables: Categories/values of the variable have a natural ordering


  • Nominal variable: Categories/values of the variable cannot be ordered naturally


Quantitative variables can be discrete or continuous


  • Discrete variable: A countable number of values (0, 1, 2, 3, 4, …)


  • Continuous variable: A continuous range of numbers (0, 0.1, 0.11, 0.111, …)


Quantitative variables can be categorized by level of measurement used for obtaining data values:


Interval level

  • Differences between values make sense

  • Ratios don’t make sense because zero has no meaning


Ratio level

  • Numerical measurement

  • Differences between values make sense

  • Ratios also make sense

  • Zero has meaning, it represents absence of the quantity




Graphics

Why would I graph anything?


When should I use a bar graph?




What about a pie chart?




Histogram: visual representation of a frequency distribution

  • Not a bar graph


Bar height (y-axis) represents class frequency

Bar width (x-axis) represents class width



  • We care about the shape of our data

    • This is the primary purpose of a histogram

    • So we want to not fail at that task

  • The shape of our data can help us observe the distribution of our data


Symmetric - mirror image on both sides of it’s center

Unimodal - One peak/hump



Positively-skewed - Long, narrow tail to the right



Negatively-skewed - Long, narrow tail to the left



Uniform - box



Bimodal - two peaks/humps




  • Histograms can be used to summarize both small and large data sets

  • Sometimes we prefer more detailed visualizations for smaller data sets

  • Stem-and-leaf plots and dotplots are alternative summaries that display the actual values



Stem-and-leaf plots

  • Each observation should have at least two digits

    • The digit furthest to the right is the “leaf”

    • The digits to the left form the “stem”

  • The data:

\[ \begin{array}{|c|c|c|c|c|c|c|c|c|c|} \hline 87 & 7 & 95 & 76 & 32 & 28 & 84 & 98 & 93 & 88\\ \hline 78 & 100 & 68 & 76 & 55 & 65 & 42 & 57 & 77 & 96 \\ \hline \end{array} \]

  • The stem-and-leaf plot:

\[ \begin{array}{r|l} 0 & 7 \\ 1 & \\ 2 & 8 \\ 3 & 2 \\ 4 & 2 \\ 5 & 5 \; 7 \\ 6 & 5 \; 8 \\ 7 & 6 \; 6 \; 7 \; 8 \\ 8 & 4 \; 7 \; 8 \\ 9 & 3 \; 5 \; 6 \; 8 \\ 10 & 0 \end{array} \]



Dotplots

  • We can represent each observation by a dot above its value on a number line

  • Data:

\[ \begin{array}{|c|c|c|c|c|} \hline 3 & 6 & 2 & 5 & 1 \\ \hline 2 & 3 & 4 & 3 & 4 \\ \hline \end{array} \]

  • Dotplot:





Questions?



Goals for Today:

  1. Describe and calculate Mean, Median, and Mode

  2. Recognize the use and importance of Statistical/Mathematical notation



Numerical Summaries of Data


Measures of Center

Graphics are good for taking data and making it easier to view

Numerical summaries are how we take data and make it easier to understand


Statistical Inference in a nutshell

“Using sample statistics to describe population parameters”




Which way is this histogram skewed?


How would you describe the “center” of this data?


  • Mean or average: Balance point (fulcrum) of the dataset

  • Median: Half of the data points are above the median, half are below

  • Mode: Where the peak is




Mean

Sum all of the data then divide by the number of observations

  • Most commonly used metric for summarizing data


\[ \begin{array}{|c|c|c|c|c|} \hline 7 & 3 & 12 & 3 & 5 \\ \hline \end{array} \]


\[\text{Mean} \ = \ {7+3+12+3+5 \over 5} = {30 \over 5} = 6\]


If the data we calculated a mean for comes from a sample:

  • Sample mean


If the data we calculated a mean for comes from a population:

  • Population mean


Mathematical Notation Soapbox

  • “Letter math”

  • I resisted it forever

  • But trust me, it does end up being helpful


Data values can be denoted as \(x_1,x_2,x_3,...\)

  • \(x_1\) refers to the observed value of the variable \(x\) from individual 1

  • \(x\) can be anything

    • It doesn’t even have to be \(x\)

    • It’s convention, not law


Sample size (the number of individuals in the sample)

  • Denoted with \(n\) (Note: lower-case)


Population size

  • Denoted with \(N\) (Note: capital)


Summation

  • This is referring to the sum (addition) of everything contained in the expression

  • We denote this with the Greek letter \(\Sigma\)


With this notation we can describe:

\[\sum\limits_{i=1}^nx_i=x_1+x_2+...+x_n\]

  • “The summation of \(x_i\) to the \(n^{th}\) term, starting from 1”


With sigma notation we can express the sample and population mean formulas:

  • Sample mean (denoted \(\bar{x}\)):

\[{1 \over n}\sum\limits_{i=1}^nx_i\]


  • Population mean (denoted \(\mu\)):

\[{1 \over N}\sum\limits_{i=1}^Nx_i\]


Greek letters usually mean population parameters

  • Lower-case letters usually mean sample statistics

  • In practice:

\[ \begin{array}{|c|c|c|c|c|} \hline \text{Student} & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 \\ \hline \text{Abscences} & 2 & 6 & 1 & 2 & 4 & 0 & 1 & 3 & 0 & 2 \\ \hline \end{array} \]


\[{1 \over n}\sum\limits_{i=1}^nx_i\]

\[{1 \over 10}(x_1+x_2+x_3+x_4+x_5+x_6+x_7+x_8+x_9+x_{10})\]

\[{1 \over 10}(2+6+1+2+4+0+1+3+0+2)\]

\[{1 \over 10} *21 = {21 \over 10} = 2.1\]


Properties of the mean:

  • Common

  • Easy to interpret

  • Susceptible to outliers

    • The average number of Super Bowl rings between me and Tom Brady is \(3.5\)

    • (As of 2021) the top \(1\%\) of households in the United States hold \(32.3\%\) of the country’s wealth, while the bottom \(50\%\) hold \(2.6\%\)

    • A statistic is resistant if its value is not affected heavily by outliers


Is the mean resistant?




Median

Middle value, half the data are below and half are above


\[ \begin{array}{|c|c|c|c|c|} \hline 7 & 3 & 12 & 3 & 5 \\ \hline \end{array} \]


Sort your data in increasing order (low to high)


\[ \begin{array}{|c|c|c|c|c|} \hline 3 & 3 & \boldsymbol{5} & 7 & 12 \\ \hline \end{array} \]


The median is 5


If \(n\) is odd: choose position \({(n+1)\over2}\) in the ordered dataset

  • So \(n=5\)

\[{(n+1)\over2}={(5+1)\over2}=3\]

We pick the \(3^{rd}\) data point after sorting


\[ \begin{array}{|c|c|c|c|c|c|} \hline 7 & 3 & 12 & 3 & 5 & 8\\ \hline \end{array} \]

If \(n\) is even after ordering:

  • Pick \(n\over 2\) and \({n \over 2}+1\)

  • Average the two data points


\[ \begin{array}{|c|c|c|c|c|} \hline 3 & 3 & \boldsymbol{5} & \boldsymbol{7} & 8 & 12 \\ \hline \end{array} \]


\[n=6\]


\[{6\over 2}=3 \ ,\ {6 \over 2}+1=4\]


  • 3rd data point: \(5\)

  • 4th data point: \(7\)

\[{(5+7)\over 2}=6\]


The median is \(6\)


Properties of median:

  • It doesn’t use all of the data directly

  • This makes it resistant

    • Outliers have little/no effect
  • Sometimes a more realistic measurement:

    • Median Household Income (Kansas): \(\$57,422\)

    • Average Household Income (Kansas): \(\$77,509\)

      • Why does median make more sense than average here?
  • Difference between median and mean depend on skew of the histogram




Mode

The most frequent observation


Useful for qualitative data

  • “Which species of Salmonella is most commonly growing in my flour?”


Not as useful for quantitative data

  • “What’s the most common weight of cattle on our research farms?”

    • Why isn’t this a helpful metric?


A data set can have any number of modes \((0,1,2,...)\)


\[ \begin{array}{|c|c|c|c|c|} \hline 3 & 3 & 5 & 7 & 12 \\ \hline \end{array} \]


The most frequently observed value in this data set is \(3\)

  • The mode is \(3\)




The below data set displays a sample of \(n=7\) observations:


\[ \begin{array}{|c|c|c|c|c|c|c|} \hline 2 & 1 & 3 & 4 & 3 & 5 & 4 \\ \hline \end{array} \]


  1. Find the mean:



  1. Find the median:



  1. Find the mode(s):





Attendance QOTD


\[ \begin{array}{|c|c|c|c|c|} \hline \text{Age} & \text{Sex} & \text{Body Temperature} & \text{Serum Cholesterol} & \text{Chest Pain Type} & \text{Max Heart Rate} \\ \hline 34 & \text{M} & 98.7 & 182 & 3 & 174 \\ \hline 63 & \text{M} & 97.8 & 233 & 3 & 150 \\ \hline 37 & \text{M} & 97.6 & 250 & 2 & 187 \\ \hline 41 & \text{F} & 98.2 & 204 & 1 & 172 \\ \hline 56 & \text{M} & 98.0 & 236 & 1 & 178 \\ \hline \end{array} \]



Go away