Day 6

Review


The Empirical Rule

For a population that has an approximately bell-shaped distribution:

  • \(\approx 68\%\) of the data is within ONE standard deviation of the mean

  • \(\approx 95\%\) of the data is within TWO standard deviations of the mean

  • \(\approx\) All or almost all of the data is within THREE standard deviations of the mean



z-scores

Let \(x\) be a value from a population with mean \(\mu\)


  • The z-score is:


\[z={x-\mu \over \sigma}\]


  • For a sample:


\[z={x-\bar{x} \over s}\]


A \(z\)-score data value \(x\) is the number of standard deviations \(x\) is from the mean of the data set


  • \(z < 0 \Rightarrow\) the value of \(x\) is less than the mean


  • \(z = 0 \Rightarrow\) the value of \(x\) is equal to the mean


  • \(z > 0 \Rightarrow\) the value of \(x\) is greater than the mean


Z-Scores and the Empirical Rule:


  • \(\approx 68\%\) of the data will be between \(z=-1\) and \(z=1\)


  • \(\approx 95\%\) of the data will be between \(z=-2\) and \(z=2\)


  • \(\approx 100\%\) of the data will be between \(z=-3\) and \(z=3\)



Quartiles

Every data set has three quartiles:


\(1^{st}\) quartile, denoted \(\textbf{Q}_1\) separates the lowest \(25\%\) of the data from the highest \(75\%\)


\(2^{nd}\) quartile, denoted \(\textbf{Q}_2\) separates the lowest \(50\%\) of the data from the highest \(50\%\) (\(Q_2 = Median\))


\(3^{rd}\) quartile, denoted \(\textbf{Q}_3\) separates the lowest \(75\%\) of the data from the highest \(25\%\)



Percentiles

For a number \(p\) between \(1\) and \(99\), the \(p^{th}\) percentile separates the lowest \(p\%\) of the data from the highest \((100-p)\%\)

  • Quartiles separate data into \(4\) parts

    • Each part is \(\approx 25\%\) of the data
  • Percentiles divide the data set into \(100\) parts




Five-Number Summary

The five-number summary is a set of five measures of position computed from a data set. The summary consists of:


\[ \begin{array}{|c|c|c|c|c|c|c|} \hline \text{Min} & \text{Q}_1 & \text{Median} & \text{Q}_3 & \text{Max} \\ \hline \end{array} \]


Outliers:

An outlier is a value that is considerably large or smaller than most of the values in a data set


Interquartile Range (IQR)

  • The IQR is a measure of spread that is often used to detect outliers

  • Take the difference between \(\text{Q}_1\) and \(\text{Q}_3\):

\[\text{IQR} = \text{Q}_3 - \text{Q}_1\]

  • To find outliers with the IQR we use the IQR Method


  1. Define outlier boundaries:


\[\text{Lower Outlier Boundary} = \text{Q}_1 - 1.5*\text{IQR}\]


\[\text{Upper Outlier Boundary} = \text{Q}_3 + 1.5*\text{IQR}\]


  1. Check to see if any data is outside of these boundaries:


\[\text{Upper Boundary} < x < \text{Lower Boundary}\]





Questions?



Goals for Today:

  1. Construct/interpret boxplots

  2. Introduce scatterplots for two quantitative variables

  3. Discuss variable correlation




Correlation, Observation, and Regression


Boxplots

A boxplot or “box-and-whiskers” plot is a graphical display of a five number summary


Recall our five-number summary of the FMD data:


\[ \begin{array}{|c|c|c|c|c|c|c|} \hline \text{Min} & \text{Q}_1 & \text{Median} & \text{Q}_3 & \text{Max} \\ \hline 90 & 124 & 165 & 182 & 196\\ \hline \end{array} \]


The boxplot for this:






How to Construct a Boxplot

  1. Find the 5 values in the five number summary
  1. Compute the IQR

  2. Find the upper & lower bounds for outliers

  1. Draw a number line to represent the scale

  2. Above the number line, draw a box with one end at \(\text{Q}_1\) and the other at \(\text{Q}_3\)

  1. Draw a verticle line across the box at the median
  1. Draw horizontal lines (“whiskers”) from the box to the smallest and largest values within the upper & lower outlier bounds

  2. Plot observations outside the bounds with a “star” (*) to identify them as outliers




Skewness and Boxplots

Showing the skew of data with a boxplot is relatively intuitive and mimics histograms:





This would be negatively-skewed


Which is the higher value in these plots? Median or Mean?



Positively-skewed is the opposite:





  • Median or mean, which is higher?


  • Approximately symmetric (otherwise known as?)



  • What are the circles?




Comparative Boxplots

Boxplots are extremely useful for comparing data sets on the same scale


Below is annual rainfall data (in inches) in LA during February: \(1930-1974\)


Year Rainfall Year Rainfall Year Rainfall Year Rainfall Year Rainfall
1930 0.45 1939 1.13 1948 1.29 1957 1.47 1966 1.51
1931 3.25 1940 5.43 1949 1.41 1958 6.46 1967 0.11
1932 5.33 1941 12.42 1950 1.67 1959 3.32 1968 0.49
1933 0.00 1942 1.05 1951 1.48 1960 2.26 1969 8.03
1934 2.04 1943 3.07 1952 0.63 1961 0.15 1970 2.58
1935 2.23 1944 8.65 1953 0.33 1962 11.57 1971 0.67
1936 7.25 1945 3.34 1954 2.98 1963 2.88 1972 0.13
1937 7.87 1946 1.52 1955 0.68 1964 0.00 1973 7.89
1938 9.81 1947 0.86 1956 0.59 1965 0.23 1974 0.14


We can compare the data from \(1930-1974\) with the data from \(1975-2019\) using boxplots:



What can you say about the shape of each dataset?


In which time period was the amount of rainfall generally greater?


On the whole, the rainfall was more variability in which time period?




Correlation and Scatterplots

The most basic goal of statistics is to describe the relationship between two variables measured on a sample of individuals from a given population


We know what to do with one quantitative variable

  • How about two?



We have a sample of \(20\) patients, of varying demographics, all who received testing of their cholesterol levels during their last visit with their doctor


Two variables for each individual in the sample


  • \(x=\) age of the patient

  • \(y=\) serum cholesterol level


For the \(i^{th}\) patient, we’ll denote it’s observated values as:


  • \(x_i=\) the age of the \(i^{th}\) patient in years

  • \(y_i=\) the serum cholesterol level of the \(i^{th}\) patient in mmol/L


Age 67 52 57 56 60 42 58 53 52 39 54 50 62 38 54 48 64 69 53 62
Cholersterol 229 196 241 283 253 265 218 282 205 219 304 196 263 175 214 245 325 239 264 209


Our data consist of ordered pairs:


\[(x_1,y_1)=(67,229),...,(x_{20},y_{20})=(62,209)\]


  • Data that consist of ordered pairs are called bivariate data


How are \(x\) and \(y\) related in this data?


  • What happens to our cholesterol as we get older?


  • In our head we may have an idea, but the way we can see this visually is called a scatterplot



What do we think of the relationship between \(x\) and \(y\)?

  • We could even describe this relationship with a line

    • We call that a linear association




Scatterplot Definitions

For any two variables we can define their relationship as a:


  • Positive association if large values of one variable are associated with large values of another


  • Negative association if large values of one variable are associated with small values of another


Two variables can have a linear relationship if the data tend to cluster around a straight line when plotted on a scatterplot





Attendance QOTD


Go away