Correlation

STAT 240 - Fall 2025

Robert Sholl

Motivating Example

Divorce

  • Colorado has seen a large plummet in divorces over the past 2 decades

  • What are they doing to produce such results?!

Milk

  • Colorado has also drastically reduced their cows milk consumption!

The Solution

  • Cow’s milk must cause divorce!

Correlation

Two or more variables that have some shared trend due to unknown causation or a shared cause

Correlation Analysis

Scatterplots

\[ \begin{aligned} \text{Let } & X \equiv \{\text{Milk Consumption} \} \\ & Y \equiv \{ \text{Divorce Rate} \} \\ \\ & x_i = \text{pounds of milk consumed in year } i\\ & y_i = \text{divorce rate in year } i\\ \\ & (x_i,y_i) \text{ are an "ordered pair" from bivariate data} \\ & (x_1,y_1) , ... , (x_n,y_n) \text{ are coordinates on a graph} \end{aligned} \]

Scatterplots

Linear association: Two variables can be reasonably described with a line

Linear Associations

Positive association: larger values of one variable relate to larger values of another

Strength of association: the degree a linear association fits on a line

Linear Association

Negative association: larger values of one variable related to smaller values of another

Linear Association

When values are flatlined or not expressable with a line they’re referred to as “lacking association” or “nonlinear”

Correlation Coefficient

  • Quantification is more reliable than ‘gut’ labeling

Correlation Coefficient

\[z_{x,y} = \frac{(x,y)-(\bar x, \bar y)}{(s_x,s_y)}\]

Correlation Coefficient

\[ \frac{\sum(z_x \times z_y)}{n-1} \]

\[ r = \frac{1}{n-1}\sum_i \left(\frac{x_i - \bar x}{s_x} \right) \left(\frac{y_i - \bar y}{s_y} \right) \]

# sample size
n = length(milk)

# z-score for milk consumption
z_x = (milk - mean(milk)) / sd(milk)

# z-score for divorce rate
z_y = (divorce - mean(divorce)) / sd(divorce)

# correlation coefficient for (x, y)
sum(z_x * z_y) / (n - 1)
[1] 0.9653682

Correlation Coefficient

Properties of \(r\)

  1. The value is always between \(-1 \le r \le 1\)
  • If \(r = 1\), all of the data falls on a line with a positive slope

  • If \(r = -1\), all of the data falls on a line with a negative slope

  • As \(r \rightarrow 0\) the relationship between \(x\) and \(y\) weakens

  • If \(r=0\) no linear relationship exists

  • As a rule of thumb, \(-0.6<r<0.6\) is considered a weak relationship

Properties of \(r\)

  1. The correlation does not depend on unit of measure

  2. Correlation is sensitive to outliers

  3. Correlation cannot capture nonlinear relationships

  • The formulation is strictly linear

Boxplots

\[ \begin{array}{|c|c|c|c|c|c|c|} \hline \text{Min} & \text{Q}_1 & \text{Median} & \text{Q}_3 & \text{Max} \\ \hline 134 & 154 & 177 & 185 & 197\\ \hline \end{array} \]

Boxplots

Constucting a Boxplot

  1. Find the 5 values in the five number summary

    1. Compute the IQR

    2. Find the upper & lower bounds for outliers

  2. Draw a number line to represent the scale

  3. Above the number line, draw a box with one end at \(\text{Q}_1\) and the other at \(\text{Q}_3\)

    1. Draw a verticle line across the box at the median

Constructing a Boxplot

  1. Draw horizontal lines (“whiskers”) from the box to the smallest and largest values within the upper & lower outlier bounds

  2. Plot observations outside the bounds with a “star” (*) to identify them as outliers

Skewness

Spurious Correlation

  • Correlation doesn’t depend on unit of measure

    • It also can’t consider them

    • What other variable do \(x\) and \(y\) change with?

Comparative Boxplots

Correlation v. Causation

Go away