Least Squares

STAT 240 - Fall 2025

Robert Sholl

Motivation

  • Phenomena: If I give three groups of patients three different blood pressure medications in an experimental trial, what effect do those drugs have? Are they different?

  • Mechanisms: We know that speed estimates of stars increase linearly with their observed distance from Earth, how can we use this to predict where the edges of the Universe are?

  • Machine learning: How can we train a machine to recognize whether a recipe is for a dinner or a dessert?

Line of Best Fit

Which is “best”?

Least Squares

Least Squares

Least Squares

Least Squares

Least Squares

\[ \begin{array}{|c|c|} \hline x & y & \hat y\\ \hline 1 & 0.16 & 0.70\\ \hline 2 & 2.82 & 1.74\\ \hline 3 & 2.24 & 2.78\\ \hline \end{array} \]

\[ \hat y = mx + b \]

Least Squares

\[ m = \frac{1.74-0.70}{1} \]

\[ 0.70 = 1.04(1) + b \]

\[ \hat y = 1.04x - 0.34 \]

Least Squares

What if this is all we have?

\[ \begin{array}{|c|c|} \hline x & y \\ \hline 1 & 0.16 \\ \hline 2 & 2.82 \\ \hline 3 & 2.24 \\ \hline \end{array} \]

Least Squares

Least Squares

\[ m = \frac{\sum_i^n(x_i-\bar x)(y_i-\bar y)}{\sum_i^n(x_i-\bar x)^2} \]

\[ b = \bar y - m \bar x \]


\[ \hat y = mx+b \]

Least Squares

\[ y = b + mx \]

\[ \begin{aligned} y & = y \\ m & = \beta_1 \\ x & = x \\ b & = \beta_0 \end{aligned} \]

\[ y = \beta_0 + \beta_1 x \]

\[ \hat \epsilon_i = y_i - \hat y_i \]

Least Squares

\[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i \]

\[ \hat\beta_1 = \frac{\sum_i^n(x_i-\bar x)(y_i-\bar y)}{\sum_i^n(x_i-\bar x)^2} \]

\[ \hat\beta_0=\bar y - \hat\beta_1 \bar x \]

\[ \hat \epsilon_i = y_i - \hat y_i \]

Interpretations

\[ \begin{array}{|c|c|c|} \hline \text{Variable} & \text{Name} & \text{Interpretation} \\ \hline y & \text{Response} & \text{Variable being predicted} \\ \hline x & \text{Predictor} & \text{Variable predicting the response} \\ \hline \beta_0 & \text{Intercept} & \text{Baseline level of response} \\ \hline \beta_1 & \text{Slope} & \text{Effect of predictor on response} \\ \hline \epsilon & \text{Residuals} & \text{Difference between prediction and data} \\ \hline \end{array} \]

The Method of Least Squares

\[ \begin{array}{|c|c|} \hline x & y \\ \hline 1 & 0.16 \\ \hline 2 & 2.82 \\ \hline 3 & 2.24 \\ \hline \end{array} \]

Assumptions

Besides linearity, that’s an assumed assumption

Independence

\[ \begin{array}{|c|c|c|c|} \hline x & y & \hat y & y-\hat y\\ \hline 1 & 0.16 & 0.7 & -0.54\\ \hline 2 & 2.82 & 1.74 & 1.08\\ \hline 3 & 2.24 & 2.78 & -0.54\\ \hline \end{array} \]

\[\bar \epsilon = \frac{\hat \epsilon_1 + \hat \epsilon_2 + \hat \epsilon_3}{3}=\frac{-0.54+1.08-0.54}{3}=\frac{0}{3}=0\]

Homoscedasticity

Loss Functions

Mathematical “rules” for optimization

Squared Loss

Absolute Loss

Least Absolute Deviations

Least Absolute Deviations

\[ \begin{array}{|c|c|ccc|} \hline x & y & \text{Loss} & \hat \beta_0 & \hat \beta_1\\ \hline 1 & 0.16 & \text{Square} & -0.34 & 1.04 \\ \hline 2 & 2.82 & \text{Absolute} & -0.88 & 1.04\\ \hline 3 & 2.24 & & & \\ \hline \end{array} \]

In-Practice

Which treatment?

Least Squares for Experiments

\[ y_i = \beta_1A_i + \beta_2B_i + \beta_3C_i + \epsilon_i \]

\[ \begin{aligned} A_i = \begin{cases} 1 & \text{if Trt A} \\ 0 & \text{otherwise} \end{cases} \\ B_i = \begin{cases} 1 & \text{if Trt B} \\ 0 & \text{otherwise} \end{cases} \\ C_i = \begin{cases} 1 & \text{if Control} \\ 0 & \text{otherwise} \end{cases} \\ \end{aligned} \]

Results

\[ \hat{y_i} = 199.2A_i + 199.6B_i + 202.0C_i \]

How old is the universe?

In 1929 Edwin Hubble investigated the relationship between distance and radial velocity of extragalactic nebulae (celestial objects). It was hoped that some knowledge of this relationship might give clues as to the way the universe was formed and what may happen later. His findings revolutionised astronomy and are the source of much research today on the ‘Big Bang’.

Derived Quantities

  • We can use the results of regression equations to form “extra” results

  • In this case the model is:

\[ y = \beta x + \epsilon \]

  • Our derived quantity is \(\beta^{-1}\)

The Data

Assumption Checks

# calculate the mean and round to 10 decimal places
round(mean(resids),10)
[1] 1.22088

Results

\[ y = 76.58x \]

\[ \text{Hubble Time} = \beta^{-1} \times 979.708 \]

Is it cake?

Is it cake?

Is it cake?

\[ P_i = \beta_0 + \beta_1C_i + \epsilon_i \]

  • \(P_i\) will be the “probability of being a dessert” for the \(i\)-th entry.

  • \(C_i\) will be the calories for the \(i\)-th entry.

Is it cake?

\[ \hat{P_i} = 0.18 + 0.000013 C_i + \epsilon_i \]

More predictors

\[ P_i = \beta_0 + \beta_1C_i + \beta_2F_i+ \epsilon_i \]

Go away