Day 9

Review

Sample Mean


\[\bar{x} = {1 \over n}\sum\limits_{i=1}^nx_i\]


\[\bar{y} = {1 \over n}\sum\limits_{i=1}^ny_i\]



Sample Variance


\[s_x^2 = {{{\sum\limits_{i=1}^n}(x_i-\bar{x})^2}\over (n-1)}\]


\[s_y^2 = {{{\sum\limits_{i=1}^n}(y_i-\bar{y})^2}\over (n-1)}\]



Sample Standard Deviation


\[\sqrt{s_x^2}=s_x\]


\[\sqrt{s_y^2}=s_y\]



Correlation Coefficient


  • Given \(n\) ordered pairs: (\(x_i,y_i\))

    • With sample means: \(\bar{x}\) and \(\bar{y}\)

    • Sample standard deviations: \(s_x\) and \(s_y\)

    • The correlation coefficient \(r\) is given by:


\[r = \frac{1}{n-1} \sum_i \left( \frac{x_i - \bar{x}}{s_x} \right) \left( \frac{y_i - \bar{y}}{s_y} \right)\]


Correlation Coefficient Properties:


  1. The value is always between \(-1 \le r \le 1\)
  • If \(r=1\), all of the data falls on a line with a positive slope

  • If \(r=-1\) all of the data falls on a line with a negative slope

  • The closer \(r\) is to 0, the weaker the linear relationship between \(x\) and \(y\)

  • If \(r=0\) no linear relationship exists


  1. The correlation does not depend on the unit of measurement for the two variables


  1. Correlation is very sensitive to outliers


  1. Correlation measures only the linear relationship and may not (by itself) detect a nonlinear relationship



Cholesterol Example


Age 67 52 57 56 60 42 58 53 52 39 54 50 62 38 54 48 64 69 53 62
Cholesterol 229 196 241 283 253 265 218 282 205 219 304 196 263 175 214 245 325 239 264 209



We can draw a line through this data:



We come up with this line using Least Squares Regression:


  • Given ordered pairs: (\(x,y\))

    • With sample means: \(\bar{x}\) and \(\bar{y}\)

    • Sample standard deviations: \(s_x\) and \(s_y\)

    • Correlation coefficient: \(r\)

    • The equation of the least-squares regression line for predicting \(y\) from \(x\) is:


\[\hat{y}=\beta_0+\beta_1x\]


Where the slope (\(\beta_1\)) is:


\[\beta_1 = r * {s_y\over s_x}\]


And the intercept (\(\beta_0\)) is:


\[\beta_0=\bar{y}-\beta_1 \bar{x}\]


  • What do we call \(y\)?


  • \(x\)?


  • \(\beta_0\)?


  • \(\beta_1\)?


  • \(\hat{y}\)?


  • How do we interpret \(\beta_0\) and \(\beta_1\)?




Questions?




Goals for Today:

  1. Define and discuss mathematical modeling

  2. Define statistical models and the linear model

  3. Make predictions about some birds



Linear Regression


Mathematical Modeling


What is a mathematical model?







A simplified mathematical representation of an existing system


Think back to LSR


\[Y=\beta_0+\beta_1X\]



Least Squares is a mathematical model

  • We’re taking coordinates and using them in a function to create a simplification of those coordinates


\[y=mx+b\]


\[y=b+ax\]


Mathematical modeling comes in many forms:


  • Quadratic/Polynomial regression

  • Quantile regression

  • Categorical/Ordinal modeling

  • Differential/Difference equations

  • Network/Graph models


None of these are used as commonly as the OLS/LSR model


Why?






Statistical Modeling

“All models are wrong, but some are useful” - George Box


Where there is OLS/LSR, there is the Linear Model


\[y_i=\beta_0+\beta_1x_i+\epsilon_i\] \[\epsilon_i\sim N(0,\sigma^2)\]


\(\epsilon_i\) (error) is what separates the mathematical model from the statistical model


This is roughly the result of the major goal of OLS/LSR being combined with the major assumption of the linear model:

  • Minimizing the sum of squared errors (residuals)

  • Assumption of normality in the residuals




Do we see any issues with this process?






\[ \begin{array}{|c|c|c|c|c|c|c|} \hline \text{Year} & \text{Count}\\ \hline 1980 & 78 \\ \hline 1981 & 73 \\ \hline 1982 & 73 \\ \hline 1983 & 75 \\ \hline 1984 & 86 \\ \hline 1985 & 97 \\ \hline 1986 & 110 \\ \hline 1987 & 134 \\ \hline 1988 & 138 \\ \hline 1989 & 146 \\ \hline 1990 & 146 \\ \hline \end{array} \]


\[ \begin{array}{|c|c|c|c|c|c|c|} \hline \text{n} & \bar{x} & \bar{y} & s_x & s_y & r\\ \hline \quad \quad & \quad \quad \quad & \quad \quad \quad & \quad \quad \quad & \quad \quad \quad & \quad \quad \quad\\ \hline \end{array} \]


\[r = \frac{1}{n-1} \sum_i \left( \frac{x_i - \bar{x}}{s_x} \right) \left( \frac{y_i - \bar{y}}{s_y} \right)\]


\[\beta_1 = r * {s_y\over s_x}\]


\[\beta_0=\bar{y}-\beta_1 \bar{x}\]







Attendance QOTD


Go away