Day 10

Review


What is a mathematical model?



What is a statistical model?



\[Y=\beta_0+\beta_1X\]






\[Y=162.12747 + 1.45179*X\]


How do we interpret this regression equation for the above scatterplot?



Given a patient who is 25 years old, should we feel confident using this equation to predict their cholesterol?




Questions?




Goals for Today:

  1. Discuss the major assumptions of Least Squares

  2. Briefly discuss Linear Model assumptions

  3. Talk about some general pitfalls of science and LSR



Linear Regression


Least Squares Assumptions


Nothing we do in science is free from assumption. Even in recording the weight of cattle or counting bacteria on a plate of media. We’re assuming our scale works and the bacteria on our plate are the ones of interest.



Linearity


“Linear” regression implies a very important feature be part of the data: linearity



What happens when our data isn’t so simple?



Our data can’t be described with a simple straight line

  • But it can still be described by a line



When we say we have to model something “linear in nature”:

  • The relationship between the variables can be described with some form of the mathematical equation:

\[y=mx+b\]


If I add in polynomials to this, it’s still generally the same equation:


\[y=mx + cx^2 +b\]


When we’re talking about non-linear models, we mean that our equation effects our model parameters in some non-linear fashion:


\[Y=\beta_0+\beta_1X \Rightarrow \text{ Linear Model}\]

\[Y=\beta_0+\beta_1^2X \Rightarrow\ Non\text{-Linear Model}\]


A decent example of a “non-linear” model is exponential growth


\[X_t=X_0(1+r)^t\]


In a more familiar notation


\[Y=\beta_0(1+\beta_1)^X\]




Constant Variance


We’re treating all of our values as if they’re equally important, we need to assume that they’re also equally contributing to how wrong we are



Don’t fuss about trying to understand this concept in depth

  • Uncertainty is a big part of our later units


Recognize this simplistic concept instead:


If I have higher uncertainty about my measurements in some areas than others, I should consider that in my analysis (and thus check it against my assumptions)




Independence


In addition to assuming that all of my data is equally contributing to my error, I need to assume that my errors have nothing to do with one another (or other variables)


If I plant an acre of corn in Nebraska, an acre in Kansas, and an acre in Iowa, will they have the same yield?



Independence is a simple concept that I’ll ruin for you in a few weeks:

  • When I plant corn in Nebraska, Kansas, and Iowa

    • The variability in yield between each acre will be correlated to where I planted them


In this case, the solutions are pretty simple:

  • Plant in the same location for my yield study

  • Separate my analysis based on location


In some cases, not so much:



“Extra” Assumptions


When I’m referring to the linear model: I’m really referring to Maximum Likelihood Estimation assuming Normality


Normality is referring to the distribution of our data in the residuals


It’s important to note that (to an extent) this really doesn’t matter:

  • Scientific data is rarely what we want it to be

  • It’s barely even clean most of the time

  • “All models are wrong”

  • Statistics doesn’t care if you’re wrong, as long as you can say how wrong you are


This is all to just put some of what we’ve done into context

  • A lot of what we’ve covered today will be more relevant information in STAT 241



OLS vs. MLE


This is more of a vocabulary adjustment


OLS refers to Ordinary Least Squares, a purely mathematical model that fits a line to data based off of minimizing squared errors


MLE refers to Maximum Likelihood Estimation, a statistical method that provides the same results as OLS with the added bonus of uncertainty estimates


OLS has no idea how wrong/right it is. MLE gives you a best guess as to how incorrect it could be.



Pitfalls


For the rest of your career as scientists and industry professionals, you will witness (or commit) these pitfalls



“Violating” Assumptions


  • Harsh language

  • Doesn’t really capture the real severity of the act


All models are wrong


If your assumptions have been “violated”:

  • Try to see what you can fix

    • Change the equation

    • Change the assumptions


  • Worst Case: explain that they’re violated in your report/writing




Correlation vs. Causation


If you take nothing else from this class, take this section with you for the rest of your life


Every year in America, as ice cream sales rise, so does the murder rate across the country. Why is that?




As you progress throughout the rest of your scientific careers, remember these key vocabulary terms:


Correlation: a statistical measure (expressed as a number) that describes the size and direction of a relationship between two or more variables.


Causation: one event is the result of the occurrence of the other event; i.e. there is a causal relationship between the two events.


When we’re searching for the difference between these two we can note two important questions:


Does it make logical/scientific sense that X would cause Y?


Is there a distinct lag between when X happens and Y happens?




Attendance QOTD


Go away