Day 10
Review
What is a mathematical model?
What is a statistical model?
\[Y=\beta_0+\beta_1X\]
\[Y=162.12747 + 1.45179*X\]
How do we interpret this regression equation for the above scatterplot?
Given a patient who is 25 years old, should we feel confident using this equation to predict their cholesterol?
Questions?
Goals for Today:
Discuss the major assumptions of Least Squares
Briefly discuss Linear Model assumptions
Talk about some general pitfalls of science and LSR
Linear Regression
Least Squares Assumptions
Nothing we do in science is free from assumption. Even in recording the weight of cattle or counting bacteria on a plate of media. We’re assuming our scale works and the bacteria on our plate are the ones of interest.
Linearity
“Linear” regression implies a very important feature be part of the data: linearity
What happens when our data isn’t so simple?
Our data can’t be described with a simple straight line
- But it can still be described by a line
When we say we have to model something “linear in nature”:
- The relationship between the variables can be described with some form of the mathematical equation:
\[y=mx+b\]
If I add in polynomials to this, it’s still generally the same equation:
\[y=mx + cx^2 +b\]
When we’re talking about non-linear models, we mean that our equation effects our model parameters in some non-linear fashion:
\[Y=\beta_0+\beta_1X \Rightarrow \text{ Linear Model}\]
\[Y=\beta_0+\beta_1^2X \Rightarrow\ Non\text{-Linear Model}\]
A decent example of a “non-linear” model is exponential growth
\[X_t=X_0(1+r)^t\]
In a more familiar notation
\[Y=\beta_0(1+\beta_1)^X\]
Constant Variance
We’re treating all of our values as if they’re equally important, we need to assume that they’re also equally contributing to how wrong we are
Don’t fuss about trying to understand this concept in depth
- Uncertainty is a big part of our later units
Recognize this simplistic concept instead:
If I have higher uncertainty about my measurements in some areas than others, I should consider that in my analysis (and thus check it against my assumptions)
Independence
In addition to assuming that all of my data is equally contributing to my error, I need to assume that my errors have nothing to do with one another (or other variables)
If I plant an acre of corn in Nebraska, an acre in Kansas, and an acre in Iowa, will they have the same yield?
Independence is a simple concept that I’ll ruin for you in a few weeks:
When I plant corn in Nebraska, Kansas, and Iowa
- The variability in yield between each acre will be correlated to where I planted them
In this case, the solutions are pretty simple:
Plant in the same location for my yield study
Separate my analysis based on location
In some cases, not so much:
“Extra” Assumptions
When I’m referring to the linear model: I’m really referring to Maximum Likelihood Estimation assuming Normality
Normality is referring to the distribution of our data in the residuals
It’s important to note that (to an extent) this really doesn’t matter:
Scientific data is rarely what we want it to be
It’s barely even clean most of the time
“All models are wrong”
Statistics doesn’t care if you’re wrong, as long as you can say how wrong you are
This is all to just put some of what we’ve done into context
- A lot of what we’ve covered today will be more relevant information in STAT 241
OLS vs. MLE
This is more of a vocabulary adjustment
OLS refers to Ordinary Least Squares, a purely mathematical model that fits a line to data based off of minimizing squared errors
MLE refers to Maximum Likelihood Estimation, a statistical method that provides the same results as OLS with the added bonus of uncertainty estimates
OLS has no idea how wrong/right it is. MLE gives you a best guess as to how incorrect it could be.
Pitfalls
For the rest of your career as scientists and industry professionals, you will witness (or commit) these pitfalls
“Violating” Assumptions
Harsh language
Doesn’t really capture the real severity of the act
All models are wrong
If your assumptions have been “violated”:
Try to see what you can fix
Change the equation
Change the assumptions
- Worst Case: explain that they’re violated in your report/writing
Correlation vs. Causation
If you take nothing else from this class, take this section with you for the rest of your life
Every year in America, as ice cream sales rise, so does the murder rate across the country. Why is that?
As you progress throughout the rest of your scientific careers, remember these key vocabulary terms:
Correlation: a statistical measure (expressed as a number) that describes the size and direction of a relationship between two or more variables.
Causation: one event is the result of the occurrence of the other event; i.e. there is a causal relationship between the two events.
When we’re searching for the difference between these two we can note two important questions:
Does it make logical/scientific sense that X would cause Y?
Is there a distinct lag between when X happens and Y happens?