Day 10

Review

What is a mathematical model?

What is a statistical model?

\[Y=\beta_0+\beta_1X\]

\[Y=162.12747 + 1.45179*X\]

How do we interpret this regression equation for the above scatterplot?

Given a patient who is 25 years old, should we feel confident using this equation to predict their cholesterol?

Questions?

Goals for Today:

Discuss the major assumptions of Least Squares
Briefly discuss Linear Model assumptions
Talk about some general pitfalls of science and LSR

Linear Regression

Least Squares Assumptions

Nothing we do in science is free from assumption. Even in recording the weight of cattle or counting bacteria on a plate of media. We’re assuming our scale works and the bacteria on our plate are the ones of interest.

Linearity

“Linear” regression implies a very important feature be part of the data: linearity

What happens when our data isn’t so simple?

Our data can’t be described with a simple straight line

But it can still be described by a line

When we say we have to model something “linear in nature”:

The relationship between the variables can be described with some form of the mathematical equation:

\[y=mx+b\]

If I add in polynomials to this, it’s still generally the same equation:

\[y=mx + cx^2 +b\]

When we’re talking about non-linear models, we mean that our equation effects our model parameters in some non-linear fashion:

\[Y=\beta_0+\beta_1X \Rightarrow \text{ Linear Model}\]

\[Y=\beta_0+\beta_1^2X \Rightarrow\ Non\text{-Linear Model}\]

A decent example of a “non-linear” model is exponential growth

\[X_t=X_0(1+r)^t\]

In a more familiar notation

\[Y=\beta_0(1+\beta_1)^X\]

Constant Variance

We’re treating all of our values as if they’re equally important, we need to assume that they’re also equally contributing to how wrong we are

Don’t fuss about trying to understand this concept in depth

Uncertainty is a big part of our later units

Recognize this simplistic concept instead:

If I have higher uncertainty about my measurements in some areas than others, I should consider that in my analysis (and thus check it against my assumptions)

Independence

In addition to assuming that all of my data is equally contributing to my error, I need to assume that my errors have nothing to do with one another (or other variables)

If I plant an acre of corn in Nebraska, an acre in Kansas, and an acre in Iowa, will they have the same yield?

Independence is a simple concept that I’ll ruin for you in a few weeks:

When I plant corn in Nebraska, Kansas, and Iowa
- The variability in yield between each acre will be correlated to where I planted them

In this case, the solutions are pretty simple:

Plant in the same location for my yield study
Separate my analysis based on location

In some cases, not so much:

“Extra” Assumptions

When I’m referring to the linear model: I’m really referring to Maximum Likelihood Estimation assuming Normality

Normality is referring to the distribution of our data in the residuals

It’s important to note that (to an extent) this really doesn’t matter:

Scientific data is rarely what we want it to be
It’s barely even clean most of the time
“All models are wrong”
Statistics doesn’t care if you’re wrong, as long as you can say how wrong you are

This is all to just put some of what we’ve done into context

A lot of what we’ve covered today will be more relevant information in STAT 241

OLS vs. MLE

This is more of a vocabulary adjustment

OLS refers to Ordinary Least Squares, a purely mathematical model that fits a line to data based off of minimizing squared errors

MLE refers to Maximum Likelihood Estimation, a statistical method that provides the same results as OLS with the added bonus of uncertainty estimates

OLS has no idea how wrong/right it is. MLE gives you a best guess as to how incorrect it could be.

Pitfalls

For the rest of your career as scientists and industry professionals, you will witness (or commit) these pitfalls

“Violating” Assumptions

Harsh language
Doesn’t really capture the real severity of the act

All models are wrong

If your assumptions have been “violated”:

Try to see what you can fix
- Change the equation
- Change the assumptions

Worst Case: explain that they’re violated in your report/writing

Correlation vs. Causation

If you take nothing else from this class, take this section with you for the rest of your life

Every year in America, as ice cream sales rise, so does the murder rate across the country. Why is that?

As you progress throughout the rest of your scientific careers, remember these key vocabulary terms:

Correlation: a statistical measure (expressed as a number) that describes the size and direction of a relationship between two or more variables.

Causation: one event is the result of the occurrence of the other event; i.e. there is a causal relationship between the two events.

When we’re searching for the difference between these two we can note two important questions:

Does it make logical/scientific sense that X would cause Y?

Is there a distinct lag between when X happens and Y happens?