Day 8
Review
Observational Studies
An observational study is one in which we study something as it exists
Observe individuals and measure variables of interest
- Gather medical records to study relationship between smoking and heart disease
Does NOT attempt to influence the response
Designed Experiments
An experiment occurs when we control the conditions under which observations are taken
Deliberately imposes treatments on individuals in order to observe response
- Lab rats are given either a low carbohydrate diet or a high carbohydrate diet to see the effect on weight
Does influence response
Experimental Unit: Subject, animal, or object used in the experiment
Treatment: Experimental condition applied to experimental unit
Response: The thing we measure to determine the effect of the treatments
Confounding
Two variables in a study/experiment
Effects are indistinguishable
Can’t tell which variable caused an effect
Observational studies don’t show causality
Experiments can show causality
- Typically have to have that intent designed into them
Collinearity
Two variables are linearly dependent
“A form of extreme confounding”
- The variables contain the same information to an extent
Bias
The design of statistical study is biased when one outcome is systematically preferred to others
Impossible to correct for
It is a systematic error caused by bad sampling design
Problems in Sampling
Making up data is always bad
Samples of convenience are easy, cheap, and easy to intentionally bias
Voluntary response surveys can work well if designed well, but they’re very easy to design poorly
Undercoverage
Nonresponse
Response Bias
Question Wording
Order of Questions
Some Observational Study Types
Case-control studies:
A form of observational study that adjusts for this unique problem
Select the case-subjects (those with the trait/condition) of interest
- Take a random sample of those individuals
Select a control group without the condition (ideally with similarities to the case subjects)
- Take a random sample of those individuals
Cohort Studies:
Subjects sharing a common demographic characteristic are enrolled and observed at regular intervals over an extended period of time
Starts with a group of similar individuals
Observations made over regular intervals
Questions?
Goals for Today:
Quantify the strength of linear relationships
Introduce Least Squares Regression
Determine the age of the universe
Correlation, Observation, and Regression
Strength of Linear Relationship
When two variables have a linear relationship
- It’s useful to quantify how strong the relationship is
Visual impressions aren’t really reliable
- Axis scaling can change everything:
Same exact data
Correlation Coefficient
Numerical measurement of the strength (and direction) of the linear relationship between two quantitative variables
Given \(n\) ordered pairs (\(x_i,y_i\))
With sample means \(\bar{x}\) and \(\bar{y}\)
Sample standard deviations \(s_x\) and \(s_y\)
The correlation coefficient \(r\) is given by:
\[r = \frac{1}{n-1} \sum_i \left( \frac{x_i - \bar{x}}{s_x} \right) \left( \frac{y_i - \bar{y}}{s_y} \right)\]
Properties of the Correlation Coefficient
- The value is always between \(-1 \le r \le 1\)
If \(r=1\), all of the data falls on a line with a positive slope
If \(r=-1\) all of the data falls on a line with a negative slope
The closer \(r\) is to 0, the weaker the linear relationship between \(x\) and \(y\)
If \(r=0\) no linear relationship exists
- The correlation does not depend on the unit of measurement for the two variables
- \(x\) is House price and \(y\) is \(ft^2\), but they can still have \(r\) calculated
- Correlation is very sensitive to outliers.
One point that does not belong in the dataset can result in a misleading correlation
Always plot your data!
Would we say this measurement is resistant or not?
- Correlation measures only the linear relationship and may not (by itself) detect a nonlinear relationship
Least-Squares Regression
Recall our cholesterol example:
Two variables for each individual in the sample
\(x=\) age of the patient
\(y=\) serum cholesterol level
For the \(i^{th}\) patient, we’ll denote it’s observated values as:
\(x_i=\) the age of the \(i^{th}\) patient in years
\(y_i=\) the serum cholesterol level of the \(i^{th}\) patient in mmol/L
Age | 67 | 52 | 57 | 56 | 60 | 42 | 58 | 53 | 52 | 39 | 54 | 50 | 62 | 38 | 54 | 48 | 64 | 69 | 53 | 62 |
Cholesterol | 229 | 196 | 241 | 283 | 253 | 265 | 218 | 282 | 205 | 219 | 304 | 196 | 263 | 175 | 214 | 245 | 325 | 239 | 264 | 209 |
The associated scatterplot:
I’ve alluded in the past to the idea that we can draw a line through these points:
Why couldn’t these lines be viable ones for describing this data?
The vertical distances are from the points to the line are smaller for the first line
We determine exactly how well the line fits by:
Squaring the vertical distances
And adding them up
The “Best Fit” line is the line for which the sum of squared distances is as small as possible
This line is known as the Least Squared Regression Line
Given ordered pairs: (\(x,y\))
With sample means: \(\bar{x}\) and \(\bar{y}\)
Sample standard deviations: \(s_x\) and \(s_y\)
Correlation coefficient: \(r\)
The equation of the least-squared regression line for predicting \(y\) from \(x\) is:
\[\hat{y}=\beta_0+\beta_1x\]
Where the slope (\(\beta_1\)) is:
\[\beta_1 = r * {s_y\over s_x}\]
And the intercept (\(\beta_0\)) is:
\[\beta_0=\bar{y}-\beta_1 \bar{x}\]
- The variable we want to predict is the outcome or response variable
- And the variable we are given is the explanatory or predictor variable
Let’s calculate the regression equation from our cholesterol example:
\[\bar{x}= 54.5,\ \bar{y}=241.25 ,\ s_x=8.49 ,\ s_y= 39.96,\ r=0.3163405\]
\[\beta_1 = r * {s_y\over s_x}\]
\[\beta_0=\bar{y}-\beta_1 \bar{x}\]
\[\hat{y}=\beta_0+\beta_1x\]
Given our regression equation, predict the cholesterol level for your age
Interpretation of Least Squares Regression
Interpreting the predicted \(\hat{y}\)
The predicted value of \(\hat{y}\) can be used to estimate the average outcome for a given value of the explanatory variable \(x\)
For any given value of \(x\), the value \(\hat{y}\) is an estimate of the average \(y\)-value for all points with that \(x\)-value
Interpreting \(y\)-intercept \(b_0\)
The y-intercept is \(b_0\) is the point where the line crosses \(y\)-axis. This has two meanings
If the data has both positive and negative \(x\)-values the \(y\)-intercept is the estimated outcome when the value of explanatory variable is \(0\)
If the x-values are all positive or all negative then \(b_0\) does not have useful information.
Interpreting the slope \(b_1\)
- If the x-values of two points differ by 1, their \(y\)-values will differ by an amount equal to the slope of the line
At the final exam the professor asked each student to indicate how many hours they have studied for the exam
- The professor computes the least-square regression line for predicting the final exam scores from the number of hours studied
\[\hat{y} = 50 + 5x\]
- Antoine has studied for \(6\) hours. What do you predict his score would be?
- Emma studied \(3\) hours more than Jeremy did. How much higher do you predict Emma’s score to be?
Is there an interpretation of the \(y\)-intercept?
The least square regression line is \(\hat{y} = 1.908 + 0.06x\) where \(x\) is temperature in freezer in Fahrenheit and \(y\) is the time it takes to freeze.
Predicting the Age of the Universe with LSR
Below is a plot of velocity against distance for 24 galaxies, according to measurements made using the Hubble Space Telescope - Credit: Wood 2006
Hubble’s Law states that galaxies are moving away from Earth at speeds proportional to their distance
We can leverage this law to estimate the age of the Universe
This is, in fact, the data we (the scientific community, humanity, etc.) used to for this estimation until horrifyingly recently
Hubble Law:
\[v=H_0D\]
\[v \rightarrow \text{Velocity}\]
\[D \rightarrow \text{Distance}\]
\[H_0 \rightarrow \text{Hubble's Constant}\]
Hubble’s constant is a relative rate of expansion (a.k.a. A slope)
By finding \(H_0^{-1}\) and doing some unit conversions, we get a rough age of the universe
- Don’t worry, I’m making that part easier (it’s a lot of messy arithmetic on it’s own)
From this data we calculated the following values:
\[ \begin{array}{|c|c|c|c|c|c|c|} \hline \text{n} & \bar{x} & \bar{y} & s_x & s_y & r\\ \hline 24 & 12.055 & 924.375 & 5.815 & 512.814 & 0.8632\\ \hline \end{array} \]
- Calculate \(\beta_1\)
\[\beta_1 = r * {s_y\over s_x}\]
- Calculate \(\beta_0\)
\[\beta_0=\bar{y}-\beta_1 \bar{x}\]
- Write out the Regression Equation
\[\hat{y}=\beta_0+\beta_1x\]
- Compute the below formula:
\[979.708\over \beta_1\]