Day 8

Review

Observational Studies

An observational study is one in which we study something as it exists

Observe individuals and measure variables of interest
- Gather medical records to study relationship between smoking and heart disease

Does NOT attempt to influence the response

Designed Experiments

An experiment occurs when we control the conditions under which observations are taken

Deliberately imposes treatments on individuals in order to observe response
- Lab rats are given either a low carbohydrate diet or a high carbohydrate diet to see the effect on weight

Does influence response

Experimental Unit: Subject, animal, or object used in the experiment

Treatment: Experimental condition applied to experimental unit

Response: The thing we measure to determine the effect of the treatments

Confounding

Two variables in a study/experiment

Effects are indistinguishable
Can’t tell which variable caused an effect
Observational studies don’t show causality
Experiments can show causality
- Typically have to have that intent designed into them

Collinearity

Two variables are linearly dependent

“A form of extreme confounding”

The variables contain the same information to an extent

Bias

The design of statistical study is biased when one outcome is systematically preferred to others

Impossible to correct for

It is a systematic error caused by bad sampling design

Problems in Sampling

Making up data is always bad

Samples of convenience are easy, cheap, and easy to intentionally bias

Voluntary response surveys can work well if designed well, but they’re very easy to design poorly

Undercoverage

Nonresponse

Response Bias

Question Wording

Order of Questions

Some Observational Study Types

Case-control studies:

A form of observational study that adjusts for this unique problem

Select the case-subjects (those with the trait/condition) of interest
- Take a random sample of those individuals

Select a control group without the condition (ideally with similarities to the case subjects)
- Take a random sample of those individuals

Cohort Studies:

Subjects sharing a common demographic characteristic are enrolled and observed at regular intervals over an extended period of time

Starts with a group of similar individuals
Observations made over regular intervals

Questions?

Goals for Today:

Quantify the strength of linear relationships
Introduce Least Squares Regression
Determine the age of the universe

Correlation, Observation, and Regression

Strength of Linear Relationship

When two variables have a linear relationship
- It’s useful to quantify how strong the relationship is

Visual impressions aren’t really reliable
- Axis scaling can change everything:

Same exact data

Correlation Coefficient

Numerical measurement of the strength (and direction) of the linear relationship between two quantitative variables

Given \(n\) ordered pairs (\(x_i,y_i\))
- With sample means \(\bar{x}\) and \(\bar{y}\)
- Sample standard deviations \(s_x\) and \(s_y\)
- The correlation coefficient \(r\) is given by:

\[r = \frac{1}{n-1} \sum_i \left( \frac{x_i - \bar{x}}{s_x} \right) \left( \frac{y_i - \bar{y}}{s_y} \right)\]

Properties of the Correlation Coefficient

The value is always between \(-1 \le r \le 1\)

If \(r=1\), all of the data falls on a line with a positive slope
If \(r=-1\) all of the data falls on a line with a negative slope
The closer \(r\) is to 0, the weaker the linear relationship between \(x\) and \(y\)
If \(r=0\) no linear relationship exists

The correlation does not depend on the unit of measurement for the two variables

\(x\) is House price and \(y\) is \(ft^2\), but they can still have \(r\) calculated

Correlation is very sensitive to outliers.

One point that does not belong in the dataset can result in a misleading correlation
Always plot your data!
Would we say this measurement is resistant or not?

Correlation measures only the linear relationship and may not (by itself) detect a nonlinear relationship

Least-Squares Regression

Recall our cholesterol example:

Two variables for each individual in the sample

\(x=\) age of the patient
\(y=\) serum cholesterol level

For the \(i^{th}\) patient, we’ll denote it’s observated values as:

\(x_i=\) the age of the \(i^{th}\) patient in years
\(y_i=\) the serum cholesterol level of the \(i^{th}\) patient in mmol/L

Age	67	52	57	56	60	42	58	53	52	39	54	50	62	38	54	48	64	69	53	62
Cholesterol	229	196	241	283	253	265	218	282	205	219	304	196	263	175	214	245	325	239	264	209

The associated scatterplot:

I’ve alluded in the past to the idea that we can draw a line through these points:

Why couldn’t these lines be viable ones for describing this data?

The vertical distances are from the points to the line are smaller for the first line

We determine exactly how well the line fits by:

Squaring the vertical distances
And adding them up

The “Best Fit” line is the line for which the sum of squared distances is as small as possible

This line is known as the Least Squared Regression Line

Given ordered pairs: (\(x,y\))

With sample means: \(\bar{x}\) and \(\bar{y}\)
- Sample standard deviations: \(s_x\) and \(s_y\)
- Correlation coefficient: \(r\)
- The equation of the least-squared regression line for predicting \(y\) from \(x\) is:

\[\hat{y}=\beta_0+\beta_1x\]

Where the slope (\(\beta_1\)) is:

\[\beta_1 = r * {s_y\over s_x}\]

And the intercept (\(\beta_0\)) is:

\[\beta_0=\bar{y}-\beta_1 \bar{x}\]

The variable we want to predict is the outcome or response variable

And the variable we are given is the explanatory or predictor variable

Let’s calculate the regression equation from our cholesterol example:

\[\bar{x}= 54.5,\ \bar{y}=241.25 ,\ s_x=8.49 ,\ s_y= 39.96,\ r=0.3163405\]

\[\beta_1 = r * {s_y\over s_x}\]

\[\beta_0=\bar{y}-\beta_1 \bar{x}\]

\[\hat{y}=\beta_0+\beta_1x\]

Given our regression equation, predict the cholesterol level for your age

Interpretation of Least Squares Regression

Interpreting the predicted \(\hat{y}\)
- The predicted value of \(\hat{y}\) can be used to estimate the average outcome for a given value of the explanatory variable \(x\)
- For any given value of \(x\), the value \(\hat{y}\) is an estimate of the average \(y\)-value for all points with that \(x\)-value

Interpreting \(y\)-intercept \(b_0\)
- The y-intercept is \(b_0\) is the point where the line crosses \(y\)-axis. This has two meanings
- If the data has both positive and negative \(x\)-values the \(y\)-intercept is the estimated outcome when the value of explanatory variable is \(0\)

If the x-values are all positive or all negative then \(b_0\) does not have useful information.

Interpreting the slope \(b_1\)
- If the x-values of two points differ by 1, their \(y\)-values will differ by an amount equal to the slope of the line

At the final exam the professor asked each student to indicate how many hours they have studied for the exam

The professor computes the least-square regression line for predicting the final exam scores from the number of hours studied

\[\hat{y} = 50 + 5x\]

Antoine has studied for \(6\) hours. What do you predict his score would be?

Emma studied \(3\) hours more than Jeremy did. How much higher do you predict Emma’s score to be?

Is there an interpretation of the \(y\)-intercept?

The least square regression line is \(\hat{y} = 1.908 + 0.06x\) where \(x\) is temperature in freezer in Fahrenheit and \(y\) is the time it takes to freeze.

Predicting the Age of the Universe with LSR

Below is a plot of velocity against distance for 24 galaxies, according to measurements made using the Hubble Space Telescope - Credit: Wood 2006

Hubble’s Law states that galaxies are moving away from Earth at speeds proportional to their distance

We can leverage this law to estimate the age of the Universe

This is, in fact, the data we (the scientific community, humanity, etc.) used to for this estimation until horrifyingly recently

Hubble Law:

\[v=H_0D\]

\[v \rightarrow \text{Velocity}\]

\[D \rightarrow \text{Distance}\]

\[H_0 \rightarrow \text{Hubble's Constant}\]

Hubble’s constant is a relative rate of expansion (a.k.a. A slope)

By finding \(H_0^{-1}\) and doing some unit conversions, we get a rough age of the universe

Don’t worry, I’m making that part easier (it’s a lot of messy arithmetic on it’s own)

From this data we calculated the following values:

\[ \begin{array}{|c|c|c|c|c|c|c|} \hline \text{n} & \bar{x} & \bar{y} & s_x & s_y & r\\ \hline 24 & 12.055 & 924.375 & 5.815 & 512.814 & 0.8632\\ \hline \end{array} \]

Calculate \(\beta_1\)

\[\beta_1 = r * {s_y\over s_x}\]

Calculate \(\beta_0\)

\[\beta_0=\bar{y}-\beta_1 \bar{x}\]

Write out the Regression Equation

\[\hat{y}=\beta_0+\beta_1x\]

Compute the below formula:

\[979.708\over \beta_1\]

Day 8

Review

Observational Studies

Designed Experiments

Confounding

Collinearity

Bias

Problems in Sampling

Some Observational Study Types

Correlation, Observation, and Regression

Strength of Linear Relationship

Correlation Coefficient

Properties of the Correlation Coefficient

Least-Squares Regression

Interpretation of Least Squares Regression

Predicting the Age of the Universe with LSR

Attendance QOTD

Go away