Day 8

Review


Observational Studies

An observational study is one in which we study something as it exists

  • Observe individuals and measure variables of interest

    • Gather medical records to study relationship between smoking and heart disease


Does NOT attempt to influence the response



Designed Experiments

An experiment occurs when we control the conditions under which observations are taken

  • Deliberately imposes treatments on individuals in order to observe response

    • Lab rats are given either a low carbohydrate diet or a high carbohydrate diet to see the effect on weight


Does influence response


Experimental Unit: Subject, animal, or object used in the experiment


Treatment: Experimental condition applied to experimental unit


Response: The thing we measure to determine the effect of the treatments



Confounding

Two variables in a study/experiment

  • Effects are indistinguishable

  • Can’t tell which variable caused an effect

  • Observational studies don’t show causality

  • Experiments can show causality

    • Typically have to have that intent designed into them



Collinearity

Two variables are linearly dependent

“A form of extreme confounding”

  • The variables contain the same information to an extent



Bias

The design of statistical study is biased when one outcome is systematically preferred to others


Impossible to correct for


It is a systematic error caused by bad sampling design


Problems in Sampling

Making up data is always bad


Samples of convenience are easy, cheap, and easy to intentionally bias


Voluntary response surveys can work well if designed well, but they’re very easy to design poorly


Undercoverage


Nonresponse


Response Bias


Question Wording


Order of Questions


Some Observational Study Types

Case-control studies:

A form of observational study that adjusts for this unique problem


  • Select the case-subjects (those with the trait/condition) of interest

    • Take a random sample of those individuals


  • Select a control group without the condition (ideally with similarities to the case subjects)

    • Take a random sample of those individuals


Cohort Studies:

Subjects sharing a common demographic characteristic are enrolled and observed at regular intervals over an extended period of time


  • Starts with a group of similar individuals

  • Observations made over regular intervals




Questions?




Goals for Today:

  1. Quantify the strength of linear relationships

  2. Introduce Least Squares Regression

  3. Determine the age of the universe




Correlation, Observation, and Regression


Strength of Linear Relationship

  • When two variables have a linear relationship

    • It’s useful to quantify how strong the relationship is


  • Visual impressions aren’t really reliable

    • Axis scaling can change everything:



Same exact data




Correlation Coefficient

Numerical measurement of the strength (and direction) of the linear relationship between two quantitative variables


  • Given \(n\) ordered pairs (\(x_i,y_i\))

    • With sample means \(\bar{x}\) and \(\bar{y}\)

    • Sample standard deviations \(s_x\) and \(s_y\)

    • The correlation coefficient \(r\) is given by:


\[r = \frac{1}{n-1} \sum_i \left( \frac{x_i - \bar{x}}{s_x} \right) \left( \frac{y_i - \bar{y}}{s_y} \right)\]




Properties of the Correlation Coefficient

  1. The value is always between \(-1 \le r \le 1\)
  • If \(r=1\), all of the data falls on a line with a positive slope

  • If \(r=-1\) all of the data falls on a line with a negative slope

  • The closer \(r\) is to 0, the weaker the linear relationship between \(x\) and \(y\)

  • If \(r=0\) no linear relationship exists


  1. The correlation does not depend on the unit of measurement for the two variables
  • \(x\) is House price and \(y\) is \(ft^2\), but they can still have \(r\) calculated


  1. Correlation is very sensitive to outliers.
  • One point that does not belong in the dataset can result in a misleading correlation

  • Always plot your data!

  • Would we say this measurement is resistant or not?


  1. Correlation measures only the linear relationship and may not (by itself) detect a nonlinear relationship





Least-Squares Regression

Recall our cholesterol example:


Two variables for each individual in the sample


  • \(x=\) age of the patient

  • \(y=\) serum cholesterol level


For the \(i^{th}\) patient, we’ll denote it’s observated values as:


  • \(x_i=\) the age of the \(i^{th}\) patient in years

  • \(y_i=\) the serum cholesterol level of the \(i^{th}\) patient in mmol/L


Age 67 52 57 56 60 42 58 53 52 39 54 50 62 38 54 48 64 69 53 62
Cholesterol 229 196 241 283 253 265 218 282 205 219 304 196 263 175 214 245 325 239 264 209


The associated scatterplot:



I’ve alluded in the past to the idea that we can draw a line through these points:






Why couldn’t these lines be viable ones for describing this data?





The vertical distances are from the points to the line are smaller for the first line


We determine exactly how well the line fits by:

  • Squaring the vertical distances

  • And adding them up


The “Best Fit” line is the line for which the sum of squared distances is as small as possible


This line is known as the Least Squared Regression Line





Given ordered pairs: (\(x,y\))

  • With sample means: \(\bar{x}\) and \(\bar{y}\)

    • Sample standard deviations: \(s_x\) and \(s_y\)

    • Correlation coefficient: \(r\)

    • The equation of the least-squared regression line for predicting \(y\) from \(x\) is:


\[\hat{y}=\beta_0+\beta_1x\]


Where the slope (\(\beta_1\)) is:


\[\beta_1 = r * {s_y\over s_x}\]


And the intercept (\(\beta_0\)) is:


\[\beta_0=\bar{y}-\beta_1 \bar{x}\]


  • The variable we want to predict is the outcome or response variable


  • And the variable we are given is the explanatory or predictor variable



Let’s calculate the regression equation from our cholesterol example:



\[\bar{x}= 54.5,\ \bar{y}=241.25 ,\ s_x=8.49 ,\ s_y= 39.96,\ r=0.3163405\]



\[\beta_1 = r * {s_y\over s_x}\]



\[\beta_0=\bar{y}-\beta_1 \bar{x}\]



\[\hat{y}=\beta_0+\beta_1x\]



Given our regression equation, predict the cholesterol level for your age



Interpretation of Least Squares Regression

  • Interpreting the predicted \(\hat{y}\)

    • The predicted value of \(\hat{y}\) can be used to estimate the average outcome for a given value of the explanatory variable \(x\)

    • For any given value of \(x\), the value \(\hat{y}\) is an estimate of the average \(y\)-value for all points with that \(x\)-value


  • Interpreting \(y\)-intercept \(b_0\)

    • The y-intercept is \(b_0\) is the point where the line crosses \(y\)-axis. This has two meanings

    • If the data has both positive and negative \(x\)-values the \(y\)-intercept is the estimated outcome when the value of explanatory variable is \(0\)


If the x-values are all positive or all negative then \(b_0\) does not have useful information.

  • Interpreting the slope \(b_1\)

    • If the x-values of two points differ by 1, their \(y\)-values will differ by an amount equal to the slope of the line




At the final exam the professor asked each student to indicate how many hours they have studied for the exam

  • The professor computes the least-square regression line for predicting the final exam scores from the number of hours studied

\[\hat{y} = 50 + 5x\]


  • Antoine has studied for \(6\) hours. What do you predict his score would be?



  • Emma studied \(3\) hours more than Jeremy did. How much higher do you predict Emma’s score to be?





Is there an interpretation of the \(y\)-intercept?

The least square regression line is \(\hat{y} = 1.908 + 0.06x\) where \(x\) is temperature in freezer in Fahrenheit and \(y\) is the time it takes to freeze.





Predicting the Age of the Universe with LSR

Below is a plot of velocity against distance for 24 galaxies, according to measurements made using the Hubble Space Telescope - Credit: Wood 2006



Hubble’s Law states that galaxies are moving away from Earth at speeds proportional to their distance


We can leverage this law to estimate the age of the Universe


This is, in fact, the data we (the scientific community, humanity, etc.) used to for this estimation until horrifyingly recently


Hubble Law:

\[v=H_0D\]

\[v \rightarrow \text{Velocity}\]

\[D \rightarrow \text{Distance}\]

\[H_0 \rightarrow \text{Hubble's Constant}\]


Hubble’s constant is a relative rate of expansion (a.k.a. A slope)


By finding \(H_0^{-1}\) and doing some unit conversions, we get a rough age of the universe

  • Don’t worry, I’m making that part easier (it’s a lot of messy arithmetic on it’s own)


From this data we calculated the following values:


\[ \begin{array}{|c|c|c|c|c|c|c|} \hline \text{n} & \bar{x} & \bar{y} & s_x & s_y & r\\ \hline 24 & 12.055 & 924.375 & 5.815 & 512.814 & 0.8632\\ \hline \end{array} \]


  1. Calculate \(\beta_1\)


\[\beta_1 = r * {s_y\over s_x}\]



  1. Calculate \(\beta_0\)


\[\beta_0=\bar{y}-\beta_1 \bar{x}\]



  1. Write out the Regression Equation


\[\hat{y}=\beta_0+\beta_1x\]



  1. Compute the below formula:

\[979.708\over \beta_1\]







Attendance QOTD


Go away