Day 7

We’re approaching “the meat” of our first course section. Next week we’ll be removing a few training wheels and looking analyzing real data with the techniques we’ve learned up until now.


Review

Recall the 5 Number Summary:


\[ \begin{array}{|c|c|c|c|c|c|c|} \hline \text{Min} & \text{Q}_1 & \text{Median} & \text{Q}_3 & \text{Max} \\ \hline 90 & 124 & 165 & 182 & 196\\ \hline \end{array} \]



And how it’s represented in a boxplot:



What are the componenets of the 5 number summary on this boxplot?


What are the cirlces?



Boxplots are best used for comparing two data sets:


Scatterplots

When we have two quantitative variables we use scatterplots


Define one variable as \(x\) and one as \(y\)


  • Given \(x_i\) is the \(i^{th}\) data point in the \(x\) set


  • \(y_i\) is the \(i^{th}\) data point in the \(y\) set


  • The data set should be:


\[(x_1,y_1),(x_2,y_2),...,(x_n,y_n)\]


What is this called?


For any two variables we can define their relationship as a:


  • Positive association if large values of one variable are associated with large values of another


  • Negative association if large values of one variable are associated with small values of another


  • Two variables can have a linear relationship if the data tend to cluster around a straight line when plotted on a scatterplot




For each of the following scatterplots, state the type of association that is exhibited:





Simple Random Sample

  • A sample chosen by a method where every selection from the population made is equally likely to make up the sample


Stratified Sampling

  • Divide the population into similar groups (i.e., group students by college)

  • Randomly sample from those groups (strata)


Cluster Sampling

  • Divide the population into clusters (i.e., split Manhattan, KS by street block)

  • Randomly sample from the clusters


Systematic Sampling

  • Randomly choose a start point in a “lined-up” population

  • Sample every \(k^{th}\) item

  • i.e., Starting from the \(4^{th}\) batch of ice cream produced on a given day, Call Hall will check the quality of every \(4^{th}\) batch that comes off the production line


Voluntary Response Samples

  • Customer support reviews


Sample of Convenience

  • Class height



Questions?




Goals for Today:

  1. Define Observational Studies vs. Designed Experiments

  2. Learn some considerations and errors in study design

  3. Look at some examples and analyze them




Correlation, Observation, and Regression


Observational Studies

An observational study is one in which we study something as it exists

  • Observe individuals and measure variables of interest

    • Gather medical records to study relationship between smoking and heart disease


Does NOT attempt to influence the response

  • Purpose: Describe and compare EXSITING groups or situations (Cannot control anything!)



A hunting ground in Wisconsic has every \(3^{rd}\) hunter on site turn over any deer they kill to have their brain matter sampled. \(1452\) samples are taken and tested for Chronic Wasting Disease.


Why is this an observational study?


What kind of sampling technique was used?




Designed Experiments

An experiment occurs when we control the conditions under which observations are taken

  • Deliberately imposes treatments on individuals in order to observe response

    • Lab rats are given either a low carbohydrate diet or a high carbohydrate diet to see the effect on weight


Does influence response

  • Purpose: Look to see whether the treatment causes a change in the response

    • Understand cause and effect



A field of sorghum is split into \(4\) sections. \(3\) of those sections are given different fertilizers with varying levels of nitrogen. The fourth section is not given any fertilizer. The fields are harvested at the same time and their yield is recorded.


Experimental Unit: Subject, animal, or object used in the experiment

  • What’s our experimental unit in the experiment above?


Treatment: Experimental condition applied to experimental unit

  • What was the treatment in the sorghum experiment?


Response: The thing we measure to determine the effect of the treatments

  • What was the response above?




Study Considerations

Let’s go back to our cholesterol study


Are there any problems with the data we’ve collected and the inference we pull from it?


Age 67 52 57 56 60 42 58 53 52 39 54 50 62 38 54 48 64 69 53 62
Cholersterol 229 196 241 283 253 265 218 282 205 219 304 196 263 175 214 245 325 239 264 209









Confounding

Two variables in a study/experiment

  • Effects are indistinguishable

  • Can’t tell which variable caused an effect

  • Observational studies don’t show causality

  • Experiments can show causality

    • Typically have to have that intent designed into them


Drinking wine is linked to better health

  • But what you drink is confounded with other variables

    • Diet, wealth, lifestyle, etc



Collinearity

Two variables are linearly dependent

“A form of extreme confounding”

  • The variables contain the same information to an extent


On observing students in an elementary school, what is the single best variable to collect in order to predict their height?




Bias

The design of statistical study is biased when one outcome is systematically preferred to others


Impossible to correct for


It is a systematic error caused by bad sampling design

  • The outcomes of the shopping mall surveys will repeatedly miss the truth about the population in the same ways

    • Overrepresents: Middle-class and Elderly

    • Underrepresents: Poor




Problems in Sampling

Making up data is always bad


Samples of convenience are easy, cheap, and easy to intentionally bias


Voluntary response surveys can work well if designed well, but they’re very easy to design poorly


Undercoverage

  • Did we reach all of the intended groups in our population?


Nonresponse

  • Did we complete our survey with all respondants?


Response Bias

  • Were the respondants likely to avoid answering a question truthfully


Question Wording

  • Do we invoke bias through the way we construct our questions?


Order of Questions

  • Do previous questions have an effect on the response to later questions?




Case-control Studies


1 in every 1000 male newborns are affected by Jacob’s Syndrome, a rare genetic condition where a male recives an extra Y chromosome.

If you were to sample 10000 men in a perfectly designed SRS, how many useful data points would you expect to have for a study on Jacob’s Syndrome?


Case-control studies are a form of observational study that adjusts for this unique problem


  • Select the case-subjects (those with the trait/condition) of interest

    • Take a random sample of those individuals


  • Select a control group without the condition (ideally with similarities to the case subjects)

    • Take a random sample of those individuals


Researchers look for exposure factors in the subjects’ past that differ


Retrospective

  • Looking back into the past

  • Identifying outcome of interest prior to the risk factor of interest


May be confounded

  • Recall bias: memories are not always accurate




Cohort Studies


Millions of adults in the US have Type 2 Diabetes. As such, research into treatment is a wildly lucrative industry. But much of the treatment associated with the illness are very long term, with a necessity for regular “check-ins”


Subjects sharing a common demographic characteristic are enrolled and observed at regular intervals over an extended period of time

  • Starts with a group of similar individuals

  • Observations made over regular intervals

    • Generally prospective

    • Expensive

    • Better suited for common outcomes

    • Extremely Informative




Examples



Gallup polls considers a proper poll to be a representative sample of 1000 people. A local news station resolves to follow this rule of thumb and call 250 individuals from 4 different age groups as part of the design for a political poll.

\[ \begin{array}{|c|c|c|c|c|} \hline \text{Age Group} & 18-26 & 27-40 & 41-52 & 52+ \\ \hline \end{array} \]

They resolve to only count calls that are answered, rather than marking unanswered calls as non-respondant. They collect their results, develop their graphics, and run their “Political Statistics” segment as planned.


What are the components of this study?





Are there any problems with the study?






Below is a set of questions in a poll of visitors to a National Park:


  1. How much do you appreciate the incredible natural beauty and unique wildlife found in this national park?

  2. How satisfied are you with the park’s facilities, such as trails, restrooms, and visitor centers?

  3. Would you recommend visiting this national park to your friends and family?

  4. Do you think this national park does an excellent job preserving its natural beauty and wildlife?

  5. How would you rate your overall experience at this national park today?


Each question used a scale of 1-5, where 1 was the most negative relation and 5 was the most positive. The survey was randomly given to 500 park visitors every month for 10 months.


What are the components of this study?





Are there any problems with the study?





A study investigates whether a new type of fertilizer is linked to an increased risk of skin irritation among farmers. The researchers recruit 200 farmers with skin irritation (cases) and 200 farmers without skin irritation (controls). They ask both groups about their fertilizer use over the past five years.


What are the components of this study?





Are there any problems with the study?





A school district examines whether longer lunch breaks improve student test scores. They collect test score data from schools that already have longer lunch breaks and compare them to schools with shorter breaks, adjusting for factors like socioeconomic status, class size, and teacher experience.


What are the components of this study?





Are there any problems with the study?





A 10-year study follows 1,000 people who regularly eat spicy food and 1,000 who do not, examining the incidence of stomach ulcers in each group.


What are the components of this study?





Are there any problems with the study?






Attendance QOTD


Go away