Day 7
We’re approaching “the meat” of our first course section. Next week we’ll be removing a few training wheels and looking analyzing real data with the techniques we’ve learned up until now.
Review
Recall the 5 Number Summary:
\[ \begin{array}{|c|c|c|c|c|c|c|} \hline \text{Min} & \text{Q}_1 & \text{Median} & \text{Q}_3 & \text{Max} \\ \hline 90 & 124 & 165 & 182 & 196\\ \hline \end{array} \]
And how it’s represented in a boxplot:
What are the componenets of the 5 number summary on this boxplot?
What are the cirlces?
Boxplots are best used for comparing two data sets:
Scatterplots
When we have two quantitative variables we use scatterplots
Define one variable as \(x\) and one as \(y\)
- Given \(x_i\) is the \(i^{th}\) data point in the \(x\) set
- \(y_i\) is the \(i^{th}\) data point in the \(y\) set
- The data set should be:
\[(x_1,y_1),(x_2,y_2),...,(x_n,y_n)\]
What is this called?
For any two variables we can define their relationship as a:
- Positive association if large values of one variable are associated with large values of another
- Negative association if large values of one variable are associated with small values of another
- Two variables can have a linear relationship if the data tend to cluster around a straight line when plotted on a scatterplot
For each of the following scatterplots, state the type of association that is exhibited:
Simple Random Sample
- A sample chosen by a method where every selection from the population made is equally likely to make up the sample
Stratified Sampling
Divide the population into similar groups (i.e., group students by college)
Randomly sample from those groups (strata)
Cluster Sampling
Divide the population into clusters (i.e., split Manhattan, KS by street block)
Randomly sample from the clusters
Systematic Sampling
Randomly choose a start point in a “lined-up” population
Sample every \(k^{th}\) item
i.e., Starting from the \(4^{th}\) batch of ice cream produced on a given day, Call Hall will check the quality of every \(4^{th}\) batch that comes off the production line
Voluntary Response Samples
- Customer support reviews
Sample of Convenience
- Class height
Questions?
Goals for Today:
Define Observational Studies vs. Designed Experiments
Learn some considerations and errors in study design
Look at some examples and analyze them
Correlation, Observation, and Regression
Observational Studies
An observational study is one in which we study something as it exists
Observe individuals and measure variables of interest
- Gather medical records to study relationship between smoking and heart disease
Does NOT attempt to influence the response
- Purpose: Describe and compare EXSITING groups or situations (Cannot control anything!)
A hunting ground in Wisconsic has every \(3^{rd}\) hunter on site turn over any deer they kill to have their brain matter sampled. \(1452\) samples are taken and tested for Chronic Wasting Disease.
Why is this an observational study?
What kind of sampling technique was used?
Designed Experiments
An experiment occurs when we control the conditions under which observations are taken
Deliberately imposes treatments on individuals in order to observe response
- Lab rats are given either a low carbohydrate diet or a high carbohydrate diet to see the effect on weight
Does influence response
Purpose: Look to see whether the treatment causes a change in the response
- Understand cause and effect
A field of sorghum is split into \(4\) sections. \(3\) of those sections are given different fertilizers with varying levels of nitrogen. The fourth section is not given any fertilizer. The fields are harvested at the same time and their yield is recorded.
Experimental Unit: Subject, animal, or object used in the experiment
- What’s our experimental unit in the experiment above?
Treatment: Experimental condition applied to experimental unit
- What was the treatment in the sorghum experiment?
Response: The thing we measure to determine the effect of the treatments
- What was the response above?
Study Considerations
Let’s go back to our cholesterol study
Are there any problems with the data we’ve collected and the inference we pull from it?
Age | 67 | 52 | 57 | 56 | 60 | 42 | 58 | 53 | 52 | 39 | 54 | 50 | 62 | 38 | 54 | 48 | 64 | 69 | 53 | 62 |
Cholersterol | 229 | 196 | 241 | 283 | 253 | 265 | 218 | 282 | 205 | 219 | 304 | 196 | 263 | 175 | 214 | 245 | 325 | 239 | 264 | 209 |
Confounding
Two variables in a study/experiment
Effects are indistinguishable
Can’t tell which variable caused an effect
Observational studies don’t show causality
Experiments can show causality
- Typically have to have that intent designed into them
Drinking wine is linked to better health
But what you drink is confounded with other variables
- Diet, wealth, lifestyle, etc
Collinearity
Two variables are linearly dependent
“A form of extreme confounding”
- The variables contain the same information to an extent
On observing students in an elementary school, what is the single best variable to collect in order to predict their height?
Bias
The design of statistical study is biased when one outcome is systematically preferred to others
Impossible to correct for
It is a systematic error caused by bad sampling design
The outcomes of the shopping mall surveys will repeatedly miss the truth about the population in the same ways
Overrepresents: Middle-class and Elderly
Underrepresents: Poor
Problems in Sampling
Making up data is always bad
Samples of convenience are easy, cheap, and easy to intentionally bias
Voluntary response surveys can work well if designed well, but they’re very easy to design poorly
Undercoverage
- Did we reach all of the intended groups in our population?
Nonresponse
- Did we complete our survey with all respondants?
Response Bias
- Were the respondants likely to avoid answering a question truthfully
Question Wording
- Do we invoke bias through the way we construct our questions?
Order of Questions
- Do previous questions have an effect on the response to later questions?
Case-control Studies
1 in every 1000 male newborns are affected by Jacob’s Syndrome, a rare genetic condition where a male recives an extra Y chromosome.
If you were to sample 10000 men in a perfectly designed SRS, how many useful data points would you expect to have for a study on Jacob’s Syndrome?
Case-control studies are a form of observational study that adjusts for this unique problem
Select the case-subjects (those with the trait/condition) of interest
- Take a random sample of those individuals
Select a control group without the condition (ideally with similarities to the case subjects)
- Take a random sample of those individuals
Researchers look for exposure factors in the subjects’ past that differ
Retrospective
Looking back into the past
Identifying outcome of interest prior to the risk factor of interest
May be confounded
- Recall bias: memories are not always accurate
Cohort Studies
Millions of adults in the US have Type 2 Diabetes. As such, research into treatment is a wildly lucrative industry. But much of the treatment associated with the illness are very long term, with a necessity for regular “check-ins”
Subjects sharing a common demographic characteristic are enrolled and observed at regular intervals over an extended period of time
Starts with a group of similar individuals
Observations made over regular intervals
Generally prospective
Expensive
Better suited for common outcomes
Extremely Informative
Examples
Gallup polls considers a proper poll to be a representative sample of 1000 people. A local news station resolves to follow this rule of thumb and call 250 individuals from 4 different age groups as part of the design for a political poll.
\[ \begin{array}{|c|c|c|c|c|} \hline \text{Age Group} & 18-26 & 27-40 & 41-52 & 52+ \\ \hline \end{array} \]
They resolve to only count calls that are answered, rather than marking unanswered calls as non-respondant. They collect their results, develop their graphics, and run their “Political Statistics” segment as planned.
What are the components of this study?
Are there any problems with the study?
Below is a set of questions in a poll of visitors to a National Park:
How much do you appreciate the incredible natural beauty and unique wildlife found in this national park?
How satisfied are you with the park’s facilities, such as trails, restrooms, and visitor centers?
Would you recommend visiting this national park to your friends and family?
Do you think this national park does an excellent job preserving its natural beauty and wildlife?
How would you rate your overall experience at this national park today?
Each question used a scale of 1-5, where 1 was the most negative relation and 5 was the most positive. The survey was randomly given to 500 park visitors every month for 10 months.
What are the components of this study?
Are there any problems with the study?
A study investigates whether a new type of fertilizer is linked to an increased risk of skin irritation among farmers. The researchers recruit 200 farmers with skin irritation (cases) and 200 farmers without skin irritation (controls). They ask both groups about their fertilizer use over the past five years.
What are the components of this study?
Are there any problems with the study?
A school district examines whether longer lunch breaks improve student test scores. They collect test score data from schools that already have longer lunch breaks and compare them to schools with shorter breaks, adjusting for factors like socioeconomic status, class size, and teacher experience.
What are the components of this study?
Are there any problems with the study?
A 10-year study follows 1,000 people who regularly eat spicy food and 1,000 who do not, examining the incidence of stomach ulcers in each group.
What are the components of this study?
Are there any problems with the study?