Lec 17
Review
All random variables have a random probability distribution
As all statistics are random variables:
All statistics arise from a random probability distribution
The probability distribution of a sample statistic is the sampling distribution
Let \(\bar{x}\) be the mean of a random sample of size \(n\), drawn from a population with mean \(\mu\) and standard deviation \(\sigma\)
Since \(\bar{x}\) is a random variable, it has the mean and the standard deviation
The mean of \(\bar{x}\) is \(\mu\)
\[\mu_{\bar{x}} = \mu = \text{population mean}\]
The standard deviation of \(\bar{x}\) is \(\sigma / \sqrt{n}\)
\[\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}\]
Questions?
Goals for today:
Reinforce learning for sampling distributions and CLT
Introduce uncertainty
Sampling Distributions and Uncertainty I
Sampling Distributions
Given any population \(Y \sim N(\mu,\sigma^2)\)
Sample \(X \sim N(\mu_X,\sigma^2_X)\)
Sample mean \(\bar{x} \sim N(\mu_{\bar{x}},\sigma^2_{\bar{x}})\)
Where: \(\mu_{\bar{x}} = \mu\)
And: \({\sqrt{\sigma^2_{\bar{x}}}}=\sigma_{\bar{x}} = {\sigma \over \sqrt{n}}\)
The intuition behind this may not be self evident, but it’s easy to visualize:
Suppose we take a simple random sample of size 25 from a normal population with a mean of 20 and a standard deviation of 4
a. What is the distribution of \(\bar{x}\)?
\[\bar{x} \sim N(\mu_{\bar{x}},\sigma^2_{\bar{x}})\]
\[\mu_{\bar{x}} = 20 \quad \text{and} \quad \sigma_{\bar{x}} = \frac{4}{\sqrt{25}} = 0.8\]
\[\bar{x} \sim N(20, 0.8^2)\]
b. Find the probability that we will observe a sample mean over 22
\[\text{"Over"} \quad \Rightarrow \quad P(\bar{x} > 22)\]
\[= P \left(Z > \frac{22 - 20}{0.8}\right) = P(Z > 2.50)\]
\[= 1 - P(Z < 2.50) \quad z\text{-table} = 0.0062\]
c. Find the 95th percentile of \(\bar{x}\)
Look up 0.95 in the body of z-table:
\[z = 1.64 \quad \text{or} \quad z = 1.65\]
\[\text{take the midpoint} \quad z \approx 1.645\]
Convert \(z\) to \(\bar{x}\) as follows:
\[\bar{x} = \mu_{\bar{x}} + z \sigma_{\bar{x}} = 20 + 1.645(0.8) = 21.316\]
What if the population we are sampling from isn’t normal
- It’s easier to find a way to assume that \(\bar{x}\) is a normal random variable
Given the Central Limit Theorem, we can do that under certain assumptions
Central Limit Theorem
Let \(\bar{x}\) be the mean of a large random sample (\(n > 30\)) from any population
- With mean \(\mu\) and standard deviation \(\sigma\)
The distribution of \(\bar{x}\) is approximately normal
- Mean \(\mu_{\bar{x}} = \mu\)
- Standard deviation \(\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}\)
If \(n\) is large enough, we have:
\[\bar{x} \sim N(\mu, \frac{\sigma^2}{n})\]
- Regardless of the original population’s distribution
How large does \(n\) need to be?
This is an on-going debate in statistics
- As the skew of the distribution increases, our requirements for larger \(n\) increases
- As a general rule of thumb, \(n > 30\) should be sufficient
Recent data from the U.S. Census indicates that the mean age of college students is \(\mu = 25\) years, with a standard deviation of \(\sigma = 9.5\) years. A simple random sample of 125 students is drawn. If \(\bar{x} =\) the sample mean age of the students, what is the distribution of \(\bar{x}\)? (Justify your answer)
Since \(n > 30\):
- \(\bar{x} \approx N(\mu_{\bar{x}},\sigma^2_{\bar{x}})\)
\[\mu_{\bar{x}} = 25 \quad \text{and} \quad \sigma_{\bar{x}} = \frac{9.5}{\sqrt{125}} \approx 0.85.\]
So:
\[\bar{x} \sim N(25, 0.85^2)\]
The Internal Revenue Service reports that the mean federal income tax paid in a recent year was \(\$8000\). Assume that the standard deviation is \(\$5000\). The IRS plans to draw a sample of \(625\) tax returns to study the effect of a new tax law
Let \(\bar{x} =\) the mean tax for the \(625\) sampled tax returns
- Then \(\bar{x} \approx N(8000, 200^2)\) by the CLT
a. What is the probability that the sample mean tax is between \(7600\) and \(7900\)?
\[P(7600 < \bar{x} < 7900) \approx P\left(\frac{7600 - 8000}{200} < Z < \frac{7900 - 8000}{200}\right)\]
\[= P(-2 < Z < -0.5) \quad \text{z-table} = 0.2857\]
b. Would it be unusual if the sample mean were less than \(7500\)?
\[P(\bar{x} < 7500) \approx P\left(Z < \frac{7500 - 8000}{200}\right) \quad \text{z-table} = 0.0062\]
Yes, because \(P(\bar{x} < 7500) < 1\%\)
Population Proportion
Proportions are a useful way to interpret information about a population and sample without losing very much nuance at all:
- Proportions are just percentages of the population
- We’ve dealt with this a lot
Say the percentage of the population who participate in early voting is \(40\%\)
- \({40\over 100}=0.40\)
- The proportion of the population who early vote, \(p=0.40\)
If we poll a sample of 100 Manhattan residents and find that \(31\%\) early vote:
- The proportion of our sample who early vote, \(\hat{p}=0.31\)
Just like every other statistic, sample proportions are random variables
- So their distribution is the sampling distribution of the proportion
All of our previous rules and ideas apply
- As we take samples from our population we will see they aren’t consistent
- The more we sample the closer we get to true values
Mean of the sample proportion \(\hat{p}\) is:
\[\mu_{\hat{p}} = p \quad \text{(population proportion)}\]
Standard deviation of sample proportion \(\hat{p}\) is:
\[\sigma_{\hat{p}} = \sqrt{\frac{p(1 - p)}{n}}\]
The Central Limit Theorem will tell us the “shape” of the distribution of \(\hat{p}\)
Proportion Central Limit Theorem
If \(np \geq 10\) and \(n(1 - p) \geq 10\)
Distribution of \(\hat{p}\) is approximately normal
- Mean \(\mu_{\hat{p}} = p\)
- Standard deviation \(\sigma_{\hat{p}} = \sqrt{\frac{p(1 - p)}{n}}\)
Thus:
\[\hat{p} \sim N \left( p, \frac{p(1 - p)}{n} \right)\]
According to a Harris Poll, chocolate is the favorite ice cream flavor for 27% of Americans. If a sample of 100 Americans is taken, what is the probability that the sample proportion of those who prefer chocolate is greater than 0.30?
Since \(np = (100)(0.27) = 27 \geq 10\) and \(n(1 - p) = (100)(0.73) = 73 \geq 10\), we can apply the CLT. By the CLT, the distribution of \(\hat{p}\) is approximately normal with:
\[\mu_{\hat{p}} = 0.27 \quad \text{and} \quad \sigma_{\hat{p}} = \sqrt{\frac{0.27(1 - 0.27)}{100}} \approx 0.0444\]
\[P(\hat{p} > 0.30) \approx P\left( Z > \frac{0.30 - 0.27}{0.0444} \right)\]
\[= P(Z > 0.68)\]
\[= 1 - P(Z < 0.68)\]
\[= 1 - 0.7517 = 0.2483\]
Uncertainty
We’ve studied point estimates — single number estimates — to estimate population parameters (e.g., sample mean, sample proportion)
Point estimates are a deterministic result
- Statistics deals with probabilistic results
It would be more informative to provide a range of values
- We generally call these confidence intervals and we’ll be talking about them more in-depth next week