Lec 17

Review

All random variables have a random probability distribution

As all statistics are random variables:

All statistics arise from a random probability distribution

The probability distribution of a sample statistic is the sampling distribution

Let $\bar{x}$ be the mean of a random sample of size $n$, drawn from a population with mean $\mu$ and standard deviation $\sigma$

Since $\bar{x}$ is a random variable, it has the mean and the standard deviation

The mean of $\bar{x}$ is $\mu$

\[\mu_{\bar{x}} = \mu = \text{population mean}\]

The standard deviation of $\bar{x}$ is $\sigma / \sqrt{n}$

\[\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}\]

Questions?

Goals for today:

Reinforce learning for sampling distributions and CLT
Introduce uncertainty

Sampling Distributions and Uncertainty I

Sampling Distributions

Given any population $Y \sim N(\mu,\sigma^2)$

Sample $X \sim N(\mu_X,\sigma^2_X)$

Sample mean $\bar{x} \sim N(\mu_{\bar{x}},\sigma^2_{\bar{x}})$

Where: $\mu_{\bar{x}} = \mu$
And: ${\sqrt{\sigma^2_{\bar{x}}}}=\sigma_{\bar{x}} = {\sigma \over \sqrt{n}}$

The intuition behind this may not be self evident, but it’s easy to visualize:

Suppose we take a simple random sample of size 25 from a normal population with a mean of 20 and a standard deviation of 4

a. What is the distribution of $\bar{x}$?

\[\bar{x} \sim N(\mu_{\bar{x}},\sigma^2_{\bar{x}})\]

\[\mu_{\bar{x}} = 20 \quad \text{and} \quad \sigma_{\bar{x}} = \frac{4}{\sqrt{25}} = 0.8\]

\[\bar{x} \sim N(20, 0.8^2)\]

b. Find the probability that we will observe a sample mean over 22

\[\text{"Over"} \quad \Rightarrow \quad P(\bar{x} > 22)\]

\[= P \left(Z > \frac{22 - 20}{0.8}\right) = P(Z > 2.50)\]

\[= 1 - P(Z < 2.50) \quad z\text{-table} = 0.0062\]

c. Find the 95th percentile of $\bar{x}$

Look up 0.95 in the body of z-table:

\[z = 1.64 \quad \text{or} \quad z = 1.65\]

\[\text{take the midpoint} \quad z \approx 1.645\]

Convert $z$ to $\bar{x}$ as follows:

\[\bar{x} = \mu_{\bar{x}} + z \sigma_{\bar{x}} = 20 + 1.645(0.8) = 21.316\]

What if the population we are sampling from isn’t normal

It’s easier to find a way to assume that $\bar{x}$ is a normal random variable

Given the Central Limit Theorem, we can do that under certain assumptions

Central Limit Theorem

Let $\bar{x}$ be the mean of a large random sample ($n > 30$) from any population

With mean $\mu$ and standard deviation $\sigma$

The distribution of $\bar{x}$ is approximately normal

Mean $\mu_{\bar{x}} = \mu$

Standard deviation $\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}$

If $n$ is large enough, we have:

\[\bar{x} \sim N(\mu, \frac{\sigma^2}{n})\]

Regardless of the original population’s distribution

How large does $n$ need to be?

This is an on-going debate in statistics

As the skew of the distribution increases, our requirements for larger $n$ increases

As a general rule of thumb, $n > 30$ should be sufficient

Recent data from the U.S. Census indicates that the mean age of college students is $\mu = 25$ years, with a standard deviation of $\sigma = 9.5$ years. A simple random sample of 125 students is drawn. If $\bar{x} =$ the sample mean age of the students, what is the distribution of $\bar{x}$? (Justify your answer)

Since $n > 30$:

$\bar{x} \approx N(\mu_{\bar{x}},\sigma^2_{\bar{x}})$

\[\mu_{\bar{x}} = 25 \quad \text{and} \quad \sigma_{\bar{x}} = \frac{9.5}{\sqrt{125}} \approx 0.85.\]

So:

\[\bar{x} \sim N(25, 0.85^2)\]

The Internal Revenue Service reports that the mean federal income tax paid in a recent year was $\$8000$. Assume that the standard deviation is $\$5000$. The IRS plans to draw a sample of $625$ tax returns to study the effect of a new tax law

Let $\bar{x} =$ the mean tax for the $625$ sampled tax returns

Then $\bar{x} \approx N(8000, 200^2)$ by the CLT

a. What is the probability that the sample mean tax is between $7600$ and $7900$?

\[P(7600 < \bar{x} < 7900) \approx P\left(\frac{7600 - 8000}{200} < Z < \frac{7900 - 8000}{200}\right)\]

\[= P(-2 < Z < -0.5) \quad \text{z-table} = 0.2857\]

b. Would it be unusual if the sample mean were less than $7500$?

\[P(\bar{x} < 7500) \approx P\left(Z < \frac{7500 - 8000}{200}\right) \quad \text{z-table} = 0.0062\]

Yes, because $P(\bar{x} < 7500) < 1\%$

Population Proportion

Proportions are a useful way to interpret information about a population and sample without losing very much nuance at all:

Proportions are just percentages of the population

We’ve dealt with this a lot

Say the percentage of the population who participate in early voting is $40\%$

${40\over 100}=0.40$

The proportion of the population who early vote, $p=0.40$

If we poll a sample of 100 Manhattan residents and find that $31\%$ early vote:

The proportion of our sample who early vote, $\hat{p}=0.31$

Just like every other statistic, sample proportions are random variables

So their distribution is the sampling distribution of the proportion

All of our previous rules and ideas apply

As we take samples from our population we will see they aren’t consistent

The more we sample the closer we get to true values

Mean of the sample proportion $\hat{p}$ is:

\[\mu_{\hat{p}} = p \quad \text{(population proportion)}\]

Standard deviation of sample proportion $\hat{p}$ is:

\[\sigma_{\hat{p}} = \sqrt{\frac{p(1 - p)}{n}}\]

The Central Limit Theorem will tell us the “shape” of the distribution of $\hat{p}$

Proportion Central Limit Theorem

If $np \geq 10$ and $n(1 - p) \geq 10$

Distribution of $\hat{p}$ is approximately normal

Mean $\mu_{\hat{p}} = p$

Standard deviation $\sigma_{\hat{p}} = \sqrt{\frac{p(1 - p)}{n}}$

Thus:

\[\hat{p} \sim N \left( p, \frac{p(1 - p)}{n} \right)\]

According to a Harris Poll, chocolate is the favorite ice cream flavor for 27% of Americans. If a sample of 100 Americans is taken, what is the probability that the sample proportion of those who prefer chocolate is greater than 0.30?

Since $np = (100)(0.27) = 27 \geq 10$ and $n(1 - p) = (100)(0.73) = 73 \geq 10$, we can apply the CLT. By the CLT, the distribution of $\hat{p}$ is approximately normal with:

\[\mu_{\hat{p}} = 0.27 \quad \text{and} \quad \sigma_{\hat{p}} = \sqrt{\frac{0.27(1 - 0.27)}{100}} \approx 0.0444\]

\[P(\hat{p} > 0.30) \approx P\left( Z > \frac{0.30 - 0.27}{0.0444} \right)\]

\[= P(Z > 0.68)\]

\[= 1 - P(Z < 0.68)\]

\[= 1 - 0.7517 = 0.2483\]

Uncertainty

We’ve studied point estimates — single number estimates — to estimate population parameters (e.g., sample mean, sample proportion)

Point estimates are a deterministic result

Statistics deals with probabilistic results

It would be more informative to provide a range of values

We generally call these confidence intervals and we’ll be talking about them more in-depth next week