Lec 17

Review


All random variables have a random probability distribution


As all statistics are random variables:


All statistics arise from a random probability distribution



The probability distribution of a sample statistic is the sampling distribution


Let \(\bar{x}\) be the mean of a random sample of size \(n\), drawn from a population with mean \(\mu\) and standard deviation \(\sigma\)


Since \(\bar{x}\) is a random variable, it has the mean and the standard deviation


The mean of \(\bar{x}\) is \(\mu\)


\[\mu_{\bar{x}} = \mu = \text{population mean}\]


The standard deviation of \(\bar{x}\) is \(\sigma / \sqrt{n}\)


\[\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}\]



Questions?




Goals for today:

  1. Reinforce learning for sampling distributions and CLT

  2. Introduce uncertainty




Sampling Distributions and Uncertainty I


Sampling Distributions

Given any population \(Y \sim N(\mu,\sigma^2)\)


Sample \(X \sim N(\mu_X,\sigma^2_X)\)


Sample mean \(\bar{x} \sim N(\mu_{\bar{x}},\sigma^2_{\bar{x}})\)


  • Where: \(\mu_{\bar{x}} = \mu\)

  • And: \({\sqrt{\sigma^2_{\bar{x}}}}=\sigma_{\bar{x}} = {\sigma \over \sqrt{n}}\)


The intuition behind this may not be self evident, but it’s easy to visualize:



Suppose we take a simple random sample of size 25 from a normal population with a mean of 20 and a standard deviation of 4


a. What is the distribution of \(\bar{x}\)?


\[\bar{x} \sim N(\mu_{\bar{x}},\sigma^2_{\bar{x}})\]


\[\mu_{\bar{x}} = 20 \quad \text{and} \quad \sigma_{\bar{x}} = \frac{4}{\sqrt{25}} = 0.8\]


\[\bar{x} \sim N(20, 0.8^2)\]



b. Find the probability that we will observe a sample mean over 22


\[\text{"Over"} \quad \Rightarrow \quad P(\bar{x} > 22)\]


\[= P \left(Z > \frac{22 - 20}{0.8}\right) = P(Z > 2.50)\]


\[= 1 - P(Z < 2.50) \quad z\text{-table} = 0.0062\]



c. Find the 95th percentile of \(\bar{x}\)


Look up 0.95 in the body of z-table:


\[z = 1.64 \quad \text{or} \quad z = 1.65\]


\[\text{take the midpoint} \quad z \approx 1.645\]


Convert \(z\) to \(\bar{x}\) as follows:


\[\bar{x} = \mu_{\bar{x}} + z \sigma_{\bar{x}} = 20 + 1.645(0.8) = 21.316\]



What if the population we are sampling from isn’t normal

  • It’s easier to find a way to assume that \(\bar{x}\) is a normal random variable


Given the Central Limit Theorem, we can do that under certain assumptions




Central Limit Theorem

Let \(\bar{x}\) be the mean of a large random sample (\(n > 30\)) from any population


  • With mean \(\mu\) and standard deviation \(\sigma\)


The distribution of \(\bar{x}\) is approximately normal


  • Mean \(\mu_{\bar{x}} = \mu\)


  • Standard deviation \(\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}\)


If \(n\) is large enough, we have:


\[\bar{x} \sim N(\mu, \frac{\sigma^2}{n})\]


  • Regardless of the original population’s distribution



How large does \(n\) need to be?


This is an on-going debate in statistics


  • As the skew of the distribution increases, our requirements for larger \(n\) increases


  • As a general rule of thumb, \(n > 30\) should be sufficient








Recent data from the U.S. Census indicates that the mean age of college students is \(\mu = 25\) years, with a standard deviation of \(\sigma = 9.5\) years. A simple random sample of 125 students is drawn. If \(\bar{x} =\) the sample mean age of the students, what is the distribution of \(\bar{x}\)? (Justify your answer)


Since \(n > 30\):


  • \(\bar{x} \approx N(\mu_{\bar{x}},\sigma^2_{\bar{x}})\)


\[\mu_{\bar{x}} = 25 \quad \text{and} \quad \sigma_{\bar{x}} = \frac{9.5}{\sqrt{125}} \approx 0.85.\]



So:


\[\bar{x} \sim N(25, 0.85^2)\]




The Internal Revenue Service reports that the mean federal income tax paid in a recent year was \(\$8000\). Assume that the standard deviation is \(\$5000\). The IRS plans to draw a sample of \(625\) tax returns to study the effect of a new tax law


Let \(\bar{x} =\) the mean tax for the \(625\) sampled tax returns


  • Then \(\bar{x} \approx N(8000, 200^2)\) by the CLT


a. What is the probability that the sample mean tax is between \(7600\) and \(7900\)?


\[P(7600 < \bar{x} < 7900) \approx P\left(\frac{7600 - 8000}{200} < Z < \frac{7900 - 8000}{200}\right)\]


\[= P(-2 < Z < -0.5) \quad \text{z-table} = 0.2857\]



b. Would it be unusual if the sample mean were less than \(7500\)?


\[P(\bar{x} < 7500) \approx P\left(Z < \frac{7500 - 8000}{200}\right) \quad \text{z-table} = 0.0062\]


Yes, because \(P(\bar{x} < 7500) < 1\%\)




Population Proportion

Proportions are a useful way to interpret information about a population and sample without losing very much nuance at all:


  • Proportions are just percentages of the population


  • We’ve dealt with this a lot


Say the percentage of the population who participate in early voting is \(40\%\)


  • \({40\over 100}=0.40\)


  • The proportion of the population who early vote, \(p=0.40\)



If we poll a sample of 100 Manhattan residents and find that \(31\%\) early vote:


  • The proportion of our sample who early vote, \(\hat{p}=0.31\)


Just like every other statistic, sample proportions are random variables

  • So their distribution is the sampling distribution of the proportion


All of our previous rules and ideas apply


  • As we take samples from our population we will see they aren’t consistent


  • The more we sample the closer we get to true values


Mean of the sample proportion \(\hat{p}\) is:


\[\mu_{\hat{p}} = p \quad \text{(population proportion)}\]


Standard deviation of sample proportion \(\hat{p}\) is:


\[\sigma_{\hat{p}} = \sqrt{\frac{p(1 - p)}{n}}\]


The Central Limit Theorem will tell us the “shape” of the distribution of \(\hat{p}\)


Proportion Central Limit Theorem


If \(np \geq 10\) and \(n(1 - p) \geq 10\)


Distribution of \(\hat{p}\) is approximately normal


  • Mean \(\mu_{\hat{p}} = p\)


  • Standard deviation \(\sigma_{\hat{p}} = \sqrt{\frac{p(1 - p)}{n}}\)



Thus:


\[\hat{p} \sim N \left( p, \frac{p(1 - p)}{n} \right)\]




According to a Harris Poll, chocolate is the favorite ice cream flavor for 27% of Americans. If a sample of 100 Americans is taken, what is the probability that the sample proportion of those who prefer chocolate is greater than 0.30?


Since \(np = (100)(0.27) = 27 \geq 10\) and \(n(1 - p) = (100)(0.73) = 73 \geq 10\), we can apply the CLT. By the CLT, the distribution of \(\hat{p}\) is approximately normal with:


\[\mu_{\hat{p}} = 0.27 \quad \text{and} \quad \sigma_{\hat{p}} = \sqrt{\frac{0.27(1 - 0.27)}{100}} \approx 0.0444\]



\[P(\hat{p} > 0.30) \approx P\left( Z > \frac{0.30 - 0.27}{0.0444} \right)\]


\[= P(Z > 0.68)\]


\[= 1 - P(Z < 0.68)\]


\[= 1 - 0.7517 = 0.2483\]



Uncertainty


We’ve studied point estimates — single number estimates — to estimate population parameters (e.g., sample mean, sample proportion)


Point estimates are a deterministic result


  • Statistics deals with probabilistic results


It would be more informative to provide a range of values

  • We generally call these confidence intervals and we’ll be talking about them more in-depth next week




Attendance QOTD




Go away