Lec 20

Review

Sampling Distribution

Given any population \(Y \sim N(\mu,\sigma^2)\)

Sample \(X \sim N(\mu_X,\sigma^2_X)\)

Sample mean \(\bar{x} \sim N(\mu_{\bar{x}},\sigma^2_{\bar{x}})\)

Where: \(\mu_{\bar{x}} = \mu\)
And: \({\sqrt{\sigma^2_{\bar{x}}}}=\sigma_{\bar{x}} = {\sigma \over \sqrt{n}}\)

Central Limit Theorem

Let \(\bar{x}\) be the mean of a large random sample (\(n > 30\)) from any population

With mean \(\mu\) and standard deviation \(\sigma\)

The distribution of \(\bar{x}\) is approximately normal

Mean \(\mu_{\bar{x}} = \mu\)

Standard deviation \(\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}\)

If \(n\) is large enough, we have:

\[\bar{x} \sim N(\mu, \frac{\sigma^2}{n})\]

Regardless of the original population’s distribution

How large does \(n\) need to be?

Population Proportion

Proportions are a useful way to interpret information about a population and sample without losing very much nuance at all:

Proportions are just percentages of the population

Say the percentage of the population who have experienced flu-like symptoms this month is \(40\%\)

\({40\over 100}=0.40\)

The proportion of the population who have flu-like symptoms, \(p=0.40\)

If we poll a sample of 100 Manhattan residents and find that \(31\%\) have experienced flu-like symptoms in the past month:

The proportion of our sample who experienced flu-like symptoms in our pre-defined time window, \(\hat{p}=0.31\)

Just like every other statistic, sample proportions are random variables

So their distribution is the sampling distribution of the proportion

All of our previous rules and ideas apply

As we take samples from our population we will see they aren’t consistent

The more we sample the closer we get to true values

Mean of the sample proportion \(\hat{p}\) is:

\[\mu_{\hat{p}} = p \quad \text{(population proportion)}\]

Standard deviation of sample proportion \(\hat{p}\) is:

\[\sigma_{\hat{p}} = \sqrt{\frac{p(1 - p)}{n}}\]

Proportion Central Limit Theorem

If \(np \geq 10\) and \(n(1 - p) \geq 10\)

Distribution of \(\hat{p}\) is approximately normal

Mean \(\mu_{\hat{p}} = p\)

Standard deviation \(\sigma_{\hat{p}} = \sqrt{\frac{p(1 - p)}{n}}\)

Thus:

\[\hat{p} \sim N \left( p, \frac{p(1 - p)}{n} \right)\]

Point estimates are a deterministic result

Statistics deals with probabilistic results

It would be more informative to provide a range of values

Questions?

Goals for today:

Compute confidence intervals for point estimates arising from Normal data
Discuss interpretations of confidence intervals

Uncertainty I

Confidence Intervals

Since the value of \(\bar{x}\) varies with each sample, we need to quantify the uncertainty associated with \(\bar{x}\)

A random sample of \(120\) students admitted to top veterinary schools yielded an average GPA of \(3.85\)

\[\bar{x} = 3.85\]

This is a point estimate of \(\mu\)

One number, no additional information provided

A confidence interval (CI) provides a range of values that contains:

The population parameter

A certain level of confidence, called the confidence level

Formula for the CI:

\[\text{Point estimate} \pm \text{Margin of Error}\]

The confidence interval for \(\mu\):

\[ \bar{x} \pm \text{Margin of Error} \]

\[ (\bar{x} - \text{Margin of Error}, \bar{x} + \text{Margin of Error}) \]

Margin of error

The farthest distance we believe our estimate \(\bar{x}\) is from \(\mu\)

The size of the margin of error is determined by the sampling distribution of \(\bar{x}\) and the confidence level

Confidence level is denoted by \(100(1 - \alpha)\%\)

Typically \(90\%\), \(95\%\), or \(99\%\)

For a population with unknown \(\mu\) but known \(\sigma\), a \(100(1 - \alpha)\%\) confidence interval for \(\mu\) is computed as:

\[\bar{x} \pm z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}}\]

Where \(z_{\alpha/2}\) is the z-score with an area of \(\alpha/2\) to its right

When construction a confidence interval for \(\mu\)

We have to consider our assumptions

At least one of the following must hold:

The sample size is large (\(n > 30\))

The original population is normally distributed

In most practical cases, \(\sigma\) is unknown, and we must use the sample standard deviation \(s\)

The formula for the confidence interval is:

\[\bar{x} \pm t_{\alpha/2} \cdot \frac{s}{\sqrt{n}}\]

Where \(t_{\alpha/2}\) is the critical value from the Student’s t-distribution, and \(s\) is the sample standard deviation

Student’s t-Distribution

The (Student’s) t-distribution is similar to the standard normal distribution

Unimodal

Symmetric around \(0\)

But it has wider (or heavier) tails than the standard normal

Meaning it’s more spread out

The t-distribution is distinguished by degrees of freedom (\(df = n - 1\))

As \(df\) increase the t-distribution converges to a normal distribution

The critical value \(t_{\alpha/2}\) is a \(t\) value separating an area of \(\alpha/2\) in the right tail of the \(t\) distribution

When using the \(t\) distribution ot construct a confidence interval for \(\mu\):

Degrees of freedom (\(df\)) is \(1\) less than the sample size

Find the critical value \(t_{\alpha/2}\) for a 95% confidence interval with \(n = 8\)

Set \(1 - \alpha = 0.95\), then \(\alpha = 0.05\), and \(\alpha/2 = 0.025\)

For \(n = 8 \Rightarrow df = n-1 = 7\)

The critical value is \(t_{\alpha/2} = 2.365\)

What if the \(df\) I’m looking for isn’t in the table?

Round down to the nearest value on the table

If \(df=59\), round down to \(df=50\)
At \(95\%\) confidence, \(t_{\alpha/2}=2.009\)

Check your assumptions for construction a CI of \(\mu\):

Sample size is large (\(n>30\)) or the population is normal

\(100(1-\alpha)\%\) confidence interval is computed as:

Case 1: \(\sigma\) is known, use the z-method:

\[\bar{x} \pm z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}}\]

Case 2: \(\sigma\) is unknown, use the t-method:

\[\bar{x} \pm t_{\alpha/2} \cdot \frac{s}{\sqrt{n}}\]

Given a sample of size \(n = 5\) from a normal population, \(\bar{x} = 4.31\), and \(s = 2.7\), construct a 95% confidence interval for \(\mu\)

Should we use \(z\) method or \(t\) method?

\(\sigma\) is unknown

Compute the margin of error for this \(95\%\) confidence interval:

With \(df = 4\) and \(t_{\alpha/2} = 2.776\), calculate:

\[\text{Margin of Error} = 2.776 \times \frac{2.7}{\sqrt{5}} \approx 3.352\]

Construct a \(95\%\) confidence interval for \(\mu\) and interpret your result:

\[4.31 \pm 3.352 \quad \text{or} \quad (0.958, 7.662)\]

We are 95% confident that the true population mean lies between 0.958 and 7.662

If the population were not normal, would the confidence interval in (c) be valid?

Interpreting a CI

Suppose we take many random samples and construct a \(95\%\) confidence interval from each sample

\(95\%\) of those intervals would contain the true population mean, \(\mu\)

In practice we say that we’re \(95\%\) confident that the true value of \(\mu\) is within our confidence interval