Lec 22
Review
Confidence Intervals for Proportions
Point estimate for population proportions
\[\hat{p}\]
MOE (Margin of Error)
\[ z_{\alpha/2} \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}} \]
CLT for Proportions
\[ n\hat{p} \geq 10 \quad \text{and} \quad n(1 - \hat{p}) \geq 10 \]
CI for population proportions
\[ \hat{p} \pm z_{\alpha/2} \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}} \quad \text{(Margin of Error)} \]
Steps for computing any confidence interval
- Find your point estimate
- Determine your confidence level
- Compute the margin of error
- Construct the confidence interval via \(\text{PE} \pm \text{MOE}\)
- Handle any interpretation steps by asking what the original research question was
The general method for all word problems
- What do I have?
What relevant variables/information does the problem provide me outright?
- What am I looking for?
What is the problem asking me to provide as a final answer?
- What tools do I have to find what I’m looking for?
What equations do I know that have the final answer as a component of them?
- What do I need to use the tools I have?
Of the relevant variables/information that I have, which ones fit into the equations I’m looking at? Am I missing any information? Do I have excess information?
- Does it all make sense?
Does my final answer match the context of the original problem? If not, what would it need to make it make sense?
Put in practice:
The daily sales of a local coffee shop are normally distributed with a population mean of \(\mu = 300\) dollars and a standard deviation of \(\sigma = 75\) dollars. If a random sample of \(n = 49\) days is taken:
- What is the probability that we will observe a sample mean over \(312\)?
What do I have?
\(\mu=300\) dollars
\(\sigma=75\) dollars
\(n=49\) days
Population is normally distributed
What am I looking for?
\[P(\bar{x}>312)\]
What tools do I have to find what I’m looking for?
- \(z\)-score formula:
\[P(X>x)=P\left(Z>{x-\mu \over \sigma}\right)=P(Z>z)\]
\(z\)-table
Empirical rule
What do I need to use the tools I have?
\(x\) or some single value
\(\mu\) or some mean
\(\sigma\) or some standard deviation
Normally distributed data
These all fit very well into our z-score formula
\[P(X>312)=P\left(Z>{312-300 \over 75}\right)=P(Z>0.16)\]
\[P(Z>0.16)=1-P(Z<0.16)=0.436\]
Does it all make sense?
We know that the data is normally distributed so:
- \(68\%\) of the data should be within \(1\) standard deviation
Let’s think about this
\[\approx 100\% \text{ data}\Rightarrow \pm3 \ \sigma\]
\[\approx 50\% \text{ data} \Rightarrow + \text{ OR} - 3 \ \sigma\]
\[312 = 0.16 \ \sigma\]
If we exclude a pretty central but small chunk of our strictly \(+\sigma\) population, we should expect our value to be something a little less than \(50\%\)
Hypothesis Testing
In hypothesis testing, there are two competing statements about population parameters:
\[H_0\equiv \text{null hypothesis} \quad \text{vs} \quad H_1 \equiv \text{alternate hypothesis}\]
The null hypothesis, \(H_0\), states that the parameter is equal to a specific value
\[H_0 : \mu = 35\]
The alternate hypothesis, \(H_1\), states that the value of the parameter differs from the value specified by the null hypothesis
\[H_1 : \mu < 35\]
\[H_1 : \mu > 35\]
\[H_1 : \mu \neq 35\]
There are three types of alternate hypothesis
- Consider \(H_0 : \mu = 35\)
- \(H_1 : \mu < 35\) \(\Rightarrow\) called left-tailed alternate hypothesis
- \(H_1 : \mu > 35\) \(\Rightarrow\) called right-tailed alternate hypothesis
- \(H_1 : \mu \neq 35\) \(\Rightarrow\) called two-tailed alternate hypothesis
Left-tailed and right-tailed hypotheses are called one-tailed hypotheses
A null hypothesis is generally thought of as a default state of nature (e.g. existing knowledge)
An alternate hypothesis, on the other hand, contradicts the default state (e.g. new knowledge)
In most cases, whatever we wish to establish is placed in the alternate hypothesis
After developing \(H_0\) and \(H_1\), we collect a set of data
Based on the data, we construct a test statistic to reach one of the following decisions:
Reject \(H_0\)
Fail to reject \(H_0\)
If we reject \(H_0\)
- We conclude that \(H_1\) is true
If we fail to reject \(H_0\)
- We conclude that the data do not provide enough evidence to reject \(H_0\)
Type I error: \(H_0\) is true in reality, but we reject \(H_0\)
Type II error: \(H_1\) is true in reality, but we do not reject \(H_0\)
\[ \begin{array}{|c|c|c|} \hline \text{Decision} & H_0 \ \text{True} & H_0 \ \text{False} \\ \hline \text{Reject} \ H_0 & \text{Type I error} & \text{Correct decision} \\ \hline \text{Don’t reject} \ H_0 & \text{Correct decision} & \text{Type II error} \\ \hline \end{array} \]
The probability of having the Type I error is denoted by \(\alpha\)
The probability of having the Type II error is denoted by \(\beta\)
Questions?
Goals for Today:
Introduce hypothesis tests for population means and proportions
Practice these skills A LOT
Hypothesis Testing I
Hypothesis Testing for Population Means
We’ve spoken about staging hypotheses, now we’ll learn how to perform the actual test
Starting with a hypothesis test for population mean \(\mu\)
- When population standard deviation \(\sigma\) is unknown
We do need the Central Limit Theorem to hold in order for us to proceed
\[\text{if} \ \ n>30 \ \ \text{then} \ \ \bar{X} \sim N(\mu,\sigma^2)\]
There are two hypothesis test methods
- Critical value method
- P-value method
We’re going to learn the P-value method:
- State the null and alternate hypotheses
- Choose a significance level \(\alpha =\) (allowed probability of Type I error)
- Compute the test statistic:
\[t = \frac{\bar{x} - \mu_0}{{s}/{\sqrt{n}}}\]
Since \(\sigma\) is unknown, we replace it with the sample standard deviation \(s\)
We use the \(t\) statistic, which comes from the \(t\) distribution with \(\text{df} = n - 1\)
- Compute the P-value of the test statistic \(t\)
Left-tailed test: \(P\)-value = area under the \(t\) distribution to the left of \(t\), i.e., \(P(T < t)\)
Right-tailed test: \(P\)-value = area under the \(t\) distribution to the right of \(t\), i.e., \(P(T > t)\)
Two-tailed test: \(P\)-value = sum of the areas under the \(t\) distribution to the left of \(-|t|\) and right of \(|t|\), i.e., \(2 * P(T < -|t|)\)
- The degrees of freedom for the \(t\) distribution is \(\text{df} = n - 1\)
- Determine whether to reject \(H_0\)
- Reject \(H_0\) if \(P\)-value \(\leq \alpha\)
- Do not reject \(H_0\) if \(P\)-value \(> \alpha\)
- State a conclusion
In a recent medical study, 76 subjects were placed on a low-fat diet. After 12 months, their sample mean weight loss was \(\bar{x} = 2.2\) kilograms, with a sample standard deviation of \(s = 6.1\) kilograms. Use the \(\alpha = 0.05\) level of significance to test the claim that the mean weight loss is greater than 0.
Step 1: State the null and alternate hypotheses
\[ H_0 : \mu = 0 \\ H_1 : \mu > 0 \\ (\text{right-tailed test}) \]
Step 2: Choose a significance level \(\alpha\)
\[ \alpha = 0.05 \]
Step 3: Compute the value of the test statistic
\[ t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} = \frac{2.2 - 0}{6.1 / \sqrt{76}} \approx 3.144 \]
Step 4: Use the t-table to compute the P-value
Since \(H_1 : \mu > 0\) is right-tailed, the P-value is \(P(T > 3.144)\)
The degrees of freedom are \(df = 75\), which does not appear in the t-table, so we round it down to the nearest whole number, \(df = 60\)
In the t-table with \(df = 60\), we find that \(P(T > 3.144)\) is between \(P(T > 3.232)\) and \(P(T > 2.915)\), so the P-value is between \(0.001\) and \(0.0025\)
Step 5: Determine whether to reject \(H_0\)
Our P-value is between \(0.0025\) and \(0.001\)
Since the P-value is less than \(\alpha = 0.05\), we reject \(H_0\)
Step 6: State your conclusion
- We conclude that the mean weight loss of people who were placed on a low-fat diet for \(12\) months is greater than \(0\)
A type of steel used by a manufacturing company is supposed to have an average hardness of 62 on the Rockwell hardness index. If the steel is too hard or too soft, defects can appear in the final product. A random sample of 10 specimens for a new steel supplier had a mean hardness of 64 with a standard deviation of 4. Test at the 5% significance level whether the mean hardness of the new supplier’s steel is different from the desired hardness of 62. (Assume that the population is normally distributed).
Step 1: State the null and alternate hypotheses
\[ H_0 : \mu = 62 \\ H_1 : \mu \neq 62 \\ (\text{two-tailed test}) \]
Step 2: Choose a significance level
\[ \alpha = 0.05 \]
Step 3: Compute the test statistic
Given:
Sample mean \(\bar{x} = 64\)
Population mean \(\mu_0 = 62\)
Sample standard deviation \(s = 4\)
Sample size \(n = 10\)
The test statistic is:
\[ t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} = \frac{64 - 62}{4 / \sqrt{10}} \approx 1.581 \]
Step 4 & 5: Determine the P-value (two-tailed test)
Using the t-table with \(df = 9\), we find:
- \(P(T > 1.581)\) is between \(P(T > 1.833)\) and \(P(T > 1.383)\), so \(P(T > 1.581)\) is between \(0.05\) and \(0.10\)
Therefore, the two-tailed P-value is:
\[ \text{P-value} = 2 * P(T > 1.581) \approx \text{between } 0.1 \text{ and } 0.2 \]
Since the P-value is greater than \(\alpha = 0.05\), we fail to reject \(H_0\)
Step 6: State the conclusion
There is not enough evidence to conclude that the mean hardness of the new supplier’s steel is different from 62
Hypothesis Testing for Proportions
Now, we want to test a hypothesis for population proportion \(p\)
We still need the Central Limit Theorem for proportions to hold:
\[np_0 \geq 10 \quad \text{and} \quad n(1 - p_0) \geq 10\]
Where \(p_0\) is the population proportion specified by \(H_0\)
Step 1: State the null and alternate hypotheses
The null hypothesis is of the form:
\[ H_0 : p = p_0 \]
The alternate hypothesis is in one of the three forms:
Left-tailed: \(H_1 : p < p_0\)
Right-tailed: \(H_1 : p > p_0\)
Two-tailed: \(H_1 : p \neq p_0\)
Step 2: Choose a significance level \(\alpha\)
Step 3: Compute the test statistic:
\[ z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1 - p_0)}{n}}} \]
Step 4: Compute the P-value of the test statistic \(z\)
Left-tailed: P-value = area under the standard normal distribution to the left of \(z\)
- i.e., \(P(Z < z)\)
Right-tailed: P-value = area under the standard normal distribution to the right of \(z\)
- i.e., \(P(Z > z)\)
Two-tailed: P-value = sum of the areas under the standard normal distribution to the left of \(-|z|\) and right of \(|z|\)
- i.e., \(2 * P(Z < -|z|)\)
Step 5: Determine whether to reject \(H_0\):
Reject \(H_0\) if P-value \(\leq \alpha\)
Do not reject \(H_0\) if P-value \(> \alpha\)
Step 6: State a conclusion
Suppose that 67% of all auto damage insurance claims in the US are made by singles under 25 years old. Also suppose that in a random sample of 53 auto damage claims in Manhattan, KS, there were 42 made by singles under 25.
Test at the 5% significance level whether the proportion of auto damage claims made by singles under 25 in Manhattan is different than the proportion for the entire US.
a. State the null and alternate hypotheses
\[ H_0 : p = 0.67 \\ H_1 : p \neq 0.67 \\ \quad (\text{two-tailed test}) \]
b. Compute the value of the test statistic
Given:
Sample proportion \(\hat{p} = \frac{42}{53} \approx 0.7925\)
Population proportion \(p_0 = 0.67\)
The test statistic is:
\[ z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1 - p_0)}{n}}} = \frac{0.7925 - 0.67}{\sqrt{\frac{0.67(1 - 0.67)}{53}}} \approx 1.90 \]
c. Determine whether to reject \(H_0\)
Using a two-tailed z-table:
\[ \text{P-value} = 2 \cdot P(Z < -1.90) = 2(0.0287) = 0.0574 \]
Since the P-value \(> \alpha (= 0.05)\), we fail to reject \(H_0\)
d. State your conclusion
There is not enough evidence to conclude that the proportion of auto damage claims made in Manhattan by singles under 25 is different from the national proportion
An educational technology specialist is studying attitudes of teachers about the use of virtual reality in the classroom. She samples 500 teachers and finds that 471 of them believe that virtual reality would have a positive effect. Can she conclude that the proportion of teachers who believe that virtual reality would have a positive effect is greater than 0.90? Use the \(\alpha = 0.05\) level of significance.
Step 1: State the null and alternate hypotheses
\[ H_0 : p = 0.90 \\ H_1 : p > 0.90 \\ \quad (\text{right-tailed test}) \]
Step 2: Choose a significance level
\[ \alpha = 0.05 \]
Step 3: Compute the test statistic
Given:
Sample proportion \(\hat{p} = \frac{471}{500} = 0.942\)
Population proportion \(p_0 = 0.90\)
The test statistic is:
\[ z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1 - p_0)}{n}}} = \frac{0.942 - 0.90}{\sqrt{\frac{0.90(1 - 0.90)}{500}}} \approx 3.13 \]
Step 4: Determine the P-value
\[ \text{P-value} = P(Z > 3.13) = 0.0009 \]
Since the P-value \(< \alpha\), we reject \(H_0\)
Step 5: State the conclusion
We conclude that more than 90% of teachers believe that virtual reality would have a positive effect on education
Attendance QOTD