Hypothesis Testing III

STAT 240 - Fall 2025

Robert Sholl

Fisher’s Exact Test

The Worst Party Ever

In Ronald Fisher’s seminal work, The Design of Exerpiments (1935), he described a party he attended where a woman claimed to be able to tell whether the tea or the milk was added first in a cup of tea. To test this claim, Fisher randomly arranged 8 cups of tea, 4 of which had milk poured first and 4 tea first. The woman was left to perform a blind tasting and report her guesses on the order.

Counting

Easy to understand, harder to perform than it looks (see: Combinatorics)
Counting is a very useful tool for probability assessment
- 4 out of 8 cups, 5 out of 8, …
Small or large counts are easily informed with relative frequency
- We smash down all the information into a proportion and retain most of the nuance

Hypothesis

Fisher’s original practice only had a null hypothesis

\[ H_0: \text{The woman cannot distinguish between milk vs. tea first} \]

The null should arise naturally from the chosen test as an exact statistical hypothesis
We refer to it as the null because our goal is to nullify the statement with data

Refining the Null

We refine the null to that “exact” sense by counting. Since the woman knows there’s 4 of each she only has to guess 4 correctly:

\[ \begin{array}{|c|c|} \hline \text{Correct Guesses} & \text{Configuartion} \\ \hline 0 & OOOO \\ \hline 1 & OOOX, OOXO, OXOO, XOOO\\ \hline 2 & OOXX, OXOX, OXXO, XOXO, XOOX \\ \hline 3 & OXXX, XOXX, XXOX, XXXO \\ \hline 4 & XXXX \\ \hline \end{array} \]

Refining the Null

Since order doesn’t matter (think about why) we can use the choose function:

\[ \begin{array}{|c|c|} \hline \text{Correct Guesses} & \text{# of Combinations} \\ \hline 0 & {4 \choose 0} \times {4 \choose 4} = 1\\ \hline 1 & {4 \choose 1} \times {4 \choose 3} = 16\\ \hline 2 & {4 \choose 2} \times {4 \choose 2} = 32\\ \hline 3 & {4 \choose 3} \times {4 \choose 1} = 16\\ \hline 4 & {4 \choose 4} \times {4 \choose 0} = 1\\ \hline \end{array} \]

Refining the Null

We finalize the null by take the relative frequency of each outcome (70 combinations in total):

\[ \begin{array}{|c|c|} \hline \text{Correct Guesses} & \text{Probability} \\ \hline 0 & 1/70 = 0.0143\\ \hline 1 & 16/70 = 0.2286\\ \hline 2 & 36/70 = 0.5142\\ \hline 3 & 16/70 = 0.2286\\ \hline 4 & 1/70 = 0.0143\\ \hline \end{array} \]

Lady Tasting Tea

We now have everything necessary to perform a very simple and direct hypothesis test.

If she gets 4 correct, then our assumption of reality (the null) doesn’t match the data
- It’s the same case with getting none correct
Anything besides that is seemingly luck/chance
The \(p\)-value is that probability of occurrence, and it’s exact to the scenario

Fisher’s Exact test

Hypergeometric Distribution

I promised I’d teach card counting

Pull a card from a standard \(52\) card deck
- What’s the probability of an ace?
Pull another card, how many are left in the deck?
- \(P(\text{Ace}) = ?\)

Hypergeometric Distribution

The r.v. \(X\) is distributed hypergeometric if:

Drawing from \(X\) (realizing \(x\)) has two mutually exclusive results
Probability of outcomes change as each draw occurrs
- Known as “sampling without replacement”
We can use this to form a counting rule

Counting Cards

DISCLAIMER: Don’t do this.

\(X\) is the deck of cards, \(X \sim \text{Hyper}(k,N,K,n)\).

The probability of our target card, \(k\), being pulled is:

\[ P(X = k) = \frac{{K \choose k}{N-K \choose n-k}}{{N \choose n}} \]

As we pull cards besides the target, \(P(X = k)\) increases (negation = vice versa)

Counting Cards

Simplifying the calculation:
- Classify three buckets of cards: “Bad”, “Neutral”, “Good”
- 2 to 6 are bad, 7 to 9 are neutral, and 10 to Ace are good
- Bad = 1, Neutral = 0, Good = -1
- There’s 20 bad cards, 12 neutral, and 20 good

Counting Cards

By adding up the scores (1, 0, or -1) we keep a running total of weighted values
By dividing the summed scores by the total remaining cards, we get a weighted expectation
\(\mathbb{E}\) bad = \(\mathbb{E}\) good \(<\) \(\mathbb{E}\) bad + neutral
- If we keep track of the cards played then we can abuse this to “feast” whenever odds are good

Counting Cards

As you play, sum the scores and divide by the remaining cards
This is a test statistic that computes a \(p\)-value
Bet higher as the \(p\)-value goes past your risk acceptance, i.e.;

\[ \frac{\text{Score}}{\text{Cards Remaining}} = \frac{7}{25} = 0.28 \]

General practice is to just use a score-based rule (i.e., “bet high at +5”)

Why are we counting cards?

Fisher’s Exact Test uses the hypergeometric distribution
- We’re sampling without replacement and counting results
- Know how the function you’re using works so you know its strengths/weaknesses
To set up the exact test we start by building a contingency table

\[ \begin{array}{|c|c|c|} \hline & \text{A} & \text{B} & \text{Row Total} \\ \hline \text{P} & a & b & a+b \\ \hline \text{P}^c & c & d & c+b \\ \hline \text{Column Total} & a+c & b+d & a+b+c+d \\ \hline \end{array} \]

Contingency Table

A company is testing a new anti-helminth treatment for fish farms. 30 fish are selected for the experiment, with 12 being administered the treatment and 18 left as controls. Of the treatment group, 6 became infected. Of the control 11 became infected.

\[ \begin{array}{|c|c|c|} \hline & \text{Infected} & \text{Not Infected} & \text{Row Total} \\ \hline \text{Treatment} & 6 & 6 & 12 \\ \hline \text{Control} & 11 & 7 & 18 \\ \hline \text{Column Total} & 17 & 13 & 30 \\ \hline \end{array} \]

Setting up the test

Remember, we only have a null.
- The natural null in this case is that the treatment doesn’t work

\[ H_0: \text{No effect of treatment} \]

There’s notational ways to express this, we can just use natural language
Our \(p\)-value will be the probability that the data occurred under the null

p-value for Fisher’s Exact Test

\[ \begin{aligned} \frac{{a+b \choose a}{c+d \choose c}}{{n \choose b+d}} = \frac{{6+6 \choose 6}{11+7 \choose 11}}{{30 \choose 6+7}} = \\ \\ \frac{{12 \choose 6}{18 \choose 11}}{{30 \choose 13}} = 0.24554 \end{aligned} \]

The \(p\)-value is actually the complement so \(1-0.24554\).

We would say “the results are statistically insignificant”
- No reject or fail to reject.

Odds Ratios

“Be wise, generalize!”

We can generalize the exact test better with an odds ratio (OR)

\[ OR = \frac{a / b}{c / d} = \frac{6 / 6}{11 / 7} = \frac{1}{1.5714} = 0.6364 \]

Odds ratios describe the strength of association between two events
In this case “fish exposed to the treatment are 0.64 times as likely to be infected compared to non-treated”

OR Confidence Intervals

Anything can have a confidence interval

\[ CI(OR) = \text{exp} \{ \ln{OR} \pm \sqrt{1/a + 1/b + 1/c + 1/d} \} \]

\(\text{exp} \{ x \}\) is another way of writing \(e^x\)
The rule of thumb is that an OR crossing \(0\) is insignificant
The OR interval from our example was \((0.115,3.541)\)

Hypothesis Testing III

Fisher’s Exact Test

The Worst Party Ever

Counting

Hypothesis

Refining the Null

Refining the Null

Refining the Null

Lady Tasting Tea

Hypergeometric Distribution

Hypergeometric Distribution

Counting Cards

Counting Cards

Counting Cards

Counting Cards

Why are we counting cards?

Contingency Table

Setting up the test

p-value for Fisher’s Exact Test

Odds Ratios

OR Confidence Intervals

Go Away