set.seed(78)
=replicate(10,sum(rbinom(1000,1,0.2)))
samplesmean(samples)
[1] 200.4
var(samples)
[1] 126.9333
sqrt(var(samples))
[1] 11.26647
All random variables have a random probability distribution
As all statistics are random variables:
All statistics arise from a random probability distribution
The probability distribution of a sample statistic is the sampling distribution
Let \(\bar{x}\) be the mean of a random sample of size \(n\), drawn from a population with mean \(\mu\) and standard deviation \(\sigma\)
Since \(\bar{x}\) is a random variable, it has the mean and the standard deviation
The mean of \(\bar{x}\) is \(\mu\)
\[\mu_{\bar{x}} = \mu = \text{population mean}\]
The standard deviation of \(\bar{x}\) is \(\sigma / \sqrt{n}\)
\[\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}\]
Questions?
Flip a coin twice
How many outcomes are possible?
\[ \begin{array}{|c|c|} \hline \text{Flip 1} & \text{H} & \text{H} & \text{T} & \text{T}\\ \hline \text{Flip 2} & \text{H} & \text{T} & \text{H} & \text{T}\\ \hline \end{array} \]
We can reframe this table so that it’s possible to do math with it:
\[\text{Let } \ \text{H}=1 \text{ and } \text{T} =0 \]
\[ \begin{array}{|c|c|} \hline \text{Flip 1} & 1 & 1 & 0 & 0\\ \hline \text{Flip 2} & 1 & 0 & 1 & 0\\ \hline \end{array} \]
Does Flip 1 have any affect on Flip 2?
Any individual, independent, binary event is referred to as a Bernoulli Trial
Take our coin flip table and make a slight adjustment:
\[\text{Let } \ P\equiv\{\text{Patient is sick} \}=1 \text{ and } N\equiv\{\text{Patient is healthy}\} =0 \]
\[ \begin{array}{|c|c|} \hline \text{Patient 1} & 1 & 1 & 0 & 0\\ \hline \text{Patient 2} & 1 & 0 & 1 & 0\\ \hline \end{array} \]
Bernoulli trials are modeled by the Bernoulli distribution, a discrete distribution for individual, independent binary outcomes
\[P(X=1)=p\] \[P(X=0)=1-p\]
\[X\sim \text{Bern}(p)\]
\[[X]=\begin{cases} p & \text{for } X=1 \\ 1-p & \text{for } X=0\end{cases}\]
Given a fair coin, the probability of getting a heads, \((1)\), is \(p=0.5\)
The probability of getting a tails, \((0)\), is \(1-p=0.5\)
This concept continues to apply regardless of the shift in probability:
\[\text{Probability of getting rabies from a bat} \approx 10^{-6}\]
\[[X]=\begin{cases} 10^{-6} & \text{for } X=1 \\ 0.999999 & \text{for } X=0\end{cases}\]
When we stack Bernoulli trials together \(n\) many times and sum them, we model this with the Binomial Distribution
\[P(X=x)={n\choose k}p^k(1-p)^{n-k}\]
\[n\choose k\]
“n choose k” is referred to as the binomial coefficient and is formally defined as:
\[n! \over k!(n-k)!\]
\[\text{Where } n! = 1\times2\times3\times...\times n\]
The number of ways to choose k things from a group of size n
Flip a coin ten times, how many possible configurations can result in 4 heads being flipped?
\[{10 \choose 4}={10! \over 4!(10-4)!}=210\]
What’s the probability of flipping exactly 4 heads from 10 tosses?
\[P(X=4)={10\choose 4}0.5^4(1-0.5)^{10-4}\]
\[=210*0.0625*0.015625=0.2050781\]
Why would we care?
Assumptions are the core of science and statistics
If we assume that our outcomes of our study are binary (i.e., \(0 \text{ or }1\))
It’s convenient to work with the assumption directly:
\[\text{Let } X \equiv \{\text{A captured raccoon is already tagged}\}\]
\[ P(X=x)= \begin{cases} 0.2 & \text{if } \ x=1 \\ 0.8 & \text{if } \ x=0 \\ \end{cases} \]
We have the tools to tell us what the probability of two individually captured raccoons being tagged is:
\[0.2*0.2=0.04\]
We have the tools to answer the probability of \(n\) many individually captured raccoons not being tagged is:
\[(1-0.2)^n=0.8^n\]
This tool now allows us to work with actual data
Given that there are \(1000\) raccoons captured, what is the probability that we capture 200 that have already been tagged?
\[P(X=200)={1000\choose 200}0.2^{200}(1-0.2)^{1000-200}=0.03152536\]
We can easily see that this is \(\neq 0.2^{200}\)
We aren’t asking subsequent capture probabilities
We’re asking within our sample of \(1000\), how likely are we to see exactly \({1/5}^{th}\) of the raccoons to be previously tagged
Just like with the Normal Distribution, we have an Expectation and Variance associated with the binomial:
\[X \sim \text{Binom}(n,p)\]
\[\mathbb EX=np\]
\[\mathbb VX=np(1-p)\]
How many raccoons should we expect to be previously tagged in our sample of 1000?
\[1000*0.2=200\]
By how much should subsequent samples vary?
\[1000*0.2(1-0.2)=160\]
\[\sqrt{160}=12.65\]
set.seed(78)
=replicate(10,sum(rbinom(1000,1,0.2)))
samplesmean(samples)
[1] 200.4
var(samples)
[1] 126.9333
sqrt(var(samples))
[1] 11.26647
What happens with our LLN and CLT with the binomial?
If we standardize all of our binomial samples:
\[Z={x-\mu \over \sigma}={X_i-200 \over 12.65}\]
Imagine you’re counting the number of deer found within any given acre of a wildlife preserve. You could assume a binary outcome (0 if the deer isn’t seen, 1 if it is), but this doesn’t work well if you don’t know the total number that can be in that acre.
As we stack Bernoulli trials infinitely and set the probability in our Binomial distribution closer and closer to zero, we converge on the Poisson Distribution
\[P(X=x)={\lambda^xe^{-\lambda}\over x!}\]
The Poisson distribution models the probability associated with strictly positive count data:
The number of deer in an acre of land
The number of fish in a pond
The number of individuals that are on a cruise ship
A unique feature of the Poisson distribution is that it’s expectation and variance are the same parameter: \(\lambda\)
What if we want to model the probability of our binary outcome?
What is the prevelance of blight in any randomly selected potato patch in the country?
Let \(X \equiv \{\text{Blight prevelance for a selected patch}\}\)
\[S_x=\{0,1\}\]
Our data is the prevalence, which is a proportion:
Strictly positive
Bounded between \(0\) and \(1\)
We can’t model this with the normal distribution
We could set the variance of our normal distribution such that only values between \(0\) and \(\pm 1\) spit out, then take the absolute values
That’s going to be messy to get right
We can make transformations to our probability data and work with it as if it was normal
Microbiologists and plant pathologists are guilty of this
It works, but not consistently and it causes problems down the road
For proportions and probabilities, we use the Beta Distribution:
\[X \sim \text{Beta}(\alpha,\beta)\]
The pdf of the Beta distribution is a fairly involved equation, so distribution notation is more useful here
\(\alpha\) and \(\beta\) both control the shape of the distribution:
The expectation of the Beta distribution:
\[\alpha\over \alpha+\beta\]
The variance:
\[\alpha\beta\over (\alpha + \beta)^2(\alpha + \beta +1)\]
If we wanted to perform our bernoullis trial with a random probability, where \(p\sim \text{Beta}(1,3)\):
What if we have continuous data, that’s strictly positive, and bounded between \((0,\infty)\)?
The normal distribution produces negative values, but that’s not too much of a problem if we set the mean high enough and the variance low enough:
It doesn’t work as well if our variance gets high enough that negatives are probable (remember the idea of violating assumptions):
In this case we would use the Gamma Distribution:
\[X\sim \text{Gamma}(\alpha,\theta)\]
\(\alpha\) controls the shape and \(\theta\) controls the scale
Generally speaking, this distribution models the probability of certain continuous time steps:
Waiting time at a bus stop
Time for an insect population in a field to die out
\[ \begin{array}{|c|c|} \hline \text{Distribution} & \text{Use-Case}\\ \hline \text{Binomial} & \text{Binary (0,1) outcomes}\\ \hline \text{Poisson} & \text{Positive, discrete counts}\\ \hline \text{Beta} & \text{Proportions and probabilities}\\ \hline \text{Gamma} & \text{Continuous, strictly positive data}\\ \hline \end{array} \]
Try for yourself. Assign distributions to each of the following data sources:
The presence/absence of a specific gene variant across 50 plant samples
The fraction of roots colonized by Mycorrhizal fungi
Milk production volumes across a herd of dairy cows
The success rate of different pollinator species visiting specific plant varieties
The amount of rainfall needed before a dormant seed will germinate
The number of new bacterial colonies appearing on a petri dish in a 24-hour period
Whether individual plants survive or die after exposure to a specific pathogen
The lifespan of insects in a laboratory colony