Lec 18

Review


All random variables have a random probability distribution


As all statistics are random variables:


All statistics arise from a random probability distribution



The probability distribution of a sample statistic is the sampling distribution


Let \(\bar{x}\) be the mean of a random sample of size \(n\), drawn from a population with mean \(\mu\) and standard deviation \(\sigma\)


Since \(\bar{x}\) is a random variable, it has the mean and the standard deviation


The mean of \(\bar{x}\) is \(\mu\)


\[\mu_{\bar{x}} = \mu = \text{population mean}\]


The standard deviation of \(\bar{x}\) is \(\sigma / \sqrt{n}\)


\[\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}\]



Questions?




Distribution Theory


Discrete Probability Distributions


Flip a coin twice


How many outcomes are possible?


\[ \begin{array}{|c|c|} \hline \text{Flip 1} & \text{H} & \text{H} & \text{T} & \text{T}\\ \hline \text{Flip 2} & \text{H} & \text{T} & \text{H} & \text{T}\\ \hline \end{array} \]


We can reframe this table so that it’s possible to do math with it:


\[\text{Let } \ \text{H}=1 \text{ and } \text{T} =0 \]

\[ \begin{array}{|c|c|} \hline \text{Flip 1} & 1 & 1 & 0 & 0\\ \hline \text{Flip 2} & 1 & 0 & 1 & 0\\ \hline \end{array} \]


Does Flip 1 have any affect on Flip 2?



Any individual, independent, binary event is referred to as a Bernoulli Trial


Take our coin flip table and make a slight adjustment:


\[\text{Let } \ P\equiv\{\text{Patient is sick} \}=1 \text{ and } N\equiv\{\text{Patient is healthy}\} =0 \]

\[ \begin{array}{|c|c|} \hline \text{Patient 1} & 1 & 1 & 0 & 0\\ \hline \text{Patient 2} & 1 & 0 & 1 & 0\\ \hline \end{array} \]


Bernoulli trials are modeled by the Bernoulli distribution, a discrete distribution for individual, independent binary outcomes


\[P(X=1)=p\] \[P(X=0)=1-p\]


\[X\sim \text{Bern}(p)\]

\[[X]=\begin{cases} p & \text{for } X=1 \\ 1-p & \text{for } X=0\end{cases}\]


Given a fair coin, the probability of getting a heads, \((1)\), is \(p=0.5\)

The probability of getting a tails, \((0)\), is \(1-p=0.5\)


This concept continues to apply regardless of the shift in probability:


\[\text{Probability of getting rabies from a bat} \approx 10^{-6}\]

\[[X]=\begin{cases} 10^{-6} & \text{for } X=1 \\ 0.999999 & \text{for } X=0\end{cases}\]


When we stack Bernoulli trials together \(n\) many times and sum them, we model this with the Binomial Distribution

\[P(X=x)={n\choose k}p^k(1-p)^{n-k}\]


\[n\choose k\]

“n choose k” is referred to as the binomial coefficient and is formally defined as:


\[n! \over k!(n-k)!\]

\[\text{Where } n! = 1\times2\times3\times...\times n\]


The number of ways to choose k things from a group of size n


Flip a coin ten times, how many possible configurations can result in 4 heads being flipped?


\[{10 \choose 4}={10! \over 4!(10-4)!}=210\]


What’s the probability of flipping exactly 4 heads from 10 tosses?


\[P(X=4)={10\choose 4}0.5^4(1-0.5)^{10-4}\]

\[=210*0.0625*0.015625=0.2050781\]


Why would we care?


Assumptions are the core of science and statistics

If we assume that our outcomes of our study are binary (i.e., \(0 \text{ or }1\))

It’s convenient to work with the assumption directly:


\[\text{Let } X \equiv \{\text{A captured raccoon is already tagged}\}\]

\[ P(X=x)= \begin{cases} 0.2 & \text{if } \ x=1 \\ 0.8 & \text{if } \ x=0 \\ \end{cases} \]


We have the tools to tell us what the probability of two individually captured raccoons being tagged is:


\[0.2*0.2=0.04\]


We have the tools to answer the probability of \(n\) many individually captured raccoons not being tagged is:


\[(1-0.2)^n=0.8^n\]


This tool now allows us to work with actual data

Given that there are \(1000\) raccoons captured, what is the probability that we capture 200 that have already been tagged?


\[P(X=200)={1000\choose 200}0.2^{200}(1-0.2)^{1000-200}=0.03152536\]


We can easily see that this is \(\neq 0.2^{200}\)

We aren’t asking subsequent capture probabilities

We’re asking within our sample of \(1000\), how likely are we to see exactly \({1/5}^{th}\) of the raccoons to be previously tagged



Just like with the Normal Distribution, we have an Expectation and Variance associated with the binomial:


\[X \sim \text{Binom}(n,p)\]

\[\mathbb EX=np\]

\[\mathbb VX=np(1-p)\]


How many raccoons should we expect to be previously tagged in our sample of 1000?

\[1000*0.2=200\]


By how much should subsequent samples vary?


\[1000*0.2(1-0.2)=160\]

\[\sqrt{160}=12.65\]


set.seed(78)
samples=replicate(10,sum(rbinom(1000,1,0.2)))
mean(samples)
[1] 200.4
var(samples)
[1] 126.9333
sqrt(var(samples))
[1] 11.26647


What happens with our LLN and CLT with the binomial?



If we standardize all of our binomial samples:

\[Z={x-\mu \over \sigma}={X_i-200 \over 12.65}\]




Imagine you’re counting the number of deer found within any given acre of a wildlife preserve. You could assume a binary outcome (0 if the deer isn’t seen, 1 if it is), but this doesn’t work well if you don’t know the total number that can be in that acre.


As we stack Bernoulli trials infinitely and set the probability in our Binomial distribution closer and closer to zero, we converge on the Poisson Distribution


\[P(X=x)={\lambda^xe^{-\lambda}\over x!}\]


The Poisson distribution models the probability associated with strictly positive count data:

  • The number of deer in an acre of land

  • The number of fish in a pond

  • The number of individuals that are on a cruise ship


A unique feature of the Poisson distribution is that it’s expectation and variance are the same parameter: \(\lambda\)







Continuous Probability Distributions

What if we want to model the probability of our binary outcome?


What is the prevelance of blight in any randomly selected potato patch in the country?


Let \(X \equiv \{\text{Blight prevelance for a selected patch}\}\)


\[S_x=\{0,1\}\]


Our data is the prevalence, which is a proportion:

  • Strictly positive

  • Bounded between \(0\) and \(1\)


We can’t model this with the normal distribution

We could set the variance of our normal distribution such that only values between \(0\) and \(\pm 1\) spit out, then take the absolute values

That’s going to be messy to get right


We can make transformations to our probability data and work with it as if it was normal

  • Microbiologists and plant pathologists are guilty of this

  • It works, but not consistently and it causes problems down the road


For proportions and probabilities, we use the Beta Distribution:


\[X \sim \text{Beta}(\alpha,\beta)\]


The pdf of the Beta distribution is a fairly involved equation, so distribution notation is more useful here


\(\alpha\) and \(\beta\) both control the shape of the distribution:



The expectation of the Beta distribution:

\[\alpha\over \alpha+\beta\]


The variance:

\[\alpha\beta\over (\alpha + \beta)^2(\alpha + \beta +1)\]


If we wanted to perform our bernoullis trial with a random probability, where \(p\sim \text{Beta}(1,3)\):



What if we have continuous data, that’s strictly positive, and bounded between \((0,\infty)\)?


The normal distribution produces negative values, but that’s not too much of a problem if we set the mean high enough and the variance low enough:


It doesn’t work as well if our variance gets high enough that negatives are probable (remember the idea of violating assumptions):


In this case we would use the Gamma Distribution:

\[X\sim \text{Gamma}(\alpha,\theta)\]


\(\alpha\) controls the shape and \(\theta\) controls the scale


Generally speaking, this distribution models the probability of certain continuous time steps:

  • Waiting time at a bus stop

  • Time for an insect population in a field to die out



\[ \begin{array}{|c|c|} \hline \text{Distribution} & \text{Use-Case}\\ \hline \text{Binomial} & \text{Binary (0,1) outcomes}\\ \hline \text{Poisson} & \text{Positive, discrete counts}\\ \hline \text{Beta} & \text{Proportions and probabilities}\\ \hline \text{Gamma} & \text{Continuous, strictly positive data}\\ \hline \end{array} \]



Try for yourself. Assign distributions to each of the following data sources:


  1. The presence/absence of a specific gene variant across 50 plant samples

  2. The fraction of roots colonized by Mycorrhizal fungi

  3. Milk production volumes across a herd of dairy cows

  4. The success rate of different pollinator species visiting specific plant varieties

  5. The amount of rainfall needed before a dormant seed will germinate

  6. The number of new bacterial colonies appearing on a petri dish in a 24-hour period

  7. Whether individual plants survive or die after exposure to a specific pathogen

  8. The lifespan of insects in a laboratory colony


Attendance QOTD


Go away