Day 15

Review


For a continuous r.v., probability is now “area under the curve”


  • Only intervals will have non-zero probability


  • Any single value will have a probability of zero:


\[P(X=a)=0, \ \text{for any single number} \ a\]

\[P(X=b)=0, \ \text{for any single number} \ b\]


  • There’s also no difference between \(\leq\) and \(<\)

\[P(X \leq 1)=P(X<1)\]


  • This area, if known, can be used to make calculations on the probability of any given interval of outcomes in a continuous PDF



If every possible value of \(X\) is equally likely then it takes on a uniform distribution

  • The curve for this distribution is a horizontal bar:


When calculating a uniform probability, consider that each value takes on that same probability:

\[\text{Given} \ P(0 < x < 5)= 5 \times{1\over 30}={5\over 30}={1\over 6}\]


\[\text{then} \ P(0 < x < 30) = 30 \times {1\over 30}={30\over 30}=1\]


\[\text{and} \ P(10 \leq x < 20) = 10 \times {1\over 30} = {10\over 30} = {1\over 3}\]



The curve used to describe the probability distribution of a continuous r.v. is called a probability density curve

  • This curve is dictated by a function, \(f(x)\), called the probability density function


The area under the curve between two values \(a\) and \(b\) has two general interpretations:

  1. The propotion of the population within the interval of \(a\) and \(b\) (values between \(a\) and \(b\))

  2. The probability that a randomnly selected individual will have a value between \(a\) and \(b\) (\(P(a<x<b)\))

The area under the entire curve must equal \(1\)

  • Approximated areas will rarely be a perfect value of \(1\), they just shouldn’t heavily be outside of \(1\)




Questions?




Goals for Today:

  1. Solidify our understanding of discrete r.v.

  2. Reframe continuous probability distributions

  3. Introduce the normal distribution



The Normal Distribution


Empirical Proof of Discrete Expectations

Expectation and Variance of a r.v. is a simplistic formula, but conceptually can be difficult to grasp


In my experience “Empirical Proof”, or the results of a large scale simulation, can sometimes make these concepts easier to interpret


The expectation of a random variable being given as:


\[\mu_x=\sum_xxP(X=x)\]


Is a theoretical simplification of an empirical study where we are somehow able to sample from our population an infinite number of times


Given a r.v. with the following probability distribution:


\[ \begin{array}{|c|c|c|c|c|c|c|} \hline x & 0 & 1 & 2 & 3 & 4 & 5 \\ \hline P(X = x) & 0.4 & 0.2 & 0.15 & 0.1 & 0.1 & 0.05 \\ \hline \end{array} \]


We would compute \(EX\) as:


\[0(0.4)+1(0.2)+2(0.15)+3(0.1)+4(0.1)+5(0.05)=1.45\]


And \(VarX\) as:


\[=(0-1.45)^2(0.4)+(1-1.45)^2(0.2)+(2-1.45)^2(0.15)+ \newline (3-1.45)^2(0.1)+(4-1.45)^2(0.1)+(5-1.45)^2(0.05)\]

\[=2.4475\]

These probabilities arise from an empirical distribution of approximately infinite samples:

set.seed(1) # for reproducibility

# our values of x
x=c(0,1,2,3,4,5)

# the theoretical probabilities of x
p=c(0.4,0.2,0.15,0.1,0.1,0.05)

# 10 million samples
n=10^6

# sample our x values 10 million times
# with replacement, and their given probabilities
samp=sample(x,n,T,p)

# get the mean of our samples
mean(samp)
[1] 1.449516
# get the variance of our samples
var(samp)
[1] 2.44431

As with ALL numerical/empirical approximations, these aren’t perfect values of our theoretical calculations.

  • Hypothetically, how could we make them exact to the theoretical values?




The Normal Distribution

The normal distribution is (un)arguably the most important probability distribution used in statistics


We’ve encountered it already in this course


The shape of a normal distribution curve is symmetric, bell-shaped, centered around its peak



A population that is represented by a normal curve is said to be normally distributed


  • or to have a normal distribution



Any normal r.v. is complete characterized by specifying values for its mean and standard deviation


The pdf for the normal distribution is given by:


\[f(x)={1\over \sigma \sqrt{2\pi}}e^{(x-\mu)^2\over 2 \sigma^2}\]


\(\mu\) is both the mean and median (due to symmetry) and is called a location parameter

  • if the value of \(\mu\) is changed, the whole distribution is shifted


\(\sigma\) is the standard deviation and referred to as the scale parameter

  • the larger \(\sigma\) is, the more “squished” the distribution looks





Recall a special feature of approximately bell-shaped distributions:




Standard Normal Distribution

A standard normal distribution is a normal distribution with:

  • \(\mu=0\)

  • \(\sigma = 1\)

    • What is the total area under the standard normal curve?



We use the letter \(Z\) to represent a standard normal random variable (referring to \(z\)-score)


The probability that a standard normal random variable \(Z\) is between \(a\) and \(b\) (\(P(a<Z<b)\)) is equal to the area under the standard normal curve over the interval \([a,b]\)


  • As we’ve discussed in gruesome detail, we need calculus to manually compute this area

    • Regardless of if you know calculus, for this course you do not




Suppose that the heights of American men (20 years or older) are approximately normal with a mean of 70 inches and a standard deviation of 4 inches. What proportion of American men are less than 6 feet tall?




A way that we could solve this, given our current understanding of the class:

  1. Find the proportion for each “Class” (58-60, 60-62, …)

  2. Add them all up

  3. Multiply them by the class index (in this context, 2)


Let’s try that:


\[ \begin{array}{|c|c|c|c|} \hline 58 & 60 & 62 & 64 & 66 & 68 & 70 & \text{Total} \\ \hline 0.005 & 0.01 & 0.02 & 0.045 & 0.075 & 0.095 & 0.095 & 0.345 \\ \hline \end{array} \]

\[0.345*2=0.69\]


With \(z\)-scores:


\[P(X\leq 72) = P\left(Z \leq {72-70 \over 4}\right) = P(Z \leq 0.5)=0.6915\]




Use \(z\)-table to compute:

  1. \(P (Z < 1.26)\) (i.e., area to the left of \(z = 1.26\));

  2. \(P (Z > −0.58)\) (i.e., area to the right of \(z = −0.58\));

  3. \(P (−0.58 < Z < 1.26)\) (i.e., area between \(z = −0.58\) and \(z = 1.26\)).



Part a

  1. \(P (Z < 1.26)\) (i.e., area to the left of \(z = 1.26\));
  • Start by sketching an image (as a terrible artist, I’m partially going to cheat):


  • Locate your value using the \(z\)-table

  • The value \(0.8962\) is the area to the left of \(z = 1.26\)




Part b

  1. \(P (Z > −0.58)\) (i.e., area to the right of \(z = −0.58\));
  • Sketch your picture:

  • Look up your value using the \(z\)-table

  • Remember that our \(z\)-table is telling use area to the left

    • Use the complement rule to find the area to the right of \(z=-0.58\)




Part c

  • Be an artist:

  • Parse the \(z\)-table if needed (we don’t need to here)

  • Use your results from part (a) \(P(Z < 1.26)\) and part (b) \(P(Z > −0.58)\) to compute the area between \(P(-0.58<Z<1.26)\)




Use the \(z\)-table to find the area under the standard normal curve in the shaded region:





Using the symmetry of a normal density curve and the fact that \(P (Z < −1) = 0.1587\), compute the following probabilities. (You don’t need \(z\)-table for this question.)

  1. \(P(Z \geq 1)\)



  1. \(P(0\leq Z\leq 1)\)



  1. \(P(−1<Z<1)\)




Finding a z Score for a given area

Often we need to find the \(z\)-score that corresponds to a given area under the standard normal curve (e.g. percentile values).

  • Recall that the numbers in the body of \(z\)-table represents the area to the left of the corresponding z-score.

  • The next example shows how to use \(z\)-table to find the \(z\)-score given an area under the standard normal curve.


Use the \(z\)-table to find the \(z\)-score that has an area of \(0.68\) to its right.


  • Sketch your curve:


Since we want area to the left in order to use a \(z\)-table, we need to take the complement:

\[1-0.68=0.32\]


Then, recognizing that approximations will never be perfect, we’ll search our \(z\)-table for a value that’s \(\approx 0.32\):

  • Approximations are a slight touchy feely art

    • We can see that \(|0.32-0.32276|=0.00276\) and \(|0.32-0.31918|=0.00082\)

    • It should be clear that \(0.00276>0.00082\)

    • So \(z=-0.47\) is a better approximation of our given area than \(z=-0.46\)

    • In practice you can do a lot of this determination with gut feeling and be effective

  • Our final conclusions is that \(z=-0.47\)



Use z-table to find the value of \(z_0\) such that \(P(−z_0 ≤ Z ≤ z_0) = 0.95\)



Attendance QOTD




Go away