Normal Distribution

Scott Yanco

March 26, 2018

Learning Objectives

The Normal Distribution

The Normal Distribution

The Normal Distribution

Normal Distribution Characteristics

An example (with code!)

Let’s work an example with owl masses (from my research).

Owl Masses

library(mosaic)
owls <- read.file("owl_morph.csv")
head(owls)
##   owl_mass owl_sex
## 1     56.5       M
## 2     58.0       M
## 3     51.5       M
## 4     50.5       M
## 5     51.0       M
## 6     48.0       M
hist(owls$owl_mass)

Owl Masses

mean.mass <- mean(owls$owl_mass)
print(mean.mass)
## [1] 54.22727
sd.mass <- sd(owls$owl_mass)
print(sd.mass)
## [1] 6.555181

Owl Masses

hist(owls$owl_mass, freq = F)
x.mass <- seq(35, 70)
y.mass <- dnorm(x.mass, mean = mean.mass, sd = sd.mass)
lines(x.mass, y.mass)

One small problem…

The following code will give us the probability mass of the normal distribution for values less than 0:

pnorm(0, mean = mean.mass, sd = sd.mass)
## [1] 6.562843e-17

Discuss with your neighbors: what’s the problem with this?

One small problem…

Non-negative probability of having owls with negative mass!

The Standard Normal (it’s like a medium regular)

Sometimes it is useful to convert a data set to the “standard normal” distribution. This distribution is a specific parameterization of the normal where: \(\mu = 0\) and \(\sigma = 1\)

There are some very convenient properties of this distribution. What are they?

Z-Score

We can convert data sets to the standard normal with a simple function:

\(\vec{Z} = \frac{\vec{Y} - \mu}{\sigma}\)

Let’s do that with the owl data…

Z Score Example

owls$z <- (owls$owl_mass - mean.mass)/sd.mass #note all the cool stuff we did with that code (new data frame column and used scalars on vectors...)
head(owls)
##   owl_mass owl_sex          z
## 1     56.5       M  0.3467070
## 2     58.0       M  0.5755336
## 3     51.5       M -0.4160484
## 4     50.5       M -0.5685995
## 5     51.0       M -0.4923239
## 6     48.0       M -0.9499772

Plot it!

Standard Normal Table

Using the table on page 279 of Whitlock and Schluter, what’s the probability of a z-score more extreme than \(|1.50|\)?

What about if the z-score were derived from a wing length data set?

But just use R though…

2*pnorm(1.5, lower.tail = F)
## [1] 0.1336144

It even works without having to standardize the distribution…

pnorm(64, mean = mean.mass, sd = sd.mass, lower.tail = F) #this is the one-tailed version
## [1] 0.06800174

Easier, right?

Sampling From the Sample

Getting back to the owl data set, let’s imagine we took a bunch of random samples from this population and look at our sampling distribution (someone remind us what a sampling distribution is…)

In groups consider:

Sampling From the Sample

Let’s imagine we took a bunch of random samples from this population and look at our sampling distribution (someone remind us what a sampling distribution is…)

sampling.mass <- 0
for (i in 1:1000) {
  samp <- sample(owls$owl_mass, length(owls$owl_mass), replace = T)
  sampling.mass[i] <- mean(samp)
}

Sampling From the Sample

hist(sampling.mass, xlab = "mean owl mass", main = "Sampling Distribution of Mean Owl Masses")

Is it normal?

x.mass <- seq(20, 80, by = .01)
y.mass <- dnorm(x.mass, mean = mean(sampling.mass), sd = sd(sampling.mass))
hist(sampling.mass, xlab = "mean owl mass", main = "Sampling Distribution of Mean Owl Masses", freq = F)
lines(x.mass, y.mass)

The Central Limit Theorem

“…the sum or mean of a large number of measurements randomly sampled from a non-normal population is approximately normal.” (Whitlock and Schluter pg 286)

What does this mean?

The Central Limit Theorem

Let’s simulate a distinctly non-normal data set and see what happens when we invoke the CLT.

Let’s consider the average height of human’s in a daycare - what might we expect that histogram to look like?

The Central Limit Theorem - The Daycare

Let’s simulate a distinctly non-normal data set and see what happens when we invoke the CLT.

Let’s consider the average height of human’s in a daycare - what might we expect that histogram to look like?

#code simulates the bimodal distribution
ad.ht <- rnorm(30, 67, 3)
kid.ht <- rnorm(100, 30, 5)
daycare.heights <- c(ad.ht, kid.ht)

The Central Limit Theorem - The Daycare

#let's look at it
hist(daycare.heights, main = "Freq. distribution of heights at a daycare", xlab = "Heights (in.)")

The Central Limit Theorem - The Daycare

Let’s take a look at the mean of that population:

pop.mean.ht <- mean(daycare.heights)
print(pop.mean.ht)
## [1] 39.12092

The Central Limit Theorem - The Daycare

Let’s take a look at the mean of that population:

hist(daycare.heights, main = "Freq. distribution of heights at a daycare", xlab = "Heights (in.)")
abline(v = pop.mean.ht, col = "red")

Daycare Sample

OK, now we’ve looked at our population, which we typically cannot - let’s do this with a sample, which is more realistic.

We’ll take a random sample of 20, plot it and look at the mean:

dc.samp <- sample(daycare.heights, 20)
samp.mean <- mean(dc.samp)
print(samp.mean)
## [1] 37.4384

Daycare Sample

hist(daycare.heights, main = "Freq. distribution of heights at a daycare", xlab = "Heights (in.)")
abline(v = samp.mean, col = "red")

Daycare Sample Again

OK, now we’ve looked at our population, which we typically cannot - let’s do this with a sample, which is more realistic.

We’ll take a random sample of 20, plot it and look at the mean:

## [1] 40.59754

and again

## [1] 35.6178

and again

## [1] 36.15207

and again

## [1] 38.11664

But Why Though?

Why are the results different each time? Discuss with your neighbors.

SamplING Distribution

Let’s do that sample trick a bunch of times and look at that distribution (sampling distribution).

reps <- 1000
sampling.mean.dc <- 0

for (i in 1:reps) {
  dc.samp <- sample(daycare.heights, 20, replace = F)
  samp.mean <- mean(dc.samp)
  sampling.mean.dc[i] <- samp.mean
}

SamplING Distribution

SamplING Distribution

SamplING Distribution

SamplING Distribution

Binomial to Normal

This concept is really just invoking the CLT again.

Let’s use the owl sex data - we’ll first simulate taking a bunch of random samples to get the \(\mu\) and \(\sigma\).

The we’ll use the shortcut the book showed us.

Binomial to Normal

p.sex <- props(owls$owl_sex)
print(p.sex)
## prop_F prop_M 
##   0.25   0.75

Binomial to Normal

successes <- 0
for (i in 1:reps) {
  samp <- rbinom(1, size = 20, prob = p.sex[1])
  successes[i] <- samp
}

Binomial to Normal

mean(successes)
## [1] 5.01
sd(successes)
## [1] 1.975805

or use the shortcut \(mean = np\) and \(SD = \sqrt{np(1-p)}\)

n = 20
p = .25
#mean
n*p
## [1] 5
#SD
sqrt(n*p * (1-p))
## [1] 1.936492

In fact…

These shortcuts work precisely BECAUSE of the CLT! Cool right?

Remember that \(SE = SD\) of the Sampling distribution? It’s because of the CLT.

Remember \(95 \% CI \approx 2 \times SE\) ? It’s because fo the CLT.

A large proportion of inferential statistics rely upon the CLT to work… as you will see in the remainder of the semester.