Scott Yanco
March 26, 2018
Let’s work an example with owl masses (from my research).
library(mosaic)
owls <- read.file("owl_morph.csv")
head(owls)
## owl_mass owl_sex
## 1 56.5 M
## 2 58.0 M
## 3 51.5 M
## 4 50.5 M
## 5 51.0 M
## 6 48.0 M
hist(owls$owl_mass)
mean.mass <- mean(owls$owl_mass)
print(mean.mass)
## [1] 54.22727
sd.mass <- sd(owls$owl_mass)
print(sd.mass)
## [1] 6.555181
hist(owls$owl_mass, freq = F)
x.mass <- seq(35, 70)
y.mass <- dnorm(x.mass, mean = mean.mass, sd = sd.mass)
lines(x.mass, y.mass)
The following code will give us the probability mass of the normal distribution for values less than 0:
pnorm(0, mean = mean.mass, sd = sd.mass)
## [1] 6.562843e-17
Discuss with your neighbors: what’s the problem with this?
Sometimes it is useful to convert a data set to the “standard normal” distribution. This distribution is a specific parameterization of the normal where: \(\mu = 0\) and \(\sigma = 1\)
There are some very convenient properties of this distribution. What are they?
We can convert data sets to the standard normal with a simple function:
\(\vec{Z} = \frac{\vec{Y} - \mu}{\sigma}\)
Let’s do that with the owl data…
owls$z <- (owls$owl_mass - mean.mass)/sd.mass #note all the cool stuff we did with that code (new data frame column and used scalars on vectors...)
head(owls)
## owl_mass owl_sex z
## 1 56.5 M 0.3467070
## 2 58.0 M 0.5755336
## 3 51.5 M -0.4160484
## 4 50.5 M -0.5685995
## 5 51.0 M -0.4923239
## 6 48.0 M -0.9499772
Using the table on page 279 of Whitlock and Schluter, what’s the probability of a z-score more extreme than \(|1.50|\)?
What about if the z-score were derived from a wing length data set?
2*pnorm(1.5, lower.tail = F)
## [1] 0.1336144
It even works without having to standardize the distribution…
pnorm(64, mean = mean.mass, sd = sd.mass, lower.tail = F) #this is the one-tailed version
## [1] 0.06800174
Easier, right?
Getting back to the owl data set, let’s imagine we took a bunch of random samples from this population and look at our sampling distribution (someone remind us what a sampling distribution is…)
In groups consider:
Let’s imagine we took a bunch of random samples from this population and look at our sampling distribution (someone remind us what a sampling distribution is…)
sampling.mass <- 0
for (i in 1:1000) {
samp <- sample(owls$owl_mass, length(owls$owl_mass), replace = T)
sampling.mass[i] <- mean(samp)
}
hist(sampling.mass, xlab = "mean owl mass", main = "Sampling Distribution of Mean Owl Masses")
x.mass <- seq(20, 80, by = .01)
y.mass <- dnorm(x.mass, mean = mean(sampling.mass), sd = sd(sampling.mass))
hist(sampling.mass, xlab = "mean owl mass", main = "Sampling Distribution of Mean Owl Masses", freq = F)
lines(x.mass, y.mass)
“…the sum or mean of a large number of measurements randomly sampled from a non-normal population is approximately normal.” (Whitlock and Schluter pg 286)
What does this mean?
Let’s simulate a distinctly non-normal data set and see what happens when we invoke the CLT.
Let’s consider the average height of human’s in a daycare - what might we expect that histogram to look like?
Let’s simulate a distinctly non-normal data set and see what happens when we invoke the CLT.
Let’s consider the average height of human’s in a daycare - what might we expect that histogram to look like?
#code simulates the bimodal distribution
ad.ht <- rnorm(30, 67, 3)
kid.ht <- rnorm(100, 30, 5)
daycare.heights <- c(ad.ht, kid.ht)
#let's look at it
hist(daycare.heights, main = "Freq. distribution of heights at a daycare", xlab = "Heights (in.)")
Let’s take a look at the mean of that population:
pop.mean.ht <- mean(daycare.heights)
print(pop.mean.ht)
## [1] 39.12092
Let’s take a look at the mean of that population:
hist(daycare.heights, main = "Freq. distribution of heights at a daycare", xlab = "Heights (in.)")
abline(v = pop.mean.ht, col = "red")
OK, now we’ve looked at our population, which we typically cannot - let’s do this with a sample, which is more realistic.
We’ll take a random sample of 20, plot it and look at the mean:
dc.samp <- sample(daycare.heights, 20)
samp.mean <- mean(dc.samp)
print(samp.mean)
## [1] 37.4384
hist(daycare.heights, main = "Freq. distribution of heights at a daycare", xlab = "Heights (in.)")
abline(v = samp.mean, col = "red")
OK, now we’ve looked at our population, which we typically cannot - let’s do this with a sample, which is more realistic.
We’ll take a random sample of 20, plot it and look at the mean:
## [1] 40.59754
## [1] 35.6178
## [1] 36.15207
## [1] 38.11664
Why are the results different each time? Discuss with your neighbors.
Let’s do that sample trick a bunch of times and look at that distribution (sampling distribution).
reps <- 1000
sampling.mean.dc <- 0
for (i in 1:reps) {
dc.samp <- sample(daycare.heights, 20, replace = F)
samp.mean <- mean(dc.samp)
sampling.mean.dc[i] <- samp.mean
}
This concept is really just invoking the CLT again.
Let’s use the owl sex data - we’ll first simulate taking a bunch of random samples to get the \(\mu\) and \(\sigma\).
The we’ll use the shortcut the book showed us.
p.sex <- props(owls$owl_sex)
print(p.sex)
## prop_F prop_M
## 0.25 0.75
successes <- 0
for (i in 1:reps) {
samp <- rbinom(1, size = 20, prob = p.sex[1])
successes[i] <- samp
}
mean(successes)
## [1] 5.01
sd(successes)
## [1] 1.975805
or use the shortcut \(mean = np\) and \(SD = \sqrt{np(1-p)}\)
n = 20
p = .25
#mean
n*p
## [1] 5
#SD
sqrt(n*p * (1-p))
## [1] 1.936492
These shortcuts work precisely BECAUSE of the CLT! Cool right?
Remember that \(SE = SD\) of the Sampling distribution? It’s because of the CLT.
Remember \(95 \% CI \approx 2 \times SE\) ? It’s because fo the CLT.
A large proportion of inferential statistics rely upon the CLT to work… as you will see in the remainder of the semester.