CLT practice

Michael Taylor

2018/04/04

Central Limit Theorem in Practice

Let’s use our data to see how well the central limit theorem approximates sample averages from our data. We will leverage our entire population dataset to compare the results we obtain by actually sampling from the distribution to what the CLT predicts.

dat <- read.csv("mice_pheno.csv") #file was previously downloaded
head(dat)
##   Sex Diet Bodyweight
## 1   F   hf      31.94
## 2   F   hf      32.48
## 3   F   hf      22.82
## 4   F   hf      19.92
## 5   F   hf      32.22
## 6   F   hf      27.50

Start by selecting only female mice since males and females have different weights. We will select three mice from each population.

controlPopulation <- filter(dat,
                            Sex == "F" & Diet == "chow") %>%  
  .$Bodyweight
hfPopulation <- filter(dat,
                       Sex == "F" & Diet == "hf") %>%  
  .$Bodyweight

We can compute the population parameters of interest using the mean function.

mu_hf <- mean(hfPopulation)
mu_control <- mean(controlPopulation)
print(mu_hf - mu_control)
## [1] 2.375517

Compute the population standard deviations as well. We do not use the R function sd because this would compute the estimates that divide by the sample size - 1 and we want the population estimates.

We can see that with R code:

x <- controlPopulation
N <- length(x)
populationvar <- mean((x - mean(x))^2)
identical(var(x), populationvar)
## [1] FALSE
identical(var(x) * (N-1)/N, populationvar)
## [1] TRUE

So to be mathematically correct, we do not use sd or var. Instead, we use the popvar and popsd function in rafalib:

library(rafalib)
sd_hf <- popsd(hfPopulation)
sd_control <- popsd(controlPopulation)

Remember that in practice we do not get to compute these population parameters. These are values we never see. In general, we want to estimate them from samples.

N <- 12
hf <- sample(hfPopulation, 12)
control <- sample(controlPopulation, 12)

As we described, the CLT tells us that for large \(N\), each of these is approximately normal with average population mean and standard error population variance divided by \(N\). We mentioned that a rule of thumb is that \(N\) should be 30 or more. However, that is just a rule of thumb since the preciseness of the approximation depends on the population distribution. Here we can actually check the approximation and we do that for various values of \(N\).

Now we use sapply and replicate instead of for loops, which makes for cleaner code (we do not have to pre-allocate a vector, R takes care of this for us):

Ns <- c(3, 12, 25, 50)
B <- 10000 #number of simulations
res <-  sapply(Ns, function(n) { 
  replicate(B, mean(sample(hfPopulation, n)) - mean(sample(controlPopulation, n)) )
    }
  )

Now we can use qq-plots to see how well CLT approximations works for these. If in fact the normal distribution is a good approximation, the points should fall on a straight line when compared to normal quantiles. The more it deviates, the worse the approximation. In the title, we also show the average and SD of the observed distribution, which demonstrates how the SD decreases with \(\sqrt N\) as predicted.

mypar(2,2)
# seq(along=Ns) generates the sequence 1, 2, ..., length(along)
for (i in seq(along=Ns)) { 
  titleavg <- signif(mean(res[,i]),3)
  titlesd <- signif(popsd(res[,i]),3)
  title <- paste0("N=",Ns[i]," Avg=",titleavg," SD=",titlesd)
  qqnorm(res[,i],main=title)
  qqline(res[,i],col=2)
}

Here we see a pretty good fit even for 3. Why is this? Because the population itself is relatively close to normally distributed, the averages are close to normal as well (the sum of normals is also a normal). In practice, we actually calculate a ratio: we divide by the estimated standard deviation. Here is where the sample size starts to matter more

Ns <- c(3,12,25,50)
B <- 10000 #number of simulations

##function to compute a t-stat
computetstat <- function(n) {
  y <- sample(hfPopulation,n)
  x <- sample(controlPopulation,n)
  (mean(y)-mean(x))/sqrt(var(y)/n+var(x)/n)
  }

res <-  sapply(Ns,function(n) {
  replicate(B, computetstat(n))
  })

mypar(2,2)

for (i in seq(along=Ns)) {
  qqnorm(res[,i], main=Ns[i])
  qqline(res[,i], col=2)
  }

So we see that for N=3, the CLT does not provide a usable approximation. For N=12, there is a slight deviation at the higher values, although the approximation appears useful. For 25 and 50, the approximation is spot on.

This simulation only proves that N=12 is large enough in this case, not in general. As mentioned above, we will not be able to perform this simulation in most situations. We only use the simulation to illustrate the concepts behind the CLT and its limitations. In future sections, we will describe the approaches we actually use in practice.
*** ## Exercises

dat <- read.csv("femaleMiceWeights.csv")
  1. The CLT is a result from probability theory. Much of probability theory was originally inspired by gambling. This theory is still used in practice by casinos. For example, they can estimate how many people need to play slots for there to be a 99.9999% probability of earning enough money to cover expenses. Let’s try a simple example related to gambling.

Suppose we are interested in the proportion of times we see a 6 when rolling n=100 die. This is a random variable which we can simulate with x=sample(1:6, n, replace=TRUE) and the proportion we are interested in can be expressed as an average: mean(x==6). Because the die rolls are independent, the CLT applies.

We want to roll n dice 10,000 times and keep these proportions. This random variable (proportion of 6s) has mean p=1/6 and variance p*(1-p)/n. So according to CLT z = (mean(x==6) - p) / sqrt(p*(1-p)/n) should be normal with mean 0 and SD 1. Set the seed to 1, then use replicate to perform the simulation, and report what proportion of times z was larger than 2 in absolute value (CLT says it should be about 0.05).

set.seed(1)
n <- 100
sides <- 6
p <- 1/sides

z <- replicate(10000, {
  x = sample(1:sides, n, replace = TRUE)
  ( mean(x == 6) - p) / sqrt( p * (1 - p) / n )
})

qqnorm(z)
abline(0,1) #confirm it's well approximated with normal distribution

print(paste("The proportion of times `z` was larger than 2 in absolute value was",
      mean(abs(z) > 2)) )
## [1] "The proportion of times `z` was larger than 2 in absolute value was 0.0424"
  1. For the last simulation you can make a qqplot to confirm the normal approximation. Now, the CLT is an asympototic result, meaning it is closer and closer to being a perfect approximation as the sample size increases. In practice, however, we need to decide if it is appropriate for actual sample sizes. Is 10 enough? 15? 30?

In the example used in exercise 1, the original data is binary (either 6 or not). In this case, the success probability also affects the appropriateness of the CLT. With very low probabilities, we need larger sample sizes for the CLT to “kick in”.

Run the simulation from exercise 1, but for different values of p and n. For which of the following is the normal approximation best?

set.seed(1)
Ns <- c(5, 30, 30, 100)
B <- 10000
sides <- 6
p <- c(0.5, 0.5, 0.01, 0.01)

zs <- sapply(Ns, function(n){
  replicate(B, {
  x = sample(1:sides, n, replace = TRUE)
  ( mean(x == 6) - p[which(n %in% Ns)]) / sqrt( p[which(n %in% Ns)] * (1 - p[which(n %in% Ns)]) / n )
  })
})
mypar(2,2)
# seq(along=Ns) generates the sequence 1, 2, ..., length(along)
for (i in seq(along=Ns)) { 
  titleavg <- signif(mean(zs[,i]),3)
  titlesd <- signif(popsd(zs[,i]),3)
  title <- paste0("p = ", p[i], " and n = ", Ns[i])
  qqnorm(zs[,i],main=title)
  qqline(zs[,i],col=2)
}

The normal approximation is the best for p=0.01 and n=100. ***

  1. As we have already seen, the CLT also applies to averages of quantitative data. A major difference with binary data, for which we know the variance is \(p(1−p)\), is that with quantitative data we need to estimate the population standard deviation.

In several previous exercises we have illustrated statistical concepts with the unrealistic situation of having access to the entire population. In practice, we do not have access to entire populations. Instead, we obtain one random sample and need to reach conclusions analyzing that data. dat is an example of a typical simple dataset representing just one sample. We have 12 measurements for each of two populations:

X <- dat %>%
  filter(Diet == "chow") %>% 
  .$Bodyweight
Y <- dat %>%
  filter(Diet == "hf") %>% 
  .$Bodyweight

We think of \(X\) as a random sample from the population of all mice in the control diet and \(Y\) as a random sample from the population of all mice in the high fat diet.

Define the parameter \(\mu_x\) as the average of the control population. We estimate this parameter with the sample average \(\bar{X}\). What is the sample average?

print(paste("The sample average is", round(mean(X), digits = 2) ))
## [1] "The sample average is 23.81"
  1. We don’t know \(\mu_X\), but want to use \(\bar{X}\) to understand \(\mu_X\). Which of the following uses CLT to understand how well \(\bar{X}\) approximates \(\mu_X\)?

\(\bar{X}\) follows a normal distribution with mean \(\mu_X\) and standard deviation \(\frac{\sigma_x}{\sqrt{12}}\) where \(\sigma x\) is the population standard deviation.

  1. The result above tells us the distribution of the following random variable: \(Z=\sqrt{12} \frac{\bar{X}-\mu_X}{\sigma_X}\). What does the CLT tell us is the mean of \(Z\)(you don’t need code)?

The CLT tells us the mean is 0.

  1. The result of 4 and 5 tell us that we know the distribution of the difference between our estimate and what we want to estimate, but don’t know. However, the equation involves the population standard deviation \(\sigma_X\), which we don’t know. Given what we discussed, what is your estimate of \(\sigma_X\)?
set.seed(1)
print(paste("The estimate is", 
            round(sd(X), digits = 2 )))
## [1] "The estimate is 3.02"
  1. Use the CLT to approximate the probability that our estimate \(\bar{X}\) is off by more than 2 grams from \(\mu_X\).
2 * ( 1-pnorm(2/sd(X) * sqrt(12) ) )
## [1] 0.02189533
  1. Now we introduce the concept of a null hypothesis. We don’t know \(μx\) nor \(μy\). We want to quantify what the data say about the possibility that the diet has no effect: \(\mu_x=\mu_y\). If we use CLT, then we approximate the distribution of \(\bar{X}\) as normal with mean \(μX\) and standard deviation \(σX\) and the distribution of \(\bar{Y}\) as normal with mean \(μy\) and standard deviation \(σy\). This implies that the difference \(\bar{Y}-\bar{X}\) has mean 0. We described that the standard deviation of this statistic (the standard error) is \(\mbox{SE}( \bar{X}-\bar{Y}) = \sqrt{ \sigma_y^2 / 12 + \sigma_x^2 /12 }\) and that we estimate the population standard deviations \(\sigma x\) and \(\sigma y\) with the sample estimates. What is the estimate of \(\mbox{SE}( \bar{X}-\bar{Y}) = \sqrt{ \sigma_y^2 / 12 + \sigma_x^2 /12 }\).
print(paste("The estimate of standard error is", 
            round(sqrt( var(X)/12 + var(Y)/12 ), 
                  digits = 2) ))
## [1] "The estimate of standard error is 1.47"
  1. So now we can compute \(\bar{Y}-\bar{X}\) as well as an estimate of this standard error and construct a t-statistic. What is this t-statistic?

Since: \[\frac{\bar{Y}-\bar{X}}{\sqrt{\frac{\sigma_X^2}{M} + \frac{\sigma_Y^2}{N}}}\]

(Z <- ( mean(Y) - mean(X) ) / sqrt( var(X)/12 + var(Y)/12))
## [1] 2.055174

or:

t.test(X, Y)$stat
##         t 
## -2.055174

10.If we apply the CLT, what is the distribution of this t-statistic? * Normal with mean 0 and standard deviation 1.

  1. Now we are ready to compute a p-value using the CLT. What is the probability of observing a quantity as large as what we computed in 10, when the null distribution is true?
2 * (1 - pnorm(Z))
## [1] 0.0398622
  1. CLT provides an approximation for cases in which the sample size is large. In practice, we can’t check the assumption because we only get to see 1 outcome (which you computed above). As a result, if this approximation is off, so is our p-value. As described earlier, there is another approach that does not require a large sample size, but rather that the distribution of the population is approximately normal. We don’t get to see this distribution so it is again an assumption, although we can look at the distribution of the sample with qqnorm(X) and qqnorm(Y). If we are willing to assume this, then it follows that the t-statistic follows t-distribution. What is the p-value under the t-distribution approximation? Hint: use the t.test function.
t.test(X, Y)$p.value
## [1] 0.05299888
  1. With the CLT distribution, we obtained a p-value smaller than 0.05 and with the t-distribution, one that is larger. They can’t both be right. What best describes the difference?

These are two different assumptions. The t-distribution accounts for the variability introduced by the estimation of the standard error and thus, under the null, large values are more probable under the null distribution.

sessionInfo()
## R version 3.4.4 (2018-03-15)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252   
## [3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C                   
## [5] LC_TIME=English_Canada.1252    
## 
## attached base packages:
## [1] methods   stats     graphics  grDevices utils     datasets  base     
## 
## other attached packages:
## [1] rafalib_1.0.0    bindrcpp_0.2.2   kableExtra_0.9.0 knitr_1.20      
## [5] dplyr_0.7.5     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.17       RColorBrewer_1.1-2 pillar_1.2.3      
##  [4] compiler_3.4.4     plyr_1.8.4         bindr_0.1.1       
##  [7] tools_3.4.4        digest_0.6.15      evaluate_0.10.1   
## [10] tibble_1.4.2       viridisLite_0.3.0  pkgconfig_2.0.1   
## [13] rlang_0.2.1        rstudioapi_0.7     yaml_2.1.19       
## [16] blogdown_0.6       xfun_0.1           stringr_1.3.1     
## [19] httr_1.3.1         xml2_1.2.0         hms_0.4.2         
## [22] rprojroot_1.3-2    tidyselect_0.2.4   glue_1.2.0        
## [25] R6_2.2.2           rmarkdown_1.9      bookdown_0.7      
## [28] purrr_0.2.5        readr_1.1.1        magrittr_1.5      
## [31] backports_1.1.2    scales_0.5.0       codetools_0.2-15  
## [34] htmltools_0.3.6    assertthat_0.2.0   rvest_0.3.2       
## [37] colorspace_1.3-2   stringi_1.1.7      munsell_0.4.3
knitr::write_bib(.packages(), "packages.bib") 

References

Irizarry, Rafael A., and Michael I. Love. 2015. Rafalib: Convenience Functions for Routine Data Exploration. https://CRAN.R-project.org/package=rafalib.

R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2018. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.