Central Limit Theorem and t-distribution

Michael Taylor

2018/06/09

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(rafalib)
knitr::opts_chunk$set(cache=TRUE)

Below we will discuss the Central Limit Theorem (CLT) and the t-distribution, both of which help us make important calculations related to probabilities. Both are frequently used in science to test statistical hypotheses. To use these, we have to make different assumptions from those for the CLT and the t-distribution. However, if the assumptions are true, then we are able to calculate the exact probabilities of events through the use of mathematical formula.

Central Limit Theorem

The CLT is one of the most frequently used mathematical results in science. It tells us that when the sample size is large, the average \(\bar{Y}\) of a random sample follows a normal distribution centered at the population average \(\mu_Y\) and with standard deviation equal to the population standard deviation \(\sigma_Y\), divided by the square root of the sample size \(N\). We refer to the standard deviation of the distribution of a random variable as the random variable’s standard error.

Please note that if we subtract a constant from a random variable, the mean of the new random variable shifts by that constant. Mathematically, if \(X\) is a random variable with mean \(\mu\) and \(a\) is a constant, the mean of \(X−a\) is \(\mu−a\). A similarly intuitive result holds for multiplication and the standard deviation (SD). If \(X\) is a random variable with mean \(\mu\) and SD \(\sigma\), and \(a\) is a constant, then the mean and SD of \(aX\) are \(a\mu\) and \(\lvert a\rvert σ\) respectively. To see how intuitive this is, imagine that we subtract 10 grams from each of the mice weights. The average weight should also drop by that much. Similarly, if we change the units from grams to milligrams by multiplying by 1000, then the spread of the numbers becomes larger.

This implies that if we take many samples of size \(N\), then the quantity: \[\frac{\bar{Y} - \mu_Y}{\sigma_Y/\sqrt{N}}\] is approximated with a normal distribution centered at 0 and with standard deviation 1.

Now we are interested in the difference between two sample averages. Here again a mathematical result helps. If we have two random variables \(X\) and \(y\) with means \(\mu_X\) and \(mt_Y\) and variance \(\sigma_X\) and \(\sigma_Y\) respectively, then we have the following result: the mean of the sum \(Y+X\) is the sum of the means \(\mu_Y+\mu_X\). Using one of the facts we mentioned earlier, this implies that the mean of \(Y−X=Y+aX\) with \(a=−1\) , which implies that the mean of \(Y−X\) is \(\mu_Y−\mu_X\). This is intuitive. However, the next result is perhaps not as intuitive. If \(X\) and \(Y\) are independent of each other, as they are in our mouse example, then the variance (SD squared) of \(Y+X\) is the sum of the variances \(\sigma^2_Y+\sigma^2_X\). This implies that variance of the difference \(Y−X\) is the variance of \(Y+aX\) with \(a=−1\) which is \(\sigma^2_Y+a^2\sigma^2_X=\sigma^2_Y+\sigma^2_X\). So the variance of the difference is also the sum of the variances. If this seems like a counterintuitive result, remember that if \(X\) and \(Y\) are independent of each other, the sign does not really matter. It can be considered random: if \(X\) is normal with certain variance, for example, so is \(−X\). Finally, another useful result is that the sum of normal variables is again normal.

All this math is very helpful for the purposes of our study because we have two sample averages and are interested in the difference. Because both are normal, the difference is normal as well, and the variance (the standard deviation squared) is the sum of the two variances. Under the null hypothesis that there is no difference between the population averages, the difference between the sample averages \(\bar{Y}-\bar{X}\), with \(\bar{X}\) and \(\bar{Y}\) the sample average for the two diets respectively, is approximated by a normal distribution centered at 0 (there is no difference) and with standard deviation \(\sqrt{\sigma_X^2 +\sigma_Y^2}/\sqrt{N}\).

This suggests that this ratio: \[\frac{\bar{Y}-\bar{X}}{\sqrt{\frac{\sigma_X^2}{M} + \frac{\sigma_Y^2}{N}}}\]

is approximated by a normal distribution centered at 0 and standard deviation 1. Using this approximation makes computing p-values simple because we know the proportion of the distribution under any value. For example, only 5% of these values are larger than 2 (in absolute value):

pnorm(-2) + (1 - pnorm(2))
## [1] 0.04550026

We don’t need to buy more mice, 12 and 12 suffice.

However, we can’t claim victory just yet because we don’t know the population standard deviations: \(\sigma_X\) and \(\sigma_Y\). These are unknown population parameters, but we can get around this by using the sample standard deviations, call them \(s_X\) and \(s_Y\). These are defined as: \[s_X^2 = \frac{1}{M-1} \sum_{i=1}^M (X_i - \bar{X})^2 \mbox{ and } s_Y^2 = \frac{1}{N-1} \sum_{i=1}^N (Y_i - \bar{Y})^2\]

Note that we are dividing by \(M−1\) and \(N−1\), instead of by \(M\) and \(N\). There is a theoretical reason for doing this which we do not explain here. But to get an intuition, think of the case when you just have 2 numbers. The average distance to the mean is basically 1/2 the difference between the two numbers. So you really just have information from one number. This is somewhat of a minor point. The main point is that \(s_X\) and \(s_Y\) serve as estimates of \(\sigma_X\) and \(\sigma_Y\)

So we can redefine our ratio as \[\sqrt{N} \frac{\bar{Y}-\bar{X}}{\sqrt{s_X^2 +s_Y^2}}\]

if \(M=N\) or in general, \[\frac{\bar{Y}-\bar{X}}{\sqrt{\frac{s_X^2}{M} + \frac{s_Y^2}{N}}}\]

The CLT tells us that when \(M\) and \(N\) are large, this random variable is normally distributed with mean 0 and SD 1. Thus we can compute p-values using the function pnorm.

The t-distribution

The CLT relies on large samples, what we refer to as asymptotic results. When the CLT does not apply, there is another option that does not rely on asymptotic results. When the original population from which a random variable, say \(Y\), is sampled is normally distributed with mean 0, then we can calculate the distribution of: \[\sqrt{N} \frac{\bar{Y}}{s_Y}\]

This is the ratio of two random variables so it is not necessarily normal. The fact that the denominator can be small by chance increases the probability of observing large values. (William Sealy Gosset)[https://en.wikipedia.org/wiki/William_Sealy_Gosset], an employee of the Guinness brewing company, deciphered the distribution of this random variable and published a paper under the pseudonym “Student”. The distribution is therefore called Student’s t-distribution. Later we will learn more about how this result is used.

Here we will use the mice phenotype data as an example. We start by creating two vectors, one for the control population and one for the high-fat diet population:

dat <- read.csv("mice_pheno.csv") #We downloaded this file in a previous section
controlPopulation <- filter(dat,Sex == "F" & Diet == "chow") %>%  
  select(Bodyweight) %>% unlist
hfPopulation <- filter(dat,Sex == "F" & Diet == "hf") %>%  
  select(Bodyweight) %>% unlist

It is important to keep in mind that what we are assuming to be normal here is the distribution of \(y1,y2,\dots,yn\), not the random variable \(\bar{Y}\). Although we can’t do this in practice, in this illustrative example, we get to see this distribution for both controls and high fat diet mice:

library(rafalib)
mypar(1,2)
hist(hfPopulation)
hist(controlPopulation)

We can use qq-plots to confirm that the distributions are relatively close to being normally distributed. We will explore these plots in more depth in a later section, but the important thing to know is that it compares data (on the y-axis) against a theoretical distribution (on the x-axis). If the points fall on the identity line, then the data is close to the theoretical distribution.

mypar(1,2)
qqnorm(hfPopulation)
qqline(hfPopulation)
qqnorm(controlPopulation)
qqline(controlPopulation)

The larger the sample, the more forgiving the result is to the weakness of this approximation. In the next section, we will see that for this particular dataset the t-distribution works well even for sample sizes as small as 3.


Exercises

dat <- read.csv("mice_pheno.csv") %>% 
  na.omit()
  1. If a list of numbers has a distribution that is well approximated by the normal distribution, what proportion of these numbers are within one standard deviation away from the list’s average?
pnorm(1) - (1 - pnorm(1))
## [1] 0.6826895
  1. What proportion of these numbers are within two standard deviations away from the list’s average?
pnorm(2) - (1 - pnorm(2))
## [1] 0.9544997
  1. What proportion of these numbers are within three standard deviations away from the list’s average?
pnorm(3) - (1 - pnorm(3))
## [1] 0.9973002
  1. Define y to be the weights of males on the control diet. What proportion of the mice are within one standard deviation away from the average weight (remember to use popsd for the population sd)?
y <- dat %>%
  filter(Sex == "M" & Diet == "chow") %>% 
  .$Bodyweight

z <- (y - mean(y)) / popsd(y)
mean(abs(z) <= 1)
## [1] 0.6950673
  1. What proportion of these numbers are within two standard deviations away from the list’s average?
mean(abs(z) <= 2)
## [1] 0.9461883
  1. What proportion of these numbers are within three standard deviations away from the list’s average?
mean(abs(z) <= 3)
## [1] 0.9910314
  1. Note that the numbers for the normal distribution and our weights are relatively close. Also, notice that we are indirectly comparing quantiles of the normal distribution to quantiles of the mouse weight distribution. We can actually compare all quantiles using a qqplot. Which of the following best describes the qq-plot comparing mouse weights to the normal distribution?
qqnorm(y)
qqline(y)

  1. Create the above qq-plot for the four populations: male/females on each of the two diets. What is the most likely explanation for the mouse weights being well approximated? What is the best explanation for all these being well approximated by the normal distribution?
yhf <- dat %>%
  filter(Sex == "M" & Diet == "hf") %>% 
  .$Bodyweight

mypar(2,2)

qqnorm(hfPopulation)
qqline(hfPopulation)

qqnorm(controlPopulation)
qqline(controlPopulation)

qqnorm(y)
qqline(y)

qqnorm(y)
qqline(y)

  1. Here we are going to use the function replicate to learn about the distribution of random variables. All the above exercises relate to the normal distribution as an approximation of the distribution of a fixed list of numbers or a population. We have not yet discussed probability in these exercises. If the distribution of a list of numbers is approximately normal, then if we pick a number at random from this distribution, it will follow a normal distribution. However, it is important to remember that stating that some quantity has a distribution does not necessarily imply this quantity is random. Also, keep in mind that this is not related to the central limit theorem. The central limit applies to averages of random variables. Let’s explore this concept.

We will now take a sample of size 25 from the population of males on the chow diet. The average of this sample is our random variable. We will use the replicate to observe 10,000 realizations of this random variable. Set the seed at 1, generate these 10,000 averages. Make a histogram and qq-plot of these 10,000 numbers against the normal distribution.

We can see that, as predicted by the CLT, the distribution of the random variable is very well approximated by the normal distribution.

avgs <- replicate(10000, mean( sample(y, 25)))
mypar(1,2)
hist(avgs)
qqnorm(avgs); qqline(avgs)

What is the average of the distribution of the sample average?

mean(avgs)
## [1] 30.96159
  1. What is the standard deviation of the distribution of sample averages?
popsd(avgs)
## [1] 0.8419788
  1. According to the CLT, the answer to exercise 9 should be the same as mean(y). You should be able to confirm that these two numbers are very close. Which of the following does the CLT tell us should be close to your answer to exercise 10?
  1. In practice we do not know \(\sigma\) (popsd(y)) which is why we can’t use the CLT directly. This is because we see a sample and not the entire distribution. We also can’t use popsd(avgs) because to construct averages, we have to take 10,000 samples and this is never practical. We usually just get one sample. Instead we have to estimate popsd(y). As described, what we use is the sample standard deviation. Set the seed at 1, using the replicate function, create 10,000 samples of 25 and now, instead of the sample average, keep the standard deviation. Look at the distribution of the sample standard deviations. It is a random variable. The real population SD is about 4.5. What proportion of the sample SDs are below 3.5?
s_sd <- vector("numeric", 10000)
set.seed(1)
s_sd <- replicate(10000, popsd(sample(y, 25)))
sum(s_sd < 3.5) / 10000
## [1] 0.116
  1. What the answer to question 12 reveals is that the denominator of the t-test is a random variable. By decreasing the sample size, you can see how this variability can increase. It therefore adds variability. The smaller the sample size, the more variability is added. The normal distribution stops providing a useful approximation. When the distribution of the population values is approximately normal, as it is for the weights, the t-distribution provides a better approximation. We will see this later on. Here we will look at the difference between the t-distribution and normal. Use the function qt and qnorm to get the quantiles of x = seq(0.0001, 0.9999, len=300). Do this for degrees of freedom 3, 10, 30, and 100. Which of the following is true?
x=seq(0.0001, 0.9999, len=300)

var(qt(x, df = 3))
## [1] 5.512038
var(qt(x, df = 10))
## [1] 1.393565
var(qt(x, df = 30))
## [1] 1.145476
var(qt(x, df = 100))
## [1] 1.080532
var(qnorm(x))
## [1] 1.055202
sessionInfo()
## R version 3.4.4 (2018-03-15)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252   
## [3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C                   
## [5] LC_TIME=English_Canada.1252    
## 
## attached base packages:
## [1] methods   stats     graphics  grDevices utils     datasets  base     
## 
## other attached packages:
## [1] bindrcpp_0.2.2 rafalib_1.0.0  dplyr_0.7.5   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.17       knitr_1.20         bindr_0.1.1       
##  [4] magrittr_1.5       tidyselect_0.2.4   R6_2.2.2          
##  [7] rlang_0.2.1        stringr_1.3.1      tools_3.4.4       
## [10] xfun_0.1           htmltools_0.3.6    yaml_2.1.19       
## [13] rprojroot_1.3-2    digest_0.6.15      assertthat_0.2.0  
## [16] tibble_1.4.2       bookdown_0.7       RColorBrewer_1.1-2
## [19] purrr_0.2.5        codetools_0.2-15   glue_1.2.0        
## [22] evaluate_0.10.1    rmarkdown_1.9      blogdown_0.6      
## [25] stringi_1.1.7      compiler_3.4.4     pillar_1.2.3      
## [28] backports_1.1.2    pkgconfig_2.0.1
knitr::write_bib(.packages(), "packages.bib") 

References

Irizarry, Rafael A., and Michael I. Love. 2015. Rafalib: Convenience Functions for Routine Data Exploration. https://CRAN.R-project.org/package=rafalib.

R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2018. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.