t-tests in Practice

Michael Taylor

2018/04/10

Introduction

We will now demonstrate how to obtain a p-value in practice. We begin by loading experimental data and walking you through the steps used to form a t-statistic and compute a p-value. We can perform this task with just a few lines of code (go to the end of section to see them). However, to understand the concepts, we will construct a t-statistic from “scratch”.

suppressPackageStartupMessages(library(dplyr))
knitr::opts_chunk$set(cache=TRUE)

Read in and prepare data

We start by reading in the data. A first important step is to identify which rows are associated with treatment and control, and to compute the difference in mean.

dat <- read.csv("femaleMiceWeights.csv")

control <- dat %>% 
  filter(Diet == "chow") %>% .$Bodyweight
treatment <- dat %>%
  filter(Diet == "hf") %>% .$Bodyweight

(diff <- mean(treatment) - mean(control))
## [1] 3.020833

We are asked to report a p-value. What do we do? We learned that diff, referred to as the observed effect size, is a random variable. We also learned that under the null hypothesis, the mean of the distribution of diff is 0. What about the standard error? We also learned that the standard error of this random variable is the population standard deviation divided by the square root of the sample size: \[SE(\bar{X}) = \sigma / \sqrt{N}\]

We use the sample standard deviation as an estimate of the population standard deviation. In R, we simply use the sd function and the SE is:

sd(control) / sqrt(length(control))
## [1] 0.8725323

This is the SE of the sample average, but we actually want the SE of diff. We saw how statistical theory tells us that the variance of the difference of two random variables is the sum of its variances, so we compute the variance and take the square root:

se <- sqrt(
  var(treatment) / length(treatment) +
    var(control) / length(control)
)
se
## [1] 1.469867

Statistical theory tells us that if we divide a random variable by its SE, we get a new random variable with an SE of 1.

(tstat <- diff /se)
## [1] 2.055174

This ratio is what we call the t-statistic. It’s the ratio of two random variables and thus a random variable. Once we know the distribution of this random variable, we can then easily compute a p-value.

As explained in the previous section, the CLT tells us that for large sample sizes, both sample averages mean(treatment) and mean(control) are normal. Statistical theory tells us that the difference of two normally distributed random variables is again normal, so CLT tells us that tstat is approximately normal with mean 0 (the null hypothesis) and SD 1 (we divided by its SE).

So now to calculate a p-value all we need to do is ask: how often does a normally distributed random variable exceed diff? R has a built-in function, pnorm, to answer this specific question. pnorm(a) returns the probability that a random variable following the standard normal distribution falls below a. To obtain the probability that it is larger than a, we simply use 1-pnorm(a). We want to know the probability of seeing something as extreme as diff: either smaller (more negative) than -abs(diff) or larger than abs(diff). We call these two regions “tails” and calculate their size:

righttail <- 1 - pnorm(abs(tstat)) 
lefttail <- pnorm(-abs(tstat))
pval <- lefttail + righttail
print(pval)
## [1] 0.0398622

In this case, the p-value is smaller than 0.05 and using the conventional cutoff of 0.05, we would call the difference statistically significant.

Now there is a problem. CLT works for large samples, but is 12 large enough? A rule of thumb for CLT is that 30 is a large enough sample size (but this is just a rule of thumb). The p-value we computed is only a valid approximation if the assumptions hold, which do not seem to be the case here. However, there is another option other than using CLT.

The t-distribution in Practice

As described earlier, statistical theory offers another useful result. If the distribution of the population is normal, then we can work out the exact distribution of the t-statistic without the need for the CLT. This is a big “if” given that, with small samples, it is hard to check if the population is normal. But for something like weight, we suspect that the population distribution is likely well approximated by normal and that we can use this approximation. Furthermore, we can look at a qq-plot for the samples. This shows that the approximation is at least close:

library(rafalib)
mypar(1,2)

qqnorm(treatment)
qqline(treatment,col=2)

qqnorm(control)
qqline(control,col=2)

If we use this approximation, then statistical theory tells us that the distribution of the random variable tstat follows a t-distribution. This is a much more complicated distribution than the normal. The t-distribution has a location parameter like the normal and another parameter called degrees of freedom. R has a nice function that actually computes everything for us.

t.test(treatment, control)
## 
##  Welch Two Sample t-test
## 
## data:  treatment and control
## t = 2.0552, df = 20.236, p-value = 0.053
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.04296563  6.08463229
## sample estimates:
## mean of x mean of y 
##  26.83417  23.81333

To see just the p-value, we can use the $ extractor:

result <- t.test(treatment,control)
result$p.value
## [1] 0.05299888

The p-value is slightly bigger now. This is to be expected because our CLT approximation considered the denominator of tstat practically fixed (with large samples it practically is), while the t-distribution approximation takes into account that the denominator (the standard error of the difference) is a random variable. The smaller the sample size, the more the denominator varies.

It may be confusing that one approximation gave us one p-value and another gave us another, because we expect there to be just one answer. However, this is not uncommon in data analysis. We used different assumptions, different approximations, and therefore we obtained different results.

Later, in the power calculation section, we will describe type I and type II errors. As a preview, we will point out that the test based on the CLT approximation is more likely to incorrectly reject the null hypothesis (a false positive), while the t-distribution is more likely to incorrectly accept the null hypothesis (false negative).

Running the t-test in practice

dat <- read.csv("mice_pheno.csv")
control <- filter(dat,Diet=="chow") %>% select(Bodyweight) 
treatment <- filter(dat,Diet=="hf") %>% select(Bodyweight) 
t.test(treatment,control)
## 
##  Welch Two Sample t-test
## 
## data:  treatment and control
## t = 7.1932, df = 735.02, p-value = 1.563e-12
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.231533 3.906857
## sample estimates:
## mean of x mean of y 
##  30.48201  27.41281

The arguments to t.test can be of type data.frame and thus we do not need to unlist them into numeric objects.


Demonstration of how the central limit theorem is used in conjunction with t-statistics to obtain p-values.

dat <- read.csv("femaleMiceWeights.csv")

control <- filter(dat, Diet=="chow") %>% 
  select(Bodyweight) %>% .$Bodyweight

treatment <- filter(dat, Diet=="hf") %>% 
  select(Bodyweight) %>% .$Bodyweight

N <- length(treatment)
# difference of means
obs <- mean(treatment) - mean(control)

se <- sqrt(
  var(treatment) / N + 
    var(control) / N  )

tstat <- obs / se
# p value for the two tails given by:
2 * (1 - pnorm(tstat))
## [1] 0.0398622
population <- read.csv("femaleControlsPopulation.csv") %>% .$Bodyweight
n <- 10000
nulls <- vector("numeric", n)
for(i in 1:n) {
  control <- sample(population, N)
  treatment <- sample(population, N)
  se <- sqrt(
  var(treatment) / N + 
    var(control) / N  )

  nulls[i] <- (mean(treatment) - mean(control)) / se
}
library(ggplot2)
# ggplot needs a data.frame
# `scale()` converts standard deviation to standard units.
p <- as.data.frame(nulls) %>% ggplot(aes(sample=scale(nulls)),intercept = 0, slope = 1)

p + geom_qq() +
  geom_abline() +
  theme_minimal()

Using a sample of 3.

nulls <- vector("numeric", n)
for(i in 1:n) {
  control <- sample(population, 3)
  treatment <- sample(population, 3)
  se <- sqrt(
  var(treatment) / 3 + 
    var(control) / 3  )

  nulls[i] <- (mean(treatment) - mean(control)) / se
}
p <- as.data.frame(nulls) %>% ggplot(aes(sample=scale(nulls)),intercept = 0, slope = 1)

p + geom_qq() +
  geom_abline() +
  theme_minimal()

With a sample of 3 the normal approximation is not very good. In this case would be best to use the t-statistic.

Practice 2

library(dplyr)
dat <- read.csv("femaleMiceWeights.csv")

control <- filter(dat,Diet=="chow") %>% select(Bodyweight) %>% .$Bodyweight

treatment <- filter(dat,Diet=="hf") %>% select(Bodyweight) %>% .$Bodyweight

# compute mean difference and standard deviation
ttest<- t.test(treatment, control)
ttest
## 
##  Welch Two Sample t-test
## 
## data:  treatment and control
## t = 2.0552, df = 20.236, p-value = 0.053
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.04296563  6.08463229
## sample estimates:
## mean of x mean of y 
##  26.83417  23.81333

Look at qqnorm

# Create a multi-paneled plotting window
par(mfrow=c(1,2) ) # mfrow=c(nrows, ncols)
qqnorm(control)
qqline(control)

qqnorm(treatment)
qqline(treatment)

sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252   
## [3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C                   
## [5] LC_TIME=English_Canada.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_3.0.0  rafalib_1.0.0  bindrcpp_0.2.2 dplyr_0.7.6   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.17       pillar_1.2.3       compiler_3.5.1    
##  [4] RColorBrewer_1.1-2 plyr_1.8.4         bindr_0.1.1       
##  [7] tools_3.5.1        digest_0.6.15      evaluate_0.10.1   
## [10] tibble_1.4.2       gtable_0.2.0       pkgconfig_2.0.1   
## [13] rlang_0.2.1        yaml_2.1.19        blogdown_0.7      
## [16] xfun_0.3           withr_2.1.2        stringr_1.3.1     
## [19] knitr_1.20         rprojroot_1.3-2    grid_3.5.1        
## [22] tidyselect_0.2.4   glue_1.2.0         R6_2.2.2          
## [25] rmarkdown_1.10     bookdown_0.7       purrr_0.2.5       
## [28] magrittr_1.5       backports_1.1.2    scales_0.5.0      
## [31] codetools_0.2-15   htmltools_0.3.6    assertthat_0.2.0  
## [34] colorspace_1.3-2   labeling_0.3       stringi_1.1.7     
## [37] lazyeval_0.2.1     munsell_0.5.0
knitr::write_bib(.packages(), "packages.bib") 

Irizarry, Rafael A., and Michael I. Love. 2015. Rafalib: Convenience Functions for Routine Data Exploration. https://CRAN.R-project.org/package=rafalib.

R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2018. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.