Population, Samples, and Estimates

Michael Taylor

2018/06/19

library(dplyr)
knitr::opts_chunk$set(cache=TRUE)

Now that we have introduced the idea of a random variable, a null distribution, and a p-value, we are ready to describe the mathematical theory that permits us to compute p-values in practice. We will also learn about confidence intervals and power calculations.

Population parameters

A first step in statistical inference is to understand what population you are interested in. In the mouse weight example, we have two populations: female mice on control diets and female mice on high fat diets, with weight being the outcome of interest. We consider this population to be fixed, and the randomness comes from the sampling. One reason we have been using this dataset as an example is because we happen to have the weights of all the mice of this type. Here we download and read in this dataset:

filename <- "mice_pheno.csv"
dat <- read.csv(filename)

We can then access the population values and determine, for example, how many we have. Here we compute the size of the control population:

controlPopulation <- filter(dat,Sex == "F" & Diet == "chow") %>% 
  .$Bodyweight
length(controlPopulation)
## [1] 225

We usually denote these values as \(x_1,\dots,x_m\). In this case, \(m\) is the number computed above. We can do the same for the high fat diet population:

hfPopulation <- dat %>% 
  filter(Sex == "F" & Diet == "hf") %>%  
  .$Bodyweight
length(hfPopulation)
## [1] 200

and denote with \(y_1,\dots,y_n\).

We can then define summaries of interest for these populations, such as the mean and variance.

The mean: \[\mu_X = \frac{1}{m}\sum_{i=1}^m x_i \mbox{ and } \mu_Y = \frac{1}{n} \sum_{i=1}^n y_i\]

The variance: \[\sigma_X^2 = \frac{1}{m}\sum_{i=1}^m (x_i-\mu_X)^2 \mbox{ and } \sigma_Y^2 = \frac{1}{n} \sum_{i=1}^n (y_i-\mu_Y)^2\]

with the standard deviation being the square root of the variance. We refer to such quantities that can be obtained from the population as population parameters. The question we started out asking can now be written mathematically: is \(μY−μX=0\)?

Although in our illustration we have all the values and can check if this is true, in practice we do not. For example, in practice it would be prohibitively expensive to buy all the mice in a population. Here we learn how taking a sample permits us to answer our questions. This is the essence of statistical inference.

Sample estimates

In the previous chapter, we obtained samples of 12 mice from each population. We represent data from samples with capital letters to indicate that they are random. This is common practice in statistics, although it is not always followed. So the samples are \(X_1,\dots,X_M\) and \(Y_1,\dots,Y_N\) and, in this case, \(N=M=12\). In contrast and as we saw above, when we list out the values of the population, which are set and not random, we use lower-case letters.

Since we want to know if \(μY−μX\) is 0, we consider the sample version: \(\bar{Y}-\bar{X}\)with: \[\bar{X}=\frac{1}{M} \sum_{i=1}^M X_i \mbox{ and }\bar{Y}=\frac{1}{N} \sum_{i=1}^N Y_i.\]

Note that this difference of averages is also a random variable. Previously, we learned about the behavior of random variables with an exercise that involved repeatedly sampling from the original distribution. Of course, this is not an exercise that we can execute in practice. In this particular case it would involve buying 24 mice over and over again. Here we described the mathematical theory that mathematically relates \(\bar{X}\) to \(μX\) and \(\bar{Y}\) to \(μY\), that will in turn help us understand the relationship between \(\bar{Y}-\bar{X}\) and \(μY−μX\). Specifically, we will describe how the Central Limit Theorem permits us to use an approximation to answer this question, as well as motivate the widely used t-distribution. ***

Exercises

We will read and remove the lines that contain missing values:

filename <- "mice_pheno.csv"
dat <- read.csv(filename) %>% na.omit()
  1. Use dplyr to create a vector x with the body weight of all males on the control (chow) diet. What is this population’s average?
x <- dat %>% 
  filter(Sex=="M" & Diet=="chow") %>% 
  .$Bodyweight
(mu_x <- mean(x))
## [1] 30.96381
  1. Now use the rafalib package and use the popsd function to compute the population standard deviation.
library(rafalib)
popsd(x)
## [1] 4.420501
  1. Set the seed at 1. Take a random sample \(X\) of size 25 from x. What is the sample average?
set.seed(1)
(xbar <- mean(sample(x, 25)))
## [1] 32.0956
  1. Use dplyr to create a vector y with the body weight of all males on the high fat (hf) diet. What is this population’s average?
y <- dat %>% 
  filter(Sex=="M" & Diet=="hf") %>% 
  .$Bodyweight
(mu_y <- mean(y))
## [1] 34.84793
  1. Now use the rafalib package and use the popsd function to compute the population standard deviation.
popsd(y)
## [1] 5.574609
  1. Set the seed at 1. Take a random sample \(Y\) of size 25 from y. What is the sample average?
set.seed(1)
(ybar <- mean(sample(y, 25)))
## [1] 34.768
  1. What is the difference in absolute value between \(\bar{y}-\bar{x}\) and \(\bar{X}-\bar{Y}\)?
abs( (mu_y - mu_x) - (ybar - xbar) )
## [1] 1.211716
  1. Repeat the above for females. Make sure to set the seed to 1 before each sample call. What is the difference in absolute value between \(\bar{y}-\bar{x}\) and \(\bar{X}-\bar{Y}\)?
x <- dat %>% 
  filter(Sex=="F" & Diet=="chow") %>% 
  .$Bodyweight
mu_x <- mean(x)

y <- dat %>% 
  filter(Sex=="F" & Diet=="hf") %>% 
  .$Bodyweight
mu_y <- mean(y)

set.seed(1)
xbar <- mean(sample(x, 25))
set.seed(1)
ybar <- mean(sample(y, 25))

abs( (mu_y - mu_x) - (ybar - xbar) )
## [1] 0.7364828
  1. For the females, our sample estimates were closer to the population difference than with males. What is a possible explanation for this?

The population variance of the females is smaller than that of the males; thus, the sample variable has less variability.

sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252   
## [3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C                   
## [5] LC_TIME=English_Canada.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] rafalib_1.0.0  bindrcpp_0.2.2 dplyr_0.7.6   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.17       knitr_1.20         bindr_0.1.1       
##  [4] magrittr_1.5       tidyselect_0.2.4   R6_2.2.2          
##  [7] rlang_0.2.1        stringr_1.3.1      tools_3.5.1       
## [10] xfun_0.3           htmltools_0.3.6    yaml_2.1.19       
## [13] rprojroot_1.3-2    digest_0.6.15      assertthat_0.2.0  
## [16] tibble_1.4.2       bookdown_0.7       RColorBrewer_1.1-2
## [19] purrr_0.2.5        codetools_0.2-15   glue_1.2.0        
## [22] evaluate_0.10.1    rmarkdown_1.10     blogdown_0.7      
## [25] stringi_1.1.7      compiler_3.5.1     pillar_1.2.3      
## [28] backports_1.1.2    pkgconfig_2.0.1
knitr::write_bib(.packages(), "packages.bib") 

References

Irizarry, Rafael A., and Michael I. Love. 2015. Rafalib: Convenience Functions for Routine Data Exploration. https://CRAN.R-project.org/package=rafalib.

R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2018. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.