Random variables

Michael Taylor

2018/03/09

Introduction to Random Variables

Introduction

library(rafalib)
library(dplyr)
knitr::opts_chunk$set(cache=TRUE)

Our first look at the data

dat <- read.csv("femaleMiceWeights.csv")
head(dat)
##   Diet Bodyweight
## 1 chow      21.51
## 2 chow      28.14
## 3 chow      24.04
## 4 chow      23.45
## 5 chow      23.68
## 6 chow      19.79
control <- filter(dat, Diet=="chow") %>% pull(Bodyweight)
treatment <- filter(dat, Diet=="hf") %>% pull(Bodyweight)

The “chow” group is the control group and the “treatment” group is the is having the high fat(‘hf’) diet.

print(paste("The mean treatment is ", round(mean(treatment), digits = 2) ))
## [1] "The mean treatment is  26.83"
print(paste("The mean control is ", round(mean(control), digits = 2) ))
## [1] "The mean control is  23.81"
(obsdiff <- mean(treatment) - mean(control))
## [1] 3.020833

Random Variables

Being skeptics we cannot say that the difference between the the two groups is due to the difference in diet. For all we know this difference could be due to chance. This introduces us to the concept of random variable.

The particular data set is special because we access to weights of the entire population. We sample it randomly to see how the differences between the groups changes with each new sampling.

population <- read.csv(filename) %>% .$Bodyweight
head(population)
## [1] 27.03 24.80 27.02 28.07 23.55 22.72

Now let’s sample 12 mice three times and see how the average changes.

(control <- mean(sample(population,12)))
## [1] 25.285
(control <- mean(sample(population,12)))
## [1] 24.645
(control <- mean(sample(population,12)))
## [1] 24.12167

With each computation of the mean from the sample you see different average. This is not the situation in practice. What ever number of mice you get that it. It therefore makes sense to think of your treatment sample as random variables similar to the above computation of the mean from the sample of 12.

The Null Hypothesis

Now let’s go back to our average difference of obsdiff, 3.0208333. As scientists we need to be skeptics. How do we know that this obsdiff is due to the diet? What happens if we give all 24 mice the same diet? Will we see a difference this big? Statisticians refer to this scenario as the null hypothesis. The name “null” is used to remind us that we are acting as skeptics: we give credence to the possibility that there is no difference.

Because we have access to the population, we can actually observe as many values as we want of the difference of the averages when the diet has no effect. We can do this by randomly sampling 24 control mice, giving them the same diet, and then recording the difference in mean between two randomly split groups of 12 and 12. Here is this process written in R code:

control <- sample(population,12)
treatment <- sample(population,12)
print(mean(treatment) - mean(control))
## [1] 0.7458333

Now let’s do it 10,000 times. We will use a “for-loop”, an operation that lets us automate this (a simpler approach that, we will learn later, is to use replicate).

n <- 10000
null <- vector("numeric",n)
for (i in 1:n) {
  control <- sample(population,12)
  treatment <- sample(population,12)
  null[i] <- mean(treatment) - mean(control)
}

The values in null form what we call the null distribution. We will define this more formally below.

So what percent of the 10,000 are bigger than obsdiff?

mean(null >= obsdiff)
## [1] 0.0136

Only a small percent of the 10,000 simulations. As skeptics what do we conclude? When there is no diet effect, we see a difference as big as the one we observed only 1.36% of the time. This is what is known as a p-value, which we will define more formally later in the book.

Distributions

We have explained what we mean by null in the context of null hypothesis, but what exactly is a distribution? The simplest way to think of a distribution is as a compact description of many numbers. For example, suppose you have measured the heights of all men in a population. Imagine you need to describe these numbers to someone that has no idea what these heights are, such as an alien that has never visited Earth. Suppose all these heights are contained in the following dataset:

library(UsingR)
x <- father.son$fheight

One approach to summarizing these numbers is to simply list them all out for the alien to see. Here are 10 randomly selected heights of 1,078:

round(sample(x, 10), 1)
##  [1] 67.9 67.1 65.2 69.0 71.3 63.9 67.6 69.2 67.1 65.0

Cumulative Distribution Function

Scanning through these numbers, we start to get a rough idea of what the entire list looks like, but it is certainly inefficient. We can quickly improve on this approach by defining and visualizing a distribution. To define a distribution we compute, for all possible values of \(a\), the proportion of numbers in our list that are below \(a\). We use the following notation: \[F(a) \equiv \mbox{Pr}(x \leq a)\] This is called the cumulative distribution function (CDF). When the CDF is derived from data, as opposed to theoretically, we also call it the empirical CDF (ECDF). We can plot \(F(a)\) versus \(a\) like this:

smallest <- floor( min(x) )
largest <- ceiling( max(x) )
values <- seq(smallest, largest,len=300)
heightecdf <- ecdf(x)
plot(values, heightecdf(values), type="l",
     xlab="a (Height in inches)",ylab="Pr(x <= a)")

The ecdf function is a function that returns a function, which is not typical behavior of R functions. For that reason, we won’t discuss it further here. Furthermore, the ecdf is actually not as popular as histograms, which give us the same information, but show us the proportion of values in intervals: \[\mbox{Pr}(a \leq x \leq b) = F(b) - F(a)\]

Plotting these heights as bars is what we call a histogram. It is a more useful plot because we are usually more interested in intervals, such and such percent are between 70 inches and 71 inches, etc., rather than the percent less than a particular height. It is also easier to distinguish different types (families) of distributions by looking at histograms. Here is a histogram of heights:

hist(x)

We can specify the bins and add better labels in the following way:

bins <- seq(smallest, largest)
hist(x,breaks=bins,
     xlab="Height (in inches)",
     main="Adult men heights")

Showing this plot to the alien is much more informative than showing numbers. With this simple plot, we can approximate the number of individuals in any given interval. For example, there are about 70 individuals over six feet (72 inches) tall.

Probability Distribution

Summarizing lists of numbers is one powerful use of distribution. An even more important use is describing the possible outcomes of a random variable. Unlike a fixed list of numbers, we don’t actually observe all possible outcomes of random variables, so instead of describing proportions, we describe probabilities. For instance, if we pick a random height from our list, then the probability of it falling between \(a\) and \(b\) is denoted with: \[\mbox{Pr}(a \leq X \leq b) = F(b) - F(a)\]

Note that the \(X\) is now capitalized to distinguish it as a random variable and that the equation above defines the probability distribution of the random variable. Knowing this distribution is incredibly useful in science. For example, in the case above, if we know the distribution of the difference in mean of mouse weights when the null hypothesis is true, referred to as the null distribution, we can compute the probability of observing a value as large as we did, referred to as a p-value. In a previous section we ran what is called a Monte Carlo simulation (we will provide more details on Monte Carlo simulation in a later section) and we obtained 10,000 outcomes of the random variable under the null hypothesis.

The figure above amounts to a histogram. From a histogram of the null vector we calculated earlier, we can see that values as large as obsdiff are relatively rare:

hist(null, freq=TRUE)
abline(v=obsdiff, col="red", lwd=2)

An important point to keep in mind here is that while we defined \(Pr(a)\) by counting cases, we will learn that, in some circumstances, mathematics gives us formulas for \(Pr(a)\) that save us the trouble of computing them as we did here. One example of this powerful approach uses the normal distribution approximation.

Normal Distribution

The probability distribution we see above approximates one that is very common in nature: the bell curve, also known as the normal distribution or Gaussian distribution. When the histogram of a list of numbers approximates the normal distribution, we can use a convenient mathematical formula to approximate the proportion of values or outcomes in any given interval: \[\mbox{Pr}(a < x < b) = \int_a^b \frac{1}{\sqrt{2\pi\sigma^2}} \exp{\left( \frac{-(x-\mu)^2}{2 \sigma^2} \right)} \, dx %]]>\]

This formula may is stored in a more convenient form (as pnorm in R which sets \(a\) to \(-\infty\), and takes \(b\) as an argument).

\(\mu\) and \(\sigma\) are the mean and standard deviation of the population. If this normal approximation holds for our list, then the population mean and variance of our list can be used in the formula above. An example of this would be when we noted above that only 1.36% of values on the null distribution were above obsdiff. We can compute the proportion of values below a value x with pnorm(x,mu,sigma) without knowing all the values. The normal approximation works very well here:

1 - pnorm(obsdiff,mean(null),sd(null)) 
## [1] 0.0123337

Later, we will learn that there is a mathematical explanation for this. A very useful characteristic of this approximation is that one only needs to know μ and σ to describe the entire distribution. From this, we can compute the proportion of values in any interval.

Exercises

x <- read.csv("femaleControlsPopulation.csv")[[1]]

Here x represents the weights for the entire population.

  1. The average of these weights:
mean(x)
## [1] 23.89338

2.After setting the seed at 1, set.seed(1) take a random sample of size 5. What is the absolute value (use abs) of the difference between the average of the sample and the average of all the values?

set.seed(1)
abs(mean(sample(x, 5)) - mean(sample(x, 5)))
## [1] 2.632
  1. After setting the seed at 5, set.seed(5) take a random sample of size 5. What is the absolute value of the difference between the average of the sample and the average of all the values?
set.seed(5)
abs(mean(sample(x, 5)) - mean(x))
## [1] 1.433378
  1. Why are the answers from 2 and 3 different?
  1. Set the seed at 1, then using a for-loop take a random sample of 5 mice 1,000 times. Save these averages. What percent of these 1,000 averages are more than 1 ounce away from the average of x?
n <- 1000
averages5 <- vector("numeric",n)
set.seed(1)
for (i in 1:n) {
  averages5[i] <- mean(sample(x, 5))
}
mean(abs(averages5 - mean(x)))
## [1] 1.213812
  1. We are now going to increase the number of times we redo the sample from 1,000 to 10,000. Set the seed at 1, then using a for-loop take a random sample of 5 mice 10,000 times. Save these averages. What percent of these 10,000 averages are more than 1 ounce away from the average of x?
n <- 10000
null <- vector("numeric",n)
set.seed(1)
for (i in 1:n) {
  null[i] <- mean(sample(x, 5))
}
mean(abs(null - mean(x)))
## [1] 1.18948
  1. Note that the answers to 5 and 6 barely changed. This is expected. The way we think about the random value distributions is as the distribution of the list of values obtained if we repeated the experiment an infinite number of times. On a computer, we can’t perform an infinite number of iterations so instead, for our examples, we consider 1,000 to be large enough, thus 10,000 is as well. Now if instead we change the sample size, then we change the random variable and thus its distribution.

Set the seed at 1, then using a for-loop take a random sample of 50 mice 1,000 times. Save these averages. What percent of these 1,000 averages are more than 1 ounce away from the average of x?

n <- 1000
averages50 <- vector("numeric",n)
set.seed(1)
for (i in 1:n) {
  averages50[i] <- mean(sample(x, 50))
}
mean(abs(averages50 - mean(x)))
## [1] 0.3398324
  1. Use a histogram to “look” at the distribution of averages we get with a sample size of 5 and a sample size of 50. How would you say they differ?
hist(averages5)

hist(averages50)

They both look roughly normal, but with a sample size of 50 the spread is smaller.

  1. For the last set of averages, the ones obtained from a sample size of 50, what percent are between 23 and 25?
mean(averages50 >= 23 & averages50 <= 25)
## [1] 0.976
  1. Now ask the same question of a normal distribution with average 23.9 and standard deviation 0.43.

Proportions below a certain number can be ascertained with R convenience function pnorm(x, mu, sigma).

pnorm(25, 23.9, 0.43) - pnorm(23, 23.9, 0.43)
## [1] 0.9765648

\[\mbox{Pr}(a < x < b) = \int_a^b \frac{1}{\sqrt{2\pi\sigma^2}} \exp{\left( \frac{-(x-\mu)^2}{2 \sigma^2} \right)}\]

In the pnorm function \(a\) is set to \(-\infty\) and \(b\) take an argument. \(\mu\) and \(\sigma\) are the mean and standard deviation respectively.

sessionInfo()
## R version 3.4.4 (2018-03-15)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252   
## [3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C                   
## [5] LC_TIME=English_Canada.1252    
## 
## attached base packages:
## [1] methods   stats     graphics  grDevices utils     datasets  base     
## 
## other attached packages:
##  [1] UsingR_2.0-6    Hmisc_4.1-1     ggplot2_2.2.1   Formula_1.2-3  
##  [5] survival_2.42-3 lattice_0.20-35 HistData_0.8-4  MASS_7.3-50    
##  [9] downloader_0.4  bindrcpp_0.2.2  dplyr_0.7.5     rafalib_1.0.0  
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.4    xfun_0.1            purrr_0.2.5        
##  [4] splines_3.4.4       colorspace_1.3-2    htmltools_0.3.6    
##  [7] yaml_2.1.19         base64enc_0.1-3     rlang_0.2.1        
## [10] pillar_1.2.3        foreign_0.8-70      glue_1.2.0         
## [13] RColorBrewer_1.1-2  bindr_0.1.1         plyr_1.8.4         
## [16] stringr_1.3.1       munsell_0.4.3       blogdown_0.6       
## [19] gtable_0.2.0        htmlwidgets_1.2     codetools_0.2-15   
## [22] evaluate_0.10.1     latticeExtra_0.6-28 knitr_1.20         
## [25] htmlTable_1.12      Rcpp_0.12.17        acepack_1.4.1      
## [28] backports_1.1.2     scales_0.5.0        checkmate_1.8.5    
## [31] gridExtra_2.3       digest_0.6.15       stringi_1.1.7      
## [34] bookdown_0.7        grid_3.4.4          rprojroot_1.3-2    
## [37] tools_3.4.4         magrittr_1.5        lazyeval_0.2.1     
## [40] tibble_1.4.2        cluster_2.0.7-1     pkgconfig_2.0.1    
## [43] Matrix_1.2-14       data.table_1.11.4   assertthat_0.2.0   
## [46] rmarkdown_1.9       rstudioapi_0.7      R6_2.2.2           
## [49] rpart_4.1-13        nnet_7.3-12         compiler_3.4.4
knitr::write_bib(.packages(), "packages.bib") 
## tweaking Hmisc

References

R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2018. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.