Introduction to null distribution

Michael Taylor

2018/03/09

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
dat <- read.csv("femaleMiceWeights.csv")
control <- filter(dat, Diet=="chow") %>% pull(Bodyweight)
treatment <- filter(dat, Diet=="hf") %>% pull(Bodyweight)

obs <- mean(treatment) - mean(control)

We will test if the 3.0208333 grams observed difference is due to chance.

The population below has the weights of all the mice that were sampled from for the control.

population <- read.csv("femaleControlsPopulation.csv") %>% pull(Bodyweight)

The Null hypothesis is that the high fat (hf) diet has no effect. In practice this means that if take multiple samples we will not see a difference between the high fat diet and the chow diet.

n <- 10000
nulls <- vector("numeric", n)
for (i in 1:n){
  control <- sample(population, 12)
  treatment <- sample(population, 12)
  nulls[i] = mean(treatment) - mean(control)
}
print(paste("The max value is",max(nulls),"and the minimum value is",min(nulls)))
## [1] "The max value is 4.8125 and the minimum value is -5.02416666666667"
max(nulls)
## [1] 4.8125

With each computation we can begin to get a sense of the Null distribution. Basically all possible realizations under the null hypothesis.

hist(nulls)

Knowing this null distribution allows us to describe the proportion of values you see for any interval of values. Thus we can see how likely it is to get a value a big a as 3 that we observed under the null hypothesis. This will provide scientific support for our findings based on statistical inference.

We are going to find the proportion of times that the null value was larger than the observed. This amounts to finding the p-value.

p <- sum(nulls > obs) / n
p
## [1] 0.0141

This tells us the proportion of times the null is bigger than the observation. It happens in this case 1.41% of the time.

How much bigger it is in absolute value is:

p_abs <- sum(abs(nulls) > obs) /n

0.0275%

Null Distribution exercises

  1. Set the seed at 1, then using a for-loop take a random sample of 5 mice 1,000 times. Save these averages. What proportion of these 1,000 averages are more than 1 gram away from the average of x ?
avg <- mean(population)
set.seed(1)
n <- 1000
avg_sam <- vector("numeric", n)
for (i in 1:n){
  avg_sam[i] = mean(sample(population, 5)) - avg
}
sum(abs(avg_sam) > 1) / n
## [1] 0.498
  1. We are now going to increase the number of times we redo the sample from 1,000 to 10,000. Set the seed at 1, then using a for-loop take a random sample of 5 mice 10,000 times. Save these averages. What proportion of these 10,000 averages are more than 1 gram away from the average of x ?
avg <- mean(population)
set.seed(1)
n <- 10000
avg_sam <- vector("numeric", n)
for (i in 1:n){
  avg_sam[i] = mean(sample(population, 5)) - avg
}
sum(abs(avg_sam) > 1) / n
## [1] 0.4976
  1. Set the seed at 1, then using a for-loop take a random sample of 50 mice 1,000 times. Save these averages. What proportion of these 1,000 averages are more than 1 gram away from the average of x ?
avg <- mean(population)
set.seed(1)
n <- 1000
avg_sam <- vector("numeric", n)
for (i in 1:n){
  avg_sam[i] = mean(sample(population, 50)) - avg
}
sum(abs(avg_sam) > 1) / n
## [1] 0.019

Probability distributions exercises

We will use the data set called “Gapminder” which is available as an R-package on Github. This data set contains the life expectancy, GDP per capita, and population by country, every five years, from 1952 to 2007. It is an excerpt of a larger and more comprehensive set of data available on Gapminder.org, and the R package of this data set was created by the statistics professor Jennifer Bryan.

library(gapminder)
data(gapminder)
head(gapminder)
## # A tibble: 6 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.

Create a vector ‘x’ of the life expectancies of each country for the year 1952. Plot a histogram of these life expectancies to see the spread of the different countries.

# Vector of life expectancy.
x <- filter(gapminder, year==1952) %>% pull(lifeExp)
class(x)
## [1] "numeric"
  1. What is the proportion of countries in 1952 that have a life expectancy less than or equal to 40?
mean(x <= 40)
## [1] 0.2887324
  1. What is the proportion of countries in 1952 that have a life expectancy between 40 and 60 years? Hint: this is the proportion that have a life expectancy less than or equal to 60 years, minus the proportion that have a life expectancy less than or equal to 40 years.
mean(x <= 60) - mean(x <= 40)
## [1] 0.4647887

sapply() on a custom function

Suppose we want to plot the proportions of countries with life expectancy 'q' for a range of different years. R has a built in function for this, plot(ecdf(x)), but suppose we didn’t know this. The function is quite easy to build, by turning the code from question 1.1 into a custom function, and then using sapply(). Our custom function will take an input variable 'q', and return the proportion of countries in 'x' less than or equal to q.

prop = function(q){
  mean(x <= q)
}

Testing the function against result we have obtained earlier.

prop(40)
## [1] 0.2887324

Now let’s build a range of q’s that we can apply the function to:

qs <- seq(min(x), max(x), length.out = 20)
props <- sapply(qs, prop)
head(props)
## [1] 0.007042254 0.028169014 0.063380282 0.105633803 0.204225352 0.288732394
plot(qs, props)

Alternatively, an inline function could have been written in one line: props = sapply(qs, function(q) mean(x <= q)).

plot(ecdf(x))