`library(dplyr)`

```
##
## Attaching package: 'dplyr'
```

```
## The following objects are masked from 'package:stats':
##
## filter, lag
```

```
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
```

```
dat <- read.csv("femaleMiceWeights.csv")
control <- filter(dat, Diet=="chow") %>% pull(Bodyweight)
treatment <- filter(dat, Diet=="hf") %>% pull(Bodyweight)
obs <- mean(treatment) - mean(control)
```

We will test if the 3.0208333 grams observed difference is due to chance.

The population below has the weights of all the mice that were sampled from for the control.

`population <- read.csv("femaleControlsPopulation.csv") %>% pull(Bodyweight)`

The Null hypothesis is that the high fat (hf) diet has no effect. In practice this means that if take multiple samples we will not see a difference between the high fat diet and the chow diet.

```
n <- 10000
nulls <- vector("numeric", n)
for (i in 1:n){
control <- sample(population, 12)
treatment <- sample(population, 12)
nulls[i] = mean(treatment) - mean(control)
}
```

`print(paste("The max value is",max(nulls),"and the minimum value is",min(nulls)))`

`## [1] "The max value is 4.8125 and the minimum value is -5.02416666666667"`

`max(nulls)`

`## [1] 4.8125`

With each computation we can begin to get a sense of the Null distribution. Basically all possible realizations under the null hypothesis.

`hist(nulls)`

Knowing this null distribution allows us to describe the proportion of values you see for any interval of values. Thus we can see how likely it is to get a value a big a as 3 that we observed under the null hypothesis. This will provide scientific support for our findings based on statistical inference.

We are going to find the proportion of times that the null value was larger than the observed. This amounts to finding the p-value.

```
p <- sum(nulls > obs) / n
p
```

`## [1] 0.0141`

This tells us the proportion of times the null is bigger than the observation. It happens in this case 1.41% of the time.

How much bigger it is in absolute value is:

`p_abs <- sum(abs(nulls) > obs) /n`

0.0275%

### Null Distribution exercises

- Set the seed at 1, then using a for-loop take a random sample of 5 mice 1,000 times. Save these averages. What proportion of these 1,000 averages are more than 1 gram away from the average of
`x`

?

```
avg <- mean(population)
set.seed(1)
n <- 1000
avg_sam <- vector("numeric", n)
for (i in 1:n){
avg_sam[i] = mean(sample(population, 5)) - avg
}
sum(abs(avg_sam) > 1) / n
```

`## [1] 0.498`

- We are now going to increase the number of times we redo the sample from 1,000 to 10,000. Set the seed at 1, then using a for-loop take a random sample of 5 mice 10,000 times. Save these averages. What proportion of these 10,000 averages are more than 1 gram away from the average of
`x`

?

```
avg <- mean(population)
set.seed(1)
n <- 10000
avg_sam <- vector("numeric", n)
for (i in 1:n){
avg_sam[i] = mean(sample(population, 5)) - avg
}
sum(abs(avg_sam) > 1) / n
```

`## [1] 0.4976`

- Set the seed at 1, then using a for-loop take a random sample of 50 mice 1,000 times. Save these averages. What proportion of these 1,000 averages are more than 1 gram away from the average of
`x`

?

```
avg <- mean(population)
set.seed(1)
n <- 1000
avg_sam <- vector("numeric", n)
for (i in 1:n){
avg_sam[i] = mean(sample(population, 50)) - avg
}
sum(abs(avg_sam) > 1) / n
```

`## [1] 0.019`

### Probability distributions exercises

We will use the data set called “Gapminder” which is available as an R-package on Github. This data set contains the life expectancy, GDP *per capita*, and population by country, every five years, from 1952 to 2007. It is an excerpt of a larger and more comprehensive set of data available on Gapminder.org, and the R package of this data set was created by the statistics professor Jennifer Bryan.

```
library(gapminder)
data(gapminder)
head(gapminder)
```

```
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
```

Create a vector ‘x’ of the life expectancies of each country for the year 1952. Plot a histogram of these life expectancies to see the spread of the different countries.

```
# Vector of life expectancy.
x <- filter(gapminder, year==1952) %>% pull(lifeExp)
class(x)
```

`## [1] "numeric"`

- What is the proportion of countries in 1952 that have a life expectancy less than or equal to 40?

`mean(x <= 40)`

`## [1] 0.2887324`

- What is the proportion of countries in 1952 that have a life expectancy between 40 and 60 years? Hint: this is the proportion that have a life expectancy less than or equal to 60 years, minus the proportion that have a life expectancy less than or equal to 40 years.

`mean(x <= 60) - mean(x <= 40)`

`## [1] 0.4647887`

`sapply()`

on a custom function

Suppose we want to plot the proportions of countries with life expectancy `'q'`

for a range of different years. R has a built in function for this, `plot(ecdf(x))`

, but suppose we didn’t know this. The function is quite easy to build, by turning the code from question 1.1 into a custom function, and then using `sapply()`

. Our custom function will take an input variable `'q'`

, and return the proportion of countries in `'x'`

less than or equal to `q`

.

```
prop = function(q){
mean(x <= q)
}
```

Testing the function against result we have obtained earlier.

`prop(40)`

`## [1] 0.2887324`

Now let’s build a range of q’s that we can apply the function to:

```
qs <- seq(min(x), max(x), length.out = 20)
props <- sapply(qs, prop)
head(props)
```

`## [1] 0.007042254 0.028169014 0.063380282 0.105633803 0.204225352 0.288732394`

`plot(qs, props)`

Alternatively, an inline function could have been written in one line: `props = sapply(qs, function(q) mean(x <= q))`

.

`plot(ecdf(x))`