## Visualizing two variables

```
library(dplyr)
library(ggplot2)
library(gridExtra)
library(openintro)
knitr::opts_chunk$set(cache = TRUE)
```

- Using the
`ncbirths`

dataset, make a scatterplot using`ggplot()`

to illustrate how the birth weight of these babies varies according to the number of weeks of gestation.

`str(ncbirths)`

```
## 'data.frame': 1000 obs. of 13 variables:
## $ fage : int NA NA 19 21 NA NA 18 17 NA 20 ...
## $ mage : int 13 14 15 15 15 15 15 15 16 16 ...
## $ mature : Factor w/ 2 levels "mature mom","younger mom": 2 2 2 2 2 2 2 2 2 2 ...
## $ weeks : int 39 42 37 41 39 38 37 35 38 37 ...
## $ premie : Factor w/ 2 levels "full term","premie": 1 1 1 1 1 1 1 2 1 1 ...
## $ visits : int 10 15 11 6 9 19 12 5 9 13 ...
## $ marital : Factor w/ 2 levels "married","not married": 1 1 1 1 1 1 1 1 1 1 ...
## $ gained : int 38 20 38 34 27 22 76 15 NA 52 ...
## $ weight : num 7.63 7.88 6.63 8 6.38 5.38 8.44 4.69 8.81 6.94 ...
## $ lowbirthweight: Factor w/ 2 levels "low","not low": 2 2 2 2 2 1 2 1 2 2 ...
## $ gender : Factor w/ 2 levels "female","male": 2 2 1 2 1 2 2 2 2 1 ...
## $ habit : Factor w/ 2 levels "nonsmoker","smoker": 1 1 1 1 1 1 1 1 1 1 ...
## $ whitemom : Factor w/ 2 levels "not white","white": 1 1 2 2 1 1 1 1 2 2 ...
```

```
ncbirths %>% ggplot(aes(x=weeks, y=weight)) +
geom_point() +
ggtitle("weight of babies against gestation period")
```

`## Warning: Removed 2 rows containing missing values (geom_point).`

### Boxplots as discretized/conditioned scatterplots

- The
`cut()`

function takes two arguments: the continuous variable you want to discretize and the number of breaks that you want to make in that continuous variable in order to discretize it.- here the
`cut()`

function is used to discretize the x-variable into six intervals (i.e. five breaks).

- here the

```
ncbirths %>% ggplot(aes(x = cut(weeks, breaks = 5), y = weight)) +
geom_boxplot()
```

### Creating scatterplots

- The mammals dataset contains information about 39 different species of mammals, including their body weight, brain weight, gestation time, and a few other variables.

`glimpse(mammals)`

```
## Observations: 62
## Variables: 11
## $ Species <fct> Africanelephant, Africangiantpouchedrat, ArcticFox...
## $ BodyWt <dbl> 6654.000, 1.000, 3.385, 0.920, 2547.000, 10.550, 0...
## $ BrainWt <dbl> 5712.0, 6.6, 44.5, 5.7, 4603.0, 179.5, 0.3, 169.0,...
## $ NonDreaming <dbl> NA, 6.3, NA, NA, 2.1, 9.1, 15.8, 5.2, 10.9, 8.3, 1...
## $ Dreaming <dbl> NA, 2.0, NA, NA, 1.8, 0.7, 3.9, 1.0, 3.6, 1.4, 1.5...
## $ TotalSleep <dbl> 3.3, 8.3, 12.5, 16.5, 3.9, 9.8, 19.7, 6.2, 14.5, 9...
## $ LifeSpan <dbl> 38.6, 4.5, 14.0, NA, 69.0, 27.0, 19.0, 30.4, 28.0,...
## $ Gestation <dbl> 645, 42, 60, 25, 624, 180, 35, 392, 63, 230, 112, ...
## $ Predation <int> 3, 3, 1, 5, 3, 4, 1, 4, 1, 1, 5, 5, 2, 5, 1, 2, 2,...
## $ Exposure <int> 5, 1, 1, 2, 5, 4, 1, 5, 2, 1, 4, 5, 1, 5, 1, 2, 2,...
## $ Danger <int> 3, 3, 1, 3, 4, 4, 1, 4, 1, 1, 4, 5, 2, 5, 1, 2, 2,...
```

- Using the mammals dataset, create a scatterplot illustrating how the brain weight of a mammal varies as a function of its body weight.

```
mammals %>%
ggplot(aes(x=BodyWt, y=BrainWt)) +
geom_point()
```

- The mlbBat10 dataset contains batting statistics for 1,199 Major League Baseball players during the 2010 season.

`glimpse(mlbBat10)`

```
## Observations: 1,199
## Variables: 19
## $ name <fct> I Suzuki, D Jeter, M Young, J Pierre, R Weeks, M Scut...
## $ team <fct> SEA, NYY, TEX, CWS, MIL, BOS, BAL, MIN, NYY, CIN, MIL...
## $ position <fct> OF, SS, 3B, OF, 2B, SS, OF, OF, 2B, 2B, OF, OF, 2B, O...
## $ G <dbl> 162, 157, 157, 160, 160, 150, 160, 153, 160, 155, 157...
## $ AB <dbl> 680, 663, 656, 651, 651, 632, 629, 629, 626, 626, 619...
## $ R <dbl> 74, 111, 99, 96, 112, 92, 79, 85, 103, 100, 101, 103,...
## $ H <dbl> 214, 179, 186, 179, 175, 174, 187, 166, 200, 172, 188...
## $ `2B` <dbl> 30, 30, 36, 18, 32, 38, 45, 24, 41, 33, 45, 34, 41, 2...
## $ `3B` <dbl> 3, 3, 3, 3, 4, 0, 3, 10, 3, 5, 1, 10, 4, 3, 3, 1, 5, ...
## $ HR <dbl> 6, 10, 21, 1, 29, 11, 12, 3, 29, 18, 25, 4, 10, 25, 1...
## $ RBI <dbl> 43, 67, 91, 47, 83, 56, 60, 58, 109, 59, 103, 41, 75,...
## $ TB <dbl> 268, 245, 291, 206, 302, 245, 274, 219, 334, 269, 310...
## $ BB <dbl> 45, 63, 50, 45, 76, 53, 73, 60, 57, 46, 56, 47, 28, 4...
## $ SO <dbl> 86, 106, 115, 47, 184, 71, 93, 74, 77, 83, 105, 170, ...
## $ SB <dbl> 42, 18, 4, 68, 11, 5, 7, 26, 3, 16, 14, 27, 14, 18, 1...
## $ CS <dbl> 9, 5, 2, 18, 4, 4, 2, 4, 2, 12, 3, 6, 4, 9, 5, 1, 3, ...
## $ OBP <dbl> 0.359, 0.340, 0.330, 0.341, 0.366, 0.333, 0.370, 0.33...
## $ SLG <dbl> 0.394, 0.370, 0.444, 0.316, 0.464, 0.388, 0.436, 0.34...
## $ AVG <dbl> 0.315, 0.270, 0.284, 0.275, 0.269, 0.275, 0.297, 0.26...
```

- Using the
`mlbBat10`

dataset, create a scatterplot illustrating how the slugging percentage (`SLG`

) of a player varies as a function of his on-base percentage (`OBP`

).

```
mlbBat10 %>%
ggplot(aes(OBP, SLG)) +
geom_point()
```

- The bdims dataset contains body girth and skeletal diameter measurements for 507 physically active individuals.

`glimpse(bdims)`

```
## Observations: 507
## Variables: 25
## $ bia.di <dbl> 42.9, 43.7, 40.1, 44.3, 42.5, 43.3, 43.5, 44.4, 43.5, 4...
## $ bii.di <dbl> 26.0, 28.5, 28.2, 29.9, 29.9, 27.0, 30.0, 29.8, 26.5, 2...
## $ bit.di <dbl> 31.5, 33.5, 33.3, 34.0, 34.0, 31.5, 34.0, 33.2, 32.1, 3...
## $ che.de <dbl> 17.7, 16.9, 20.9, 18.4, 21.5, 19.6, 21.9, 21.8, 15.5, 2...
## $ che.di <dbl> 28.0, 30.8, 31.7, 28.2, 29.4, 31.3, 31.7, 28.8, 27.5, 2...
## $ elb.di <dbl> 13.1, 14.0, 13.9, 13.9, 15.2, 14.0, 16.1, 15.1, 14.1, 1...
## $ wri.di <dbl> 10.4, 11.8, 10.9, 11.2, 11.6, 11.5, 12.5, 11.9, 11.2, 1...
## $ kne.di <dbl> 18.8, 20.6, 19.7, 20.9, 20.7, 18.8, 20.8, 21.0, 18.9, 2...
## $ ank.di <dbl> 14.1, 15.1, 14.1, 15.0, 14.9, 13.9, 15.6, 14.6, 13.2, 1...
## $ sho.gi <dbl> 106.2, 110.5, 115.1, 104.5, 107.5, 119.8, 123.5, 120.4,...
## $ che.gi <dbl> 89.5, 97.0, 97.5, 97.0, 97.5, 99.9, 106.9, 102.5, 91.0,...
## $ wai.gi <dbl> 71.5, 79.0, 83.2, 77.8, 80.0, 82.5, 82.0, 76.8, 68.5, 7...
## $ nav.gi <dbl> 74.5, 86.5, 82.9, 78.8, 82.5, 80.1, 84.0, 80.5, 69.0, 8...
## $ hip.gi <dbl> 93.5, 94.8, 95.0, 94.0, 98.5, 95.3, 101.0, 98.0, 89.5, ...
## $ thi.gi <dbl> 51.5, 51.5, 57.3, 53.0, 55.4, 57.5, 60.9, 56.0, 50.0, 5...
## $ bic.gi <dbl> 32.5, 34.4, 33.4, 31.0, 32.0, 33.0, 42.4, 34.1, 33.0, 3...
## $ for.gi <dbl> 26.0, 28.0, 28.8, 26.2, 28.4, 28.0, 32.3, 28.0, 26.0, 2...
## $ kne.gi <dbl> 34.5, 36.5, 37.0, 37.0, 37.7, 36.6, 40.1, 39.2, 35.5, 3...
## $ cal.gi <dbl> 36.5, 37.5, 37.3, 34.8, 38.6, 36.1, 40.3, 36.7, 35.0, 3...
## $ ank.gi <dbl> 23.5, 24.5, 21.9, 23.0, 24.4, 23.5, 23.6, 22.5, 22.0, 2...
## $ wri.gi <dbl> 16.5, 17.0, 16.9, 16.6, 18.0, 16.9, 18.8, 18.0, 16.5, 1...
## $ age <int> 21, 23, 28, 23, 22, 21, 26, 27, 23, 21, 23, 22, 20, 26,...
## $ wgt <dbl> 65.6, 71.8, 80.7, 72.6, 78.8, 74.8, 86.4, 78.4, 62.0, 8...
## $ hgt <dbl> 174.0, 175.3, 193.5, 186.5, 187.2, 181.5, 184.0, 184.5,...
## $ sex <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
```

Using the `bdims`

dataset, create a scatterplot illustrating how a person’s weight varies as a function of their height. Use color to separate by sex, which you’ll need to coerce to a factor with `factor()`

.

`bdims$sex <- factor(bdims$sex, labels = c('female', 'male'))`

```
bdims %>%
ggplot(aes(hgt, wgt, color = sex)) +
geom_point()+
facet_wrap(~ sex)
```

- The
`smoking`

dataset contains information on the smoking habits of 1,691 citizens of the United Kingdom.

`glimpse(smoking)`

```
## Observations: 1,691
## Variables: 12
## $ gender <fct> Male, Female, Male, Female, Female, Femal...
## $ age <int> 38, 42, 40, 40, 39, 37, 53, 44, 40, 41, 7...
## $ maritalStatus <fct> Divorced, Single, Married, Married, Marri...
## $ highestQualification <fct> No Qualification, No Qualification, Degre...
## $ nationality <fct> British, British, English, English, Briti...
## $ ethnicity <fct> White, White, White, White, White, White,...
## $ grossIncome <fct> 2,600 to 5,200, Under 2,600, 28,600 to 36...
## $ region <fct> The North, The North, The North, The Nort...
## $ smoke <fct> No, Yes, No, No, No, No, Yes, No, Yes, Ye...
## $ amtWeekends <int> NA, 12, NA, NA, NA, NA, 6, NA, 8, 15, NA,...
## $ amtWeekdays <int> NA, 12, NA, NA, NA, NA, 6, NA, 8, 12, NA,...
## $ type <fct> , Packets, , , , , Packets, , Hand-Rolled...
```

- Using the
`smoking`

dataset, create a scatterplot illustrating how the amount that a person smokes on weekdays varies as a function of their age.

```
smoking %>%
ggplot(aes(age, amtWeekdays)) +
geom_point()
```

`## Warning: Removed 1270 rows containing missing values (geom_point).`

### Transformations

`ggplot2`

provides several different mechanisms for viewing transformed relationships. The `coord_trans()`

function transforms the coordinates of the plot. Alternatively, the `scale_x_log10()`

and `scale_y_log10()`

functions perform a base-10 log transformation of each axis. Note the differences in the appearance of the axes.

```
# Scatterplot with coord_trans()
p1 <- ggplot(data = mammals, aes(x = BodyWt, y = BrainWt)) +
geom_point() +
coord_trans(x = "log10", y = "log10")
# Scatterplot with scale_x_log10() and scale_y_log10()
p2 <- ggplot(data = mammals, aes(x = BodyWt, y = BrainWt)) +
geom_point() +
scale_x_log10() + scale_y_log10()
grid.arrange(p1, p2, ncol=2)
```

### Identifying outliers

- Use
`filter()`

to create a scatterplot for`SLG`

as a function of`OBP`

among players who had at least 200 at-bats.

```
mlbBat10 %>%
filter(AB >= 200) %>%
ggplot(aes(OBP, SLG)) +
geom_point()
```

- Find the row of
`mlbBat10`

corresponding to the one player with at least 200 at-bats whose`OBP`

was below 0.200.

```
mlbBat10 %>%
filter(AB >=200) %>%
filter(OBP < 0.200)
```

```
## name team position G AB R H 2B 3B HR RBI TB BB SO SB CS OBP
## 1 B Wood LAA 3B 81 226 20 33 2 0 4 14 47 6 71 1 0 0.174
## SLG AVG
## 1 0.208 0.146
```

## Correlation

### Computing correlation

- Use
`cor()`

to compute the correlation between the birthweight of babies in the`ncbirths`

dataset and their mother’s age. There is no missing data in either variable.

```
# Compute correlation
ncbirths %>%
summarize(N = n(), r = cor(weight, mage))
```

```
## N r
## 1 1000 0.05506589
```

- Compute the correlation between the birthweight and the number of weeks of gestation for all non-missing pairs.

The use argument allows you to override the default behavior of returning `NA`

whenever any of the values encountered is `NA`

. Setting the use argument to `"pairwise.complete.obs"`

allows `cor()`

to compute the correlation coefficient for those observations where the values of `x`

and `y`

are both not missing.

```
# Compute correlation for all non-missing pairs
ncbirths %>%
summarize(N = n(), r = cor(weight, weeks, use = 'pairwise.complete.obs'))
```

```
## N r
## 1 1000 0.6701013
```

### The Anscombe dataset (**???**)

```
library(Tmisc)
data(quartet)
```

```
ggplot(data = quartet, aes(x = x, y = y)) +
geom_point() +
facet_wrap(~ set)
```

In 1973, Francis Anscombe famously created four datasets with remarkably similar numerical properties, but obviously different graphic relationships. The Anscombe dataset contains the x and y coordinates for these four datasets, along with a grouping variable, `set`

, that distinguishes the quartet.

```
quartet %>%
group_by(set) %>%
summarize(N = n(), mean(x), sd(x), mean(y), sd(y), cor(x,y) )
```

```
## # A tibble: 4 x 7
## set N `mean(x)` `sd(x)` `mean(y)` `sd(y)` `cor(x, y)`
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 I 11 9 3.32 7.50 2.03 0.816
## 2 II 11 9 3.32 7.50 2.03 0.816
## 3 III 11 9 3.32 7.5 2.03 0.816
## 4 IV 11 9 3.32 7.50 2.03 0.817
```

### Perception of correlation

```
# Correlation for all baseball players
mlbBat10 %>%
summarize(N = n(), r = cor(OBP, SLG))
```

```
## N r
## 1 1199 0.8145628
```

```
# Correlation for all players with at least 200 ABs
mlbBat10 %>%
filter(AB >= 200) %>%
summarize(N = n(), r = cor(OBP, SLG))
```

```
## N r
## 1 329 0.6855364
```

```
# Correlation of body dimensions
bdims %>%
group_by(sex) %>%
summarize(N = n(), r = cor(hgt, wgt))
```

```
## # A tibble: 2 x 3
## sex N r
## <fct> <int> <dbl>
## 1 female 260 0.431
## 2 male 247 0.535
```

```
# Correlation among mammals, with and without log
mammals %>%
summarize(N = n(),
r = cor(BodyWt, BrainWt),
r_log = cor(log(BodyWt), log(BrainWt)))
```

```
## N r r_log
## 1 62 0.9341638 0.9595748
```

### Spurious correlation in random data

```
set.seed(926)
noise <- data.frame(x=rnorm(1000),
y=rnorm(1000),
z = 1:20)
```

Statisticians must always be skeptical of potentially spurious correlations. Human beings are very good at seeing patterns in data, sometimes when the patterns themselves are actually just random noise. To illustrate how easy it can be to fall into this trap, we will look for patterns in truly random data.

The `noise`

dataset contains 20 sets of `x`

and `y`

variables drawn at random from a standard normal distribution. Each set, denoted as `z`

, has 50 observations of `x`

, `y`

pairs. Do you see any pairs of variables that might be meaningfully correlated? Are all of the correlation coefficients close to zero?

- Create a faceted scatterplot that shows the relationship between each of the 20 sets of pairs of random variables x and y. You will need the
`facet_wrap()`

function for this. - Compute the actual correlation between each of the 20 sets of pairs of
`x`

and`y`

. - Identify the datasets that show non-trivial correlation of greater than 0.2 in absolute value.

```
# Create faceted scatterplot
ggplot(noise, aes(x=x, y=y)) +
geom_point() +
facet_wrap(~ z)
```

```
# Compute correlations for each dataset
noise_summary <- noise %>%
group_by(z) %>%
summarize(N = n(), spurious_cor = cor(x, y))
```

```
# Isolate sets with correlations above 0.2 in absolute strength
noise_summary %>%
filter(abs(spurious_cor) > 0.2)
```

```
## # A tibble: 1 x 3
## z N spurious_cor
## <int> <int> <dbl>
## 1 8 50 -0.228
```

## Simple linear regression

### The “best fit” line

- Create a scatterplot of body weight as a function of height for all individuals in the
`bdims`

dataset with a simple linear model plotted over the data.

```
# Scatterplot with regression line
ggplot(data = bdims, aes(x = hgt, y = wgt)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
```

### Uniqueness of least squares regression line

```
add_line <- function (my_slope) {
bdims_summary <- bdims %>%
summarize(N = n(), r = cor(hgt, wgt),
mean_hgt = mean(hgt), mean_wgt = mean(wgt),
sd_hgt = sd(hgt), sd_wgt = sd(wgt)) %>%
mutate(true_slope = r * sd_wgt / sd_hgt,
true_intercept = mean_wgt - true_slope * mean_hgt)
p <- ggplot(data = bdims, aes(x = hgt, y = wgt)) +
geom_point() +
geom_point(data = bdims_summary,
aes(x = mean_hgt, y = mean_wgt),
color = "red", size = 3)
my_data <- bdims_summary %>%
mutate(my_slope = my_slope,
my_intercept = mean_wgt - my_slope * mean_hgt)
p + geom_abline(data = my_data,
aes(intercept = my_intercept, slope = my_slope), color = "dodgerblue")
}
```

The least squares criterion implies that the slope of the regression line is unique. In practice, the slope is computed by R. In this exercise, you will experiment with trying to find the optimal value for the regression slope for weight as a function of height in the bdims dataset via trial-and-error.

A custom function called `add_line()`

, which takes a single argument: the proposed slope coefficient.

`add_line(my_slope = 1.15)`

### Fitting a linear model “by hand”

Simple linear regression model: \[Y = \beta_0 + \beta_1 \cdot X + \epsilon \,, \text{ where } \epsilon \sim N(0, \sigma_{\epsilon}) \,.\]

Two facts enable you to compute the slope \(\beta_1\) and intercept \(\beta_0\) of a simple linear regression model from some basic summary statistics.

First, the `slope`

can be defined as:
\[b_1 = r_{X,Y} \cdot \frac{s_Y}{s_X}\]

where \(rX,Y\) represents the correlation (`cor()`

) of \(X\) and \(Y\) and \(sX\) and \(sY\) represent the standard deviation (`sd()`

) of \(X\) and \(Y\), respectively.

The `bdims_summary`

data frame contains all of the information you need to compute the slope and intercept of the least squares regression line for body weight \((Y)\) as a function of height \((X)\). You might need to do some algebra to solve for \(b_0\)!

```
bdims_summary <- bdims %>% summarise(N = n(),
r = cor(hgt, wgt),
mean_hgt = mean(hgt),
sd_hgt = sd(hgt),
mean_wgt = mean(wgt),
sd_wgt = sd(wgt)
)
bdims_summary
```

```
## N r mean_hgt sd_hgt mean_wgt sd_wgt
## 1 507 0.7173011 171.1438 9.407205 69.14753 13.34576
```

```
# Add slope and intercept
bdims_summary %>%
mutate(slope = r*(sd_wgt/sd_hgt),
intercept = mean_wgt - (slope * mean_hgt)
)
```

```
## N r mean_hgt sd_hgt mean_wgt sd_wgt slope intercept
## 1 507 0.7173011 171.1438 9.407205 69.14753 13.34576 1.017617 -105.0113
```

### Regression to the mean

**Regression to the mean** is a concept attributed to Sir Francis Galton. The basic idea is that extreme random observations will tend to be less extreme upon a second trial. This is simply due to chance alone. While “regression to the mean” and “linear regression” are not the same thing, we will examine them together in this exercise.

The `GaltonFamilies`

(**???**) datasets contain data originally collected by Galton himself in the 1880s on the heights of men and women, respectively, along with their parents’ heights.

```
# This dataset is based on the work done by Francis Galton in the 19th century
library(HistData)
data(GaltonFamilies)
head(GaltonFamilies)
```

```
## family father mother midparentHeight children childNum gender
## 1 001 78.5 67.0 75.43 4 1 male
## 2 001 78.5 67.0 75.43 4 2 female
## 3 001 78.5 67.0 75.43 4 3 female
## 4 001 78.5 67.0 75.43 4 4 female
## 5 002 75.5 66.5 73.66 4 1 male
## 6 002 75.5 66.5 73.66 4 2 male
## childHeight
## 1 73.2
## 2 69.2
## 3 69.0
## 4 69.0
## 5 73.5
## 6 72.5
```

- Create a scatterplot of the height of men as a function of their father’s height. Add the simple linear regression line and a diagonal line (with slope equal to 1 and intercept equal to 0) to the plot.

```
# Height of children vs. height of father
GaltonFamilies %>%
filter(gender == "male") %>%
ggplot(aes(father, childHeight)) +
geom_point() +
geom_abline(slope = 1, intercept = 0) +
geom_smooth(method = "lm", se = FALSE)
```

```
# Height of children vs. height of mother
GaltonFamilies %>%
filter(gender == "female") %>%
ggplot(aes(mother, childHeight)) +
geom_point() +
geom_abline(slope = 1, intercept = 0) +
geom_smooth(method = "lm", se = FALSE)
```

## Interpreting regression models

### Fitting simple linear models

While the `geom_smooth(method = "lm")`

function is useful for drawing linear models on a scatterplot, it doesn’t actually return the characteristics of the model. As suggested by that syntax, however, the function that creates linear models is `lm()`

. This function generally takes two arguments:

- A formula that specifies the model
- A data argument for the data frame that contains the data you want to use to fit the model

The `lm()`

function return a model object having class `"lm"`

. This object contains lots of information about your regression model, including the data used to fit the model, the specification of the model, the fitted values and residuals, etc.

- Using the
`bdims`

dataset, create a linear model for the weight of people as a function of their height.

```
# Linear model for weight as a function of height
lm(wgt ~ hgt, data = bdims)
```

```
##
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
##
## Coefficients:
## (Intercept) hgt
## -105.011 1.018
```

- Using the
`mlbBat10`

dataset, create a linear model for`SLG`

as a function of`OBP`

.

```
# Linear model for SLG as a function of OBP
lm(SLG ~ OBP, data = mlbBat10)
```

```
##
## Call:
## lm(formula = SLG ~ OBP, data = mlbBat10)
##
## Coefficients:
## (Intercept) OBP
## 0.009407 1.110323
```

- Using the
`mammals`

dataset, create a linear model for the body weight of mammals as a function of their brain weight, after taking the**natural log**of both variables.

```
# Log-linear model for body weight as a function of brain weight
lm(log(BodyWt) ~ log(BrainWt), data = mammals)
```

```
##
## Call:
## lm(formula = log(BodyWt) ~ log(BrainWt), data = mammals)
##
## Coefficients:
## (Intercept) log(BrainWt)
## -2.509 1.225
```

### The lm summary output

An `"lm"`

object contains a host of information about the regression model that you fit. There are various ways of extracting different pieces of information.

The `coef()`

function displays only the values of the coefficients. Conversely, the `summary()`

function displays not only that information, but a bunch of other information, including the associated standard error and p-value for each coefficient, the \(R^2\), adjusted \(R^2\), and the residual standard error. The summary of an `"lm"`

object in R is very similar to the output you would see in other statistical computing environments (e.g. Stata, SPSS, etc).

The `mod`

object is a linear model for the weight of individuals as a function of their height, using the `bdims`

.

`mod <- lm(wgt ~ hgt, data = bdims)`

- Use
`coef()`

to display the coefficients of mod.

`coef(mod)`

```
## (Intercept) hgt
## -105.011254 1.017617
```

- Use
`summary()`

to display the full regression output of`mod`

.

`summary(mod)`

```
##
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.743 -6.402 -1.231 5.059 41.103
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -105.01125 7.53941 -13.93 <2e-16 ***
## hgt 1.01762 0.04399 23.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.308 on 505 degrees of freedom
## Multiple R-squared: 0.5145, Adjusted R-squared: 0.5136
## F-statistic: 535.2 on 1 and 505 DF, p-value: < 2.2e-16
```

### Fitted values and residuals

Once you have fit a regression model, you are often interested in the fitted values (\(\hat{y}_i\)) and the residuals (\(e_i\)), where \(i\) indexes the observations. Recall that: \[e_i = y_i - \hat{y}_i\]

In this exercise, we will confirm these two mathematical facts by accessing the fitted values and residuals with the `fitted.values()`

and `residuals()`

functions, respectively, for the following model:

`mod <- lm(wgt ~ hgt, data = bdims)`

- Confirm that the mean of the body weights equals the mean of the fitted values of mod.

`mean(bdims$wgt) == mean(fitted.values(mod))`

`## [1] TRUE`

- Compute the mean of the residuals of mod.

`mean(residuals(mod))`

`## [1] -1.266971e-15`

The least squares fitting procedure guarantees that the mean of the residuals is zero (n.b., numerical instability may result in the computed values not being exactly zero). At the same time, the mean of the fitted values must equal the mean of the response variable.

### Tidying your linear model

As you fit a regression model, there are some quantities (e.g. \(R^2\)) that apply to the model as a whole, while others apply to each observation (e.g. \(y^i\)). If there are several of these per-observation quantities, it is sometimes convenient to attach them to the original data as new variables.

The `augment()`

function from the broom package does exactly this. It takes a model object as an argument and returns a data frame that contains the data on which the model was fit, along with several quantities specific to the regression model, including the fitted values, residuals, leverage scores, and standardized residuals.

```
# Load broom
library(broom)
# Create bdims_tidy
bdims_tidy <- augment(mod)
# Glimpse the resulting data frame
glimpse(bdims_tidy)
```

```
## Observations: 507
## Variables: 9
## $ wgt <dbl> 65.6, 71.8, 80.7, 72.6, 78.8, 74.8, 86.4, 78.4, 62....
## $ hgt <dbl> 174.0, 175.3, 193.5, 186.5, 187.2, 181.5, 184.0, 18...
## $ .fitted <dbl> 72.05406, 73.37697, 91.89759, 84.77427, 85.48661, 7...
## $ .se.fit <dbl> 0.4320546, 0.4520060, 1.0667332, 0.7919264, 0.81834...
## $ .resid <dbl> -6.4540648, -1.5769666, -11.1975919, -12.1742745, -...
## $ .hat <dbl> 0.002154570, 0.002358152, 0.013133942, 0.007238576,...
## $ .sigma <dbl> 9.312824, 9.317005, 9.303732, 9.301360, 9.312471, 9...
## $ .cooksd <dbl> 5.201807e-04, 3.400330e-05, 9.758463e-03, 6.282074e...
## $ .std.resid <dbl> -0.69413418, -0.16961994, -1.21098084, -1.31269063,...
```

### Making predictions

The `fitted.values()`

function or the `augment()`

-ed data frame provides us with the fitted values for the observations that were in the original data. However, once we have fit the model, we may want to compute expected values for observations that were not present in the data on which the model was fit. These types of predictions are called *out-of-sample*.

The `ben`

data frame contains a height and weight observation for one person. The `mod`

object contains the fitted model for weight as a function of height for the observations in the `bdims`

dataset. We can use the `predict()`

function to generate expected values for the weight of new individuals. We must pass the data frame of new observations through the `newdata`

argument.

`(ben <- data.frame(wgt = 74.8, hgt = 182.8))`

```
## wgt hgt
## 1 74.8 182.8
```

Note that the data frame `ben`

must have variables with the exact same names as those in the fitted model.

`predict(mod, newdata = ben )`

```
## 1
## 81.00909
```

### Adding a regression line to a plot manually

The `geom_smooth()`

function makes it easy to add a simple linear regression line to a scatterplot of the corresponding variables. And in fact, there are more complicated regression models that can be visualized in the data space with `geom_smooth()`

. However, there may still be times when we will want to add regression lines to our scatterplot manually. To do this, we will use the `geom_abline()`

function, which takes `slope`

and `intercept`

arguments. Naturally, we have to compute those values ahead of time, but we already saw how to do this (e.g. using `coef()`

).

The coefs data frame contains the model estimates retrieved from `coef()`

. Passing this to `geom_abline()`

as the data argument will enable you to draw a straight line on your scatterplot.

- Use
`geom_abline()`

to add a line defined in the`coefs`

data frame to a scatterplot of weight vs. height for individuals in the`bdims`

dataset.

```
coefs <- coef(mod)
# Add the line to the scatterplot
ggplot(bdims, aes(x = hgt, y = wgt)) +
geom_point() +
geom_abline(aes(intercept = coefs[1],
slope = coefs[2]),
color = "dodgerblue")
```

## Model Fit

### Standard error of residuals

One way to assess strength of fit is to consider how far off the model is for a typical case. That is, for some observations, the fitted value will be very close to the actual value, while for others it will not. The magnitude of a typical residual can give us a sense of generally how close our estimates are.

However, recall that some of the residuals are positive, while others are negative. In fact, it is guaranteed by the least squares fitting procedure that the mean of the residuals is zero. Thus, it makes more sense to compute the square root of the mean squared residual, or *root mean squared error (RMSE)*. R calls this quantity the *residual standard error*.

To make this estimate unbiased, you have to divide the sum of the squared residuals by the degrees of freedom in the model. Thus, \[RMSE = \sqrt{ \frac{\sum_i{e_i^2}}{d.f.} } = \sqrt{ \frac{SSE}{d.f.} }\]

You can recover the residuals from `mod`

with `residuals()`

, and the degrees of freedom with `df.residual()`

.

- View a
`summary()`

of mod.

```
# View summary of model
summary(mod)
```

```
##
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.743 -6.402 -1.231 5.059 41.103
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -105.01125 7.53941 -13.93 <2e-16 ***
## hgt 1.01762 0.04399 23.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.308 on 505 degrees of freedom
## Multiple R-squared: 0.5145, Adjusted R-squared: 0.5136
## F-statistic: 535.2 on 1 and 505 DF, p-value: < 2.2e-16
```

- Compute the mean of the
`residuals()`

and verify that it is approximately zero.

```
# Compute the mean of the residuals
mean(residuals(mod))
```

`## [1] -1.266971e-15`

- Use
`residuals()`

and`df.residual()`

to compute the root mean squared error (RMSE), a.k.a. residual standard error.

```
# Compute RMSE
sqrt(sum(residuals(mod)^2) / df.residual(mod))
```

`## [1] 9.30804`

### Assessing simple linear model fit

Recall that the coefficient of determination (\(R^2\)), can be computed as \[R^2 = 1 - \frac{SSE}{SST} = 1 - \frac{Var(e)}{Var(y)} \,,\]

where \(e\) is the vector of residuals and \(y\) is the response variable. This gives us the interpretation of \(R^2\) as the percentage of the variability in the response that is explained by the model, since the residuals are the part of that variability that remains unexplained by the model.

The `bdims_tidy`

data frame is the result of `augment()`

-ing the `bdims`

data frame with the `mod`

for `wgt`

as a function of `hgt`

.

- Use the
`summary()`

function to view the full results of mod.

`summary(mod)`

```
##
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.743 -6.402 -1.231 5.059 41.103
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -105.01125 7.53941 -13.93 <2e-16 ***
## hgt 1.01762 0.04399 23.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.308 on 505 degrees of freedom
## Multiple R-squared: 0.5145, Adjusted R-squared: 0.5136
## F-statistic: 535.2 on 1 and 505 DF, p-value: < 2.2e-16
```

- Use the
`bdims_tidy`

data frame to compute the \(R^2\) of`mod`

manually using the formula above, by computing the ratio of the variance of the residuals to the variance of the response variable.

```
# Compute R-squared
bdims_tidy %>%
summarize(var_y = var(wgt), var_e = var(.resid) ) %>%
mutate(R_squared = 1 - (var_e / var_y) )
```

```
## var_y var_e R_squared
## 1 178.1094 86.46839 0.5145208
```

### Interpretation of \(R^2\)

The \(R^2\) reported for the regression model for poverty rate of U.S. counties in terms of high school graduation rate is 0.464.

`countyComplete`

is from R package `openintro`

by (**???**).

```
lm(formula = poverty ~ hs_grad, data = countyComplete) %>%
summary()
```

```
##
## Call:
## lm(formula = poverty ~ hs_grad, data = countyComplete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.035 -3.034 -0.434 2.405 36.874
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64.59437 0.94619 68.27 <2e-16 ***
## hs_grad -0.59075 0.01134 -52.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.677 on 3141 degrees of freedom
## Multiple R-squared: 0.4635, Adjusted R-squared: 0.4633
## F-statistic: 2713 on 1 and 3141 DF, p-value: < 2.2e-16
```

\(46.4\%\) of the variability in poverty rate among U.S. counties can be explained by high school graduation rate.

## Linear vs. average

The \(R^2\) gives us a numerical measurement of the strength of fit relative to a null model based on the average of the response variable: \[\hat{y}_{null} = \bar{y}\]

This model has an \(R^2\) of zero because \(SSE=SST\). That is, since the fitted values \(\hat{y}_{null}\) are all equal to the average \(\bar{y}\), the residual for each observation is the distance between that observation and the mean of the response. Since we can always fit the null model, it serves as a baseline against which all other models will be compared.

In the graphic, we visualize the residuals for the null model (`mod_null`

at left) vs. the simple linear regression model (`mod_hgt`

at right) with height as a single explanatory variable. Try to convince yourself that, if you squared the lengths of the grey arrows on the left and summed them up, you would get a larger value than if you performed the same operation on the grey arrows on the right.

It may be useful to preview these `augment()`

-ed data frames with `glimpse()`

:

```
mod_null <- lm(wgt ~ 1, data = bdims) %>% augment()
mod_hgt <- lm(wgt ~ hgt, data = bdims) %>% augment()
```

`glimpse(mod_null)`

```
## Observations: 507
## Variables: 8
## $ wgt <dbl> 65.6, 71.8, 80.7, 72.6, 78.8, 74.8, 86.4, 78.4, 62....
## $ .fitted <dbl> 69.14753, 69.14753, 69.14753, 69.14753, 69.14753, 6...
## $ .se.fit <dbl> 0.5927061, 0.5927061, 0.5927061, 0.5927061, 0.59270...
## $ .resid <dbl> -3.5475345, 2.6524655, 11.5524655, 3.4524655, 9.652...
## $ .hat <dbl> 0.001972387, 0.001972387, 0.001972387, 0.001972387,...
## $ .sigma <dbl> 13.35803, 13.35845, 13.34906, 13.35808, 13.35205, 1...
## $ .cooksd <dbl> 1.399179e-04, 7.822033e-05, 1.483780e-03, 1.325192e...
## $ .std.resid <dbl> -0.26607983, 0.19894594, 0.86648293, 0.25894926, 0....
```

```
# Compute SSE for null model
mod_null %>%
summarize(SSE = var(.resid))
```

```
## SSE
## 1 178.1094
```

`glimpse(mod_hgt)`

```
## Observations: 507
## Variables: 9
## $ wgt <dbl> 65.6, 71.8, 80.7, 72.6, 78.8, 74.8, 86.4, 78.4, 62....
## $ hgt <dbl> 174.0, 175.3, 193.5, 186.5, 187.2, 181.5, 184.0, 18...
## $ .fitted <dbl> 72.05406, 73.37697, 91.89759, 84.77427, 85.48661, 7...
## $ .se.fit <dbl> 0.4320546, 0.4520060, 1.0667332, 0.7919264, 0.81834...
## $ .resid <dbl> -6.4540648, -1.5769666, -11.1975919, -12.1742745, -...
## $ .hat <dbl> 0.002154570, 0.002358152, 0.013133942, 0.007238576,...
## $ .sigma <dbl> 9.312824, 9.317005, 9.303732, 9.301360, 9.312471, 9...
## $ .cooksd <dbl> 5.201807e-04, 3.400330e-05, 9.758463e-03, 6.282074e...
## $ .std.resid <dbl> -0.69413418, -0.16961994, -1.21098084, -1.31269063,...
```

```
ggplot(bdims, aes(x=hgt, y=wgt)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
```

```
# Compute SSE for regression model
mod_hgt %>%
summarize(SSE = var(.resid))
```

```
## SSE
## 1 86.46839
```

### Leverage

The *leverage* of an observation in a regression model is defined entirely in terms of the distance of that observation from the mean of the explanatory variable. That is, observations close to the mean of the explanatory variable have low leverage, while observations far from the mean of the explanatory variable have high leverage. Points of high leverage may or may not be influential.

The `augment()`

function from the broom package will add the leverage scores (`.hat`

) to a model data frame.

- Use
`augment()`

to list the top 6 observations by their leverage scores, in descending order.

```
mod <- lm(SLG ~ OBP, filter(mlbBat10, AB >= 10))
# Rank points of high
mod %>%
augment() %>%
arrange(desc(.hat)) %>%
head()
```

```
## SLG OBP .fitted .se.fit .resid .hat .sigma
## 1 0.000 0.000 -0.03744579 0.009956861 0.03744579 0.01939493 0.07153050
## 2 0.000 0.000 -0.03744579 0.009956861 0.03744579 0.01939493 0.07153050
## 3 0.000 0.000 -0.03744579 0.009956861 0.03744579 0.01939493 0.07153050
## 4 0.308 0.550 0.69049108 0.009158810 -0.38249108 0.01641049 0.07011360
## 5 0.000 0.037 0.01152451 0.008770891 -0.01152451 0.01504981 0.07154283
## 6 0.038 0.038 0.01284803 0.008739031 0.02515197 0.01494067 0.07153800
## .cooksd .std.resid
## 1 0.0027664282 0.5289049
## 2 0.0027664282 0.5289049
## 3 0.0027664282 0.5289049
## 4 0.2427446800 -5.3943121
## 5 0.0002015398 -0.1624191
## 6 0.0009528017 0.3544561
```

### Influence

As noted previously, observations of high leverage may or may not be *influential*. The influence of an observation depends not only on its leverage, but also on the magnitude of its residual. Recall that while leverage only takes into account the explanatory variable (\(x\)), the residual depends on the response variable (\(y\)) and the fitted value (\(\hat{y}\)).

Influential points are likely to have high leverage and deviate from the general relationship between the two variables. We measure influence using Cook’s distance, which incorporates both the leverage and residual of each observation.

```
# Rank influential points
mod %>%
augment() %>%
arrange(desc(.cooksd)) %>%
head()
```

```
## SLG OBP .fitted .se.fit .resid .hat .sigma
## 1 0.308 0.550 0.69049108 0.009158810 -0.3824911 0.016410487 0.07011360
## 2 0.833 0.385 0.47211002 0.004190644 0.3608900 0.003435619 0.07028875
## 3 0.800 0.455 0.56475653 0.006186785 0.2352435 0.007488132 0.07101125
## 4 0.379 0.133 0.13858258 0.005792344 0.2404174 0.006563752 0.07098798
## 5 0.786 0.438 0.54225666 0.005678026 0.2437433 0.006307223 0.07097257
## 6 0.231 0.077 0.06446537 0.007506974 0.1665346 0.011024863 0.07127661
## .cooksd .std.resid
## 1 0.24274468 -5.394312
## 2 0.04407145 5.056428
## 3 0.04114818 3.302718
## 4 0.03760256 3.373787
## 5 0.03712042 3.420018
## 6 0.03057912 2.342252
```

### Removing outliers

Observations can be outliers for a number of different reasons. Statisticians must always be careful—and more importantly, transparent—when dealing with outliers. Sometimes, a better model fit can be achieved by simply removing outliers and re-fitting the model. However, one must have strong justification for doing this. A desire to have a higher \(R^2\) is not a good enough reason!

In the `mlbBat10 data`

, the outlier with an `OBP`

of 0.550 is Bobby Scales, an infielder who had four hits in 13 at-bats for the Chicago Cubs. Scales also walked seven times, resulting in his unusually high OBP. The justification for removing Scales here is weak. While his performance was unusual, there is nothing to suggest that it is not a valid data point, nor is there a good reason to think that somehow we will learn more about Major League Baseball players by excluding him.

Nevertheless, we can demonstrate how removing him will affect our model.

- Use
`filter()`

to create a subset of`mlbBat10`

called`nontrivial_players`

consisting of only those players with at least 10 at-bats and`OBP`

of below 0.500.

`nontrivial_players <- mlbBat10 %>% filter((AB >= 10) & (OBP < 0.500))`

- Fit the linear model for
`SLG`

as a function of`OBP`

for the`nontrivial_players`

. Save the result as`mod_cleaner`

.

`mod_cleaner <- lm(SLG ~ OBP, data = nontrivial_players)`

- View the
`summary()`

of the new model and compare the`slope`

and \(R^2\) to those of`mod`

, the original model fit to the data on all players.

`summary(mod_cleaner)`

```
##
## Call:
## lm(formula = SLG ~ OBP, data = nontrivial_players)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.31383 -0.04165 -0.00261 0.03992 0.35819
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.043326 0.009823 -4.411 1.18e-05 ***
## OBP 1.345816 0.033012 40.768 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07011 on 734 degrees of freedom
## Multiple R-squared: 0.6937, Adjusted R-squared: 0.6932
## F-statistic: 1662 on 1 and 734 DF, p-value: < 2.2e-16
```

- Visualize the new model with
`ggplot()`

and the appropriate`geom_*()`

functions.

```
ggplot(nontrivial_players, aes(y=SLG, x=OBP))+
geom_point()+
geom_smooth(method = "lm")
```

### High leverage points

Not all points of high leverage are influential. While the high leverage observation corresponding to Bobby Scales in the previous exercise is influential, the three observations for players with `OBP`

and `SLG`

values of 0 are not influential.

This is because they happen to lie right near the regression anyway. Thus, while their extremely low `OBP`

gives them the power to exert influence over the slope of the regression line, their low `SLG`

prevents them from using it.

- Use a combination of
`augment()`

,`arrange()`

with two arguments, and`head()`

to find the top 6 observations with the highest leverage but the lowest Cook’s distance.

```
mod <- lm(SLG ~ OBP, mlbBat10)
mod %>%
augment() %>%
arrange(.hat, desc(.cooksd)) %>%
head()
```

```
## SLG OBP .fitted .se.fit .resid .hat .sigma
## 1 0.323 0.206 0.2381334 0.003956495 0.08486655 0.0008340345 0.1370347
## 2 0.281 0.206 0.2381334 0.003956495 0.04286655 0.0008340345 0.1370510
## 3 0.216 0.205 0.2370231 0.003956499 -0.02102312 0.0008340362 0.1370553
## 4 0.276 0.207 0.2392438 0.003956623 0.03675623 0.0008340884 0.1370525
## 5 0.174 0.204 0.2359128 0.003956635 -0.06191280 0.0008340936 0.1370449
## 6 0.271 0.204 0.2359128 0.003956635 0.03508720 0.0008340936 0.1370529
## .cooksd .std.resid
## 1 1.602930e-04 0.6197252
## 2 4.089579e-05 0.3130265
## 3 9.836415e-06 -0.1535182
## 4 3.006987e-05 0.2684068
## 5 8.531654e-05 -0.4521089
## 6 2.740121e-05 0.2562190
```

`sessionInfo()`

```
## R version 3.4.4 (2018-03-15)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_Canada.1252 LC_CTYPE=English_Canada.1252
## [3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C
## [5] LC_TIME=English_Canada.1252
##
## attached base packages:
## [1] methods stats graphics grDevices utils datasets base
##
## other attached packages:
## [1] broom_0.4.4 HistData_0.8-4 Tmisc_0.1.19 bindrcpp_0.2.2
## [5] openintro_1.7.1 gridExtra_2.3 ggplot2_2.2.1 dplyr_0.7.4
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.16 pillar_1.2.2 compiler_3.4.4 plyr_1.8.4
## [5] bindr_0.1.1 tools_3.4.4 digest_0.6.15 lattice_0.20-35
## [9] nlme_3.1-137 evaluate_0.10.1 tibble_1.4.2 gtable_0.2.0
## [13] pkgconfig_2.0.1 rlang_0.2.0 psych_1.8.4 cli_1.0.0
## [17] parallel_3.4.4 yaml_2.1.19 blogdown_0.6 xfun_0.1
## [21] stringr_1.3.0 knitr_1.20 rprojroot_1.3-2 grid_3.4.4
## [25] glue_1.2.0 R6_2.2.2 foreign_0.8-70 rmarkdown_1.9
## [29] bookdown_0.7 reshape2_1.4.3 purrr_0.2.4 tidyr_0.8.0
## [33] magrittr_1.5 backports_1.1.2 scales_0.5.0 codetools_0.2-15
## [37] htmltools_0.3.6 mnormt_1.5-5 assertthat_0.2.0 colorspace_1.3-2
## [41] labeling_0.3 utf8_1.1.3 stringi_1.1.7 lazyeval_0.2.1
## [45] munsell_0.4.3 crayon_1.3.4
```

```
## Adding cites for R packages using knitr
knitr::write_bib(.packages(), "packages.bib")
```

# References

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2018. *Dplyr: A Grammar of Data Manipulation*. https://CRAN.R-project.org/package=dplyr.