Basics of ggplot2

Michael Taylor

2018/03/17

Graph Components

library(dplyr) # add "pipe" like operator `%>%`
library(ggplot2)
library(dslabs, quietly = True)
data(murders)
head(murders)
##        state abb region population total
## 1    Alabama  AL  South    4779736   135
## 2     Alaska  AK   West     710231    19
## 3    Arizona  AZ   West    6392017   232
## 4   Arkansas  AR  South    2915918    93
## 5 California  CA   West   37253956  1257
## 6   Colorado  CO   West    5029196    65

ggplot terminology

  1. Data component.
  • The US murders data table is being summarised.
  1. Geometry component
  • The plot is a scatter plot.
  1. Aesthetic mapping components
  • The x-axis values are used to display the population size
  • the y-axis values are used to display the total number of murders
  • text is used to identify the states
  • colors are used to denote the four different regions
  1. Scale component
  • x-axis and y-axis range are determined by the range of the data
  • They are in this case log scales
  1. Labels, Tiltle, Legend, etc.

Creaing a new plot

# `ggplot(data = murders)` equqivalent to `murders %>% ggplot()` 
# assign graph object to the object `p`
p <- murders %>% ggplot() 

Layers

Layers can define geometries, compute summary statistics, define what scales to use, and even change styles. To add layers, we use a symbol + plus.

The first added layer defines the geometry. We want to make a scatter plot. The function to do this is geom_point. It will act on the data and mapping provided. The mappings in this case will be the required aruments x and y provided through the aes function. It connects data with what we see on the graph and is the argument to geom_point.

Note: x and y can be dropped.

p + 
  geom_point(aes(x = population / 10^6,
                 y = total))

geom_label and geom_text allows us to add text to the plot.

p + 
  geom_point(aes(x = population / 10^6,
                 y = total), size = 3) +
  geom_text(aes(x = population / 10^6,
                 y = total, label = abb), nudge_x = 1)

Tinkering

Globale aesthetic values (aes) can be added to ggplot to simplify our code.

p <- murders %>% ggplot(aes(population / 10^6,
                            total,
                            label = abb))
p + geom_point(size = 3) +
  geom_text(nudge_x = 1.5)

We can overide global mappings. This done by setting aes directly in geom_text.

p <- murders %>% ggplot(aes(population / 10^6,
                            total,
                            label = abb))
p + geom_point(size = 3) +
  geom_text(aes(x = 10, y = 800, label = "Hello there!"))

Only the label assigned by the new local aesthetic mapping is shown.

Scales, Labels, and Colors

p <- murders %>% ggplot(aes(population / 10^6,
                            total,
                            label = abb))
p + geom_point(size = 3) +
  geom_text(nudge_x = 0.075) +
  scale_x_continuous(trans = "log10") +
  scale_y_continuous(trans = "log10")

ggplot provides specialised functions for log transformations. scale_x_log10 and scale_y_log10. x-axis and y-axis labels can be added with xlab() and ylab() respectively. A title can be added with ggtitle()

p <- murders %>% ggplot(aes(population / 10^6,
                            total,
                            label = abb))
p + geom_point(size = 3) +
  geom_text(nudge_x = 0.075) +
  scale_x_log10() +
  scale_y_log10() +
  xlab("Population in millions (log scale)") +
  ylab("Total number of murders (log scale)") +
  ggtitle("US Gun Murders in US 2010")

The color of the points can be changed by calling using the col argument in the geom_point function. We will redefine the object p minus the geom_point layer. This help us in the understanding of how the col argument works.

p <- murders %>% ggplot(aes(population / 10^6,
                            total,
                            label = abb)) +
  geom_text(nudge_x = 0.075) +
  scale_x_log10() +
  scale_y_log10() +
  xlab("Population in millions (log scale)") +
  ylab("Total number of murders (log scale)") +
  ggtitle("US Gun Murders in US 2010")

Mapping color to region.

p + geom_point(aes(col=region), size = 3)

We want to add a line that represents the average per million murder rate r for the entire country. The line will be defined by y = r*x.

r <- murders %>% summarise(rate = sum(total) / sum(population) * 10^6) %>% pull(rate)
p + geom_point(aes(col=region), size = 3) +
  geom_abline(intercept = log10(r))

p <- murders %>% ggplot(aes(population / 10^6,
                            total,
                            label = abb)) +
  geom_abline(intercept = log10(r),
            lty = 2,
            color = "darkgrey") +
  geom_point(aes(col=region), size = 3) +
  scale_color_discrete(name = "Region") +
  geom_text(nudge_x = 0.075) +
  scale_x_log10() +
  scale_y_log10() +
  xlab("Population in millions (log scale)") +
  ylab("Total number of murders (log scale)") +
  ggtitle("US Gun Murders in US 2010")
p

add on packages

library(ggthemes)
p + theme_economist()

library(ggrepel)
p <- murders %>% ggplot(aes(population / 10^6,
                            total,
                            label = abb)) +
  geom_abline(intercept = log10(r),
            lty = 2,
            color = "darkgrey") +
  geom_point(aes(col=region), size = 3) +
  geom_text_repel() +
  scale_color_discrete(name = "Region") +
  geom_text(nudge_x = 0.075) +
  scale_x_log10() +
  scale_y_log10() +
  xlab("Population in millions (log scale)") +
  ylab("Total number of murders (log scale)") +
  ggtitle("US Gun Murders in US 2010") +
  theme_economist()
p

Other Examples

data("heights")
head(heights)
##      sex height
## 1   Male     75
## 2   Male     70
## 3   Male     68
## 4   Male     74
## 5   Male     61
## 6 Female     65
p <- heights %>% 
  filter(sex == "Male") %>% 
  ggplot(aes(x = height))
p + geom_histogram(binwidth = 1,
                   fill = "blue",
                   col = "black") +
  xlab("Male heights in inches") +
  ggtitle("Histogram")

Creating smoothe density plot.

p + geom_density(fill = "blue")

p <- heights %>% 
  filter(sex == "Male") %>% 
  ggplot(aes(sample = height))

p + geom_qq()

params <- heights %>% 
  filter(sex == "Male") %>% 
  summarise(mean = mean(height), sd = sd(height))

p + geom_qq(dparams = params) +
  geom_abline()

Alternately we could scale the data so it is in standard units and then plot it against the standard normal deviation.

heights %>% filter(sex == "Male") %>%
  ggplot(aes(sample=scale(height))) +
  geom_qq() +
  geom_abline()

Grid Plots

p <- heights %>% 
  filter(sex == "Male") %>% 
  ggplot(aes(x = height))

p1 <- p + geom_histogram(binwidth = 1,
                   fill = "blue",
                   col = "black") +
  xlab("Male heights in inches") +
  ggtitle("Histogram")

p2 <- p + geom_histogram(binwidth = 1,
                   fill = "blue",
                   col = "black") +
  xlab("Male heights in inches") +
  ggtitle("Histogram")

p3 <- p + geom_histogram(binwidth = 1,
                   fill = "blue",
                   col = "black") +
  xlab("Male heights in inches") +
  ggtitle("Histogram")

library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
grid.arrange(p1, p2, p3, ncol = 3)