# Basics of ggplot2

## Graph Components

library(dplyr) # add "pipe" like operator %>%
library(ggplot2)
library(dslabs, quietly = True)
data(murders)
head(murders)
##        state abb region population total
## 1    Alabama  AL  South    4779736   135
## 2     Alaska  AK   West     710231    19
## 3    Arizona  AZ   West    6392017   232
## 4   Arkansas  AR  South    2915918    93
## 5 California  CA   West   37253956  1257
## 6   Colorado  CO   West    5029196    65

### ggplot terminology

1. Data component.
• The US murders data table is being summarised.
1. Geometry component
• The plot is a scatter plot.
1. Aesthetic mapping components
• The x-axis values are used to display the population size
• the y-axis values are used to display the total number of murders
• text is used to identify the states
• colors are used to denote the four different regions
1. Scale component
• x-axis and y-axis range are determined by the range of the data
• They are in this case log scales
1. Labels, Tiltle, Legend, etc.

### Creaing a new plot

# ggplot(data = murders) equqivalent to murders %>% ggplot()
# assign graph object to the object p
p <- murders %>% ggplot() 

### Layers

Layers can define geometries, compute summary statistics, define what scales to use, and even change styles. To add layers, we use a symbol + plus.

The first added layer defines the geometry. We want to make a scatter plot. The function to do this is geom_point. It will act on the data and mapping provided. The mappings in this case will be the required aruments x and y provided through the aes function. It connects data with what we see on the graph and is the argument to geom_point.

Note: x and y can be dropped.

p +
geom_point(aes(x = population / 10^6,
y = total)) geom_label and geom_text allows us to add text to the plot.

p +
geom_point(aes(x = population / 10^6,
y = total), size = 3) +
geom_text(aes(x = population / 10^6,
y = total, label = abb), nudge_x = 1) ### Tinkering

Globale aesthetic values (aes) can be added to ggplot to simplify our code.

p <- murders %>% ggplot(aes(population / 10^6,
total,
label = abb))
p + geom_point(size = 3) +
geom_text(nudge_x = 1.5) We can overide global mappings. This done by setting aes directly in geom_text.

p <- murders %>% ggplot(aes(population / 10^6,
total,
label = abb))
p + geom_point(size = 3) +
geom_text(aes(x = 10, y = 800, label = "Hello there!")) Only the label assigned by the new local aesthetic mapping is shown.

### Scales, Labels, and Colors

p <- murders %>% ggplot(aes(population / 10^6,
total,
label = abb))
p + geom_point(size = 3) +
geom_text(nudge_x = 0.075) +
scale_x_continuous(trans = "log10") +
scale_y_continuous(trans = "log10") ggplot provides specialised functions for log transformations. scale_x_log10 and scale_y_log10. x-axis and y-axis labels can be added with xlab() and ylab() respectively. A title can be added with ggtitle()

p <- murders %>% ggplot(aes(population / 10^6,
total,
label = abb))
p + geom_point(size = 3) +
geom_text(nudge_x = 0.075) +
scale_x_log10() +
scale_y_log10() +
xlab("Population in millions (log scale)") +
ylab("Total number of murders (log scale)") +
ggtitle("US Gun Murders in US 2010") The color of the points can be changed by calling using the col argument in the geom_point function. We will redefine the object p minus the geom_point layer. This help us in the understanding of how the col argument works.

p <- murders %>% ggplot(aes(population / 10^6,
total,
label = abb)) +
geom_text(nudge_x = 0.075) +
scale_x_log10() +
scale_y_log10() +
xlab("Population in millions (log scale)") +
ylab("Total number of murders (log scale)") +
ggtitle("US Gun Murders in US 2010")

Mapping color to region.

p + geom_point(aes(col=region), size = 3) We want to add a line that represents the average per million murder rate r for the entire country. The line will be defined by y = r*x.

r <- murders %>% summarise(rate = sum(total) / sum(population) * 10^6) %>% pull(rate)
p + geom_point(aes(col=region), size = 3) +
geom_abline(intercept = log10(r)) p <- murders %>% ggplot(aes(population / 10^6,
total,
label = abb)) +
geom_abline(intercept = log10(r),
lty = 2,
color = "darkgrey") +
geom_point(aes(col=region), size = 3) +
scale_color_discrete(name = "Region") +
geom_text(nudge_x = 0.075) +
scale_x_log10() +
scale_y_log10() +
xlab("Population in millions (log scale)") +
ylab("Total number of murders (log scale)") +
ggtitle("US Gun Murders in US 2010")
p library(ggthemes)
p + theme_economist() library(ggrepel)
p <- murders %>% ggplot(aes(population / 10^6,
total,
label = abb)) +
geom_abline(intercept = log10(r),
lty = 2,
color = "darkgrey") +
geom_point(aes(col=region), size = 3) +
geom_text_repel() +
scale_color_discrete(name = "Region") +
geom_text(nudge_x = 0.075) +
scale_x_log10() +
scale_y_log10() +
xlab("Population in millions (log scale)") +
ylab("Total number of murders (log scale)") +
ggtitle("US Gun Murders in US 2010") +
theme_economist()
p ### Other Examples

data("heights")
head(heights)
##      sex height
## 1   Male     75
## 2   Male     70
## 3   Male     68
## 4   Male     74
## 5   Male     61
## 6 Female     65
p <- heights %>%
filter(sex == "Male") %>%
ggplot(aes(x = height))
p + geom_histogram(binwidth = 1,
fill = "blue",
col = "black") +
xlab("Male heights in inches") +
ggtitle("Histogram") Creating smoothe density plot.

p + geom_density(fill = "blue") p <- heights %>%
filter(sex == "Male") %>%
ggplot(aes(sample = height))

p + geom_qq() params <- heights %>%
filter(sex == "Male") %>%
summarise(mean = mean(height), sd = sd(height))

p + geom_qq(dparams = params) +
geom_abline() Alternately we could scale the data so it is in standard units and then plot it against the standard normal deviation.

heights %>% filter(sex == "Male") %>%
ggplot(aes(sample=scale(height))) +
geom_qq() +
geom_abline() ### Grid Plots

p <- heights %>%
filter(sex == "Male") %>%
ggplot(aes(x = height))

p1 <- p + geom_histogram(binwidth = 1,
fill = "blue",
col = "black") +
xlab("Male heights in inches") +
ggtitle("Histogram")

p2 <- p + geom_histogram(binwidth = 1,
fill = "blue",
col = "black") +
xlab("Male heights in inches") +
ggtitle("Histogram")

p3 <- p + geom_histogram(binwidth = 1,
fill = "blue",
col = "black") +
xlab("Male heights in inches") +
ggtitle("Histogram")

library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
##     combine
grid.arrange(p1, p2, p3, ncol = 3) 