# Data Visualizaton Principles

## 2019/01/03

library(dslabs)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
##     filter, lag
## The following objects are masked from 'package:base':
##
##     intersect, setdiff, setequal, union
library(ggplot2)
data("heights")
head(heights)
##      sex height
## 1   Male     75
## 2   Male     70
## 3   Male     68
## 4   Male     74
## 5   Male     61
## 6 Female     65
heights %>%
ggplot(aes(x=height)) +
geom_histogram(aes(y=..density..),binwidth = 1, color="black") +
facet_grid(. ~ sex)

heights %>%
ggplot(aes(x=height)) +
geom_histogram(aes(y=..density..),binwidth = 1, color="black") +
facet_grid(sex ~.)

### Slope charts

data('gapminder')
west <- c("Western Europe", "Northern Europe", "Southern Europe", "Northern America", "Australia", "New Zealand")

dat <- gapminder %>%
filter(year %in% c(2010, 2015) &
region %in% west &
!is.na(life_expectancy) &
population > 10^7)

Slope charts are used when you are comparing variables of the same type but at different time points and for a relatively small number of comparison.

df <- dat %>%
mutate(location = ifelse(year == 2010, 1, 2),
location = ifelse(year == 2015 & country %in% c("United Kingdom", "Portugal"),
location + 0.22, location),
hjust = ifelse(year == 2010, 1, 0)) %>%
mutate(year = as.factor(year))

df %>% ggplot(aes(year, life_expectancy, group = country)) +
geom_line(aes(color = country), show.legend = FALSE) +
geom_text(aes(x = location, label = country, hjust = hjust),
show.legend = FALSE) +
xlab("") + ylab("Life Expectancy")

An advantage of the slope chart is that it permits us to quickly get an idea of changes based on the slope of the lines. Note, that we’re using angle as a visual cue, but we also have position to determine the exact values.

## Case Study: Vacines

data("us_contagious_diseases")
str(us_contagious_diseases)
## 'data.frame':    18870 obs. of  6 variables:
##  $disease : Factor w/ 7 levels "Hepatitis A",..: 1 1 1 1 1 1 1 1 1 1 ... ##$ state          : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $year : num 1966 1967 1968 1969 1970 ... ##$ weeks_reporting: int  50 49 52 49 51 51 45 45 45 46 ...
##  $count : num 321 291 314 380 413 378 342 467 244 286 ... ##$ population     : num  3345787 3364130 3386068 3412450 3444165 ...

dat stores all the measles data. It includes a per 100,000 rate, orders states by average value of disease, and removes Alaska and Hawaii, since they only became states in the late 50s.

We can now easily plot disease rates per year. Here are the measles data for California. We can use this simple code to show it. We add a vertical line at 1963, since this is when the vaccine was introduced.

the_disease <- "Measles"
dat <- us_contagious_diseases %>%
disease == the_disease) %>%
mutate(rate = count / population * 10000) %>%
mutate(state = reorder(state, rate))
dat %>% filter(state == "California") %>%
ggplot(aes(year, rate)) +
geom_line() +
ylab("Cases per 10,000") +
geom_vline(xintercept = 1963, col="blue")

When choosing colors to quantify a numeric variable, we choose between two options, sequential and diverging. Sequential palettes are suited for data that goes from high to low. High values are clearly distinguished from the low values.

library(RColorBrewer)
display.brewer.all(type = "seq")

An example of when we would use a divergent pattern would be if we were to show heights and standard deviations away from the average.

display.brewer.all(type = "div")

We use the geometry geom_tile to tile the region with colors representing disease rates. We use square root transformation to avoid having the really high counts dominate the plot. Here’s the code that generates a very nice and impactful plot.

dat %>% ggplot(aes(year, state, fill=rate)) +
geom_tile(color="grey50") +
scale_x_continuous(expand = c(0, 0) ) +
trans="sqrt") +
geom_vline(xintercept = 1963, color="blue") +
theme_minimal() +
ylab("") +
xlab("")

This plot makes a very striking argument for the contribution of vaccines.

However, one limitation of this plot is that it uses color to represent quantity, which we earlier explained makes it a bit harder to know exactly how high it is going. Position and lens are better cues. If we are willing to lose data information, we can make a version of the plot that shows the values with position. We can also show the average for the US, which we compute like this.

avg <- us_contagious_diseases %>%
filter(disease == the_disease) %>%
group_by(year) %>%
summarise(
us_rate = sum(count, na.rm = TRUE) / sum(population, na.rm = TRUE) * 10000
)
avg
## # A tibble: 76 x 2
##     year us_rate
##    <dbl>   <dbl>
##  1  1928    40.4
##  2  1929    27.9
##  3  1930    31.3
##  4  1931    35.4
##  5  1932    31.2
##  6  1933    30.2
##  7  1934    57.3
##  8  1935    56.5
##  9  1936    21.9
## 10  1937    22.8
## # ... with 66 more rows

Now to make the plot, we simply use the geom_line geometry. We are going to make every state the same color. This is because it’s harder to choose 50 distinct colors. However, the plot is very impactful. It shows very clearly how after the vaccine was introduced the rates went down across all states. It shows the same information as our previous plot, but now we can actually see what the values are. End of transcript. Skip to the start.

dat %>% filter(!is.na(rate)) %>%
ggplot() +
geom_line( aes(year, rate, group=state),
color="grey50",
alpha = 0.2,
size = 1) +
scale_y_continuous(expand = c(0, 0),
trans = "sqrt",
breaks = c(5, 25, 125, 300) ) +
geom_vline(xintercept = 1963, color="blue") +
geom_line(data = avg, aes(x=year, us_rate), size = 1) +
geom_text(data = data.frame(x=1955, y=50),
mapping = aes(x, y, label="US average"),
color="black") +
ggtitle("Cases per 10,000 by state") +
theme_light() +
ylab("") +
xlab("")

### Time series plot - all diseases in California

• Include only years with 10 or more weeks reporting.
us_contagious_diseases %>%
filter(state=="California" & weeks_reporting >= 10) %>%
group_by(year, disease) %>%
summarize(rate = sum(count)/sum(population)*10000) %>%
ggplot(aes(year, rate, color=disease)) +
geom_line()

### Time series plot - all diseases in the United States

• Compute the US rate by using summarize to sum over states.
• The US rate for each disease will be the total number of cases divided by the total population.
• cases per 10000
us_contagious_diseases %>%
filter(!is.na(population) ) %>%
group_by(state) %>%
summarise(rate = sum(count)/sum(population)*10000)
## # A tibble: 51 x 2
##    state                 rate
##    <fct>                <dbl>
##  1 Alabama              2.03
##  3 Arizona              2.69
##  4 Arkansas             2.38
##  5 California           2.47
##  7 Connecticut          4.73
##  8 Delaware             2.33
##  9 District Of Columbia 3.42
## 10 Florida              0.838
## # ... with 41 more rows
• You will need to filter for !is.na(population) to get all the data.
• Plot each disease in a different color.
us_contagious_diseases %>%
filter(!is.na(population) ) %>%
group_by(year, disease) %>%
summarise(rate = sum(count)/sum(population)*10000) %>%
ggplot(aes(year, rate, color=disease)) +
geom_line()

Irizarry, Rafael A. 2017. Dslabs: Data Science Labs. https://CRAN.R-project.org/package=dslabs.

R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2018. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.