Introduction to the Tidyverse

Michael Taylor

2018/05/03

# if (!require("DT")) install.packages('DT')
library(DT)

Data wrangling

Loading the gapminder and dplyr packages

# Load the gapminder package
suppressPackageStartupMessages(library(gapminder))

# Load the dplyr package
library(dplyr)
# Look at the gapminder dataset
gapminder %>% datatable()

Filtering for one year

The filter verb extracts particular observations based on a condition.

  • Add a filter() line after the pipe (%>%) to extract only the observations from the year 1957. Remember that you use == to compare two values.
# Filter the gapminder dataset for the year 1957
gapminder %>% 
  filter(year == 1957) %>% datatable()
  • Filter the gapminder data to retrieve only the observation from China in the year 2002.
gapminder %>% 
  filter(country == "China") %>% 
  filter(year == 2002) %>% 
  datatable()

Arranging observations by life expectancy

  • Sort the gapminder dataset in ascending order of life expectancy (lifeExp).
  • Sort the gapminder dataset in descending order of life expectancy.
gapminder %>%
  arrange(lifeExp) %>% 
  datatable()
gapminder %>% 
  arrange(desc(lifeExp)) %>% 
  datatable()

Filtering and arranging

  • Use filter() to extract observations from just the year 1957, then use arrange() to sort in descending order of population (pop).
gapminder %>% 
  filter(year==1957) %>% 
  arrange(desc(pop)) %>% 
  datatable()

Using mutate to change or create a column

  • Use mutate() to change the existing lifeExp column, by multiplying it by 12: 12 * lifeExp.
  • Use mutate() to add a new column, called lifeExpMonths, calculated as 12 * lifeExp.
gapminder %>% 
  mutate(lifeExp=lifeExp*12) %>% 
  datatable()
gapminder %>% 
  mutate(lifeExpMonths=12*lifeExp) %>% 
  datatable()

Combining filter, mutate, and arrange

  • filter() for observations from the year 2007,
  • mutate() to create a column lifeExpMonths, calculated as 12 * lifeExp, and
  • arrange() in descending order of that new column
gapminder %>% 
  filter(year == 2007) %>% 
  mutate(lifeExpMonths = 12 * lifeExp) %>% 
  arrange(desc(lifeExpMonths)) %>%
  datatable()

Data visualization

Variable assignment

  • Load the ggplot2 package.
  • Filter gapminder for observations from the year 1952, and assign it to a new dataset gapminder_1952 using the assignment operator (<-).
# Load the ggplot2 package
library(ggplot2)
gapminder_1952 <- gapminder %>%
  filter(year == 1952)

Comparing population and GDP per capita

ggplot(gapminder_1952, aes(x = pop, y = gdpPercap)) +
  geom_point()

Comparing population and life expectancy

  • Create a scatter plot of gapminder_1952 with population (pop) is on the x-axis and life expectancy (lifeExp) on the y-axis.
ggplot(gapminder_1952, aes(x=pop, y=lifeExp)) +
  geom_point()

Putting the x-axis on a log scale

ggplot(gapminder_1952, aes(x=pop, y=lifeExp)) +
  geom_point() +
  scale_x_log10()

Putting the x- and y- axes on a log scale

ggplot(gapminder_1952, aes(x=pop, y=gdpPercap)) +
  geom_point() +
  scale_x_log10() +
  scale_y_log10() +
  ggtitle("Scatter plot comparing pop and gdpPercap")

ggplot(gapminder_1952, aes(x=pop, y=lifeExp, color=continent)) +
  geom_point() +
  scale_x_log10() +
  ggtitle("Scatter plot comparing pop and lifeExp")

Adding size and color to a plot

  • Modify the scatter plot so that the size of the points represents each country’s GDP per capita (gdpPercap).
ggplot(gapminder_1952, aes(x=pop, 
                           y=lifeExp, 
                           color=continent,
                           size=gdpPercap)) +
  geom_point() +
  scale_x_log10() +
  ggtitle("Scatter plot comparing pop and lifeExp")

Creating a subgraph for each continent

  • Create a scatter plot of gapminder_1952 with the x-axis representing population (pop), the y-axis representing life expectancy (lifeExp), and faceted to have one subplot per continent (continent). Put the x-axis on a log scale.
gapminder_1952 %>% 
  ggplot(aes(pop , lifeExp)) +
  geom_point() +
  facet_wrap(~ continent) +
  scale_x_log10()

Faceting by year

Create a scatter plot of the gapminder data: * Put GDP per capita (gdpPercap) on the x-axis and life expectancy (lifeExp) on the y-axis, with continent (continent) represented by color and population (pop) represented by size. * Put the x-axis on a log scale * Facet by the year variable

gapminder %>% 
  ggplot(aes(gdpPercap, lifeExp, color=continent, size=pop)) +
  geom_point() +
  scale_x_log10() +
  facet_wrap(~ year)

Grouping and summarizing

gapminder %>% 
  summarise(medianLifeExp = median(lifeExp))
## # A tibble: 1 x 1
##   medianLifeExp
##           <dbl>
## 1          60.7

Summarizing the median life expectancy in 1957

  • Filter for the year 1957, then use the median() function within a summarize() to calculate the median life expectancy into a column called medianLifeExp.
gapminder %>%
  filter(year == 1957) %>% 
  summarise(medianLifeExp = median(lifeExp))
## # A tibble: 1 x 1
##   medianLifeExp
##           <dbl>
## 1          48.4

Summarizing multiple variables in 1957

  • Find both the median life expectancy (lifeExp) and the maximum GDP per capita (gdpPercap) in the year 1957, calling them medianLifeExp and maxGdpPercap respectively. You can use the max() function to find the maximum.
gapminder %>% 
  filter(year == 1957) %>% 
  summarise(medianLifeExp = median(lifeExp),
            maxGdpPercap = max(gdpPercap))
## # A tibble: 1 x 2
##   medianLifeExp maxGdpPercap
##           <dbl>        <dbl>
## 1          48.4      113523.

Summarizing by year

  • Find the median life expectancy (lifeExp) and maximum GDP per capita (gdpPercap) within each year, saving them into medianLifeExp and maxGdpPercap, respectively.
gapminder %>% 
  group_by(year) %>% 
  summarise(medianLifeExp = median(lifeExp),
            maxGdpPercap = max(gdpPercap)) %>% 
  datatable()

Summarizing by continent

  • Filter the gapminder data for the year 1957. Then find the median life expectancy (lifeExp) and maximum GDP per capita (gdpPercap) within each continent, saving them into medianLifeExp and maxGdpPercap, respectively.
gapminder %>%
  filter(year == 1957) %>% 
  group_by(continent) %>% 
  summarise(medianLifeExp = median(lifeExp),
            maxGdpPercap = max(gdpPercap)) %>% 
  datatable()

Summarizing by continent and year

  • Find the median life expectancy (lifeExp) and maximum GDP per capita (gdpPercap) within each combination of continent and year, saving them into medianLifeExp and maxGdpPercap, respectively.
gapminder %>%
  group_by(continent, year) %>% 
  summarise(medianLifeExp = median(lifeExp),
            maxGdpPercap = max(gdpPercap)) %>% 
  datatable()

Visualizing median life expectancy over time

Use the by_year dataset to create a scatter plot showing the change of median life expectancy over time, with year on the x-axis and medianLifeExp on the y-axis. Add expand_limits(y = 0) to make sure the plot’s y-axis includes zero.

by_year <- gapminder %>%
  group_by(year) %>%
  summarize(medianLifeExp = median(lifeExp),
            maxGdpPercap = max(gdpPercap))

by_year %>% 
  ggplot(aes(year, medianLifeExp)) +
  geom_point() +
  expand_limits(y = 0)

Visualizing median GDP per capita per continent over time

  • Summarize the gapminder dataset by continent and year, finding the median GDP per capita (medianGdpPercap) within each and putting it into a column called medianGdpPercap. Use the assignment operator <- to save this summarized data as by_year_continent.
  • Create a scatter plot showing the change in medianGdpPercap by continent over time. Use color to distinguish between continents, and be sure to add expand_limits(y = 0) so that the y-axis starts at zero.
by_year_continent <- 
  gapminder %>% 
  group_by(continent, year) %>% 
  summarise(medianGdpPercap = median(gdpPercap))

by_year_continent %>% 
  ggplot(aes(year, medianGdpPercap, color = continent)) +
  geom_point() +
  ggtitle("change in `medianGdpPercap` by `continent` over time") +
  expand_limits(y = 0)

Comparing median life expectancy and median GDP per continent in 2007

  • Filter the gapminder dataset for the year 2007, then summarize the median GDP per capita and the median life expectancy within each continent, into columns called medianLifeExp and medianGdpPercap. Save this as by_continent_2007.
  • Use the by_continent_2007 data to create a scatterplot comparing these summary statistics for continents in 2007, putting the median GDP per capita on the x-axis to the median life expectancy on the y-axis. Color the scatter plot by continent. You don’t need to add expand_limits(y = 0) for this plot.
by_continent_2007 <- gapminder %>% 
  filter(year == 2007) %>% 
  group_by(continent) %>% 
  summarise(medianLifeExp = median(lifeExp),
            medianGdpPercap = median(gdpPercap))

by_continent_2007 %>% 
  ggplot(aes(medianGdpPercap, medianLifeExp, color = continent)) + 
  geom_point() +
  expand_limits(y = 0)

Types of visualizations

Visualizing median GDP per capita over time

  • Use group_by() and summarize() to find the median GDP per capita within each year, calling the output column medianGdpPercap. Use the assignment operator <- to save it to a dataset called by_year.
  • Use the by_year dataset to create a line plot showing the change in median GDP per capita over time. Be sure to use expand_limits(y = 0) to include 0 on the y-axis.
by_year <- gapminder %>%
  group_by(year) %>% 
  summarise(medianGdpPercap = median(gdpPercap))

ggplot(by_year, aes(year, medianGdpPercap)) +
  geom_line() +
  expand_limits(y = 0)

Visualizing median GDP per capita by continent over time

  • Use group_by() and summarize() to find the median GDP per capita within each year and continent, calling the output column medianGdpPercap. Use the assignment operator <- to save it to a dataset called by_year_continent.
  • Use the by_year_continent dataset to create a line plot showing the change in median GDP per capita over time, with color representing continent. Be sure to use expand_limits(y = 0) to include 0 on the y-axis.
by_year_continent <- gapminder %>% 
  group_by(year, continent) %>% 
  summarise(medianGdpPercap = median(gdpPercap))

ggplot(by_year_continent, aes(year, medianGdpPercap, color=continent)) +
  geom_line() +
  ggtitle("median GDP per capita within each `year` and `continent`") +
  expand_limits(y = 0)

Visualizing median GDP per capita by continent

  • Use group_by() and summarize() to find the median GDP per capita within each continent in the year 1952, calling the output column medianGdpPercap. Use the assignment operator <- to save it to a dataset called by_continent.
  • Use the by_continent dataset to create a bar plot showing the median GDP per capita in each continent.
by_continent <- gapminder %>% 
  filter(year == 1952) %>%
  group_by(continent) %>% 
  summarise(medianGdpPercap = median(gdpPercap))

ggplot(by_continent, aes(continent, medianGdpPercap)) +
  geom_col() +
  ggtitle("median GDP per capita within each `continent` in the `year` 1952")

Visualizing GDP per capita by country in Oceania

  • Filter for observations in the Oceania continent in the year 1952. Save this as oceania_1952.
  • Use the oceania_1952 dataset to create a bar plot, with country on the x-axis and gdpPercap on the y-axis.
oceania_1952 <- gapminder %>% 
  filter(year == 1952) %>% 
  filter(continent == "Oceania")

ggplot(oceania_1952, aes(x = country, y = gdpPercap)) +
  geom_col() +
  ggtitle("Oceanic countries GDP per capita")

Visualizing population

  • Use the gapminder_1952 dataset to create a histogram of country population (pop) in the year 1952.
ggplot(gapminder_1952, aes(x = pop)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  • put the x-axis on a log scale with scale_x_log10().
ggplot(gapminder_1952, aes(x = pop)) +
  geom_histogram() +
  scale_x_log10()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Comparing GDP per capita across continents

  • Use the gapminder_1952 dataset to create a boxplot comparing GDP per capita (gdpPercap) among continents. Put the y-axis on a log scale with scale_y_log10().
ggplot(gapminder_1952, aes(continent, gdpPercap)) +
  geom_boxplot() +
  scale_y_log10() +
  ggtitle("Comparing GDP per capita across continents")