# Gapminder

## 2018/12/20

library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
##     filter, lag
## The following objects are masked from 'package:base':
##
##     intersect, setdiff, setequal, union
library(dslabs)
library(ggplot2)
data("gapminder")
str(gapminder)
## 'data.frame':    10545 obs. of  9 variables:
##  $country : Factor w/ 185 levels "Albania","Algeria",..: 1 2 3 4 5 6 7 8 9 10 ... ##$ year            : int  1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
##  $infant_mortality: num 115.4 148.2 208 NA 59.9 ... ##$ life_expectancy : num  62.9 47.5 36 63 65.4 ...
##  $fertility : num 6.19 7.65 7.32 4.43 3.11 4.55 4.82 3.45 2.7 5.57 ... ##$ population      : num  1636054 11124892 5270844 54681 20619075 ...
##  $gdp : num NA 1.38e+10 NA NA 1.08e+11 ... ##$ continent       : Factor w/ 5 levels "Africa","Americas",..: 4 1 1 2 2 3 2 5 4 3 ...
##  $region : Factor w/ 22 levels "Australia and New Zealand",..: 19 11 10 2 15 21 2 1 22 21 ... So here’s a quiz. For each of the pairs of countries here, which country do you think had the highest child mortality in 2015? 1. Sri Lanka or Turkey 2. Poland or South Korea 3. Malaysia or Russia 4. Pakistan or Vietnam 5. Thailand or South Korea gapminder %>% filter(year==2015 & country %in% c("Sri Lanka", "Turkey")) %>% select(country, infant_mortality) ## country infant_mortality ## 1 Sri Lanka 8.4 ## 2 Turkey 11.6 filter(gapminder, year==1962) %>% ggplot(aes(fertility, life_expectancy))+ geom_point(aes(color=continent)) Note life expectancy falling into two distinct categories. To stratify data by some variable we will be using facet_grid(). filter(gapminder, year %in% c(1962, 2015)) %>% ggplot(aes(fertility, life_expectancy))+ geom_point(aes(color=continent)) + facet_grid(. ~ year) ## Warning: Removed 1 rows containing missing values (geom_point). After we split the plot like this, it clearly shows that the majority of countries have moved from the developing world cluster to the Western world one. years <- c(1962, 1980, 1990, 2000, 2012) continents <- c("Europe", "Asia") gapminder %>% filter(year %in% years & continent %in% continents) %>% ggplot(aes(fertility, life_expectancy, col = continent)) + geom_point() + facet_wrap(~ year) gapminder %>% select(country, fertility, year) %>% filter(country=="United States") %>% ggplot(aes(year, fertility)) + geom_point() ## Warning: Removed 1 rows containing missing values (geom_point). countries <- c("South Korea", "Germany") gapminder %>% filter(country %in% countries) %>% ggplot(aes(year, fertility, color=country))+ geom_line() ## Warning: Removed 2 rows containing missing values (geom_path). We are going to show an example of how to add labels to a time series plot. We define a data table with the label locations. And then we use a second mapping just for the labels. labels <- data_frame(country = countries, x = c(1975, 1965), y = c(60, 72)) gapminder %>% filter(country %in% countries) %>% ggplot(aes(year, life_expectancy, color=country))+ geom_line() + geom_text(data = labels, aes(x, y, label=countries))+ theme(legend.position = "none") labels ## # A tibble: 2 x 3 ## country x y ## <chr> <dbl> <dbl> ## 1 South Korea 1975 60 ## 2 Germany 1965 72 ### Transformation We are going to add the dollars per day variable. gapminder <- gapminder %>% mutate(dollar_per_day=(gdp/population)/365 ) Here’s a histogram of per day incomes from 1970. Note that GDP values is in our data table are adjusted for inflation and represent current US dollars. gapminder %>% filter(year==1970 & !is.na(gdp)) %>% ggplot(aes(dollar_per_day)) + geom_histogram(binwidth = 1, color="black") It might be more informative to quickly be able to see how many countries make on average about$1 a day,extremely poor– $2 a day, very poor–$4 a day, poor– $8 a day, which is about middle–$16 a day which is a well-off country– $32 is rich, and$64 which is very rich. These changes are multiplicative. And here we introduce log transformations. Log transformations change multiplicative changes into additive ones.

Using base 2 for, example, means that every time a value doubles, the log transformation increases by one. So to get the distribution of the log base 2 transform values, we simply transform the data and use the same code.

past_year <- 1970
gapminder %>%
filter(year==past_year & !is.na(gdp)) %>%
ggplot(aes(log2(dollar_per_day) )) +
geom_histogram(binwidth = 1, color="black")

The histogram we just saw suggests that in 1970, country income distribution have two modes. One at about $2 per day, one in the log2 scale, and another at about$32 per day– 5 in the log2 scale. 5 in the $$\log_2$$ scale. This bimodality is consistent with the dichotomous world made up of countries with average incomes less than $8 per day, 3 on the log scale and countries above that we see two modes in the histogram. We already learned this. We learned the scale_x_continuous function. So we want to remake the histograms that we already made but now using scales that have been transformed. We simply add a layer using the scale under scale_x_continuous() function. gapminder %>% filter(year==past_year & !is.na(gdp)) %>% ggplot(aes(dollar_per_day )) + geom_histogram(binwidth = 1, color="black") + scale_x_continuous(trans = "log2") ### stratify and boxplot gapminder$region %>% unique() %>% length()
## [1] 22
p <- gapminder %>%
filter(year==past_year & !is.na(gdp)) %>%
ggplot(aes(region, dollar_per_day))

p + geom_boxplot()

Now, note that we can’t read the region names because the default ggplot behavior is to write the labels horizontally. We can fix this by rotating the labels.

The hjust = 1 argument justifies this text so that it is next to the axis.

p + geom_boxplot() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))

Order by something meaningful use reorder().

We will reorder regions by their median income levels. To achieve this, we write the same code as before but we add mutate that changes region to a new factor where the levels are reordered.

p <- gapminder %>%
filter(year==past_year & !is.na(gdp)) %>%
mutate(region = reorder(region, dollar_per_day, FUN = median)) %>%
ggplot(aes(region, dollar_per_day, fill=continent)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
xlab("")

p

We can see that there’s four box plots that stand out at the end. The four highest ones. These are Western Europe, Australia and New Zealand, northern Europe, and North America. This is what we define as the West.

The last change we can make to this plot to help us see the data little bit better, is to change the scale through the log scale. We want to change it to be log2 scale in this case, so we add the layer scale_y_contiuous(), and we use the log2 transformation. This helps us to see the differences between the countries with the lower income. For example, we see a difference now between the African continent, which is in red, and Asia, which is in green.

p + scale_y_continuous(trans = "log2")

We could add the actual points to also show the data.

p + scale_y_continuous(trans = "log2") +
geom_point(show.legend = FALSE)

The exploratory data analysis we have conducted has revealed two characteristics about average income distributions in 1970. Using a histogram, we found a bimodal distribution with the most relating to poor and rich countries. So we are going to define a vector that defines the regions in the West.

west <- c("Western Europe", "Northern Europe", "Southern Europe", "Northern America", "Australia", "New Zealand")

Now we want to focus on comparing the differences in distributionacross time. We start by confirming that the bi-modality observed in 1970 is explained by a west versus developing world economy. We do this by creating a histogram for the groups previously defined. Note that we create the two groups with an if else inside a mutate.

gapminder %>%
filter(year==past_year & !is.na(gdp)) %>%
mutate(group= ifelse(region %in% west, "West", "Developing")) %>%
ggplot(aes(dollar_per_day)) +
geom_histogram(binwidth = 1, color="black") +
scale_x_continuous(trans = "log2") +
facet_grid(. ~ group)

Now we’re ready to see if the separation is worse today than it was 40 years ago. We do this by now faceting by both region and year.

present_year <- 2010

gapminder %>%
filter(year==c(past_year, present_year) & !is.na(gdp)) %>%
mutate(group= ifelse(region %in% west, "West", "Developing")) %>%
ggplot(aes(dollar_per_day)) +
geom_histogram(binwidth = 1, color="black") +
scale_x_continuous(trans = "log2") +
facet_grid(year ~ group)
## Warning in year == c(past_year, present_year): longer object length is not
## a multiple of shorter object length

So we’re going to remake the plots, but using only countries with data available for both years.

country_list_1 <- gapminder %>%
filter(year==past_year & !is.na(dollar_per_day)) %>%
.$country country_list_2 <- gapminder %>% filter(year==present_year & !is.na(dollar_per_day)) %>% .$country
country_list <- intersect(country_list_1, country_list_2)
gapminder %>%
filter(year==c(past_year, present_year) & country %in% country_list) %>%
mutate(group= ifelse(region %in% west, "West", "Developing")) %>%
ggplot(aes(dollar_per_day)) +
geom_histogram(binwidth = 1, color="black") +
scale_x_continuous(trans = "log2") +
facet_grid(year ~ group)
## Warning in year == c(past_year, present_year): longer object length is not
## a multiple of shorter object length

p <- gapminder %>%
filter(year==c(past_year, present_year) & country %in% country_list ) %>%
mutate(region = reorder(region, dollar_per_day, FUN = median)) %>%
ggplot() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Warning in year == c(past_year, present_year): longer object length is not
## a multiple of shorter object length
p + geom_boxplot(aes(region, dollar_per_day, fill=continent)) +
scale_y_continuous(trans = "log2") +
xlab("") +
facet_grid(year ~ .)

p + geom_boxplot(aes(region, dollar_per_day, fill=factor(year))) +
scale_y_continuous(trans = "log2") +
xlab("")

### Density plot

gapminder %>%
filter(year==past_year & country %in% country_list) %>%
mutate(group= ifelse(region %in% west, "West", "Developing")) %>%
group_by(group) %>%
summarise(n=n()) %>% knitr::kable()
group n
Developing 88
West 20

The next message we need to convey is that the reason for this change in distribution is that poor countries became richer rather than some rich countries becoming poorer. However, before we can do this, we need to learn how to make these smooth densities in a way that preserves information of how many countries are in each group. If we divide the world into developing and West, we have 88 developing countries and 20 Western countries. If we overlay the two densities, the defaultis is to have the area represented by each distribution add up to 1 regardless of the size of each group.

This makes it seem like there’s the same number of countries in each group, which is incorrect. To change this, we’ll need to learn to access computed variables with the geom_density function. To have the areas of the densities be proportional to the size of the groups, we can simply multiply the y-axis values by the size of the group. From the geom_density help file, we see that the function computes a variable called count that does exactly this. We want this variable to be on the y-axis rather than the density value. In ggplot, we can access these variablesby surrounding their names with dot dot.

So we will use the following mapping. We type aes() x = dollars_per_day and y = ..count.. This will put count on the y-axis. We can now create the desired plot by simply changing the mapping in the previous code chunk.

p <- gapminder %>%
filter(year==c(past_year, present_year) & country %in% country_list) %>%
mutate(group= ifelse(region %in% west, "West", "Developing")) %>%
ggplot(aes(dollar_per_day, y = ..count.. , fill=group)) +
scale_x_continuous(trans = "log2") +
facet_grid(year ~ group)
## Warning in year == c(past_year, present_year): longer object length is not
## a multiple of shorter object length
p + geom_density(alpha=.2) +
facet_grid(year ~ .)

If you want the densities to be smoother, because we can see in the Western countries, there was a lot of unsmoothness, we can change the bw argument.

p + geom_density(alpha=.2 , bw = .75) +
facet_grid(year ~ .)

We can easily alter the plot to show key regions separately. To do this, we introduced a new function called case_when().

gapminder <- gapminder %>%
mutate(group = case_when(
region %in% west ~ "The West",
region %in% c("Eastern Asia", "South-Eastern Asia") ~ "East Asia",
region %in% c("Caribbean", "Central America", "South America") ~ "Latin America",
continent == "Africa" & region != "Northern Africa" ~ "Sub-Saharan Africa",
TRUE ~ "Others"
))
gapminder <- gapminder %>%
mutate(group = factor(group, levels = c("Others", "Latin America", "East Asia", "Sub-Saharan Africa", "The West")))
p <- gapminder %>%
filter(year %in% c(past_year, present_year) & country %in% country_list) %>%
ggplot(aes(dollar_per_day, y = ..count.. , fill=group)) +
scale_x_continuous(trans = "log2") +
facet_grid(year ~ group)

p + geom_density(alpha=.2 , bw = .75) +
facet_grid(year ~ .)

p + geom_density(alpha=.2 , bw = .75, position = "stack") +
facet_grid(year ~ .)

gapminder %>%
filter(year %in% c(past_year, present_year) & country %in% country_list) %>%
group_by(year) %>%
mutate(weight = population/ sum(population)) %>%
ungroup() %>%
ggplot(aes(dollar_per_day, fill=group, weight=weight)) +
scale_x_continuous(trans = "log2") +
geom_density(alpha=.2 , bw = .75, position = "stack") +
facet_grid(year ~ .)
## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

### Ecological Fallacy

Here, we focus on the importance of describing the variability within the groups. While we do this, we’ll also show you some other ggplot functions as well as a transformation called the logit transformation, which is useful for the data that we’ll be looking at.

As an example for this, we will focus on the relationship between country child survival rates and average income. We start by defining quantities across regions.

gapminder <- gapminder %>%
mutate(group = case_when(
region %in% west ~ "The West",
region %in% "Northern Africa" ~ "Northern Africa",
region %in% "Southern Asia" ~ "Southern Asia",
region %in% c("Eastern Asia", "South-Eastern Asia") ~ "East Asia",
region %in% c("Caribbean", "Central America", "South America") ~ "Latin America",
continent == "Africa" & region != "Northern Africa" ~ "Sub-Saharan Africa",
region %in% c("Melanesia", "Micronesia", "Polynesia") ~ "Pacific Islands"
))
surv_income <- gapminder %>%
filter(year %in% present_year,
!is.na(gdp),
!is.na(infant_mortality),
!is.na(group)) %>%
group_by(group) %>%
summarise(income =  sum(gdp)/sum(population)/365,
infant_survival_rate = 1 - sum(infant_mortality/1000 * population)/sum(population))

surv_income %>% arrange(income)
## # A tibble: 7 x 3
##   group              income infant_survival_rate
##   <chr>               <dbl>                <dbl>
## 1 Sub-Saharan Africa   1.76                0.936
## 2 Southern Asia        2.07                0.952
## 3 Pacific Islands      2.70                0.956
## 4 Northern Africa      4.94                0.970
## 5 Latin America       13.2                 0.983
## 6 East Asia           13.4                 0.985
## 7 The West            77.5                 0.995

While in the West less than 0.5% of children die, in Sub-Saharan Africa, the rate is higher than 6%. In fact, the relationship between these two variablesis almost perfectly linear.

surv_income %>% ggplot(aes(income, infant_survival_rate, label = group, color = group)) +
scale_x_continuous(trans = "log2", limits = c(0.25, 150) ) +
scale_y_continuous(trans = "logit", limits = c(0.875, 0.9981),
breaks = c(.85, .90, .95, .99, .995, .998) ) +
geom_label(size=3, show.legend = FALSE)

In this plot, we introduced the use of the limit argument, which lets us change the range of the axis. We would do it like this following this code. We are making the range larger than the data needs because we will later compare this plot we just saw to one with more variability. And we want the ranges to be the same. We also introduced the breaks argument, which lets us set the location of the axis labels. Finally, we introduce a new transformation, the logistic transformation. The logistic or logit transformation for a proportional rate $$p$$ is defined as follows. $$f(p)=\log\left(\frac{p}{1-p}\right)$$. When $$p$$ is a proportion or probability, the quantity that is being logged, $$\frac{p}{1-p}$$, is called the odds. And the case $$p$$ is the proportion of children that survive. The odds tells us how many more children are expected to survive than to die. The log transformation makes this quantity symmetric. If the rates are the same, then the log odds is 0. Fold increases or decreases turn into positive and negative increments respectively.

This scale is useful when we want to highlight differences that are near 0 or near 1. For survival rates, this is important because a survival rate of 90% is unacceptable while the survival rate of 99% is relatively good. We would much prefer a survival rate closer to 99.9%. We want our scale to highlight these differences and the logit does this. Note that 99.9 divided by 0.1 is about 10 times larger than 99 divided by 1, which is about 10 times larger than 90 divided by 10. By using the log, these fold changes turn into constant increases.

Based on the plot we showed earlier, do we conclude that a country with a low income is destined to have low survival rate? Do we conclude that all survival rates in Sub-Saharan Africa are all lower than in southern Asia, which in turn are lower than in the Pacific Islands and so on? Jumping to this conclusion based on the plot we showed, the plot that shows only the averages is referred to as the ecological fallacy. The almost perfect relationship between survival rates and income is only observed for the averages at the regional level. Once we show the data, we see a some what more complicated story. So here is the plot for the averages.

And look at what happens once we show you every individual country. Specifically, we see that there is a large amount of variability. We see that the countries from the same regions can be quite different. And that countries within the same income can have different survival rates. For example, while on average Sub-Saharan Africa had the worst health and economic outcomes, there is wide variability within that group. For example, note that Mauritius and Botswana are doing much better than Angola and Sierra Leone with Mauritius comparable to Western countries.

sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.1 LTS
##
## Matrix products: default
## BLAS: /home/michael/anaconda3/lib/R/lib/libRblas.so
## LAPACK: /home/michael/anaconda3/lib/R/lib/libRlapack.so
##
## locale:
## [1] en_CA.UTF-8
##
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base
##
## other attached packages:
## [1] bindrcpp_0.2.2       ggplot2_3.0.0        dslabs_0.3.3
## [4] dplyr_0.7.6          RevoUtils_11.0.1     RevoUtilsMath_11.0.0
##
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.18     highr_0.7        pillar_1.3.0     compiler_3.5.1
##  [5] plyr_1.8.4       bindr_0.1.1      tools_3.5.1      digest_0.6.15
##  [9] evaluate_0.11    tibble_1.4.2     gtable_0.2.0     pkgconfig_2.0.1
## [13] rlang_0.2.1      cli_1.0.0        yaml_2.2.0       blogdown_0.9.8
## [17] xfun_0.4.11      withr_2.1.2      stringr_1.3.1    knitr_1.20
## [21] rprojroot_1.3-2  grid_3.5.1       tidyselect_0.2.4 glue_1.3.0
## [25] R6_2.2.2         fansi_0.2.3      rmarkdown_1.10   bookdown_0.7
## [29] reshape2_1.4.3   purrr_0.2.5      magrittr_1.5     backports_1.1.2
## [33] scales_0.5.0     codetools_0.2-15 htmltools_0.3.6  assertthat_0.2.0
## [37] colorspace_1.3-2 labeling_0.3     utf8_1.1.4       stringi_1.2.4
## [41] lazyeval_0.2.1   munsell_0.5.0    crayon_1.3.4

# References

Irizarry, Rafael A. 2017. Dslabs: Data Science Labs. https://CRAN.R-project.org/package=dslabs.

R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2018. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.