tidygraph and ggraph

Michael Taylor

2018/06/06

library(tidyverse)
library(tidygraph)
library(ggraph)
library(showtext)
knitr::opts_chunk$set(cache=TRUE)

(Thomas Lin Pedersen)[https://www.data-imaginist.com/] has recently released the tidygraph and ggraph packages that leverage the power of igraph in a manner consistent with the tidyverse workflow.

Network Analysis: Nodes and Edges

The two primary aspects of networks are a multitude of separate entities and the connections between them. The vocabulary can be a bit technical and even inconsistent between different disciplines, packages, and software. The entities are referred to as nodes or vertices of a graph, while the connections are edges or links. In this post I will mainly use the nomenclature of nodes and edges except when discussing packages that use different vocabulary.

The network analysis packages need data to be in a particular form to create the special type of object used by each package. The object classes for igraph, and tidygraph are all based on adjacency matrices, also known as sociomatrices. An adjacency matrix is a square matrix in which the column and row names are the nodes of the network. Within the matrix a 1 indicates that there is a connection between the nodes, and a 0 indicates no connection. Adjacency matrices implement a very different data structure than data frames and do not fit within the tidyverse workflow. Helpfully, the specialized network objects can also be created from an edge-list data frame, which do fit in the tidyverse workflow. In this post I will stick to the data analysis techniques of the tidyverse to create edge lists, which will then be converted to the specific object classes for igraph, and tidygraph.

An edge list is a data frame that contains a minimum of two columns, one column of nodes that are the source of a connection and another column of nodes that are the target of the connection. The nodes in the data are identified by unique IDs. If the distinction between source and target is meaningful, the network is directed. If the distinction is not meaningful, the network is undirected. With the example of letters sent between cities, the distinction between source and target is clearly meaningful, and so the network is directed. For the examples below, I will name the source column as “for” and the target column as “to”. I will use integers beginning with one as node IDs. An edge list can also contain additional columns that describe attributes of the edges such as a magnitude aspect for an edge. If the edges have a magnitude attribute the graph is considered weighted.

Edge lists contain all of the information necessary to create network objects, but sometimes it is preferable to also create a separate node list. At its simplest, a node list is a data frame with a single column — which I will label as “id” — that lists the node IDs found in the edge list. The advantage of creating a separate node list is the ability to add attribute columns to the data frame such as the names of the nodes or any kind of groupings. Below I give an example of minimal edge and node lists created with the tibble() function.

edge_list <- tibble(from = c(1, 2, 2, 3, 4), to = c(2, 3, 4, 2, 1))
node_list <- tibble(id = 1:4)
edge_list
## # A tibble: 5 x 2
##    from    to
##   <dbl> <dbl>
## 1     1     2
## 2     2     3
## 3     2     4
## 4     3     2
## 5     4     1
node_list
## # A tibble: 4 x 1
##      id
##   <int>
## 1     1
## 2     2
## 3     3
## 4     4

Creating edge and node lists

To create network objects from the database of letters received by Daniel van der Meulen in 1585 I will make both an edge list and a node list. This will necessitate the use of the dplyr package to manipulate the data frame of letters sent to Daniel and split it into two data frames or tibbles with the structure of edge and node lists. In this case, the nodes will be the cities from which Daniel’s correspondents sent him letters and the cities in which he received them. The node list will contain a “label” column, containing the names of the cities. The edge list will also have an attribute column that will show the amount of letters sent between each pair of cities.

The first step is to load the tidyverse library to import and manipulate the data. Printing out the letters data frame shows that it contains four columns: “writer”, “source”, “destination”, and “date”. In this example, we will only deal with the “source” and “destination” columns.

(letters <-read_csv("correspondence-data-1585.csv"))
## # A tibble: 114 x 4
##    writer                  source  destination date      
##    <chr>                   <chr>   <chr>       <date>    
##  1 Meulen, Andries van der Antwerp Delft       1585-01-03
##  2 Meulen, Andries van der Antwerp Haarlem     1585-01-09
##  3 Meulen, Andries van der Antwerp Haarlem     1585-01-11
##  4 Meulen, Andries van der Antwerp Delft       1585-01-12
##  5 Meulen, Andries van der Antwerp Haarlem     1585-01-12
##  6 Meulen, Andries van der Antwerp Delft       1585-01-17
##  7 Meulen, Andries van der Antwerp Delft       1585-01-22
##  8 Meulen, Andries van der Antwerp Delft       1585-01-23
##  9 Della Faille, Marten    Antwerp Haarlem     1585-01-24
## 10 Meulen, Andries van der Antwerp Delft       1585-01-28
## # ... with 104 more rows

Node list

We want to get the distinct cities from both the “source” and “destination” columns and then join the information from these columns together. In the example below, I choose to have the name for the columns with the city names be the same for both the sources and destinations data frames to simplify the full_join() function. I rename the column with the city names as “label” to adopt the vocabulary used by network analysis packages.

sources <- letters %>%
  distinct(source) %>% 
  rename(label = source)

sources
## # A tibble: 9 x 1
##   label    
##   <chr>    
## 1 Antwerp  
## 2 Haarlem  
## 3 Dordrecht
## 4 Venice   
## 5 Lisse    
## 6 Het Vlie 
## 7 Hamburg  
## 8 Emden    
## 9 Amsterdam
destinations <- letters %>%
  distinct(destination) %>%
  rename(label = destination)

destinations
## # A tibble: 5 x 1
##   label     
##   <chr>     
## 1 Delft     
## 2 Haarlem   
## 3 The Hague 
## 4 Middelburg
## 5 Bremen

To create a single dataframe with a column with the unique locations we need to use a (full join)[http://r4ds.had.co.nz/relational-data.html#outer-join], because we want to include all unique places from both the sources of the letters and the destinations.

nodes <- full_join(sources, destinations, by = "label")
nodes
## # A tibble: 13 x 1
##    label     
##    <chr>     
##  1 Antwerp   
##  2 Haarlem   
##  3 Dordrecht 
##  4 Venice    
##  5 Lisse     
##  6 Het Vlie  
##  7 Hamburg   
##  8 Emden     
##  9 Amsterdam 
## 10 Delft     
## 11 The Hague 
## 12 Middelburg
## 13 Bremen

This results in a data frame with one variable. However, the variable contained in the data frame is not really what we are looking for. The “label” column contains the names of the nodes, but we also want to have unique IDs for each city. We can do this by adding an “id” column to the nodes data frame that contains numbers from one to whatever the total number of rows in the data frame is. A helpful function for this workflow is rowid_to_column(), which adds a column with the values from the row ids and places the column at the start of the data frame. Note that rowid_to_column() is a pipeable command, and so it is possible to do the full_join() and add the “id” column in a single command. The result is a nodes list with an ID column and a label attribute.

nodes <- nodes %>% rowid_to_column("id")
nodes
## # A tibble: 13 x 2
##       id label     
##    <int> <chr>     
##  1     1 Antwerp   
##  2     2 Haarlem   
##  3     3 Dordrecht 
##  4     4 Venice    
##  5     5 Lisse     
##  6     6 Het Vlie  
##  7     7 Hamburg   
##  8     8 Emden     
##  9     9 Amsterdam 
## 10    10 Delft     
## 11    11 The Hague 
## 12    12 Middelburg
## 13    13 Bremen

Edge list

Creating an edge list is similar to the above, but it is complicated by the need to deal with two ID columns instead of one. We also want to create a weight column that will note the amount of letters sent between each set of nodes. To accomplish this I will use the same group_by() and summarise() workflow that I have discussed in previous posts. The difference here is that we want to group the data frame by two columns — “source” and “destination” — instead of just one. I have adopted the nomenclature of network analysisI have named the column that counts the number of observations per group “weight”. The final command in the pipeline removes the grouping for the data frame instituted by the group_by() function. This makes it easier to manipulate the resulting per_route data frame unhindered.

per_route <- letters %>%  
  group_by(source, destination) %>%
  summarise(weight = n()) %>% 
  ungroup()

per_route
## # A tibble: 15 x 3
##    source    destination weight
##    <chr>     <chr>        <int>
##  1 Amsterdam Bremen           1
##  2 Antwerp   Delft           68
##  3 Antwerp   Haarlem          5
##  4 Antwerp   Middelburg       1
##  5 Antwerp   The Hague        2
##  6 Dordrecht Haarlem          1
##  7 Emden     Bremen           1
##  8 Haarlem   Bremen           2
##  9 Haarlem   Delft           26
## 10 Haarlem   Middelburg       1
## 11 Haarlem   The Hague        1
## 12 Hamburg   Bremen           1
## 13 Het Vlie  Bremen           1
## 14 Lisse     Delft            1
## 15 Venice    Haarlem          2

Like the node list, per_route now has the basic form that we want, but we again have the problem that the “source” and “destination” columns contain labels rather than IDs. What we need to do is link the IDs that have been assigned in nodes to each location in both the “source” and “destination” columns. This can be accomplished with another join function. In fact, it is necessary to perform two joins, one for the “source” column and one for “destination.” In this case, I will use a left_join() with per_route as the left data frame, because we want to maintain the number of rows in per_route. While doing the left_join, we also want to rename the two “id” columns that are brought over from nodes. For the join using the “source” column I will rename the column as “from”. The column brought over from the “destination” join is renamed “to”. It would be possible to do both joins in a single command with the use of the pipe. However, for clarity, I will perform the joins in two separate commands. Because the join is done across two commands, notice that the data frame at the beginning of the pipeline changes from per_route to edges, which is created by the first command.

edges <- per_route %>% 
  left_join(nodes, by = c("source" = "label")) %>% 
  rename(from = id)

edges
## # A tibble: 15 x 4
##    source    destination weight  from
##    <chr>     <chr>        <int> <int>
##  1 Amsterdam Bremen           1     9
##  2 Antwerp   Delft           68     1
##  3 Antwerp   Haarlem          5     1
##  4 Antwerp   Middelburg       1     1
##  5 Antwerp   The Hague        2     1
##  6 Dordrecht Haarlem          1     3
##  7 Emden     Bremen           1     8
##  8 Haarlem   Bremen           2     2
##  9 Haarlem   Delft           26     2
## 10 Haarlem   Middelburg       1     2
## 11 Haarlem   The Hague        1     2
## 12 Hamburg   Bremen           1     7
## 13 Het Vlie  Bremen           1     6
## 14 Lisse     Delft            1     5
## 15 Venice    Haarlem          2     4
edges <- edges %>% 
  left_join(nodes, by = c("destination" = "label")) %>% 
  rename(to = id)

edges
## # A tibble: 15 x 5
##    source    destination weight  from    to
##    <chr>     <chr>        <int> <int> <int>
##  1 Amsterdam Bremen           1     9    13
##  2 Antwerp   Delft           68     1    10
##  3 Antwerp   Haarlem          5     1     2
##  4 Antwerp   Middelburg       1     1    12
##  5 Antwerp   The Hague        2     1    11
##  6 Dordrecht Haarlem          1     3     2
##  7 Emden     Bremen           1     8    13
##  8 Haarlem   Bremen           2     2    13
##  9 Haarlem   Delft           26     2    10
## 10 Haarlem   Middelburg       1     2    12
## 11 Haarlem   The Hague        1     2    11
## 12 Hamburg   Bremen           1     7    13
## 13 Het Vlie  Bremen           1     6    13
## 14 Lisse     Delft            1     5    10
## 15 Venice    Haarlem          2     4     2

Now that edges has “from” and “to” columns with node IDs, we need to reorder the columns to bring “from” and “to” to the left of the data frame. Currently, the edges data frame still contains the “source” and “destination” columns with the names of the cities that correspond with the IDs. However, this data is superfluous, since it is already present in nodes. Therefore, I will only include the “from”, “to”, and “weight” columns in the select() function.

edges <- select(edges, from, to, weight)
edges
## # A tibble: 15 x 3
##     from    to weight
##    <int> <int>  <int>
##  1     9    13      1
##  2     1    10     68
##  3     1     2      5
##  4     1    12      1
##  5     1    11      2
##  6     3     2      1
##  7     8    13      1
##  8     2    13      2
##  9     2    10     26
## 10     2    12      1
## 11     2    11      1
## 12     7    13      1
## 13     6    13      1
## 14     5    10      1
## 15     4     2      2

The edges data frame does not look very impressive; it is three columns of integers. However, edges combined with nodes provides us with all of the information necessary to create network objects with the igraph and tidygraph packages.

tidygraph and ggraph

The tidygraph and ggraph packages are newcomers to the network analysis landscape. tidygraph and ggraph represent an attempt to bring network analysis into the tidyverse workflow. tidygraph provides a way to create a network object that more closely resembles a tibble or data frame. This makes it possible to use many of the dplyr functions to manipulate network data. ggraph gives a way to plot network graphs using the conventions and power of ggplot2. In other words, tidygraph and ggraph allow you to deal with network objects in a manner that is more consistent with the commands used for working with tibbles and data frames. However, the true promise of tidygraph and ggraph is that they leverage the power of igraph. This means that you sacrifice few of the network analysis capabilities of igraph by using tidygraph and ggraph.

First, let’s create a network object using tidygraph, which is called a tbl_graph. A tbl_graph consists of two tibbles: an edges tibble and a nodes tibble.

routes_tidy <- tbl_graph(nodes = nodes, 
                         edges = edges, 
                         directed = TRUE)

Now that we have created a tbl_graph object, let’s inspect it with the class() function.

class(routes_tidy)
## [1] "tbl_graph" "igraph"

Conveniently, the tbl_graph object class is a wrapper around an igraph object, meaning that at its basis a tbl_graph object is essentially an igraph object.

Printing out a tbl_graph object to the console results in an output similar to that of a normal tibble.

routes_tidy
## # A tbl_graph: 13 nodes and 15 edges
## #
## # A directed acyclic simple graph with 1 component
## #
## # Node Data: 13 x 2 (active)
##      id label    
##   <int> <chr>    
## 1     1 Antwerp  
## 2     2 Haarlem  
## 3     3 Dordrecht
## 4     4 Venice   
## 5     5 Lisse    
## 6     6 Het Vlie 
## # ... with 7 more rows
## #
## # Edge Data: 15 x 3
##    from    to weight
##   <int> <int>  <int>
## 1     9    13      1
## 2     1    10     68
## 3     1     2      5
## # ... with 12 more rows

Printing routes_tidy shows that it is a tbl_graph object with 13 nodes and 15 edges. The command also prints the first six rows of “Node Data” and the first three of “Edge Data”. Notice too that it states that the Node Data is active. The notion of an active tibble within a tbl_graph object makes it possible to manipulate the data in one tibble at a time. The nodes tibble is activated by default, but you can change which tibble is active with the activate() function. Thus, if I wanted to rearrange the rows in the edges tibble to list those with the highest “weight” first, I could use activate() and then arrange(). Here I simply print out the result rather than saving it.

routes_tidy %>% 
  activate(edges) %>% 
  arrange(desc(weight))
## # A tbl_graph: 13 nodes and 15 edges
## #
## # A directed acyclic simple graph with 1 component
## #
## # Edge Data: 15 x 3 (active)
##    from    to weight
##   <int> <int>  <int>
## 1     1    10     68
## 2     2    10     26
## 3     1     2      5
## 4     1    11      2
## 5     2    13      2
## 6     4     2      2
## # ... with 9 more rows
## #
## # Node Data: 13 x 2
##      id label    
##   <int> <chr>    
## 1     1 Antwerp  
## 2     2 Haarlem  
## 3     3 Dordrecht
## # ... with 10 more rows

Since we do not need to further manipulate routes_tidy, we can plot the graph with ggraph. Like ggmap, ggraph is an extension of ggplot2, making it easier to carry over basic ggplot skills to the creation of network plots. As in all network graphs, there are three main aspects to a ggraph plot: nodes, edges, and layouts. The vignettes for the ggraph package helpfully cover the fundamental aspects of ggraph plots. ggraph adds special geoms to the basic set of ggplot geoms that are specifically designed for networks. Thus, there is a set of geom_node and geom_edge geoms. The basic plotting function is ggraph(), which takes the data to be used for the graph and the type of layout desired. Both of the arguments for ggraph() are built around igraph. Therefore, ggraph() can use either an igraph object or a tbl_graph object. In addition, the available layouts algorithms primarily derive from igraph. Lastly, ggraph introduces a special ggplot theme that provides better defaults for network graphs than the normal ggplot defaults. The ggraph theme can be set for a series of plots with the set_graph_style() command run before the graphs are plotted or by using theme_graph() in the individual plots. Here, I will use the latter method.

Let’s see what a basic ggraph plot looks like. The plot begins with ggraph() and the data. I then add basic edge and node geoms. No arguments are necessary within the edge and node geoms, because they take the information from the data provided in ggraph().

ggraph(routes_tidy) + 
  geom_edge_link() + 
  geom_node_point() + 
  theme_graph()

As you can see, the structure of the command is similar to that of ggplot with the separate layers added with the + sign. The basic ggraph plot looks similar to those of network and igraph, if not even plainer, but we can use similar commands to ggplot to create a more informative graph. We can show the “weight” of the edges — or the amount of letters sent along each route — by using width in the geom_edge_link() function. To get the width of the line to change according to the weight variable, we place the argument within an aes() function. In order to control the maximum and minimum width of the edges, I use scale_edge_width() and set a range. I choose a relatively small width for the minimum, because there is a significant difference between the maximum and minimum number of letters sent along the routes. We can also label the nodes with the names of the locations since there are relatively few nodes. Conveniently, geom_node_text() comes with a repel argument that ensures that the labels do not overlap with the nodes in a manner similar to the ggrepel package. I add a bit of transparency to the edges with the alpha argument. I also use labs() to relabel the legend “Letters”.

ggraph(routes_tidy, layout = "graphopt") + 
  geom_node_point() +
  geom_edge_link(aes(width = weight), alpha = 0.8) + 
  scale_edge_width(range = c(0.2, 2)) +
  geom_node_text(aes(label = label), repel = TRUE) +
  labs(edge_width = "Letters") +
  theme_graph()

In addition to the layout choices provided by igraph, ggraph also implements its own layouts. For example, you can use ggraph’s concept of circularity to create arc diagrams. Here, I layout the nodes in a horizontal line and have the edges drawn as arcs. Unlike the previous plot, this graph indicates directionality of the edges. The edges above the horizontal line move from left to right, while the edges below the line move from right to left. Intsead of adding points for the nodes, I just include the label names. I use the same width aesthetic to denote the difference in the weight of each edge. Note that in this plot I use an igraph object as the data for the graph, which makes no practical difference.

ggraph(routes_tidy, layout = "linear") + 
  geom_edge_arc(aes(width = weight), alpha = 0.8) + 
  scale_edge_width(range = c(0.2, 2)) +
  geom_node_text(aes(label = label), repel = TRUE) +
  labs(edge_width = "Letters") +
  theme_graph(base_family = "Roboto Mono")
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x
## $y, : font family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x
## $y, : font family not found in Windows font database

   # Note default base_family is "Arial Narrow"
sessionInfo()
## R version 3.4.4 (2018-03-15)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252   
## [3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C                   
## [5] LC_TIME=English_Canada.1252    
## 
## attached base packages:
## [1] methods   stats     graphics  grDevices utils     datasets  base     
## 
## other attached packages:
##  [1] bindrcpp_0.2.2  showtext_0.5-1  showtextdb_2.0  sysfonts_0.7.2 
##  [5] ggraph_1.0.1    tidygraph_1.1.0 forcats_0.3.0   stringr_1.3.1  
##  [9] dplyr_0.7.5     purrr_0.2.5     readr_1.1.1     tidyr_0.8.1    
## [13] tibble_1.4.2    ggplot2_2.2.1   tidyverse_1.2.1
## 
## loaded via a namespace (and not attached):
##  [1] ggrepel_0.8.0     Rcpp_0.12.17      lubridate_1.7.4  
##  [4] lattice_0.20-35   utf8_1.1.4        assertthat_0.2.0 
##  [7] rprojroot_1.3-2   digest_0.6.15     psych_1.8.4      
## [10] ggforce_0.1.2     R6_2.2.2          cellranger_1.1.0 
## [13] plyr_1.8.4        backports_1.1.2   evaluate_0.10.1  
## [16] httr_1.3.1        blogdown_0.6      pillar_1.2.3     
## [19] rlang_0.2.1       lazyeval_0.2.1    readxl_1.1.0     
## [22] rstudioapi_0.7    rmarkdown_1.9     labeling_0.3     
## [25] udunits2_0.13     foreign_0.8-70    igraph_1.2.1     
## [28] munsell_0.4.3     broom_0.4.4       compiler_3.4.4   
## [31] modelr_0.1.2      xfun_0.1          pkgconfig_2.0.1  
## [34] mnormt_1.5-5      htmltools_0.3.6   tidyselect_0.2.4 
## [37] gridExtra_2.3     bookdown_0.7      codetools_0.2-15 
## [40] viridisLite_0.3.0 crayon_1.3.4      MASS_7.3-50      
## [43] grid_3.4.4        nlme_3.1-137      jsonlite_1.5     
## [46] gtable_0.2.0      magrittr_1.5      units_0.5-1      
## [49] scales_0.5.0      cli_1.0.0         stringi_1.1.7    
## [52] reshape2_1.4.3    viridis_0.5.1     xml2_1.2.0       
## [55] tools_3.4.4       glue_1.2.0        tweenr_0.1.5     
## [58] hms_0.4.2         parallel_3.4.4    yaml_2.1.19      
## [61] colorspace_1.3-2  rvest_0.3.2       knitr_1.20       
## [64] bindr_0.1.1       haven_1.1.1
## Adding cites for R packages using knitr
knitr::write_bib(.packages(), "packages.bib")

References

Pedersen, Thomas Lin. 2018a. Ggraph: An Implementation of Grammar of Graphics for Graphs and Networks. https://CRAN.R-project.org/package=ggraph.

———. 2018b. Tidygraph: A Tidy Api for Graph Manipulation. https://CRAN.R-project.org/package=tidygraph.

Qiu, Yixuan, and authors/contributors of the included software. See file AUTHORS for details. 2018. Showtext: Using Fonts More Easily in R Graphs. https://CRAN.R-project.org/package=showtext.

R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Wickham, Hadley. 2017. Tidyverse: Easily Install and Load the ’Tidyverse’. https://CRAN.R-project.org/package=tidyverse.