Exploring the Kaggle Data Science Survey

Michael Taylor

2018/09/09

Respondents’ answers to multiple choice and ranking questions. These are non-randomized and thus a single row does correspond to all of a single user’s answers.

# Loading necessary packages
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.6
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
  1. Welcome to the world of data science

Throughout the world of data science, there are many languages and tools that can be used to complete a given task. While you are often able to use whichever tool you prefer, it is often important for analysts to work with similar platforms so that they can share their code with one another. Learning what professionals in the data science industry use while at work can help you gain a better understanding of things that you may be asked to do in the future.

In this project, we are going to find out what tools and languages professionals use in their day-to-day work. Our data comes from the Kaggle Data Science Survey which includes responses from over 10,000 people that write code to analyze data in their daily work.

# Loading the data
responses <- read_csv("multipleChoiceResponses.csv",
                      col_names = TRUE,
                      col_types = cols(.default = col_character()
                                       )) %>% 
  select(81, 14, 69, 80)
head(responses)
## # A tibble: 6 x 4
##   WorkToolsSelect    LanguageRecommen… EmployerIndustry WorkAlgorithmsSel…
##   <chr>              <chr>             <chr>            <chr>             
## 1 Amazon Web servic… F#                Internet-based   Neural Networks,R…
## 2 <NA>               Python            <NA>             <NA>              
## 3 <NA>               R                 <NA>             <NA>              
## 4 Amazon Machine Le… Python            Mix of fields    Bayesian Techniqu…
## 5 C/C++,Jupyter not… Python            Technology       Bayesian Techniqu…
## 6 Jupyter notebooks… Python            Academic         Bayesian Techniqu…
  1. Using multiple tools

Now that we’ve loaded in the survey results, we want to focus on the tools and languages that the survey respondents use at work.

# Printing the first respondents' tools and languages

responses[1,2]
## # A tibble: 1 x 1
##   LanguageRecommendationSelect
##   <chr>                       
## 1 F#
# Creating a new data frame called tools
tools <- responses 

# Adding a new column to tools which splits the WorkToolsSelect column at the commas and unnests the new column
tools <- tools  %>% 
    mutate(work_tools = WorkToolsSelect) %>% 
    unnest(work_tools = strsplit(WorkToolsSelect, ",") )

# Viewing the first 6 rows of tools
# .... YOUR CODE FOR TASK 2 ....
head(tools, 6)
## # A tibble: 6 x 5
##   WorkToolsSelect LanguageRecomme… EmployerIndustry WorkAlgorithmsS…
##   <chr>           <chr>            <chr>            <chr>           
## 1 Amazon Web ser… F#               Internet-based   Neural Networks…
## 2 Amazon Web ser… F#               Internet-based   Neural Networks…
## 3 Amazon Web ser… F#               Internet-based   Neural Networks…
## 4 <NA>            Python           <NA>             <NA>            
## 5 <NA>            R                <NA>             <NA>            
## 6 Amazon Machine… Python           Mix of fields    Bayesian Techni…
## # ... with 1 more variable: work_tools <chr>
  1. Counting users of each tool

Now that we’ve split apart all of the tools used by each respondent, we can figure out which tools are the most popular.

# Creating a new data frame
tool_count <- tools

# Grouping the data by work_tools, calculate the number of responses in each group
tool_count <- tool_count  %>% 
    group_by(work_tools)  %>% 
    summarise(n = n())

# Sorting tool_count so that the most popular tools are at the top
#.... YOUR CODE FOR TASK 3 ....
tool_count <- tool_count %>% arrange(desc(n))
# Printing the first 6 results
#.... YOUR CODE FOR TASK 3 ....
head(tool_count, 6) %>% na.omit()
## # A tibble: 5 x 2
##   work_tools            n
##   <chr>             <int>
## 1 Python             6073
## 2 R                  4708
## 3 SQL                4261
## 4 Jupyter notebooks  3206
## 5 TensorFlow         2256
  1. Plotting the most popular tools

Let’s see how your favorite tools stack up against the rest.

# Creating a bar chart of the work_tools column. 
# Arranging the bars so that the tallest are on the far right

tool_count$work_tools <- factor(tool_count$work_tools, 
                                levels = tool_count$work_tools[order(tool_count$n)])

ggplot(filter(tool_count, work_tools != 'NA'), aes(work_tools, n) ) + 
    geom_bar(stat="identity") +

# Rotating the bar labels 90 degrees
    theme(axis.text = element_text(angle = 90))

  1. The R vs Python debate

Within the field of data science, there is a lot of debate among professionals about whether R or Python should reign supreme. You can see from our last figure that R and Python are the two most commonly used languages, but it’s possible that many respondents use both R and Python. Let’s take a look at how many people use R, Python, and both tools.

# Creating a new data frame called debate_tools
debate_tools <- responses

# Creating a new column called language preference, based on the conditions specified in the Instructions
debate_tools <- debate_tools  %>% 
   mutate(language_preference = case_when(
     grepl('R', WorkToolsSelect, ignore.case = TRUE) ~ "R",
     grepl('Python', WorkToolsSelect, ignore.case = TRUE) ~ "Python",
     grepl('R|Python', WorkToolsSelect, ignore.case = TRUE) ~ "both",
     TRUE ~ "neither") ) %>% 
  na.omit()
# Printing the first 6 rows
head(debate_tools, 6)
## # A tibble: 6 x 5
##   WorkToolsSelect LanguageRecomme… EmployerIndustry WorkAlgorithmsS…
##   <chr>           <chr>            <chr>            <chr>           
## 1 Amazon Web ser… F#               Internet-based   Neural Networks…
## 2 Amazon Machine… Python           Mix of fields    Bayesian Techni…
## 3 C/C++,Jupyter … Python           Technology       Bayesian Techni…
## 4 Jupyter notebo… Python           Academic         Bayesian Techni…
## 5 Jupyter notebo… Python           Internet-based   CNNs,Decision T…
## 6 Python,Spark /… Python           Mix of fields    Bayesian Techni…
## # ... with 1 more variable: language_preference <chr>
  1. Plotting R vs Python users

Now we just need to take a closer look at how many respondents use R, Python, and both!

# Creating a new data frame
debate_plot <- debate_tools

# Grouping by language preference and calculate number of responses
debate_plot <- debate_plot  %>% 
   group_by(language_preference)  %>% 
   summarise(n = n()) %>% 
# Removing the row for users of "neither"
   filter(language_preference != "neither")

# Creating a bar chart
# .... YOUR CODE FOR TASK 6 ....
ggplot(debate_plot, aes(language_preference, n) ) +
    geom_bar(stat="identity")

  1. Language recommendations

It looks like the largest group of professionals program in both Python and R. But what happens when they are asked which language they recommend to new learners? Do R lovers always recommend R?

# Creating a new data frame
recommendations <- debate_tools

# Grouping by language_preference and then LanguageRecommendationSelect
recommendations <- recommendations  %>% 
    group_by(language_preference, LanguageRecommendationSelect)  %>% 
    summarize(n = n() ) %>% 
# Removing empty responses and include the top recommendations
    filter(!is.na(LanguageRecommendationSelect) & (language_preference != 'neither') ) %>% 
    arrange(desc(n), .by_group = TRUE) %>% 
    mutate(id = row_number() ) %>% 
    filter( id <= 8 )

recommendations
## # A tibble: 16 x 4
## # Groups:   language_preference [2]
##    language_preference LanguageRecommendationSelect     n    id
##    <chr>               <chr>                        <int> <int>
##  1 Python              Python                         228     1
##  2 Python              C/C++/C#                        18     2
##  3 Python              Matlab                          13     3
##  4 Python              R                                8     4
##  5 Python              SQL                              4     5
##  6 Python              Java                             2     6
##  7 Python              Julia                            2     7
##  8 Python              Haskell                          1     8
##  9 R                   Python                        3463     1
## 10 R                   R                             1527     2
## 11 R                   SQL                            211     3
## 12 R                   C/C++/C#                        92     4
## 13 R                   Matlab                          77     5
## 14 R                   Scala                           63     6
## 15 R                   Java                            41     7
## 16 R                   Other                           39     8
  1. The most recommended language by the language used

Just one thing left. Let’s graphically determine which languages are most recommended based on the language that a person uses.

# Creating a faceted bar plot
ggplot(recommendations, aes(x = language_preference, y = n) )+
geom_bar(stat = "identity") +
facet_wrap(~ factor(LanguageRecommendationSelect) )

  1. The moral of the story

So we’ve made it to the end. We’ve found that Python is the most popular language used among Kaggle data scientists, but R users aren’t far behind. And while Python users may highly recommend that new learners learn Python, would R users find the following statement TRUE or FALSE?

# Would R users find this statement TRUE or FALSE?
R_is_number_one = TRUE
sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.1 LTS
## 
## Matrix products: default
## BLAS: /home/michael/anaconda3/lib/R/lib/libRblas.so
## LAPACK: /home/michael/anaconda3/lib/R/lib/libRlapack.so
## 
## locale:
## [1] en_CA.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2.2       forcats_0.3.0        stringr_1.3.1       
##  [4] dplyr_0.7.6          purrr_0.2.5          readr_1.1.1         
##  [7] tidyr_0.8.1          tibble_1.4.2         ggplot2_3.0.0       
## [10] tidyverse_1.2.1      RevoUtils_11.0.1     RevoUtilsMath_11.0.0
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.4 xfun_0.4.11      haven_1.1.2      lattice_0.20-35 
##  [5] colorspace_1.3-2 htmltools_0.3.6  yaml_2.2.0       utf8_1.1.4      
##  [9] rlang_0.2.1      pillar_1.3.0     glue_1.3.0       withr_2.1.2     
## [13] modelr_0.1.2     readxl_1.1.0     bindr_0.1.1      plyr_1.8.4      
## [17] munsell_0.5.0    blogdown_0.9.8   gtable_0.2.0     cellranger_1.1.0
## [21] rvest_0.3.2      codetools_0.2-15 evaluate_0.11    labeling_0.3    
## [25] knitr_1.20       fansi_0.2.3      broom_0.5.0      Rcpp_0.12.18    
## [29] scales_0.5.0     backports_1.1.2  jsonlite_1.5     hms_0.4.2       
## [33] digest_0.6.15    stringi_1.2.4    bookdown_0.7     grid_3.5.1      
## [37] rprojroot_1.3-2  cli_1.0.0        tools_3.5.1      magrittr_1.5    
## [41] lazyeval_0.2.1   crayon_1.3.4     pkgconfig_2.0.1  xml2_1.2.0      
## [45] lubridate_1.7.4  assertthat_0.2.0 rmarkdown_1.10   httr_1.3.1      
## [49] rstudioapi_0.7   R6_2.2.2         nlme_3.1-137     compiler_3.5.1

References

knitr::write_bib(.packages(), "packages.bib")

R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Wickham, Hadley. 2017. Tidyverse: Easily Install and Load the ’Tidyverse’. https://CRAN.R-project.org/package=tidyverse.