Exploratory Data Analysis

Michael Taylor

2018/04/12

Exploring Categorical Data

url <- "https://assets.datacamp.com/production/course_1796/datasets/comics.csv"
filename <- basename(url)
if (!file.exists(filename)) download(url,destfile=filename)
comics <- read.csv(filename)
glimpse(comics)
## Observations: 23,272
## Variables: 11
## $ name         <fct> Spider-Man (Peter Parker), Captain America (Steve...
## $ id           <fct> Secret, Public, Public, Public, No Dual, Public, ...
## $ align        <fct> Good, Good, Neutral, Good, Good, Good, Good, Good...
## $ eye          <fct> Hazel Eyes, Blue Eyes, Blue Eyes, Blue Eyes, Blue...
## $ hair         <fct> Brown Hair, White Hair, Black Hair, Black Hair, B...
## $ gender       <fct> Male, Male, Male, Male, Male, Male, Male, Male, M...
## $ gsm          <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ alive        <fct> Living Characters, Living Characters, Living Char...
## $ appearances  <int> 4043, 3360, 3061, 2961, 2258, 2255, 2072, 2017, 1...
## $ first_appear <fct> Aug-62, Mar-41, Oct-74, Mar-63, Nov-50, Nov-61, N...
## $ publisher    <fct> marvel, marvel, marvel, marvel, marvel, marvel, m...

Contingency table

Let’s start by creating a contingency table, which is a useful way to represent the total counts of observations that fall into each combination of the levels of categorical variables.

# Print the first rows of the data
head(comics)
##                                    name      id   align        eye
## 1             Spider-Man (Peter Parker)  Secret    Good Hazel Eyes
## 2       Captain America (Steven Rogers)  Public    Good  Blue Eyes
## 3 Wolverine (James \\"Logan\\" Howlett)  Public Neutral  Blue Eyes
## 4   Iron Man (Anthony \\"Tony\\" Stark)  Public    Good  Blue Eyes
## 5                   Thor (Thor Odinson) No Dual    Good  Blue Eyes
## 6            Benjamin Grimm (Earth-616)  Public    Good  Blue Eyes
##         hair gender  gsm             alive appearances first_appear
## 1 Brown Hair   Male <NA> Living Characters        4043       Aug-62
## 2 White Hair   Male <NA> Living Characters        3360       Mar-41
## 3 Black Hair   Male <NA> Living Characters        3061       Oct-74
## 4 Black Hair   Male <NA> Living Characters        2961       Mar-63
## 5 Blond Hair   Male <NA> Living Characters        2258       Nov-50
## 6    No Hair   Male <NA> Living Characters        2255       Nov-61
##   publisher
## 1    marvel
## 2    marvel
## 3    marvel
## 4    marvel
## 5    marvel
## 6    marvel
# Check levels of align
levels(comics$align)
## [1] "Bad"                "Good"               "Neutral"           
## [4] "Reformed Criminals"
# Check the levels of gender
levels(comics$gender)
## [1] "Female" "Male"   "Other"
# Create a 2-way contingency table
(tab <- table(comics$align, comics$gender) )
##                     
##                      Female Male Other
##   Bad                  1573 7561    32
##   Good                 2490 4809    17
##   Neutral               836 1799    17
##   Reformed Criminals      1    2     0

Dropping levels

The contingency table from the last exercise revealed that there are some levels that have very low counts. To simplify the analysis, it often helps to drop such levels.

In R, this requires two steps: first filtering out any rows with the levels that have very low counts, then removing these levels from the factor variable with droplevels(). This is because the droplevels() function would keep levels that have just 1 or 2 counts; it only drops levels that don’t exist in a data set.

  • Use filter() to filter out all rows of comics with that level, then drop the unused level with droplevels(). Save the simplified data set over the old one as comics.
# Remove align level
comics <- comics %>%
  filter(align != row.names(tab)[4]) %>%
  droplevels()

Side-by-side barcharts

While a contingency table represents the counts numerically, it’s often more useful to represent them graphically.

Here you’ll construct two side-by-side bar charts of the comics data. This shows that there can often be two or more options for presenting the same data. Passing the argument position = "dodge" to geom_bar() says that you want a side-by-side (i.e. not stacked) bar chart.

  • Create a side-by-side bar chart with align on the x-axis and gender as the fill aesthetic.

*Create another side-by-side bar chart with gender on the x-axis and align as the fill aesthetic. Rotate the axis labels 90 degrees to help readability.

# Create side-by-side barchart of gender by alignment
p1 <- ggplot(comics, aes(x = align, fill = gender)) + 
  geom_bar(position = "dodge")

# Create side-by-side barchart of alignment by gender
p2 <- ggplot(comics, aes(x = gender, fill = align)) + 
  geom_bar(position="dodge") +
  theme(axis.text.x = element_text(angle = 90))

grid.arrange(p1, p2, ncol=2)

Conditional proportions

The following code generates tables of joint and conditional proportions, respectively:

tab <- table(comics$align, comics$gender)
options(scipen = 999, digits = 3) # Print fewer digits
prop.table(tab)     # Joint proportions
##          
##             Female     Male    Other
##   Bad     0.082210 0.395160 0.001672
##   Good    0.130135 0.251333 0.000888
##   Neutral 0.043692 0.094021 0.000888
prop.table(tab, 2)  # Conditional on columns
##          
##           Female  Male Other
##   Bad      0.321 0.534 0.485
##   Good     0.508 0.339 0.258
##   Neutral  0.171 0.127 0.258

Counts vs. proportions (2)

Bar charts can tell dramatically different stories depending on whether they represent counts or proportions and, if proportions, what the proportions are conditioned on. To demonstrate this difference, you’ll construct two bar charts in this exercise: one of counts and one of proportions.

# Plot of gender by align
p3 <- ggplot(comics, aes(x = align, fill = gender)) +
  geom_bar() +
  theme(axis.text.x = element_text(angle = 45))
  
# Plot proportion of gender, conditional on align
p4 <- ggplot(comics, aes(x = align, fill = gender)) + 
  geom_bar(position = "fill") +
  ylab("proportion") +
  theme(axis.text.x = element_text(angle = 45))

grid.arrange(p3, p4, ncol=2)

By adding position = "fill" to geom_bar(), you are saying you want the bars to fill the entire height of the plotting window, thus displaying proportions and not raw counts.

Marginal barchart

If you are interested in the distribution of alignment of all superheroes, it makes sense to construct a bar chart for just that single variable.

# Change the order of the levels in align
comics$align <- factor(comics$align, 
                       levels = c("Bad", "Neutral", "Good"))

# Create plot of align
ggplot(comics, aes(x = align)) + 
  geom_bar()

Conditional barchart

Now, if you want to break down the distribution of alignment based on gender, you’re looking for conditional distributions.

Create a bar chart of align faceted by gender

# Plot of alignment broken down by gender
ggplot(comics, aes(x = align)) + 
  geom_bar() +
  facet_wrap(~ gender)

Exploring Numerical Data

Faceted histogram

url <- "https://assets.datacamp.com/production/course_1796/datasets/cars04.csv"
filename <- basename(url)
if (!file.exists(filename)) download(url,destfile=filename)
cars <- read.csv(filename)
glimpse(cars)
## Observations: 428
## Variables: 19
## $ name        <fct> Chevrolet Aveo 4dr, Chevrolet Aveo LS 4dr hatch, C...
## $ sports_car  <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
## $ suv         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
## $ wagon       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
## $ minivan     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
## $ pickup      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
## $ all_wheel   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
## $ rear_wheel  <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
## $ msrp        <int> 11690, 12585, 14610, 14810, 16385, 13670, 15040, 1...
## $ dealer_cost <int> 10965, 11802, 13697, 13884, 15357, 12849, 14086, 1...
## $ eng_size    <dbl> 1.6, 1.6, 2.2, 2.2, 2.2, 2.0, 2.0, 2.0, 2.0, 2.0, ...
## $ ncyl        <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,...
## $ horsepwr    <int> 103, 103, 140, 140, 140, 132, 132, 130, 110, 130, ...
## $ city_mpg    <int> 28, 28, 26, 26, 26, 29, 29, 26, 27, 26, 26, 32, 36...
## $ hwy_mpg     <int> 34, 34, 37, 37, 37, 36, 36, 33, 36, 33, 33, 38, 44...
## $ weight      <int> 2370, 2348, 2617, 2676, 2617, 2581, 2626, 2612, 26...
## $ wheel_base  <int> 98, 98, 104, 104, 104, 105, 105, 103, 103, 103, 10...
## $ length      <int> 167, 153, 183, 183, 183, 174, 174, 168, 168, 168, ...
## $ width       <int> 66, 66, 69, 68, 69, 67, 67, 67, 67, 67, 67, 67, 67...

Working with the cars data set.

# Create faceted histogram
ggplot(cars, aes(x = city_mpg)) +
  geom_histogram(col="black") +
  facet_wrap(~ suv)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 14 rows containing non-finite values (stat_bin).

Boxplots and density plots

A quick look at unique(cars$ncyl)

unique(cars$ncyl)
## [1]  4  6  3  8  5 12 10 -1
  • Filter cars to include only cars with 4, 6, or 8 cylinders and save the result as common_cyl. The %in% operator may prove useful here.
  • Create side-by-side box plots of city_mpg separated out by ncyl.
  • Create overlaid density plots of city_mpg colored by ncyl.
# Filter cars with 4, 6, 8 cylinders
common_cyl <- filter(cars, ncyl %in% c(4,6,8))

# Create box plots of city mpg by ncyl
p5 <- ggplot(common_cyl, aes(x = as.factor(ncyl), y = city_mpg)) +
  geom_boxplot()

# Create overlaid density plots for same data
p6 <- ggplot(common_cyl, aes(x = city_mpg, fill = as.factor(ncyl))) +
  geom_density(alpha = .3)

grid.arrange(p5, p6, ncol=1)
## Warning: Removed 11 rows containing non-finite values (stat_boxplot).
## Warning: Removed 11 rows containing non-finite values (stat_density).

The following interpretations can be drawn from the plots:

  • the highest mileage cars have 4 cylinders
  • the typical 4 cylinder car gets better mileage than even the most efficient 8 cylinder car.
  • most of the 4 cylinder cars get better mileage than even the most efficient 8 cylinder cars.

Marginal and conditional histograms

We will turn our attention to a new variable: horsepwr. The goal is to get a sense of the marginal distribution of this variable and then compare it to the distribution of horsepower conditional on the price of the car being less than $25,000.

# Create hist of horsepwr
p1 <- cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram() +
  ggtitle('Distribution of horsepower')

# Create hist of horsepwr for affordable cars
p2 <- cars %>% 
  filter(msrp < 25000) %>%
  ggplot(aes(horsepwr)) +
  geom_histogram() +
  xlim(c(90, 550)) +
  ggtitle('Distribution of horsepower for \ncars under $25K')

grid.arrange(p1, p2, ncol=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).

Observation: The highest horsepower cars in the less expensive range has just under 250 horsepower.

It’s a good idea to see how things change when you alter the binwidth. The binwidth determines how smooth your distribution will appear: the smaller the binwidth, the more jagged your distribution becomes. It’s good practice to consider several binwidths in order to detect different types of structure in your data.

# Create hist of horsepwr with binwidth of 3
p1 <- cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram(binwidth = 3) +
  ggtitle('Plot A, bandwidth =3')

# Create hist of horsepwr with binwidth of 30
p2 <- cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram(binwidth = 30) +
  ggtitle('Plot A, bandwidth = 30')

# Create hist of horsepwr with binwidth of 60
p3 <- cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram(binwidth = 60) +
  ggtitle('Plot A, bandwidth = 60')

grid.arrange(p1, p2, p3, nrow=3)

Observation: There is a tendency for cars to have horsepower right at 200 or 300 horsepower.

Box plots for outliers

In addition to indicating the center and spread of a distribution, a box plot provides a graphical means to detect outliers. You can apply this method to the msrp column (manufacturer’s suggested retail price) to detect if there are unusually expensive or cheap cars.

# Construct box plot of msrp
p1 <- cars %>%
  ggplot(aes(x = 1, y = msrp)) +
  geom_boxplot()

# Exclude outliers from data
cars_no_out <- cars %>%
  filter(msrp < 100000)

# Construct box plot of msrp using the reduced dataset
p2 <- cars_no_out %>%
  ggplot(aes(x=1, y=msrp)) +
  geom_boxplot()

grid.arrange(p1, p2, ncol=2)

### Plot selection

Consider two other columns in the cars data set: city_mpg and width. Which is the most appropriate plot for displaying the important features of their distributions? Both density plots and box plots display the central tendency and spread of the data, but the box plot is more robust to outliers.

# Create plot of city_mpg
p1 <- cars %>%
  ggplot(aes(x = 1, y = city_mpg)) +
  geom_boxplot() +
  ggtitle("Distribution of `city_mpg`")

# Create plot of width
p2 <- cars %>% 
  ggplot(aes(x = width)) +
  geom_density() +
  ggtitle("Distribution of `width`")

grid.arrange(p1, p2, ncol=2)
## Warning: Removed 14 rows containing non-finite values (stat_boxplot).
## Warning: Removed 28 rows containing non-finite values (stat_density).

The city_mpg variable has a much wider range with its outliers, it’s best to display its distribution as a box plot.

3 variable plot

Faceting is a valuable technique for looking at several conditional distributions at the same time. If the faceted distributions are laid out in a grid, you can consider the association between a variable and two others, one on the rows of the grid and the other on the columns.

# Facet hists using hwy mileage and ncyl
common_cyl %>%
  ggplot(aes(x = hwy_mpg)) +
  geom_histogram() +
  facet_grid(ncyl ~ suv, labeller=label_both) +
  ggtitle("hwy_mpg")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 11 rows containing non-finite values (stat_bin).

It can be observed across both SUVs and non-SUVs, mileage tends to decrease as the number of cylinders increases.

Numerical Summaries

Calculate center measures

  • Create a data set called gap2007 that contains only data from the year 2007.
  • Using gap2007, calculate the mean and median life expectancy for each continent. Don’t worry about naming the new columns produced by summarize().
  • Confirm the trends that you see in the medians by generating side-by-side box plots of life expectancy for each continent.
# Create dataset of 2007 data
gap2007 <- gapminder %>% filter(year == 2007)

# Compute groupwise mean and median lifeExp
gap2007 %>%
  group_by(continent) %>%
  summarize(mean(lifeExp),
            median(lifeExp))
## # A tibble: 5 x 3
##   continent `mean(lifeExp)` `median(lifeExp)`
##   <fct>               <dbl>             <dbl>
## 1 Africa               54.8              52.9
## 2 Americas             73.6              72.9
## 3 Asia                 70.7              72.4
## 4 Europe               77.6              78.6
## 5 Oceania              80.7              80.7
# Generate box plots of lifeExp for each continent
gap2007 %>%
  ggplot(aes(x = continent, y = lifeExp)) +
  geom_boxplot() +
  ggtitle("Life expectancy for each continent")

Calculate spread measures

  • For each continent in gap2007, summarize life expectancies using the sd(), the IQR(), and the count of countries, n(). No need to name the new columns produced here. The n() function within your summarize() call does not take any arguments.
  • Graphically compare the spread of these distributions by constructing overlaid density plots of life expectancy broken down by continent.
# Compute groupwise measures of spread
gap2007 %>%
  group_by(continent) %>%
  summarize(sd(lifeExp),
            IQR(lifeExp),
            n())
## # A tibble: 5 x 4
##   continent `sd(lifeExp)` `IQR(lifeExp)` `n()`
##   <fct>             <dbl>          <dbl> <int>
## 1 Africa            9.63          11.6      52
## 2 Americas          4.44           4.63     25
## 3 Asia              7.96          10.2      33
## 4 Europe            2.98           4.78     30
## 5 Oceania           0.729          0.516     2
# Generate overlaid density plots
gap2007 %>%
  ggplot(aes(x = lifeExp, fill = continent)) +
  geom_density(alpha = 0.3)

Choose measures for center and spread

# Compute stats for lifeExp in Americas
gap2007 %>%
  filter(continent == 'Americas') %>%
  summarize(mean(lifeExp),
            sd(lifeExp))
## # A tibble: 1 x 2
##   `mean(lifeExp)` `sd(lifeExp)`
##             <dbl>         <dbl>
## 1            73.6          4.44
# Compute stats for population
gap2007 %>%
  summarize(median(pop),
            IQR(pop) )
## # A tibble: 1 x 2
##   `median(pop)` `IQR(pop)`
##           <dbl>      <dbl>
## 1      10517531  26702008.

Like mean and standard deviation, median and IQR measure the central tendency and spread, respectively, but are robust to outliers and non-normal data.

Transformations

Using the gap2007 data:

  • Create a density plot of the population variable.
  • Mutate a new column called log_pop that is the natural log of the population and save it back into gap2007.
  • Create a density plot of your transformed variable.
# Create density plot of old variable
gap2007 %>%
  ggplot(aes(x = pop)) +
  geom_density() +
  ggtitle("Density plot of population")

# Transform the skewed pop variable
gap2007 <- gap2007 %>%
  mutate(log_pop = log(pop))

# Create density plot of new variable
gap2007 %>%
  ggplot(aes(x = log_pop)) +
  geom_density() +
  ggtitle("Density plot of the natural log of population")

Identify outliers

  • Apply a filter so that it only contains observations from Asia, then create a new variable called is_outlier that is TRUE for countries with life expectancy less than 50. Assign the result to gap_asia.
  • Filter gap_asia to remove all outliers, then create another box plot of the remaining life expectancies.
# Filter for Asia, add column indicating outliers
gap_asia <- gap2007 %>%
  filter(continent == 'Asia') %>%
  mutate(is_outlier = lifeExp < 50)

# Remove outliers, create box plot of lifeExp
gap_asia %>%
  filter(!is_outlier) %>%
  ggplot(aes(x = 1, y = lifeExp)) +
  geom_boxplot()

Case study

Spam and num_char

Is there an association between spam and the length of an email? You could imagine a story either way:

  • Spam is more likely to be a short message tempting me to click on a link, or
  • My normal email is likely shorter since I exchange brief emails with my friends all the time.
head(email)
##   spam to_multiple from cc sent_email                time image attach
## 1    0           0    1  0          0 2012-01-01 01:16:41     0      0
## 2    0           0    1  0          0 2012-01-01 02:03:59     0      0
## 3    0           0    1  0          0 2012-01-01 11:00:32     0      0
## 4    0           0    1  0          0 2012-01-01 04:09:49     0      0
## 5    0           0    1  0          0 2012-01-01 05:00:01     0      0
## 6    0           0    1  0          0 2012-01-01 05:04:46     0      0
##   dollar winner inherit viagra password num_char line_breaks format
## 1      0     no       0      0        0    11.37         202      1
## 2      0     no       0      0        0    10.50         202      1
## 3      4     no       1      0        0     7.77         192      1
## 4      0     no       0      0        0    13.26         255      1
## 5      0     no       0      0        2     1.23          29      0
## 6      0     no       0      0        2     1.09          25      0
##   re_subj exclaim_subj urgent_subj exclaim_mess number
## 1       0            0           0            0    big
## 2       0            0           0            1  small
## 3       0            0           0            6  small
## 4       0            0           0           48  small
## 5       0            0           0            1   none
## 6       0            0           0            1   none
email$spam <- factor(email$spam)
levels(email$spam) <- c("not-spam", "spam")
head(email)
##       spam to_multiple from cc sent_email                time image attach
## 1 not-spam           0    1  0          0 2012-01-01 01:16:41     0      0
## 2 not-spam           0    1  0          0 2012-01-01 02:03:59     0      0
## 3 not-spam           0    1  0          0 2012-01-01 11:00:32     0      0
## 4 not-spam           0    1  0          0 2012-01-01 04:09:49     0      0
## 5 not-spam           0    1  0          0 2012-01-01 05:00:01     0      0
## 6 not-spam           0    1  0          0 2012-01-01 05:04:46     0      0
##   dollar winner inherit viagra password num_char line_breaks format
## 1      0     no       0      0        0    11.37         202      1
## 2      0     no       0      0        0    10.50         202      1
## 3      4     no       1      0        0     7.77         192      1
## 4      0     no       0      0        0    13.26         255      1
## 5      0     no       0      0        2     1.23          29      0
## 6      0     no       0      0        2     1.09          25      0
##   re_subj exclaim_subj urgent_subj exclaim_mess number
## 1       0            0           0            0    big
## 2       0            0           0            1  small
## 3       0            0           0            6  small
## 4       0            0           0           48  small
## 5       0            0           0            1   none
## 6       0            0           0            1   none
  • Compute appropriate measures of the center and spread of num_char for both spam and not-spam using group_by() and summarize(). No need to name the new columns created by summarize().
  • Construct side-by-side box plots to visualize the association between the same two variables. It will be useful to mutate() a new column containing a log-transformed version of num_char.
# Compute summary statistics
email %>%
  group_by(spam) %>%
  summarise(median(num_char), IQR(num_char)  )
## # A tibble: 2 x 3
##   spam     `median(num_char)` `IQR(num_char)`
##   <fct>                 <dbl>           <dbl>
## 1 not-spam               6.83           13.6 
## 2 spam                   1.05            2.82
# Create plot
email %>%
  mutate(log_num_char = log(num_char)) %>%
  ggplot(aes(x = spam, y = log_num_char)) +
  geom_boxplot()

The median length of not-spam email is greater than that of spam emails.

Spam and !!!

Some more obvious indicator of spam: exclamation marks.

  • Calculate appropriate measures of the center and spread of exclaim_mess for both spam and not-spam using group_by() and summarize().
email %>% 
  group_by(spam) %>% 
  summarise(mean(exclaim_mess), sd(exclaim_mess) )
## # A tibble: 2 x 3
##   spam     `mean(exclaim_mess)` `sd(exclaim_mess)`
##   <fct>                   <dbl>              <dbl>
## 1 not-spam                 6.51               47.6
## 2 spam                     7.32               79.9
  • Construct an appropriate plot to visualize the association between the same two variables, adding in a log-transformation step if necessary.
email %>%
  ggplot(aes(x=log(exclaim_mess+0.01) ) ) +
  geom_histogram(binwidth = 0.75) +
  facet_wrap(~spam)

Spam and !!! interpretation

  1. The most common value of exclaim_mess in both classes of email is zero (a log(exclaim_mess) of -4.6 after adding .01).
  2. Even after transformation, the distribution of exclaim_mess in both classes of email is right skewed.
  3. The typical number of exclamation in the not-spam group appears to be slightly higher than in the spam group.

Collapsing levels

## tabulates the number of cases in each category
table(email$image)
## 
##    0    1    2    3    4    5    9   20 
## 3811   76   17   11    2    2    1    1
  • Create a new variable called has_image that is TRUE where the number of images is greater than zero and FALSE otherwise.
  • Create an appropriate plot with email to visualize the relationship between has_image and spam.
# Create plot of proportion of spam by image
email %>%
  mutate(has_image = image > 0) %>%
  ggplot(aes(x = has_image, fill = spam)) +
  geom_bar(position = "fill")

Image and spam interpretation

An email without an image is more likely to be not-spam than spam.

Data Integrity

In the process of exploring a data set, you’ll sometimes come across something that will lead you to question how the data were compiled. For example, the variable num_char contains the number of characters in the email, in thousands, so it could take decimal values, but it certainly shouldn’t take negative values.

You can formulate a test to ensure this variable is behaving as we expect:

sum(email$num_char < 0)
## [1] 0

There are not negative values.

Consider the variables image and attach. You can read about them with ?email, but the help file is ambiguous: do attached images count as attached files in this data set?

Design a simple test to determine if images count as attached files. This involves creating a logical condition to compare the values of the two variables, then using sum() to assess every case in the data set. Recall that the logical operators are < for less than, <= for less than or equal to, > for greater than, >= for greater than or equal to, and == for equal to.

sum(email$image > email$attach)
## [1] 0

Since image is never greater than attach, we can infer that images are counted as attachments.

Answering questions with chains

  • For emails containing the word “dollar”, does the typical spam email contain a greater number of occurrences of the word than the typical non-spam email? Create a summary statistic that answers this question.
# Question 1
email %>%
  filter(dollar > 0) %>%
  group_by(spam) %>%
  summarize(median(dollar))
## # A tibble: 2 x 2
##   spam     `median(dollar)`
##   <fct>               <dbl>
## 1 not-spam                4
## 2 spam                    2
  • If you encounter an email with greater than 10 occurrences of the word “dollar”, is it more likely to be spam or not-spam? Create a bar chart that answers this question.
# Question 2
email %>%
  filter(dollar > 10) %>%
  ggplot(aes(x = spam)) +
  geom_bar()

What’s in a number?

We will be looking at the variable numer.

  • Reorder the levels of number so that they preserve the natural ordering of “none”, then “small”, then “big”.
  • Construct a faceted bar chart of the association between number and spam.
# Reorder levels
email$number <- factor(email$number, 
                       levels=c("none", "small", "big"))

# Construct plot of number
ggplot(email, aes(number)) +
  geom_bar() +
  facet_wrap(~spam)

What’s in a number interpretation

  • Given that an email contains a small number, it is more likely to be not-spam.
  • Given that an email contains a big number, it is more likely to be not-spam.
  • Within both spam and not-spam, the most common number is a small one.