One Categorical Variable: Graphs

Michael Taylor

2018/03/28

Introduction

In the previous section, we explored the distribution of a categorical variable using graphs (pie chart, bar chart) supplemented by numerical measures (percent of observations in each category). In this section, we will explore the data collected from a quantitative variable, and learn how to describe and summarize the important features of its distribution. We will first learn how to display the distribution using graphs and then move on to discuss numerical measures.


To display data from one quantitative variable graphically, we can use either the histogram or the stemplot.

Count <- c(1,2,4,5,2,1)
Score <- c("[40-50)",
       "[50-60)",
       "[60-70)",
       "[70-80)",
       "[80-90)",
       "[90-100)")
x <- data.frame(Score, Count, 
                stringsAsFactors = F)

Here are the exam grades of 15 students:

88, 48, 60, 51, 57, 85, 69, 75, 97, 72, 71, 79, 65, 63, 73

We first need to break the range of values into intervals (also called “bins” or “classes”). In this case, since our dataset consists of exam scores, it will make sense to choose intervals that typically correspond to the range of a letter grade, 10 points wide: 40-50, 50-60, … 90-100. By counting how many of the 15 observations fall in each of the intervals, we get the following table:

Table 1: Exam Grades
Score Count
[40-50) 1
[50-60) 2
[60-70) 4
[70-80) 5
[80-90) 2
[90-100) 1

To construct the histogram from this table we plot the intervals on the X-axis, and show the number of observations in each interval (frequency of the interval) on the Y-axis, which is represented by the height of a rectangle located above the interval:

library(ggplot2)
ggplot(data = x, aes(Score,Count) ) +
  geom_histogram(fill="lightblue",
                 color="black",
                 stat="identity")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

The table above can also be turned into a relative frequency table using the following steps:

  1. Add a row on the bottom and include the total number of observations in the dataset that are represented in the table.

  2. Add a column, at the end of the table, and calculate the relative frequency for each interval, by dividing the number of observations in each row by the total number of observations. These two steps are illustrated in red in the following frequency distribution table:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
rel_freq <- x %>% mutate("Relative Frequency" =
                           round(
                             Count / sum(Count),
                             digits = 2))

kable(rel_freq, caption = "Exam Grades") %>% 
  kable_styling(
    bootstrap_options = "striped", 
    position = "float_left") %>% 
  column_spec(3, color = "red")
Table 2: Exam Grades
Score Count Relative Frequency
[40-50) 1 0.07
[50-60) 2 0.13
[60-70) 4 0.27
[70-80) 5 0.33
[80-90) 2 0.13
[90-100) 1 0.07

It is also possible to determine the number of scores for an interval, if you have the total number of observations and the relative frequency for that interval.

Ex. If the relative frequency is 0.13 and the total number of observations is 15, the number of score for an interval is determined by \(15\times0.13=1.95\) which rounds to two.

Comment

  1. The square bracket means “including” and the parenthesis means “not including”. For example, [50,60) is the interval from 50 to 60, including 50 and not including 60; [60,70) is the interval from 60 to 70, including 60, and not including 70, etc. It really does not matter how you decide to set up your intervals, as long as you’re consistent.

Interpreting the Historgram

Once the distribution has been displayed graphically, we can describe the overall pattern of the distribution and mention any striking deviations from that pattern. More specifically, we should consider the following features of the distribution:

Shape

When describing the shape of a distribution:

Symmetry/skewness of the distribution. Peakedness (modality)—the number of peaks (modes) the distribution has. We distinguish between:

Symmetric Distributions # dataset: data=data.frame(value=rnorm(10000))

Basic histogram

# dataset:
data=data.frame(value=rnorm(10000))
# basic historgram 
ggplot(data, aes(x=value)) + 
    geom_histogram(binwidth = 0.2, 
                   color="white", 
                   fill=rgb(0.2,0.7,0.1,0.4) ) +
  labs(title="Symetric, Single-Peaked (Unimodal) Distribution",x="",y="Frequency")

# Create data 
my_variable=c(rnorm(1000 , 0 , 2) , rnorm(1000 , 9 , 2))
 
# Draw the histogram with border=F
hist(my_variable , breaks=40 , col=rgb(0.2,0.8,0.5,0.5) , border=F , main="")

four <- data.frame(
  dist = factor(
    rep(
    c("n", "s", "k", "mm"), 
    each = 100),
    c("n", "s", "k", "mm")), 
  vals = c(n, s, k, mm))

#no y needed for visualization of univariate distributions
#easier to see for me
#use this to change x/y limits!
#this is one factor variable with 4 level
ggplot(four, aes(x = vals)) + 
  geom_histogram(fill = "white", colour = "black") + 
  coord_cartesian(xlim = c(-5, 5)) + 
  facet_wrap(~ dist)