In the previous section, we explored the distribution of a categorical variable using graphs (pie chart, bar chart) supplemented by numerical measures (percent of observations in each category). In this section, we will explore the data collected from a quantitative variable, and learn how to describe and summarize the important features of its distribution. We will first learn how to display the distribution using graphs and then move on to discuss numerical measures.
To display data from one quantitative variable graphically, we can use either the histogram or the stemplot.
Count <- c(1,2,4,5,2,1) Score <- c("[40-50)", "[50-60)", "[60-70)", "[70-80)", "[80-90)", "[90-100)") x <- data.frame(Score, Count, stringsAsFactors = F)
Here are the exam grades of 15 students:
88, 48, 60, 51, 57, 85, 69, 75, 97, 72, 71, 79, 65, 63, 73
We first need to break the range of values into intervals (also called “bins” or “classes”). In this case, since our dataset consists of exam scores, it will make sense to choose intervals that typically correspond to the range of a letter grade, 10 points wide: 40-50, 50-60, … 90-100. By counting how many of the 15 observations fall in each of the intervals, we get the following table:
To construct the histogram from this table we plot the intervals on the X-axis, and show the number of observations in each interval (frequency of the interval) on the Y-axis, which is represented by the height of a rectangle located above the interval:
library(ggplot2) ggplot(data = x, aes(Score,Count) ) + geom_histogram(fill="lightblue", color="black", stat="identity")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
The table above can also be turned into a relative frequency table using the following steps:
Add a row on the bottom and include the total number of observations in the dataset that are represented in the table.
Add a column, at the end of the table, and calculate the relative frequency for each interval, by dividing the number of observations in each row by the total number of observations. These two steps are illustrated in red in the following frequency distribution table:
## ## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats': ## ## filter, lag
## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union
rel_freq <- x %>% mutate("Relative Frequency" = round( Count / sum(Count), digits = 2)) kable(rel_freq, caption = "Exam Grades") %>% kable_styling( bootstrap_options = "striped", position = "float_left") %>% column_spec(3, color = "red")
It is also possible to determine the number of scores for an interval, if you have the total number of observations and the relative frequency for that interval.
Ex. If the relative frequency is 0.13 and the total number of observations is 15, the number of score for an interval is determined by \(15\times0.13=1.95\) which rounds to two.
Interpreting the Historgram
Once the distribution has been displayed graphically, we can describe the overall pattern of the distribution and mention any striking deviations from that pattern. More specifically, we should consider the following features of the distribution:
When describing the shape of a distribution:
Symmetry/skewness of the distribution. Peakedness (modality)—the number of peaks (modes) the distribution has. We distinguish between:
Symmetric Distributions # dataset: data=data.frame(value=rnorm(10000))
# dataset: data=data.frame(value=rnorm(10000)) # basic historgram ggplot(data, aes(x=value)) + geom_histogram(binwidth = 0.2, color="white", fill=rgb(0.2,0.7,0.1,0.4) ) + labs(title="Symetric, Single-Peaked (Unimodal) Distribution",x="",y="Frequency")
# Create data my_variable=c(rnorm(1000 , 0 , 2) , rnorm(1000 , 9 , 2)) # Draw the histogram with border=F hist(my_variable , breaks=40 , col=rgb(0.2,0.8,0.5,0.5) , border=F , main="")
four <- data.frame( dist = factor( rep( c("n", "s", "k", "mm"), each = 100), c("n", "s", "k", "mm")), vals = c(n, s, k, mm)) #no y needed for visualization of univariate distributions #easier to see for me #use this to change x/y limits! #this is one factor variable with 4 level ggplot(four, aes(x = vals)) + geom_histogram(fill = "white", colour = "black") + coord_cartesian(xlim = c(-5, 5)) + facet_wrap(~ dist)