Data Visualization and distribution

Michael Taylor

2018/12/19

Pretend that we have to describe the heights of our classmates to ET, an extraterrestrial that has never seen humans.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(dslabs)
data("heights")
head(heights)
##      sex height
## 1   Male     75
## 2   Male     70
## 3   Male     68
## 4   Male     74
## 5   Male     61
## 6 Female     65

One way to convey the heights to ET is to simply send him the list of 924 heights. Understanding distribution will give us a much more effective means to convey this information. The simplest way to think of a distribution is as a compact description of a list with many elements.

table(heights$sex)
## 
## Female   Male 
##    238    812

Statistics textbooks teach us that a more useful way to define a distribution for numerical data is to define a function that reports the proportion of the data below a value A for all possible values of A. This function is called a cumulative distribution function or CDF. The following mathematical notation is used in statistical text books.

\[F(a)=Pr(x\leq a)\]

We define a function \(f(a)\) and make that equal to the proportion of values \(x\) less than or equal to \(a\), which is represented with this \(Pr\), meaning proportion or probability, and then in parentheses the event that we require, \(x\) less than \(a\).

We can report the proportion of values between any two heights, say \(a\) and \(b\), by computing \(F(b)\), and then subtracting \(F(a)\).

For a list of numbers contained in a vector that we’ll call x, the average is simply defined as a sum of x divided by the length of x. Here’s the code that you use in R.

average <- sum(x) / length(x)

Standard deviation

And the standard aviation is defined with the following formula. It’s the square root of the sum of the differences between the values and the mean squared divided by the length.

SD <- sqrt( sum(x - average)^2 / length(x))

You can think of this as the average distance between the values and their average.

Let’s compute the average and the standard deviation for the male heights, which we will store in an object called x.

head(heights)
##      sex height
## 1   Male     75
## 2   Male     70
## 3   Male     68
## 4   Male     74
## 5   Male     61
## 6 Female     65
x <- heights %>% filter(sex=="Male") %>% .$height
average <- mean(x)
SD <- sd(x)
c(average=average, SD=SD)
## average      SD 
##  69.315   3.611

For data that is approximately normal, it is convenient to think in terms of standard units. The standard unit of value tells us how many standard deviations away from the average this value is. Specifically for a value x, we define the standard unitas z equals x minus the average divided by the standard deviation.

z = (x - average) / SD

Standard units

For data that is approximately normal, it is convenient to think in terms of standard units. The standard unit of value tells us how many standard deviations away from the average this value is. Specifically for a value x, we define the standard unit as z = (x - average) / SD.

To see how many men are within two standard deviations from the average, now that we were already converted to standard units, all we have to do is count the number of z’s that are less than 2 and bigger than negative 2,and then divide by the total.

## In R, we can quickly obtain standard units using the function scale.
z <- scale(x) 
mean(abs(z) < 2)
## [1] 0.9495

Note that it’s about 95%, which is exactly what the normal distribution predicts. So it’s quite useful.

So we take the mean of this quantity that you see here in the code, and we see that the proportion is 0.95.

If a distribution is well approximated by the normal distribution, we can have a very useful and short summary. But to check if, in fact, it is a good approximation, we can use quantile-quantile plots, or q-q plots. We start by defining a series of proportion, for example, p equals 0.05, 0.10, 0.15, up to 0.95.

To give a quick example, for the male heights data that we showed in previous videos, we have that 50% of the data is below 69.5 inches. So this means that if p equals 0.5, then the q associated with that p is 69.5.

mean(x < 69.5)
## [1] 0.5148

Now, we can make this computation for a series of p’s. If the quantiles for the data match the quantiles for the normal distribution, then it must be because the data is approximated by a normal distribution.

p <- seq(0.05, 0.95, 0.05)

Once this is defined for each p, we determine the value q, so that the proportion of the values in the data below q is p.The q’s are referred to as the quantiles.

observed_quantiles <- quantile(x, p)

theoretical_quantiles <- qnorm( p, mean = mean(x), sd = sd(x))
plot(theoretical_quantiles, observed_quantiles)
abline(0,1)

If we use standard units, instead of x, we don’t have to define the mean and the standard deviation in the function qnorm. So the code simplifies and looks like this.

observed_quantiles <- quantile(z, p)
theoretical_quantiles <- qnorm(p)
plot(theoretical_quantiles, observed_quantiles)
abline(0,1)

Percentiles

The percentiles are the quantiles you obtain when you define p as 0.01, 0.02, up to 0.99, 1%, 2%, 3%, et cetera. We call, for example, the case of p equals 0.25 the 25th percentiles. This gives us the number for which 25% other data is below. The most famous percentile is the 50th, also known as the median. Note that for the normal distribution, the median and the average are the same. But this is not generally the case. Another special case that receives a name are the quartiles,which are obtained when we set p to be 0.25, 0.50, and 0.75.End of transcript.

boxplots

data("murders")
str(murders)
## 'data.frame':    51 obs. of  5 variables:
##  $ state     : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ abb       : chr  "AL" "AK" "AZ" "AR" ...
##  $ region    : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
##  $ population: num  4779736 710231 6392017 2915918 37253956 ...
##  $ total     : num  135 19 232 93 1257 ...

Now suppose you were trying to describe this data to someone who is used to receiving just two numbers, the average and the standard deviation. Providing a five number summary composed of the range along with the quartiles, 25th, 50th, and 75th percentile would be appropriate.

This five number summary can be represented as a box with whiskers.The box is defined by the 25th and 75th percentiles, and the whiskers are showing the range. The distance between these two percentiles is called the interquartile range. The two points that are outliers according to Tookey’s definition, those are shown separately, and the median is shown with a horizontal line. Today, we call these boxplots.

sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.1 LTS
## 
## Matrix products: default
## BLAS: /home/michael/anaconda3/lib/R/lib/libRblas.so
## LAPACK: /home/michael/anaconda3/lib/R/lib/libRlapack.so
## 
## locale:
## [1] en_CA.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] bindrcpp_0.2.2       dslabs_0.3.3         dplyr_0.7.6         
## [4] RevoUtils_11.0.1     RevoUtilsMath_11.0.0
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.18     rstudioapi_0.7   knitr_1.20       bindr_0.1.1     
##  [5] magrittr_1.5     tidyselect_0.2.4 R6_2.2.2         rlang_0.2.1     
##  [9] stringr_1.3.1    tools_3.5.1      xfun_0.4.11      htmltools_0.3.6 
## [13] yaml_2.2.0       rprojroot_1.3-2  digest_0.6.15    assertthat_0.2.0
## [17] tibble_1.4.2     crayon_1.3.4     bookdown_0.7     purrr_0.2.5     
## [21] codetools_0.2-15 glue_1.3.0       evaluate_0.11    rmarkdown_1.10  
## [25] blogdown_0.9.8   stringi_1.2.4    compiler_3.5.1   pillar_1.3.0    
## [29] backports_1.1.2  pkgconfig_2.0.1

References

Irizarry, Rafael A. 2017. Dslabs: Data Science Labs. https://CRAN.R-project.org/package=dslabs.

R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2018. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.