Continuous Probability

Michael Taylor

2019/01/11

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(dslabs)
library(ggplot2)
data(heights)

Earlier we explained why when summarizing a list of numeric values such as heights, it’s not useful to construct a distribution that assigns a proportion to each possible outcome. Note, for example, that if we measure every single person in a very large population with extremely high precision, because no two people are exactly the same height, we would need to assign a proportion to each observed value and attain no useful summary at all. It is much more practical to define intervals rather than single values. The standard way of doing this is using the cumulative distributive function.

We previously described the empirical cumulative distribution function– eCDF– as a basic summary of a list of numeric values. As an example, we define the eCDF for heights for male adult students. Here, we define the vector x to use as an example that contains the male heights. So for every value of a, this gives a proportion of values in the list x that are smaller or equal to a.

x <- heights %>% filter(sex=="Male") %>% .$height

To define eCDF we just count the number of cases where x <= a and divide by n. i.e this is taking the mean, that’s the *proportion of cases.

So let’s introduce probability. For example, let’s ask if I pick one of the male students at random, what is the chance that he is taller than 70.5 inches?

# A emperical distribution function
F <- function(a) mean(x <= a)

What is the chance that a random male student is taller than 70.5 inches.

1- F(70.5)
## [1] 0.3633
df <- data_frame(x, y=dnorm(x, 
                             mean = mean(x), 
                             sd=sd(x)))
ggplot(df) +
  geom_line(aes(x,y))+
  geom_area(data = filter(df, x <= 70.5),
              aes(x = x, y = y), 
              fill="grey", position = "identity")

The answer is the proportion of students that are taller than 70.5 inches.

Once the Cumulative Distrivution function has been defined we can use it to compute the probability of and subset. For example, the probability of a student being between the height a and the height b is simply f(b) minus f(a). Because we can compute the probability for any possible event this way, the cumulative probability function defines a probability distribution for picking a height at random from our vector of heights x.


Theoretical Distribution

In the data visualization model, we introduced the normal distribution as a useful approximation to many naturally occurring distributions, including that of height. The cumulative distribution for the normal distribution is defined by a mathematical formula, which in R can be obtained with the function pnorm(). We say that a random quantity is normally distributed with average, avg, and standard deviation, s, if it’s probability distribution is defined by:

F(a) = pnorm(a, avg, s)

This is the code.

This is useful, because if we are going to use the normal approximation for say, height, we don’t need the entire dataset to answer questions such as, what is the probability that a randomly selected student is taller and 70.5 inches. We just need the average height and the standard deviation. Then we can use this piece of code,

1 - pnorm(70.5, mean(x), sd(x))
## [1] 0.3714

Which gives us 0.37.

The normal distribution is derived mathematically. Apart from computing the average in the standard deviation, we don’t use data to define it. Also the normal distribution is defined for continuous variables. It is not described for discrete variables. However, for practicing data scientists, pretty much everything we do involves data, which is technically speaking discrete. For example, we could consider our adult data categorical with each specific height a unique category. The probability distribution would then be defined by the proportion of students reporting each of those unique heights. Here is what a plot of that would look like.

plot(prop.table(table(x)), 
     xlab = "a = Height in inches",
     ylab = "Pr(X = a)")

This would be the distribution function for those categories. So each reported height gets a probability defined by the proportion of students reporting it. Now while most students rounded up their height to the nearest inch, others reported values with much more precision. For example, student reported his height to be 69.6850393700787. What is that about? What’s that very, very precise number? Well, it turns out, that’s 177 centimeters. So the student converted it to inches, and copied and pasted the result into the place where they had to report their heights. The probably assigned to this height is about 0.001. It’s 1 in 708.

However, the probability for 70 inches is 0.12. This is much higher than what was reported with this other value. But does it really make sense to think that the probability of being exactly 70 inches is so much higher than the probability of being 69.68? Clearly, it is much more useful for data analytic purposes to treat this outcome as a continuous numeric variable. But keeping in mind that very few people, perhaps none, are exactly 70 inches. But rather, that people rounded to the nearest inch. With continuous distributions, the probability of a singular value is not even defined. For example, it does not make sense to ask what is the probability that a normally distributed value is 70. Instead, we define probabilities for intervals. So we could ask instead, what is a probability that someone is between 69.99 and 70.01. In cases like height in which the data is rounded, the normal approximation is particularly useful if we deal with intervals that include exactly one round number. So for example, the normal distribution is useful for approximating the proportion of students reporting between 69.5 and 70.5.

Here are three other examples.

mean(x<=68.5) - mean(x<=67.5)
## [1] 0.1145
mean(x<=69.5) - mean(x<=68.5)
## [1] 0.1195
mean(x<=70.5) - mean(x<=69.5)
## [1] 0.1219

Look at the numbers that are being reported.

pnorm(68.5, mean(x), sd(x) ) - pnorm(67.5, mean(x), sd(x) )
## [1] 0.1031
pnorm(69.5, mean(x), sd(x) ) - pnorm(68.5, mean(x), sd(x) )
## [1] 0.1097
pnorm(70.5, mean(x), sd(x) ) - pnorm(69.5, mean(x), sd(x) )
## [1] 0.1082

This is using the data, the actual data, not the approximation.

Now look at what we get when we use the approximation. We get almost the same values. For these particular intervals, the normal approximation is quite useful. However, the approximation is not that useful for other intervals. For example, those that don’t include an integer. Here are two examples.

mean(x<=70.9) - mean(x<=70.1)
## [1] 0.02217
pnorm(70.9, mean(x), sd(x) ) - pnorm(70.1, mean(x), sd(x) )
## [1] 0.0836

If we use these two intervals, again, this is the data, look at the approximations now with the normal distribution. They’re not that good. In general, we call this situation discretization. Although the true height distribution is continuous, the reported heights tend to be more common at discrete values, in this case, due to rounding. As long as we are aware of how to deal with this reality, the normal approximation can still be a very useful tool.

Probability Density

For categorical data, we can define the probability of a category. For example, a roll of a die, let’s call it \(x\), can be \(1, 2, 3, 4, 5, or\ 6\). The \(\Pr(X=4)=1/6\). The CDF can easily be defined by simply adding up probability. So \(F(4)=\Pr(X\leq 4)=\Pr(X = 4)+\Pr(X=3)+\Pr(X=2)+\Pr(X=1)\), which is probably x being 4, or 3, or 2, or 1. So we just add up those probabilities.

In contrast, for continuous distributions, the probability of a single value is not defined. However, there is a theoretical definition that has a similar interpretation. This has to do with the probability density. The probability density at x is defined as the function, we’re going to call it little f of x, such that the probability distribution big F of a, which is the probability of x being less than or equal to a, is the integral of all values up to a of little f of x dx.

\[F(a) = \Pr(X\leq a)=\int_{-\infty}^a f(x)\,\mathrm dx\]

For those that know calculus, remember, that the integral is related to a sum. It’s the sum of bars with widths that are approximating 0. If you don’t know calculus, you can think of little f of x as a curve for which the area under the curve up to the value a gives you the probability of x being less than or equal to a. For example, to use the normal approximation to estimate the probability of someone being taller than 76 inches, we can use the probability density.

avg <- mean(x)
s <- sd(x)
1 - pnorm(76, avg, s)
## [1] 0.03206

So this is a mathematical formula. Mathematically, the gray area in this figure represents the probability of x being bigger than 76. The curve you see is a probability density function for the normal distribution. In R you get the probability density function for the normal distribution using the function dnorm(). D stands for density. Although it may not be immediately obvious why knowing of all probability densities is useful, understanding this concept will be essential to those wanting to fit models to data for which predefined functions are not available.

Monte Carlo Simulations

Monte Carlo simulations for normally distributed functions. R provides rnorm() which takes three arguments, size, mean that defaults to 0 and standard deviation which defaults to 1. Below is an example of how we can generate data.

n <- length(x)
avg <- mean(x)
s <- sd(x)
simulated_heights <- rnorm(n, avg, s)

So if our reported heights are in the vector x, the computer length, their average, and their standard deviation, and then use the rnorm function to generate the randomly distributed outcomes. Not surprisingly, the distribution of these outcomes looks normal because they were generated to look normal.

data_frame(simulated_heights=simulated_heights) %>% 
  ggplot(aes(simulated_heights)) +
  geom_histogram(color="black", binwidth = 2)

This is one of the most useful functions in R, as it will permit us to generate data that mimics naturally occurring events, and it’ll let us answer questions related to what could happen by chance by running Monte Carlo simulations.

For example, if we pick 800 males at random, what is the distribution of the tallest person? Specifically, we could ask, how rare is that the tallest person is a seven footer? We can use the following Monte Carlo simulation to answer this question. We’re going to run 10,000 simulations, and for each one, we’re going to generate 800 normally distributed values, pick the tallest one, and return that.

B <- 10000
tallest <- replicate(B, {
  simulated_data <- rnorm(800, avg, s)
  max(simulated_data)
})

The tallest variable will have these values. So now we can ask, what proportion of these simulations return a seven footer as the tallest person? And we can see that it’s a very small number.

mean(tallest >= 7*12)
## [1] 0.0169
ggplot(data_frame(x=simulated_heights , y=dnorm(simulated_heights, mean = avg, sd =s) ), 
       aes(x=x) ) +
  geom_histogram(aes(x=x, y=..density..), 
                 bins = 15, 
                 fill="grey",
                 color="black") +
  geom_line(aes(x=x,y=y), 
            color="red")

Other Continuous Distributions

The normal distribution is not the only useful theoretical distribution. Other continuous distributions that we may encounter are student-t, the chi-squared, the exponential, the gamma, and the beta distribution. R provides functions to compute the density, the quantiles, the cumulative distribution function, and to generate Monte Carlos simulations for all these distributions. R uses a convention that lets us remember the names of these functions. Namely, using the letters d for density, q for quantile, p for probability density function, and r for random. By putting these letters in front of a shorthand for the distribution, gives us the names of these useful functions. We have already seen the functions dnorm, pnorm, and rnorm. So these are examples of what we just described. These are for the normal distribution. Norm is the nickname or shorthand for the normal distribution. The function qnorm gives us the quantiles. For example, we can use the dnorm function to generate this plot.

x <- seq(-4, 4, length.out = 100)
data_frame(x, f=dnorm(x)) %>% 
  ggplot(aes(x, f)) +
  geom_line()

This is the density function for the normal distribution. We use the function dnorm. For the student’s t distribution, which has shorthand t, we can use the functions dr, qt, pt, and rt to generate the density, quantiles, probability density function, or a Monte Carlo simulation.

sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.1 LTS
## 
## Matrix products: default
## BLAS: /home/michael/anaconda3/lib/R/lib/libRblas.so
## LAPACK: /home/michael/anaconda3/lib/R/lib/libRlapack.so
## 
## locale:
## [1] en_CA.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] bindrcpp_0.2.2       ggplot2_3.0.0        dslabs_0.3.3        
## [4] dplyr_0.7.6          RevoUtils_11.0.1     RevoUtilsMath_11.0.0
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.18     pillar_1.3.0     compiler_3.5.1   plyr_1.8.4      
##  [5] bindr_0.1.1      tools_3.5.1      digest_0.6.15    evaluate_0.11   
##  [9] tibble_1.4.2     gtable_0.2.0     pkgconfig_2.0.1  rlang_0.2.1     
## [13] yaml_2.2.0       blogdown_0.9.8   xfun_0.4.11      withr_2.1.2     
## [17] stringr_1.3.1    knitr_1.20       rprojroot_1.3-2  grid_3.5.1      
## [21] tidyselect_0.2.4 glue_1.3.0       R6_2.2.2         rmarkdown_1.10  
## [25] bookdown_0.7     purrr_0.2.5      magrittr_1.5     backports_1.1.2 
## [29] scales_0.5.0     codetools_0.2-15 htmltools_0.3.6  assertthat_0.2.0
## [33] colorspace_1.3-2 labeling_0.3     stringi_1.2.4    lazyeval_0.2.1  
## [37] munsell_0.5.0    crayon_1.3.4

References

R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2018. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.