Introduction to Probability

Michael Taylor

2018/12/18

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(dslabs)
library(ggplot2)

Random number generators permit us to mimic the process of picking at random. An example in R is the sample function. We demonstrate its use showing you some code. First, use the rep function to generate the urn. We create an urn with 2 red and 3 blues. You can see when we type beads we see this.

# probability red 2/5 or 40%
# and of blue 3/5 or 60%
(beads <- rep(c('red', 'blue'), times = c(2, 3)))
## [1] "red"  "red"  "blue" "blue" "blue"

Now, we can use a sample function to pick one at random.

sample(beads, 1)
## [1] "blue"

Monte Carlo

Now, we want to repeat this experiment over and over. However, it is, of course, impossible to repeat forever. Instead, we repeat the experiment a large enough number of times to make the results practically equivalent to doing it over and over forever. This is an example of a Monte Carlo simulation.

To perform our first Monte Carlo simulation, we use the replicate function. This permits us to repeat the same task any number of times we want. Here, we repeat the random event 10,000 times.

B <- 10000
events <- replicate(B, sample(beads, 1))

We can now see if, in fact, our definition is in agreement with this Monte Carlo simulation approximation. We can use table, for example, to see the distribution.

(tab <- table(events))
## events
## blue  red 
## 6037 3963

prop.table gives us the proportions

(ptab <- prop.table(tab))
## events
##   blue    red 
## 0.6037 0.3963

We see that, in fact, the Monte Carlo simulation gives a very good approximation with 0.6037 for blue and 0.3963 for red. We didn’t get exactly 0.6 and exactly 0.4, but statistical theory tells us that, if we make B large enough, we can get as close as we want to those numbers.

Probability Distribution

Defining a distribution for categorical outcomes is relatively straight forward. We simply assign a probability to each category. In cases that can be thought of as beads in an urn, for each bead type, the proportion defines the distribution. Another example comes from polling. If you’re randomly calling likely voters from a population that has 44% Democrat, 44% Republican, 10% undecided, and 2% green, these proportions define the probability for each group. For this example, the probability distributionis are simply these four proportions.

Independence

RAFAEL IRIZARRY: We say that two events are independent if the outcome of one does not affect the other. The classic example are coin tosses. Every time we toss a fair coin, the probability of seeing heads is 1/2, \(Pr(heads) =0.5\), regardless of what previous tosses have revealed. The same is true when we pick beads from an urn with replacement. In the example we saw earlier, \(Pr(red)=0.4\) regardless of previous draws.

Many examples of events that are not independent come from card games. When we deal the first card, the probability of getting, say, a king is 1 in 13. This is because there are 13 possibilities. You can get an ace, a 2, a 3, a 4, et cetera, 10, jack, queen or king. Now, if we deal a king for the first card and I don’t replace it, then the problem of getting a king in the second card is less because there are only three kings left. The probability is 3 out of not 52– because we already dealt one card– but out of 51. These events are, therefore, not independent. The first outcome affects the second. To see an extreme case of non-independent events, consider an example of drawing five beads at random without replacement from an urn. Three are blue, two are red. I’m going to generate data like this using the sample function and assign it to x. You can’t see the outcomes. Now, if I ask you to guess the color of the first bead, what do you guess? Since there’s more blue beads, there’s actually a 0.6 chance of seeing blue. That’s probably what you guess. But now, I’m going to show you the outcomes of the other four. The second, third, fourth and fifth outcomes you can see here.

set.seed(2355)
x <- sample(beads, 5)
x[2:5]
## [1] "blue" "red"  "blue" "blue"

You can see that the three blue beads have already come out. This affects the probability of the first. They are not independent. So would you still guess blue? Of course not. Now, you know that the probability of red is 1. These events are not independent. The probabilities change once you see the other outcomes. When events are not independent, conditional probabilities are useful and necessary to make correct calculations. We already saw an example of a conditional probability. We computed the probability that a second dealt card is a king given that the first was a king. In probability, we use the following notation.

\[\Pr(\text{Card 2 is a king | Card 1 is a king})=3/51\]

We use this dash like this as a shorthand for given that or conditional on. These are synonyms. Note that, when two events, say A and B, are independent, we have the following equation.

\[\Pr(A \mid B)\]

The probability of A given B is equal to the probability of A. It doesn’t matter what B is. The probability A is unchanged. This is the mathematical way of saying it. And in fact, this can be considered the mathematical definition of independence. All right now. If we want to know the probably of two events, say A and B, occurring, we can use the multiplication rule. So the probability of A and B is equal to the probability of A multiplied by the probability of B given that A already happened.

\[\Pr(A \ and\ B)=\Pr(A)\Pr(B\mid A)\]

Let’s use blackjack as an example. In blackjack, you get assigned two random cards without replacement. Then, you can ask for more. The goal is to get closer to 21 than the dealer without going over. Face cards are worth 10 points. So is the 10 card. That’s worth 10 points, too. And aces are worth either 11 or 1. So if you get an ace and a face card, you win automatically. So, in blackjack, to calculate the chances of getting 21 in the following way, first we get an ace and then we get a face card or a 10, we compute the problem of the first being an ace and then multiply by the probability of a face card or a 10 given that the first card was an ace. This calculation is 1/13– chance of getting an ace– times the chance of getting a card with value 10 given that we already saw an ace, which is 16 out of 51. We’ve already taking one card out. This is approximately 2%.

The multiplicative rule also applies to more than two events. We can use induction to expand for more than two. So the probability of A and B and C is equal to the probability of A times the probability of B given that A happened times the probability of C that A and b happened.

\[\Pr(A \ and\ B)=\Pr(A)\Pr(B\mid A)\Pr(C \mid A \ and \ B)\]

When we have independent events, the multiplication rule becomes simpler. We simply multiply the three probabilities. But we have to be very careful when we use a multiplicative rule in practice. We’re assuming independence, and this can result in very different and incorrect probability calculations when we don’t actually have independence. This could have dire consequences, for example, in a trial if an expert doesn’t really know the multiplication rule and how to use it. So let’s use an example.

This is loosely based on something that actually happened. Imagine a court case in which the suspect was described to have a mustache and a beard, and the prosecution brings in an expert to argue that, because 1 in 10 men have beards and 1 in 5 men has mustaches, using the multiplication rule, this means that only 2% of men have both beards and mustaches– 1/10 times 1/5. 2% is a pretty unlikely event. However, to multiply like this, we need to assume independence. And in this case, it’s clearly not true. The conditional probability of a man having a mustache conditional on them having a beard is quite high. It’s about 95%. So the correct calculation actually gives us a much higher probability. It’s 9%. So there’s definitely reasonable doubt.

Combinations and Permutations

For example, what is the probability that if I draw 5 cards without replacement, I get all cards of the same suit, what is called a flush in poker? Discrete probability teaches us how to make these computations using mathematics. Here we will focus on using R code. We will use card games as an example and construct a deck of cards using R.

For this, we will use the function expand.grid and the function paste. We use paste to create strings by joining smaller strings. For example, if we have the number and the suit for a card, in 2 different variables we can create the card name using paste like this.

number <- "Three"
suit <- "Hearts"
paste(number, suit)
## [1] "Three Hearts"

It also works on pairs of vectors. It performs the operation element-wise.

paste(letters[1:5], as.character(1:5))
## [1] "a 1" "b 2" "c 3" "d 4" "e 5"

The function expand.grid gives us all the combination of 2 lists. So for example, if you have blue and black pants and white, gray, and plaid shirt, all your combinations can be computed using the expand.grid function like this.

expand.grid(pants=c("blue", "black"), 
            shirt=c("white", "grid", "plaid"))
##   pants shirt
## 1  blue white
## 2 black white
## 3  blue  grid
## 4 black  grid
## 5  blue plaid
## 6 black plaid

So here’s how we generate a deck of cards. We define the four suits, we define the 13 numbers, and then we create the deck using expand.grid and then pasting together the 2 columns that expand grid creates.

suits <- c("Diamonds", "Clubs", "Hearts", "Spades")

numbers <- c("Ace", "Deuce", "Three", "Four", "Five", "Six", "Seven",
             "Eight", "Nine", "Ten", "Jack", "Queen", "King")

deck <- expand.grid(number=numbers, suit=suits)

( deck <- paste(deck$number, deck$suit) )
##  [1] "Ace Diamonds"   "Deuce Diamonds" "Three Diamonds" "Four Diamonds" 
##  [5] "Five Diamonds"  "Six Diamonds"   "Seven Diamonds" "Eight Diamonds"
##  [9] "Nine Diamonds"  "Ten Diamonds"   "Jack Diamonds"  "Queen Diamonds"
## [13] "King Diamonds"  "Ace Clubs"      "Deuce Clubs"    "Three Clubs"   
## [17] "Four Clubs"     "Five Clubs"     "Six Clubs"      "Seven Clubs"   
## [21] "Eight Clubs"    "Nine Clubs"     "Ten Clubs"      "Jack Clubs"    
## [25] "Queen Clubs"    "King Clubs"     "Ace Hearts"     "Deuce Hearts"  
## [29] "Three Hearts"   "Four Hearts"    "Five Hearts"    "Six Hearts"    
## [33] "Seven Hearts"   "Eight Hearts"   "Nine Hearts"    "Ten Hearts"    
## [37] "Jack Hearts"    "Queen Hearts"   "King Hearts"    "Ace Spades"    
## [41] "Deuce Spades"   "Three Spades"   "Four Spades"    "Five Spades"   
## [45] "Six Spades"     "Seven Spades"   "Eight Spades"   "Nine Spades"   
## [49] "Ten Spades"     "Jack Spades"    "Queen Spades"   "King Spades"

Now that we have a deck constructed, we can now start answering questions about probability. Let’s start by doing something simple. Let’s double-check that the probability of a king in the first card is 1 in 13.

We simply compute the proportion of possible outcomes that satisfy our condition. So we create a vector that contains the four ways we can get a king. That’s going to be the kings variable. And then we simply check what proportion of the deck is one of these cards and we get the answer that we expect–

## create a vector that contains the four ways we can get a king.
kings <- paste("King", suits)
## propoortion that are kings
mean(deck %in% kings)
## [1] 0.07692

This is approximately \(\frac{1}{13}\).

Now, how about the conditional probably of the second card being a king, given that the first was a king? Earlier we deduced that if 1 king is already out, then there’s 51 left. So the probability is 3 in 51 (aproximately 0.05882). But let’s confirm by listing out all possible outcomes. To do this, we’re going to use the combination and permutations functions that are available from the gtools package.

library(gtools)

The permutation functions computes for any list of size n all the different ways we can select r items. So here’s an example– here all the ways we can choose 2 numbers from the list 1, 2, 3, 4, 5.

permutations(5,2 )
##       [,1] [,2]
##  [1,]    1    2
##  [2,]    1    3
##  [3,]    1    4
##  [4,]    1    5
##  [5,]    2    1
##  [6,]    2    3
##  [7,]    2    4
##  [8,]    2    5
##  [9,]    3    1
## [10,]    3    2
## [11,]    3    4
## [12,]    3    5
## [13,]    4    1
## [14,]    4    2
## [15,]    4    3
## [16,]    4    5
## [17,]    5    1
## [18,]    5    2
## [19,]    5    3
## [20,]    5    4

Notice that the order matters. So 3, 1 is different than 1, 3, So it appears in our permutations. Also notice that 1, 1; 2, 2; and 3, 3 don’t appear, because once we pick a number, it can’t appear again. Optionally for this function permutations, we can add a vector.

So for example, if you want to see 5 random 7-digit phone numbers out of all possible phone numbers, you could type code like this.

all_phone_numbers <- permutations(10, 7, v = 0:9)
n <- nrow(all_phone_numbers)
index <- sample(n ,5)
all_phone_numbers[index,]
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,]    6    7    0    3    8    5    4
## [2,]    8    1    9    4    2    0    5
## [3,]    2    9    7    0    6    1    3
## [4,]    5    7    9    1    2    6    3
## [5,]    5    8    3    2    7    1    0

Here we’re defining a vector of digits that goes from 0 to 9 rather than 1 through 10. So these four lines of code generate all phone numbers, picks 5 at random, and then shows them to you.

To compute all possible ways that we can choose 2 cards when the order matters, we simply type the following piece of code.

hands <- permutations(52, 2, v=deck)

Here we use permutations. There’s 52 cards, we’re going to choose 2, and we’re going to select them out of the vector that includes our card names, which we called deck earlier. This is going to be a matrix with 2 dimensions, 2 columns, and in this case, it’s going to have 2,652 rows. Those are all the permutations.

Now, we’re going to define the first card and the second card by grabbing the first and second columns using this simple piece of code.

first_card <- hands[,1]
second_card <- hands[,2]
sum(first_card %in% kings)
## [1] 204

And now to find the conditional probability, we ask what fraction of these 204 have also a king in the second card.

# What fraction of these 204 also have a king as the second card
sum(first_card %in% kings & second_card %in% kings) / 
  sum(first_card %in% kings)
## [1] 0.05882

So this case we type the following piece of code. We add all the cases that have king in the first, king in the second, and divide by the cases that have a king in the first. And now we get the answer 0.058, which is exactly 3 out of 51,

This is in R version of the multiplication rule, which tells us that probably of P, given A, is equal to proportion of A and B, or the probability of A and B, divided by the proportion of A or the probability of A.

\[\Pr(B \mid A)=\frac{\Pr(A\ and\ B)}{\Pr(A)}\]

Now, what if the order does not matter? For example, in blackjack, if you get an ace and a face card or a 10, it’s called a natural 21, and you win automatically. If we want to compute the probability of this happening, we want to enumerate the combinations, not permutations, since the order doesn’t matter. So if we get an Ace and a king, king and an Ace, it’s still a 21. We don’t want to count that twice. So notice the difference between the permutations functions, which lists all permutations,

permutations(3, 2)
##      [,1] [,2]
## [1,]    1    2
## [2,]    1    3
## [3,]    2    1
## [4,]    2    3
## [5,]    3    1
## [6,]    3    2

and the combination function,

combinations(3, 2)
##      [,1] [,2]
## [1,]    1    2
## [2,]    1    3
## [3,]    2    3

where order does not matter.

So to compute the probability of a natural 21 in blackjack, we can do this. We can define a vector that includes all the aces, a vector that includes all the face cards, then we generate all the combinations of picking 2 cards out of 52, and then we simply count.

aces <- paste("Ace", suits)

facecard <- c("King", "Queen", "Jack", "Ten")
facecard <- expand.grid(number=facecard, suit=suits)
facecard <- paste( facecard$number, facecard$suit)

hands <- combinations(52, 2, v = deck)

How often do we get a facecard?

mean(hands[,1] %in% aces & hands[,2] %in% facecard)
## [1] 0.04827

We can also use a Monte Carlo to estimate this probably. In this case, we draw two cards over and over and keep track of how many 21’s we get. We can use the function sample to draw a card with out replacement like this.

# Here is one hand
(hand <- sample(deck, 2))
## [1] "Three Hearts" "Ten Hearts"

Here’s 1 hand. We didn’t get a 21 there. And then check if 1 card is an ace and the other is a face card or a 10. Now we simply repeat this over and over and we get a very good approximation–

B <- 10000
results <- replicate(B, {
  hand <- sample(deck, 2)
  (hand[1] %in% aces & hand[2] %in% facecard) |
    (hand[2] %in% aces & hand[1] %in% facecard)
})

mean(results)
## [1] 0.0468

in this case, 0.0468.

The Birthday Problem

Suppose you’re in a classroom with 50 people. If we assume this is a randomly selected group, what is the chance that at least two people have the same birthday? Although it is some what advanced, we can actually deduce this mathematically, and we do this later, but now, we’re going to use Monte Carlo simulations. For simplicity, we assumed that nobody was born on February 29th. This actually doesn’t change the answer much. All right, first, note that birthdays can be represented as numbers between 1 and 365.

So a sample of 50 random birthdays can be obtained simply using the sample function, like this.

n <- 50
bdays <- sample(1:365, n, replace = TRUE)

To check if, in this particular set of 50 people we have at least two with the same birthday, we can use the function duplicated, which returns true whenever an element of a vector has already appeared in that vector.

So here’s an example. If I type duplicated 1, 2, 3, 1, 4, 3, 5,

duplicated(c(1, 2, 3, 1, 4, 3, 5))
## [1] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE

So to check if two birthdays were the same, we simply use any and duplicated functions, like this:

any(duplicated(bdays))
## [1] TRUE

In this case we have at least two people with the same birthday.

We get true. Now, to estimate the probability, we’re going to repeat this experiment. We’re going to run a Monte Carlo simulation over and over again. So what we do is we do it 10,000 times.

We use the replicate function like this.

B <- 10000
results <- replicate(B, {
  bdays <- sample(1:365, n, replace = TRUE)
  any(duplicated(bdays))
})

mean(results)
## [1] 0.9733

And when we’re done, we get that the probability of having two people, at least two people, with the same birthday in a group of 50 people is about 0%.

sapply

Say you want to use what you’ve just learned about the birthday problem to bet with friends about two people having the same birthday in a group of people. When are the chances larger than 50%? Larger than 75%? Let’s create a lookup table. We can quickly create a function to compute this for any group. We write the function like this. We’ll call it compute_prob, and we’ll basically make the calculations for the probability of two people having the same birthday.

compute_prob <- function(n, B=10000){
  same_day <- replicate(B, {
    bdays <- sample(1:365, n, replace=TRUE)
    any(duplicated(bdays))
    })
  mean(same_day)
  }

Well, we use a small Monte Carlo simulation to do it. Now that we’ve done this, we want to compute this function, we want to apply this function to several values of n, let’s say from 1 to 60. Let’s define n as a sequence starting at 1 and ending at 60. Now, we can use a for loop to apply this function to each value in n, but it turns out that for loops are rarely the preferred approach in R.

n <- seq(1, 60)

In general, we try to perform operations on entire vectors. Arithmetic operations, for example, operate on vectors in an element wise fashion. We saw this when we learned about R. So if we type x equals 1 through 10, now X is the vector starting at 1 and ending at 10, and we compute the square root of x, it actually computes the square root for each element.

x <- 1:10
sqrt(x)
##  [1] 1.000 1.414 1.732 2.000 2.236 2.449 2.646 2.828 3.000 3.162

Equally, if we define y to be 1 through 10, and then multiply x by y, it multiplies each element 1 times 1, 2 times 2, et cetera.

y <- 1:10
x * y
##  [1]   1   4   9  16  25  36  49  64  81 100

So there’s really no need for for loops. But not all functions work this way. You can’t just send a vector to any function in R. For example, the function we just wrote does not work element-wise since it’s expecting a scalar, it’s expecting an end.

This piece of code does not do what we want. If we type compute prob and send it the vector n, we will not get what we want. We will get just one number.

compute_prob(n)
## [1] 0

What we can do instead is use the function sapply. sapply permits us to perform element-wise operations on any function. Here’s how it works. We’ll define a simple example for the vector 1 through 10. If we want to apply the square roots of each element of x, we can simply type sapply x comma square root, and it’ll apply square root to each element of x.

x <- 1:10
sapply(x, sqrt)
##  [1] 1.000 1.414 1.732 2.000 2.236 2.449 2.646 2.828 3.000 3.162

We don’t need to do this because square root already does that, but we are using it as a simple example. So for our case, we can simply type:

prob <- sapply(n, compute_prob)

And this will assign to each element of prob the probability of two people having the same birthday for that end. And now we can very quickly make a plot. We plot the probability of two people having the same birthday against the size of the group.

ggplot(data = data_frame(n, prob), aes(n, prob)) +
  geom_point(shape=21) +
  xlab("n = size of group ")

Now, let’s compute the exact probabilities rather than use Monte Carlo simulations. The function we just define uses a Monte Carlo simulation, but we can use what we’ve learned about probability theory to compute the exact value. Not only do we get the exact answer using math, but the computations are much faster since we don’t have to generate experiments. We simply get a number. To make the math simpler for this particular problem, instead of computing the probability of it happening, we’ll compute the probability of it not happening, and then we can use a multiplication rule. Let’s start with the first person. The probability that person 1 has a unique birthday is 1, of course.

\[\Pr(\text{person 1 has a unique birthday})=1\] Now let’s move on to the second one. The probability that the second person has a unique birthday given that person 1 already took one other day is 364 divided by 365.

\[\Pr(\text{person 2 has a unique birthday}\mid\text{person 1 has a unique birthday})=364/365\]

Then for a person 3, given that the first two people already have unique birthdays, that leaves 363. So now that probability is 363 divided by 365.

\[\Pr(\text{person 3 has a unique birthday} \mid \text{person 1 has a unique birthday})=364/365\]

If we continue this way and find the chances of all, say, 50 people having unique birthdays, we would multiply 1 times 364 divided by 365, times 363 divided by 365, dot dot dot, all the way to the 50th element.

Here’s the equation: \[1 \times\frac{364}{365}\times\frac{363}{365}\cdots\frac{365-n+1}{365}\]

Now, we can easily write a function that does this. This time we’ll call it exact_prob. It takes n as a argument, and it computes this probability using the simple code. Now we can compute each probably for each n using sapply again, like this.

exact_prob <- function(n){
  prob_unique <- seq(365, 365 - n + 1) / 365
  1 - prod(prob_unique)
}
eprob <- sapply(n, exact_prob)

And if we plot it, we can see that the Monte Carlo simulations were almost exactly right.

ggplot() +
  geom_point(data = data_frame(n, prob), aes(n, prob), shape=21) +
  geom_line(data = data_frame(n, eprob), aes(n, eprob), color="red") +
  xlab("n = size of group ") + 
  ylab("prob and eprob")

They were almost exact approximations of the actual value. Now, notice had it not been possible to compute the exact probabilities, something that sometimes happens, we would have still been able to accurately estimate the probabilities using Monte Carlo.

How many Monte Carlo experiments are enough?

In practice, we won’t know what the answer is, so we don’t know if our Monte Carlo estimate is accurate. We know that the larger the number of experiments, we’ve been using the letter B to represent that, the better the approximation. But how big do we need it to be? This is actually a challenging question, and answering it often requires advanced theoretical statistics training. One practical approach we will describe here is to check for the stability of the estimate.

Here’s an example with the birthday problem. We’re going to use n <- 22. There’s 22 people. So we’re going to run a simulation where we compute or estimate the probability of two people having a certain birthday using different sizes of the Monte Carlo simulations. So the value of B going to go from 10, to 20, to 40, to 100, et cetera. We compute the simulation, and now we look at the values that we get for each simulation.

B <- 10^seq(1, 5, len=100)

compute_prob <- function(B, n=22){
  same_day <- replicate(B, {
    bdays <- sample(1:365, n, replace = TRUE)
    any(duplicated(bdays))
  })
  mean(same_day)
}

Remember, each simulation has a different B, a different number of experiments. When we see this graph, we can see that it’s wiggling up and down.

prob <- sapply(B, compute_prob)
plot(log10(B), prob, type = "l")

That’s because the estimate is not stable yet. It’s not such a great estimate. But as B gets bigger and bigger, eventually it starts to stabilize. And that’s when we start getting a feeling for the fact that now perhaps we have a large enough number of experiments.

The Addition Rule

The additional rule tells us that the probability of A or B, we’re going to have to make this calculation now, because you can get to 21 in two ways. You can get either a face card and then an ace. Or you can get an ace and then a face card. The additional rule tells us the probability of \(A\) or \(B\) is the probably of \(A\) plus the probably \(B\) minus the probability of \(A\) and \(B\).

\[\Pr(A\ \text{or}\ B)=\Pr(A)+\Pr(B)-\Pr(A \ \text{and}\ B)\]

To understand why this makes sense, think of a Venn diagram. If we’re computing the probability of this whole thing happening, A or B, we can add the blue circle plus the pink circle, and then subtract the middle because we have added it twice by adding the blue plus the pink.

png

png

So now let’s apply it to the natural 21. In the case of natural 21, the intersection is empty. Since both hands can’t happen, you can’t have both an ace and then a face card, and then at the same time have a face card and then an ace. Those two things can’t happen at the same time. So this will be a very easy application of the addition rule. The probably of an ace followed by a face card we know is \(\frac{1}{13}\times\frac{16}{51}\). And the probability of a face card followed by an ace is \(\frac{16}{52}\times\frac{4}{51}\). These are actually the same, which makes sense to the symmetry. These two values are actually the same. In any case, we get the same result that we got before for the natural 21, which is about 0.05.

The Monty Hall Problem

We’re going to look at one last example from the discrete probability. It’s the Monty Hall problem. In the 1970s, there was a game show called “Let’s Make a Deal”. Monty Hall was the host– this is where the name of the problem comes from. At some point in the game, contestants were asked to pick one of three doors. Behind one door, there was a prize. The other two had a goat behind them. And this basically meant that you lost.

If the contestant did not pick the prize door on his or her first try, Monty Hall would open one of the two remaining doors and show the contestant that there was no prize behind that door. So you’re left with two doors, the one you picked and one door that you do not know what’s behind it. So then Monty Hall would ask, do you want to switch doors? What would you do? We can use probability to show that if you stick to the original door, your chances of winning a car or whatever big prize is 1 in 3. But if you switch, your chances double to 2 in 3. This seems counterintuitive. Many people incorrectly think both chances are 1 in 2, since there’s only two doors left and there’s a prize behind one of them. You can find details, explanations– we’re going to provide links of the mathematics of how you can calculate that this is, in fact, wrong, that you have a higher chance if you switch. But here, we’re going to use a Monte Carlo simulation to show you that this is the case. And this will help you understand how it all works. Note that the code we’re going to show you is longer than it needs to be, but we’re using it as an illustration to help you understand the logic behind what actually happens here. So let’s start by creating a simulation that imitates the strategy of sticking to the same door.

Let’s go through the code. We’re going to do 10,000 experiments. So we’re going to do this over and over 10,000 times. First thing we do is we assign the prize to a door. So the prize is going to be in one of the three doors that we have created. Then that’s going to be in prize door. Prize door contains the number of the door with the prize behind it. Then we’re going to imitate your pick by taking a random sample of these three doors. Now, in the next step, we’re going to decide which door you’re shown. You’re going to be shown not your door and not the door with a prize. You’re going to be shown the other one. You stick to your door. That’s what you stick to. Nothing changed. All this that we did right above doesn’t matter, because you’re sticking to your door. So now we do this over and over again, and at the end, we ask is the door you chose, the one you stuck to, is that the prize door? How often does this happen? We know it’s going to be a 1/3, because none of this procedure changed anything. You started picking one out of three, nothing changed. And now, you are asked, is the one I picked the door? If we run the simulation, we actually see it happening. We ran it 10,000 times and the probability of winning was 0.3357, about 1 in 3.

# stick strategy code
B <- 10000
stick <- replicate(B, {
  doors <- as.character(1:3)
  prize <- sample(c('car', 'goat', 'goat'))
  prize_door <- doors[prize == 'car']
  my_pick <- sample(doors, 1)
  show <- sample(doors[!doors %in% c(my_pick, prize_door)], 1)
  stick <- my_pick
  stick == prize_door
})
mean(stick)
## [1] 0.3317
(doors <- as.character(1:3))
## [1] "1" "2" "3"
(prize <- sample(c('car', 'goat', 'goat')))
## [1] "goat" "car"  "goat"
# switch strategy code
switch <- replicate(B, {
  doors <- as.character(1:3)
  prize <- sample(c("car","goat","goat"))
  prize_door <- doors[prize == "car"]
  my_pick  <- sample(doors, 1)
  show <- sample(doors[!doors %in% c(my_pick, prize_door)], 1)
  stick <- my_pick
  switch <- doors[!doors%in%c(my_pick, show)]
  switch == prize_door
})
mean(switch)
## [1] 0.6695
replicate(5, {
    ifelse(sample(beads, 1)=='blue', 1, 0)
})
## [1] 1 0 1 1 1

Let’s start by constructing the urn, the urn we use for our sampling model. A roulette wheel has 18 red pockets, 18 black pockets, and 2 green ones. So playing a color in one game of roulette is equivalent to drawing from this urn.

Let’s write some code. There’s 18 black, 18 red, and 2 green.

color <- rep(c('Red', 'Black', 'Green'), c(18, 18, 2))

The 1,000 outcomes from 1,000 people playing are independent draws from this urn. If red comes up, the gambler wins, and the casino loses 1 dollar, so we draw a negative 1. Otherwise, the casino wins 1 dollar, and we draw a 1. We can code 1,000 independent draws using the following code.

n <- 1000
X <- sample(ifelse(color == 'Red', -1, 1), n, replace = TRUE)

Here are the first 10 outcomes of these 1,000 draws.

X[1:10]
##  [1]  1 -1  1  1 -1  1 -1  1 -1 -1
sum(X)
## [1] 48

Because we know the proportions of 1’s and negative 1’s, inside the urn, we can generate the draws with one line of code, without defining color. Here’s that line of code. We call this approach a sampling model, since we are modeling the random behavior of a roulette with the sampling of draws from an urn.

# sampling model
X <- sample(c(-1, 1), n, replace = TRUE, prob = c(9/19, 10/19))

The total winnings, capital S, is simply the sum of these 1,000 independent draws. So here’s a code that generates an example of S.

(S <- sum(X))
## [1] 30

If you run that code over and over again, you see that S changes every time. This is, of course, because S is a random variable.

replicate(5, {
    X <- sample(c(-1, 1), n, replace = TRUE, prob = c(9/19, 10/19))
    sum(X)
})
## [1] 44 78 18 40 92

This is, of course, because S is a random variable. A very important and useful concept is the probability distribution of the random variable. The probability distribution of a random variable tells us the probability of the observed value falling in any given interval. So for example, if we want to know the probability that we lose money, we’re asking, what is the probability that S is in the interval S less than 0?

Note that if we can define a commutative distribution function– let’s call it \(F(a)=Pr(S\leq a)\) – then we’ll be able to answer any question related to the probability of events defined by a random variable S, including the event \(S\leq 0\). We call \(F\) the random variable’s distribution function. We can estimate the distribution function for the random variable S by using a Monte Carlo simulation to generate many, many realizations of the random variable. With the code we’re about to write, we run the experiment of having 1,000 people play roulette over and over. Specifically, we’re going to do this experiment 10,000 times. Here’s the code.

n <- 1000
B <- 10000
S <- replicate(B, {
    X <- sample(c(-1, 1), n, replace = TRUE, prob = c(9/19, 10/19))
    sum(X)
})

So now we’re going to ask, in our simulation, how often did we get sums smaller or equal to \(a\)? We can get the answer by using this simple code. This will be a very good approximation of \(F(a)\), our distribution function. In fact, we can visualize the distribution by creating a histogram showing the probability \(F(b)\) minus \(F(a)\) for several intervals ab. Here it is.

library(ggplot2)
Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union
ggplot(as.data.frame(S), aes(S)) +
  geom_histogram(aes(y = ..density..), color='black', binwidth = 10)

Now we can easily answer the casino’s question, how likely is it that we lose money? We simply ask, how often was S, out of the 10,000 simulations, smaller than 0? And the answer is, it was only 4.6% of the time.

mean(S <= 0)
## [1] 0.0486

So it’s quite low. From the histogram, we also see that that distribution appears to be approximately normal. If you make a Q-Q plot, you’ll confirm that the normal approximation is close to perfect. If, in fact, the distribution is normal, then all we need to define is the distribution’s average and standard deviation.

ggplot(as.data.frame(S), aes(sample=scale(S) )) +
  geom_qq(geom = 'point', position='identity') +
  geom_abline()

mean(S)
## [1] 52.99
sd(S)
## [1] 31.37

Because we have the original values from which the distribution is created, we can easily compute these. The average is 52.5, and the standard deviation is 31.75. If we add a normal density with this average and standard deviation to the histogram we created earlier, we see that it matches very well.

This average and this standard deviation have special names. They are referred to as the expected value and the standard error of the random variable S.

We will say more about this in the next section. It actually turns out that statistical theory provides a way to derive the distribution of a random variable defined as independent draws from an urn. Specifically, in our example, we can show that (S + n) / 2 follows what is known as a binomial distribution. We therefore do not need to run Monte Carlo simulations, nor use the normal approximation, to know the probability distribution of S.

We did this for illustrative purposes. We ran the simulation for illustrative purposes. For the details of the binomial distribution, you can consult any probability book, or even Wikipedia. However, we will discuss an incredibly useful approximation provided by mathematical theory that applies generally to sums of averages of draws from any urn, the central limit theorem.

ggplot(data.frame(S=S), aes(S)) +
  geom_histogram(aes(y=..density..), binwidth = 10, color = 'black') +
  geom_density(aes(y=..density..), color = 'blue') +
  ylab("Probability")

Note that in statistical textbooks, capital letters are used to denote random variables and we follow this convention here. Lower case letters are used for observed values. You’ll see some notation that includes both. For example, you’ll see events defined as capital \(X\) less than or equal than small cap \(x\).

\[X\leq x\] Here, \(x\) is a random variable, making it a random event. And little \(x\) is an arbitrary value and not random. So for example, capital \(X\) might represent the number on a die roll – that’s a random value– and little \(x\) will represent an actual value we see.

Exercise 1. American Roulette probabilities

An American roulette wheel has 18 red, 18 black, and 2 green pockets. Each red and black pocket is associated with a number from 1 to 36. The two remaining green slots feature “0” and “00”. Players place bets on which pocket they think a ball will land in after the wheel is spun. Players can bet on a specific number (0, 00, 1-36) or color (red, black, or green).

What are the chances that the ball lands in a green pocket?

# The variables `green`, `black`, and `red` contain the number of pockets for each color
green <- 2
black <- 18
red <- 18

# Assign a variable `p_green` as the probability of the ball landing in a green pocket
p_green <- green / (green + black + red)

# Print the variable `p_green` to the console
p_green
## [1] 0.05263

Exercise 2. American Roulette payout

In American roulette, the payout for winning on green is $17. This means that if you bet $1 and it lands on green, you get $17 as a prize.

Create a model to predict your winnings from betting on green.

# Use the `set.seed` function to make sure your answer matches the expected result after random sampling.
set.seed(1)

# The variables 'green', 'black', and 'red' contain the number of pockets for each color
green <- 2
black <- 18
red <- 18

# Assign a variable `p_green` as the probability of the ball landing in a green pocket
p_green <- green / (green+black+red)

# Assign a variable `p_not_green` as the probability of the ball not landing in a green pocket
p_not_green <- 1 - p_green

#Create a model to predict the random variable `X`, your winnings from betting on green.
X <- sample(c(1, -1), 1, prob = c(p_green, p_not_green))

# Print the value of `X` to the console
X
## [1] -1

Exercise 3. American Roulette expected value

In American roulette, the payout for winning on green is \$17. This means that if you bet \$1 and it lands on green, you get \$17 as a prize. In the previous exercise, you created a model to predict your winnings from betting on green.

Now, compute the expected value of X, the random variable you generated previously.

# The variables 'green', 'black', and 'red' contain the number of pockets for each color
green <- 2
black <- 18
red <- 18

# Assign a variable `p_green` as the probability of the ball landing in a green pocket
p_green <- green / (green+black+red)

# Assign a variable `p_not_green` as the probability of the ball not landing in a green pocket
p_not_green <- 1-p_green

# Calculate the expected outcome if you win $17 if the ball lands on green and you lose $1 if the ball doesn't land on green
p_green * 17 + -1 * p_not_green
## [1] -0.05263

Exercise 4. American Roulette standard error

The standard error of a random variable X tells us the difference between a random variable and its expected value. You calculated a random variable X in exercise 2 and the expected value of that random variable in exercise 3. Now, compute the standard error of that random variable, which represents a single outcome after one spin of the roulette wheel.

# The variables 'green', 'black', and 'red' contain the number of pockets for each color
green <- 2
black <- 18
red <- 18

# Assign a variable `p_green` as the probability of the ball landing in a green pocket
p_green <- green / (green+black+red)

# Assign a variable `p_not_green` as the probability of the ball not landing in a green pocket
p_not_green <- 1-p_green

\[p(\text{Exactly k scores in n attempts})=\binom{n}{k}\cdot f^k(1 - f)^{n-k} \]

The expected value \(E(X)\) of a random variable is the population mean. The probability weighted sum.

The variance of a random variable X is the expaected value of the squared deviation from the mean of \(x\), \(\mu=E[X]\):

\[Var(X)=E[(X - \mu)^2]\]

# Compute the standard error of the random variable
(avg <- (p_green * 17) + p_not_green * -1)
## [1] -0.05263
sqrt(p_green * (17 - avg)^2 + p_not_green * (-1 - avg)^2)
## [1] 4.019
sqrt((17 - -1)*p_green*(1-p_green))
## [1] 0.9474

Exercise 5. American Roulette sum of winnings

You modeled the outcome of a single spin of the roulette wheel, X, in exercise 2.

Now create a random variable S that sums your winnings after betting on green 1,000 times.

# The variables 'green', 'black', and 'red' contain the number of pockets for each color
green <- 2
black <- 18
red <- 18

# Assign a variable `p_green` as the probability of the ball landing in a green pocket
p_green <- green / (green+black+red)

# Assign a variable `p_not_green` as the probability of the ball not landing in a green pocket
p_not_green <- 1-p_green

# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1)

# Define the number of bets using the variable 'n'
n <- 1000

# Create a vector called 'X' that contains the outcomes of 1000 samples
X <- sample(c(17, -1), size = n, replace=TRUE, prob = c(p_green, p_not_green))

# Assign the sum of all 1000 outcomes to the variable 'S'
S <- sum(X)

# Print the value of 'S' to the console
S
## [1] -10

Exercise 6. American Roulette winnings expected value

In the previous exercise, you generated a vector of random outcomes, S, after betting on green 1,000 times.

What is the expected value of S?

# The variables 'green', 'black', and 'red' contain the number of pockets for each color
green <- 2
black <- 18
red <- 18

# Assign a variable `p_green` as the probability of the ball landing in a green pocket
p_green <- green / (green+black+red)

# Assign a variable `p_not_green` as the probability of the ball not landing in a green pocket
p_not_green <- 1-p_green

# Define the number of bets using the variable 'n'
n <- 1000

# Calculate the expected outcome of 1,000 spins if you win $17 when the ball lands on green and you lose $1 when the ball doesn't land on green
1000 * (p_green*17 + p_not_green * -1)
## [1] -52.63

Exercise 7. American Roulette winnings expected value

You generated the expected value of S, the outcomes of 1,000 bets that the ball lands in the green pocket, in the previous exercise.

What is the standard error of S?

# The variables 'green', 'black', and 'red' contain the number of pockets for each color
green <- 2
black <- 18
red <- 18

# Assign a variable `p_green` as the probability of the ball landing in a green pocket
p_green <- green / (green+black+red)

# Assign a variable `p_not_green` as the probability of the ball not landing in a green pocket
p_not_green <- 1-p_green

# Define the number of bets using the variable 'n'
n <- 1000

# Compute the standard error of the sum of 1,000 outcomes
sqrt(1000) * abs(17 - -1) * sqrt(p_green * (1 - p_green) )
## [1] 127.1
sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.1 LTS
## 
## Matrix products: default
## BLAS: /home/michael/anaconda3/lib/R/lib/libRblas.so
## LAPACK: /home/michael/anaconda3/lib/R/lib/libRlapack.so
## 
## locale:
## [1] en_CA.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] gtools_3.8.1         ggplot2_3.0.0        dslabs_0.3.3        
## [4] dplyr_0.7.6          RevoUtils_11.0.1     RevoUtilsMath_11.0.0
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.18     pillar_1.3.0     compiler_3.5.1   plyr_1.8.4      
##  [5] bindr_0.1.1      tools_3.5.1      digest_0.6.15    evaluate_0.11   
##  [9] tibble_1.4.2     gtable_0.2.0     pkgconfig_2.0.1  rlang_0.2.1     
## [13] yaml_2.2.0       blogdown_0.9.8   xfun_0.4.11      bindrcpp_0.2.2  
## [17] withr_2.1.2      stringr_1.3.1    knitr_1.20       rprojroot_1.3-2 
## [21] grid_3.5.1       tidyselect_0.2.4 glue_1.3.0       R6_2.2.2        
## [25] rmarkdown_1.10   bookdown_0.7     purrr_0.2.5      magrittr_1.5    
## [29] backports_1.1.2  scales_0.5.0     codetools_0.2-15 htmltools_0.3.6 
## [33] assertthat_0.2.0 colorspace_1.3-2 labeling_0.3     stringi_1.2.4   
## [37] lazyeval_0.2.1   munsell_0.5.0    crayon_1.3.4

References

R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2018. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.