Probability and distributions

Michael Taylor


It is an empirical fact that most experiments and investigators are not perfectly reproducible. The degree of reproducibility may vary: Somme experiments in physics may yield data that are accurate to many decimal places, whereas data on biological systems are typically much less reliable.

Random sampling

Lot of the earlier work on probability was on games and gambling. In R, you can simulate these situations with the sample function. If you want to pick 5 numbers from a set of 1:40, then you can write:

sample(1:40, 5)
## [1] 14 19  3  8 18

The default behavior of sample is sampling without replacement. That means the sample will not contain the same number twice.To sample with replacement add the argument replace=TRUE.

Sampling with replacement is suitable for modelling coin tosses or throws of a die. So, to simulate 10coin tosses we write:

coin <- c("H", "T")
sample(coin, 10, replace = TRUE)
##  [1] "H" "H" "H" "H" "H" "H" "T" "H" "T" "T"

Data can also be simulated with non equal outcomes by using the prob argument to sample as in:

outcome <- c("succ", "fail")
sample(outcome, 10, replace = T, prob = c(0.9, 0.1))
##  [1] "succ" "succ" "succ" "succ" "succ" "succ" "succ" "succ" "fail" "succ"

Probability calculations and combinatrics

Set theory terminology

Probability often uses notation and terminology from a branch of mathematics called set theory. A set is a collection of distinct objects. Sets are specified using capital letters, for example A or B, while objects comprising sets are listed in curly brackets. For example A = {♀♂} is a set containing gender outcomes from sexual reproduction.

In the idiom of set theory the process of obtaing an observation is an experiment an iteration of an experiment is a trial and the observed result is an outcome.


The sample space or the universal set is termed the S and is a collection of elements.

Element: We write \(x\in S\) to mean \(X\) in the set \(S\).
Subset: We say set \(A\) is a subset of \(S\) if all its elements are in S. We write \(A\subset S\).
Complement: The complement of \(A\) of \(S\) is the set of elements of \(S\) that are not in \(A\). \(A^\complement\) or \(S-A\).
Union: The union of \(A\) and \(B\) is the set of all elements in \(A\) or \(B\) (or both). \(A\cup B\)
Intersection: The intersection of \(A\) and \(B\) is the set of all elements in both \(A\) and \(B\). \(A\cap B\)
Empty set: The empty set has no elements. \(\emptyset\)


Lets look at a case of sampling without replacement, specifically sample(1:40, 5).

The Rule of product says:

If there are $n$ ways to perform action 1 and then by $m$ ways to perform action 2, then there are $n\times m$ to performaction 1 followed by action 2.

ex. If you have 3 shirts and 4 pants then you can make \(3\times4=12\) outfits.

The probability of obtaing a given number is as the first one of the sample should be \(1/40\), the next one \(1/39\) and so forth. The probability of a given sample should be \(1/(40\times39\times38\times37\times36)\). The prod function that calculates the product of vectors used to compute this in R.

## [1] 1.266449e-08

This is however, the probability of getting the numbers in a given order. If this were Lotto 6/49 like game we would only be interested in guessing the set of five numbers correctly. Each case of these five numbers would have an equal probability of occurring. Thus, there are five possibilities for the first number, four for the second and so forth; i.e., the number is \(5\times4\times3\times2\times1\). This can be otherwise written as \(5!\). So, the probability for a combination of set becomes:

## [1] 1.519738e-06

Mimicking the mathematical formula: \[_nC_k={n \choose k}=\frac{n!}{k!(n-k)}\]

The notation \(_nC_k\) is read as “n choose k”.

In R, the choose function can be used to calculate this number, and the probability is thus:

## [1] 1.519738e-06

Discrete distributions