Sampling strategies and experimental design

Michael Taylor

2018/04/06

Sampling strategies

Simple random sampling
In simple random sampling we select cases from the population, such that each case is equally likely to be selected.

Stratified sampling
In stratified sampling, we first divide the population into homogeneous groups, called strata, and then wwe randomly sample from within each stratum.

Cluster sampling
In cluster sampling, divide the population into clusters, randomly sample a few clusters, and then sample all observations within these clusters. The clusters, unlike strata in stratified sampling are heterogeneous within themselves and each cluster is similar to the other, such that we can get away with sampling from just a few of the clusters.

Multistage sampling
Multistage sammpling adds another stage to cluster sampling, we divide the population into cluster, randomly sample a few clusters, and then we randomly sample observations fromm within those clusters.

Setup

library(openintro)
suppressPackageStartupMessages(library(dplyr))

library(knitr)
library(kableExtra)
options(knitr.table.format = "html") 
## If you don't define format here, you'll need put `format = "html"` in every kable function.
# Load county data
data(county)
head(levels(county$state), 10)
##  [1] "Alabama"              "Alaska"               "Arizona"             
##  [4] "Arkansas"             "California"           "Colorado"            
##  [7] "Connecticut"          "Delaware"             "District of Columbia"
## [10] "Florida"
# Remove DC
county_noDC <- county %>% 
  filter(state != levels(county$state)[9]) %>% 
  droplevels()

Simple random sample

# Simple random sample of 150 counties
county_srs <- county_noDC %>% 
  sample_n(size = 150)
glimpse(county_srs)
## Observations: 150
## Variables: 10
## $ name          <fct> Johnson County, Loudon County, Clay County, Hami...
## $ state         <fct> Iowa, Tennessee, Illinois, Kansas, Kansas, Color...
## $ pop2000       <dbl> 111006, 39086, 14560, 2670, 53597, 12442, 38972,...
## $ pop2010       <dbl> 130882, 48556, 13815, 2690, 55606, 14843, 42391,...
## $ fed_spend     <dbl> 8.588232, 11.087878, 8.926384, 15.956877, 7.1041...
## $ poverty       <dbl> 18.2, 13.8, 16.3, 7.7, 12.6, 6.6, 20.9, 14.7, 14...
## $ homeownership <dbl> 60.3, 77.9, 76.4, 74.2, 67.8, 76.9, 72.4, 75.4, ...
## $ multiunit     <dbl> 35.8, 7.3, 10.2, 9.1, 17.4, 30.3, 6.6, 8.7, 10.1...
## $ income        <dbl> 28008, 27046, 20802, 20190, 23669, 30055, 18049,...
## $ med_income    <dbl> 51380, 49343, 38016, 36297, 45162, 60433, 36357,...

SRS state distribution

# State distribution of SRS counties
county_srs %>% 
  group_by(state) %>% 
  count()
## # A tibble: 40 x 2
## # Groups:   state [40]
##    state        n
##    <fct>    <int>
##  1 Alabama      3
##  2 Alaska       4
##  3 Arizona      1
##  4 Arkansas     4
##  5 Colorado     8
##  6 Florida      2
##  7 Georgia      8
##  8 Idaho        3
##  9 Illinois     7
## 10 Indiana      5
## # ... with 30 more rows

Stratified sample

county_str <- county_noDC %>% 
  group_by(state) %>% 
  sample_n(size = 3)
glimpse(county_str)
## Observations: 150
## Variables: 10
## $ name          <fct> Lamar County, Blount County, Escambia County, Al...
## $ state         <fct> Alabama, Alabama, Alabama, Alaska, Alaska, Alask...
## $ pop2000       <dbl> 15904, 51024, 38440, 2697, NA, 808, 19715, 51335...
## $ pop2010       <dbl> 14564, 57322, 38319, 3141, 968, 662, 20489, 5359...
## $ fed_spend     <dbl> 9.965394, 5.130910, 7.805162, 4.450493, 0.000000...
## $ poverty       <dbl> 18.5, 13.4, 24.4, 10.4, 10.8, 4.3, 20.3, 18.9, 1...
## $ homeownership <dbl> 75.1, 82.0, 73.5, 59.2, 59.1, 61.1, 75.4, 78.3, ...
## $ multiunit     <dbl> 9.0, 3.7, 7.8, 11.8, 27.2, 12.4, 3.6, 4.8, 22.9,...
## $ income        <dbl> 19789, 21070, 16259, 22279, 35536, 28576, 21165,...
## $ med_income    <dbl> 33887, 45549, 31927, 54375, 73500, 65750, 32147,...

Simple random sample in R

glimpse(county)
## Observations: 3,143
## Variables: 10
## $ name          <fct> Autauga County, Baldwin County, Barbour County, ...
## $ state         <fct> Alabama, Alabama, Alabama, Alabama, Alabama, Ala...
## $ pop2000       <dbl> 43671, 140415, 29038, 20826, 51024, 11714, 21399...
## $ pop2010       <dbl> 54571, 182265, 27457, 22915, 57322, 10914, 20947...
## $ fed_spend     <dbl> 6.068095, 6.139862, 8.752158, 7.122016, 5.130910...
## $ poverty       <dbl> 10.6, 12.2, 25.0, 12.6, 13.4, 25.3, 25.0, 19.5, ...
## $ homeownership <dbl> 77.5, 76.7, 68.0, 82.9, 82.0, 76.9, 69.0, 70.7, ...
## $ multiunit     <dbl> 7.2, 22.6, 11.1, 6.6, 3.7, 9.9, 13.7, 14.3, 8.7,...
## $ income        <dbl> 24568, 26469, 15875, 19918, 21070, 20289, 16916,...
## $ med_income    <dbl> 53255, 50147, 33219, 41770, 45549, 31602, 30659,...
# Making the US regions data, since it is not directly available
data(state, package = "datasets")
us_regions <- as.data.frame(cbind(state.name,as.character(state.region)))
names(us_regions) <- c("state","region")

state_srs <- us_regions %>% sample_n(size=8)

state_srs %>% group_by(region) %>% count()
## # A tibble: 3 x 2
## # Groups:   region [3]
##   region            n
##   <fct>         <int>
## 1 North Central     3
## 2 Northeast         1
## 3 South             4

Simple random sample in R
A list of all states and the region they belong to (Northeast, Midwest, South, West) are given in the us_regions data frame.

# Simple random sample: states_srs
states_srs <- us_regions %>%
  sample_n(size=8)

# Count states by region
states_srs %>%
  group_by(region) %>%
  count()
## # A tibble: 4 x 2
## # Groups:   region [4]
##   region            n
##   <fct>         <int>
## 1 North Central     3
## 2 Northeast         1
## 3 South             2
## 4 West              2

Stratified sample in R
In the last exercise, you took a simple random sample of eight states. However, as you may have noticed when you counted the number of states selected from each region, this strategy is unlikely to select an equal number of states from each region. The goal of stratified sampling is to select an equal number of states from each region.

# Stratified sample
states_str <- us_regions %>%
  group_by(region) %>%
  sample_n(size=2)

# Count states by region
states_str %>%
  group_by(region) %>%
  count()
## # A tibble: 4 x 2
## # Groups:   region [4]
##   region            n
##   <fct>         <int>
## 1 North Central     2
## 2 Northeast         2
## 3 South             2
## 4 West              2

Principles of experimental design

(“Confounding Variable Examples” n.d.) Confounding Variable A confounding variable is an outside influence that changes the effect of a dependent and independent variable. This extraneous influence is used to influence the outcome of an experimental design. Simply, a confounding variable is an extra variable entered into the equation that was not accounted for. Confounding variables can ruin an experiment and produce useless results. They suggest that there are correlations when there really are not. In an experiment, the independent variable generally has an effect on the dependent variable. For example, if you are researching whether a lack of exercise has an effect on weight gain, the lack of exercise is the independent variable and weight gain is the dependent variable. A confounding variable would be any other influence that has an effect on weight gain. Amount of food consumption is a confounding variable, a placebo is a confounding variable, or weather could be a confounding variable. Each may change the effect of the experiment design.

Experimental design terminology
Explanatory variables are conditions you can impose on the experimental units, while blocking variables are characteristics that the experimental units come with that you would like to control for.

Connect blocking and stratifying In random sampling, you use stratyfing to control for a variable. In random assignment, you use blocking to achieve the same goal.

library(downloader)
url <- "http://www.openintro.org/stat/data/evals.RData"
filename <- basename(url)
if (!file.exists(filename)) download(url,destfile=filename)
load(filename)
# Inspect evals
glimpse(evals)
## Observations: 463
## Variables: 21
## $ score         <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5...
## $ rank          <fct> tenure track, tenure track, tenure track, tenure...
## $ ethnicity     <fct> minority, minority, minority, minority, not mino...
## $ gender        <fct> female, female, female, female, male, male, male...
## $ language      <fct> english, english, english, english, english, eng...
## $ age           <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, ...
## $ cls_perc_eval <dbl> 55.81395, 68.80000, 60.80000, 62.60163, 85.00000...
## $ cls_did_eval  <int> 24, 86, 76, 77, 17, 35, 39, 55, 111, 40, 24, 24,...
## $ cls_students  <int> 43, 125, 125, 123, 20, 40, 44, 55, 195, 46, 27, ...
## $ cls_level     <fct> upper, upper, upper, upper, upper, upper, upper,...
## $ cls_profs     <fct> single, single, single, single, multiple, multip...
## $ cls_credits   <fct> multi credit, multi credit, multi credit, multi ...
## $ bty_f1lower   <int> 5, 5, 5, 5, 4, 4, 4, 5, 5, 2, 2, 2, 2, 2, 2, 2, ...
## $ bty_f1upper   <int> 7, 7, 7, 7, 4, 4, 4, 2, 2, 5, 5, 5, 5, 5, 5, 5, ...
## $ bty_f2upper   <int> 6, 6, 6, 6, 2, 2, 2, 5, 5, 4, 4, 4, 4, 4, 4, 4, ...
## $ bty_m1lower   <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, ...
## $ bty_m1upper   <int> 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ bty_m2upper   <int> 6, 6, 6, 6, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, ...
## $ bty_avg       <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000,...
## $ pic_outfit    <fct> not formal, not formal, not formal, not formal, ...
## $ pic_color     <fct> color, color, color, color, color, color, color,...

Recode a variable
The cls_students variable in evals tells you the number of students in the class. Suppose instead of the exact number of students, you’re interested in whether the class is

# Recode cls_students as cls_type: evals
evals <- evals %>%
  # Create new variable
  mutate(cls_type = ifelse(cls_students < 18, "small", 
                      ifelse(cls_students <= 59, "midsize", "large")))

Create a scatterplot
The bty_avg variable shows the average beauty rating of the professor by the six students who were asked to rate the attractiveness of these faculty. The score variable shows the average professor evaluation score, with 1 being very unsatisfactory and 5 being excellent.

library(ggplot2)
# Scatterplot of score vs. bty_avg
ggplot(evals, aes(x=bty_avg, y=score)) + geom_point()

Create a scatterplot, with an added layer
Suppose you are interested in evaluating how the relationship between a professor’s attractiveness and their evaluation score varies across different class types (small, midsize, and large).

# Scatterplot of score vs. bty_avg colored by cls_type
ggplot(evals, aes(bty_avg, score, color=cls_type)) + geom_point()

References

“Confounding Variable Examples.” n.d. Accessed April 9, 2018. http://www.softschools.com/examples/science/confounding_variable_examples/479/.