Study types and cautionary tales

Michael Taylor


Types of studies

Identify study type

A study is designed to evaluate whether people read text faster in Arial or Helvetica font. A group of volunteers who agreed to be a part of the study are randomly assigned to two groups: one where they read some text in Arial, and another where they read the same text in Helvetica. At the end, average reading speeds from the two groups are compared.

The study above is an experiment because the decision of which type of texts the subjects would be reading was made by the researchers.

options(knitr.table.format = "html") 
## If you don't define format here, you'll need put `format = "html"` in every kable function.
dt <- function(x) {
  x %>% 
    kable() %>% 
## Observations: 1,704
## Variables: 6
## $ country   <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, ...
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia...
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...

Random sampling and random assignment

One of the early studies linking smoking and lung cancer compared patients who are already hospitalized with lung cancer to similar patients without lung cancer (hospitalized for other reasons), and recorded whether each patient smoked. Then, proportions of smokers for patients with and without lung cancer were compared.

Random assignment is not employed because the conditions are not imposed on the patients by the people conducting the study; random sampling is not employed because the study records the patients who are already hospitalized, so it wouldn’t be appropriate to apply the findings back to the population as a whole.

Identify the scope of inference of study

Volunteers were recruited to participate in a study where they were asked to type 40 bits of trivia—for example, “an ostrich’s eye is bigger than its brain”—into a computer. A randomly selected half of these subjects were told the information would be saved in the computer; the other half were told the items they typed would be erased.

Then, the subjects were asked to remember these bits of trivia, and the number of bits of trivia each subject could correctly recall were recorded. It was found that the subjects were significantly more likely to remember information if they thought they would not be able to find it later.

The results of the study cannot be generalized to all people and a causal link between believing information is stored and memory can be inferred based on these results.

Simpson’s paradox

Independent and Dependent Variables

(“What Is a Response Variable? | Reference.Com” n.d.)
> The concept of response variables and explanatory variables is very similar to another variable pair you’re likely to encounter in statistics: independent and dependent variables. The independent variable is one that does not change based on other factors in the study. This variable correlates to the explanatory variable. The dependent variable, on the other hand, does change. Because change in the dependent variable is driven by the independent variable(s), it correlates to the response variable.

Labelling variable as explanatory and response does not guarantee that the relationship between the two is actually casual, even if there is an association identified.

(“Mine Cetinkaya-Rundel | DataCamp” n.d.)
> Not considering an important variable when studying a relationship can result in what we call Simpson’s paradox. It illustrates the effect the ommision of an explanatory vanriable can have on the measure of association between another explanatory variabel and the response variable. So the inclusiionn of a third variable can change the apparent relationshiop between the other two varables.

count() allows you to group the data by certain variables (in this case, admission status and gender) and then counts the number of observations in each category. These counts are available under a new variable called n.

spread() simply reorganizes the output across columns based on a key-value pair, where a pair contains a key that explains what the information describes and a value that contains the actual information. spread() takes the name of the dataset as its first argument, the name of the key column as its second argument, and the name of the value column as its third argument, all specified without quotation marks.

ucb_admits_df <-
ucb_admit <- ucb_admits_df[rep(
  row.names(ucb_admits_df), ucb_admits_df$Freq),
## Observations: 4,526
## Variables: 3
## $ Admit  <fct> Admitted, Admitted, Admitted, Admitted, Admitted, Admit...
## $ Gender <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male, M...
## $ Dept   <fct> A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A...
# Count number of male and female applicants admitted
ucb_counts <- ucb_admit %>% 
   count(Gender, Admit)
ucb_counts %>% dt
Gender Admit n
Male Admitted 1198
Male Rejected 1493
Female Admitted 557
Female Rejected 1278
# Spread the output across columns
ucb_counts %>%
  spread(Admit, n) %>% dt
Gender Admitted Rejected
Male 1198 1493
Female 557 1278

Calculate the percentage of males admitted. To do so, create a new variable with mutate() from the dplyr package.

ucb_admit %>% 
  count(Admit, Gender) %>% 
  spread(Admit, n) %>%
   mutate(Perc_Admit = 
            Admitted / (Admitted + Rejected)
          ) %>% dt
Gender Admitted Rejected Perc_Admit
Male 1198 1493 0.4451877
Female 557 1278 0.3035422

Proportion of males admitted for each department

Make a table similar to the one you constructed earlier, except you will first group the data by department. Then, you’ll use this table to calculate the proportion of males admitted in each department.

# Table of counts of admission status and gender for each department
admit_by_dept <- ucb_admit %>%
  count(Dept, Admit, Gender) %>%
  spread(Admit, n)

# View result
admit_by_dept %>% dt
Dept Gender Admitted Rejected
A Male 512 313
A Female 89 19
B Male 353 207
B Female 17 8
C Male 120 205
C Female 202 391
D Male 138 279
D Female 131 244
E Male 53 138
E Female 94 299
F Male 22 351
F Female 24 317
# Percentage of those admitted to each department
admit_by_dept %>%
  mutate(Perc_Admit = Admitted / (Admitted + Rejected)) %>% dt
Dept Gender Admitted Rejected Perc_Admit
A Male 512 313 0.6206061
A Female 89 19 0.8240741
B Male 353 207 0.6303571
B Female 17 8 0.6800000
C Male 120 205 0.3692308
C Female 202 391 0.3406408
D Male 138 279 0.3309353
D Female 131 244 0.3493333
E Male 53 138 0.2774869
E Female 94 299 0.2391858
F Male 22 351 0.0589812
F Female 24 317 0.0703812


“Mine Cetinkaya-Rundel | DataCamp.” n.d. Accessed April 5, 2018.

“What Is a Response Variable? | Reference.Com.” n.d. Accessed April 5, 2018.