# Study types and cautionary tales

## 2018/04/05

### Types of studies

• Observational study
• Collect data in a way that does not directly interfere with how the data arise. In other words they merly observe.
• Only correlation can be inferred. We can only establish an association between the explanatory and the response variables.
• Experiment
• Researchers randomly assign subjects to various treatments.
• Causation can be inferred. Casual connections can be established between the explanatory and response variables.

### Identify study type

A study is designed to evaluate whether people read text faster in Arial or Helvetica font. A group of volunteers who agreed to be a part of the study are randomly assigned to two groups: one where they read some text in Arial, and another where they read the same text in Helvetica. At the end, average reading speeds from the two groups are compared.

The study above is an experiment because the decision of which type of texts the subjects would be reading was made by the researchers.

suppressPackageStartupMessages(library(dplyr))
library(gapminder)
library(tidyr)
data("gapminder")
library(knitr)
library(kableExtra)
options(knitr.table.format = "html")
## If you don't define format here, you'll need put format = "html" in every kable function.
dt <- function(x) {
x %>%
kable() %>%
kable_styling()
}
glimpse(gapminder)
## Observations: 1,704
## Variables: 6
## $country <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, ... ##$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia...
## $year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992... ##$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
## $pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488... ##$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...

### Random sampling and random assignment

• Random sampling
• Random sampling occurs when subjects are being selected for a study. If subjects are randomly selected from the population, then the resulting sample is likely representative of the population.
• The study’s results can be generalized to that population.
• Random assignment:
• Random assignment occurs only in experimental settings where subjects are being assigned to varous treatments.
• Random assignment allows for casual conclusions.

One of the early studies linking smoking and lung cancer compared patients who are already hospitalized with lung cancer to similar patients without lung cancer (hospitalized for other reasons), and recorded whether each patient smoked. Then, proportions of smokers for patients with and without lung cancer were compared.

Random assignment is not employed because the conditions are not imposed on the patients by the people conducting the study; random sampling is not employed because the study records the patients who are already hospitalized, so it wouldn’t be appropriate to apply the findings back to the population as a whole.

### Identify the scope of inference of study

Volunteers were recruited to participate in a study where they were asked to type 40 bits of trivia—for example, “an ostrich’s eye is bigger than its brain”—into a computer. A randomly selected half of these subjects were told the information would be saved in the computer; the other half were told the items they typed would be erased.

Then, the subjects were asked to remember these bits of trivia, and the number of bits of trivia each subject could correctly recall were recorded. It was found that the subjects were significantly more likely to remember information if they thought they would not be able to find it later.

The results of the study cannot be generalized to all people and a causal link between believing information is stored and memory can be inferred based on these results.

#### Independent and Dependent Variables

(“What Is a Response Variable? | Reference.Com” n.d.)
> The concept of response variables and explanatory variables is very similar to another variable pair you’re likely to encounter in statistics: independent and dependent variables. The independent variable is one that does not change based on other factors in the study. This variable correlates to the explanatory variable. The dependent variable, on the other hand, does change. Because change in the dependent variable is driven by the independent variable(s), it correlates to the response variable.

Labelling variable as explanatory and response does not guarantee that the relationship between the two is actually casual, even if there is an association identified.

(“Mine Cetinkaya-Rundel | DataCamp” n.d.)
> Not considering an important variable when studying a relationship can result in what we call Simpson’s paradox. It illustrates the effect the ommision of an explanatory vanriable can have on the measure of association between another explanatory variabel and the response variable. So the inclusiionn of a third variable can change the apparent relationshiop between the other two varables.

count() allows you to group the data by certain variables (in this case, admission status and gender) and then counts the number of observations in each category. These counts are available under a new variable called n.

spread() simply reorganizes the output across columns based on a key-value pair, where a pair contains a key that explains what the information describes and a value that contains the actual information. spread() takes the name of the dataset as its first argument, the name of the key column as its second argument, and the name of the value column as its third argument, all specified without quotation marks.

ucb_admits_df <- as.data.frame(UCBAdmissions)
row.names(ucb_admits_df), ucb_admits_df$Freq), 1:3] glimpse(ucb_admit) ## Observations: 4,526 ## Variables: 3 ##$ Admit  <fct> Admitted, Admitted, Admitted, Admitted, Admitted, Admit...
## $Gender <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male, M... ##$ Dept   <fct> A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A...
# Count number of male and female applicants admitted
ucb_counts <- ucb_admit %>%
ucb_counts %>% dt
Male Rejected 1493
Female Rejected 1278
# Spread the output across columns
ucb_counts %>%
spread(Admit, n) %>% dt
Male 1198 1493
Female 557 1278

Calculate the percentage of males admitted. To do so, create a new variable with mutate() from the dplyr package.

ucb_admit %>%
) %>% dt
Male 1198 1493 0.4451877
Female 557 1278 0.3035422

### Proportion of males admitted for each department

Make a table similar to the one you constructed earlier, except you will first group the data by department. Then, you’ll use this table to calculate the proportion of males admitted in each department.

# Table of counts of admission status and gender for each department
count(Dept, Admit, Gender) %>%

# View result
admit_by_dept %>% dt
Dept Gender Admitted Rejected
A Male 512 313
A Female 89 19
B Male 353 207
B Female 17 8
C Male 120 205
C Female 202 391
D Male 138 279
D Female 131 244
E Male 53 138
E Female 94 299
F Male 22 351
F Female 24 317
# Percentage of those admitted to each department
mutate(Perc_Admit = Admitted / (Admitted + Rejected)) %>% dt
A Male 512 313 0.6206061
A Female 89 19 0.8240741
B Male 353 207 0.6303571
B Female 17 8 0.6800000
C Male 120 205 0.3692308
C Female 202 391 0.3406408
D Male 138 279 0.3309353
D Female 131 244 0.3493333
E Male 53 138 0.2774869
E Female 94 299 0.2391858
F Male 22 351 0.0589812
F Female 24 317 0.0703812

# References

“Mine Cetinkaya-Rundel | DataCamp.” n.d. Accessed April 5, 2018. https://www.datacamp.com/instructors/mine.

“What Is a Response Variable? | Reference.Com.” n.d. Accessed April 5, 2018. https://www.reference.com/math/response-variable-14511ba1f409613c#.