Data and case studies

Michael Taylor

2018/04/17

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## 
## Attaching package: 'resampledata'
## The following object is masked from 'package:datasets':
## 
##     Titanic
dt <- function(x, y="") {
  x %>% 
    kable(caption = y) %>% 
    kable_styling()
  }

CASE STUDY: FLIGHT DELAYS

Information on 4029 United and American airlines departures from LaGuardia Airport (LGA) during May and June 2009.

data("FlightDelays")
glimpse(FlightDelays)
## Observations: 4,029
## Variables: 10
## $ ID           <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
## $ Carrier      <fct> UA, UA, UA, UA, UA, UA, UA, UA, UA, UA, UA, UA, U...
## $ FlightNo     <int> 403, 405, 409, 511, 667, 669, 673, 677, 679, 681,...
## $ Destination  <fct> DEN, DEN, DEN, ORD, ORD, ORD, ORD, ORD, ORD, ORD,...
## $ DepartTime   <fct> 4-8am, 8-Noon, 4-8pm, 8-Noon, 4-8am, 4-8am, 8-Noo...
## $ Day          <fct> Fri, Fri, Fri, Fri, Fri, Fri, Fri, Fri, Fri, Fri,...
## $ Month        <fct> May, May, May, May, May, May, May, May, May, May,...
## $ FlightLength <int> 281, 277, 279, 158, 143, 150, 158, 160, 160, 163,...
## $ Delay        <int> -1, 102, 4, -2, -3, 0, -5, 0, 10, 60, 0, 32, 0, 4...
## $ Delayed30    <fct> No, Yes, No, No, No, No, No, No, No, Yes, No, Yes...
head(FlightDelays)[,1:6] %>% dt("Partial View of FlightDelays Data")
Table 1: Partial View of FlightDelays Data
ID Carrier FlightNo Destination DepartTime Day
1 UA 403 DEN 4-8am Fri
2 UA 405 DEN 8-Noon Fri
3 UA 409 DEN 4-8pm Fri
4 UA 511 ORD 8-Noon Fri
5 UA 667 ORD 4-8am Fri
6 UA 669 ORD 4-8am Fri

CASE STUDY: BIRTH WEIGHTS OF BABIES

Random sample of 1009 babies born in North Carolina during 2004.

data("NCBirths2004")
glimpse(NCBirths2004)
## Observations: 1,009
## Variables: 8
## $ ID         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
## $ MothersAge <fct> 30-34, 30-34, 35-39, 20-24, 25-29, 35-39, 20-24, 20...
## $ Tobacco    <fct> No, No, No, No, No, No, No, No, No, No, No, Yes, No...
## $ Alcohol    <fct> No, No, No, No, No, No, No, No, No, No, No, No, No,...
## $ Gender     <fct> Male, Male, Female, Female, Male, Female, Female, M...
## $ Weight     <int> 3827, 3629, 3062, 3430, 3827, 3119, 3260, 3969, 317...
## $ Gestation  <int> 40, 38, 37, 39, 38, 39, 40, 40, 39, 39, 41, 39, 38,...
## $ Smoker     <fct> No, No, No, No, No, No, No, No, No, No, No, Yes, No...

Random sample of 40 baby girls born in Alaska and 40 baby girls born in Wyoming.

data("Girls2004")
glimpse(Girls2004)
## Observations: 80
## Variables: 6
## $ ID         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
## $ State      <fct> WY, WY, WY, WY, WY, WY, WY, WY, WY, WY, WY, WY, WY,...
## $ MothersAge <fct> 15-19, 35-39, 25-29, 20-24, 25-29, 20-24, 20-24, 25...
## $ Smoker     <fct> No, No, No, No, No, No, No, No, No, No, No, Yes, Ye...
## $ Weight     <int> 3085, 3515, 3775, 3265, 2970, 2850, 2737, 3515, 374...
## $ Gestation  <int> 40, 39, 40, 39, 40, 38, 38, 37, 39, 40, 41, 39, 40,...

Random sample of 1587 babies born in Texas in 2004.

data("TXBirths2004")
glimpse(TXBirths2004)
## Observations: 1,587
## Variables: 8
## $ ID         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
## $ MothersAge <fct> 20-24, 20-24, 25-29, 25-29, 15-19, 30-34, 30-34, 25...
## $ Smoker     <fct> No, No, No, No, No, No, No, No, No, No, No, No, No,...
## $ Gender     <fct> Male, Male, Female, Female, Female, Female, Male, F...
## $ Weight     <int> 3033, 3232, 3317, 2560, 2126, 2948, 3884, 2665, 371...
## $ Gestation  <int> 39, 40, 37, 36, 37, 38, 39, 38, 40, 37, 39, 37, 40,...
## $ Number     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Multiple   <fct> No, No, No, No, No, No, No, No, No, No, No, No, No,...

CASE STUDY: VERIZON REPAIR TIMES

Random sample of repair times for 1664 ILEC and 23 CLEC customers.

data("Verizon")
glimpse(Verizon)
## Observations: 1,687
## Variables: 2
## $ Time  <dbl> 17.50, 2.40, 0.00, 0.65, 22.23, 1.20, 2.08, 4.97, 0.00, ...
## $ Group <fct> ILEC, ILEC, ILEC, ILEC, ILEC, ILEC, ILEC, ILEC, ILEC, IL...

Sampling

CASE STUDY: GENERAL SOCIAL SURVEY

Since 1972, the General Social Survey (GSS) has provided politicians, policymakers, and scholars with a clear and unbiased perspective on what Americans think and feel about such issues as national spendi ng priorities, crime and punishment, intergroup relations, and confidence in institutions.

EXPLORATORY DATA ANALYSIS

FlightDelays %>%
  filter(Carrier == "UA") %>%
  mutate(time_interval = cut(Delay, breaks = seq(-50, 450, by = 50))) %>%
  group_by(time_interval) %>%
  summarize(nflights = n())
## # A tibble: 9 x 2
##   time_interval nflights
##   <fct>            <int>
## 1 (-50,0]            722
## 2 (0,50]             249
## 3 (50,100]            86
## 4 (100,150]           39
## 5 (150,200]           14
## 6 (200,250]            7
## 7 (250,300]            3
## 8 (300,350]            2
## 9 (350,400]            1
FlightDelays %>%
  filter(Carrier == "UA") %>%
  ggplot(aes(x = Delay)) +
  geom_histogram(binwidth = 50)