One Categorical Variable

Michael Taylor

2018/03/26

Frequency Distributions

What is your perception of your own body? Do you feel that you are overweight, underweight, or about right?

A random sample of 1,200 U.S. college students were asked this question as part of a larger survey. The following table shows part of the responses:

Table 1: Body Image
Student Body_Image
student 25 overweight
student 26 about right
student 27 underweight
student 28 about right
student 29 about right

The information we would be interested in finding are:

There is no way that we can answer these questions by looking at the raw data, which are in the form of a long list of 1,200 responses, and thus not very useful. However, both these questions will be easily answered once we summarize and look at the distribution of the variable Body Image (i.e., once we summarize how often each of the categories occurs).

In order to summarize the distribution of a categorical variable, we first create a table of the different values (categories) the variable takes, how many times each value occurs (count) and, more importantly, how often each value occurs (by converting the counts to percentages); this table is called a frequency distribution. Here is the frequency distribution for our example:

Category <- c("About right", "Overweight", "Underweight", "$\\textbf{Total}$")

Count <- c(855, 235, 110, "$n=1200$")

Percent <- c("$\\left(\\frac{855}{1200}\\right)*100=71.3%$",
             "$\\left(\\frac{235}{1200}\\right)*100=19.63%$",
             "$\\left(\\frac{110}{1200}\\right)*100=9.2%$",
             "$100\\%$")
dt <- data.frame(Category, Count, Percent,
                 stringsAsFactors = F)

kable(dt, "html", caption = "Body Image Distribution") %>% 
  kable_styling(
    bootstrap_options = c(
      "striped")
    )
Table 2: Body Image Distribution
Category Count Percent
About right 855 \(\left(\frac{855}{1200}\right)*100=71.3%\)
Overweight 235 \(\left(\frac{235}{1200}\right)*100=19.63%\)
Underweight 110 \(\left(\frac{110}{1200}\right)*100=9.2%\)
\(\textbf{Total}\) \(n=1200\) \(100\%\)

In order to visualize the numerical summaries we’ve obtained, we need a graphical display. There are two simple graphical displays for visualizing the distribution of categorical data:

  1. The pie chart
pie_dt <- dt[1:3,1:2]
pie_dt$Count <- as.integer(pie_dt$Count)
library(plotly)
p <- plot_ly(pie_dt, labels = Category,
             values = Count, type = "pie") %>% 
  layout(title = 'Pie Chart of Body Image')
# api_create(p, filename = "r-pie-body-image")
  1. The Bar Chart
library(ggplot2)
g <- ggplot(pie_dt, aes(x = Category,
                        Count,
                        fill=Category)) +
  geom_col() +
  theme_bw()
p <- ggplotly(g)
# api_create(p, filename = "r-barchart-body-image")

Pictograms

  1. While both the pie chart and the bar chart help us visualize the distribution of a categorical variable, the pie chart emphasizes how the different categories relate to the whole, and the bar chart emphasizes how the different categories compare with each other.

  2. A variation on the pie chart and bar chart that is very commonly used in the media is the pictogram. Here are two examples:

Source: USA Today Snapshots and the Impulse Research for Northern Confidential Bathroom survey

Source: Market Facts for the Association of Dressings and Sauces

This graph is aimed at advertisers deciding where to spend their budgets, and clearly suggests that Time magazine attracts by far the largest amount of advertising spending. Are the differences really as dramatic as the graph suggests? If we look carefully at the numbers above the pens, we find that advertisers spend in Time only \(\$4,433,879 / \$2,698,386 = 1.64\) times more than in Newsweek, and only \(\$4,433,879 / \$1,537,617 = 2.88\) times more than in U.S. News. By looking at the pictogram, however, we get the impression that Time is much further ahead. Why? In order to magnify the picture without distorting it, we must increase both its height and width. As a result, the area of Time’s pen is 1.64 * 1.64 = 2.7 times larger than the Newsweek pen, and 2.88 * 2.88 = 8.3 times larger than the U.S. News pen. Our eyes capture the area of the pens rather than only the height, and so we are misled to think that Time is a bigger winner than it really is. ***

The same survey that asked 1,200 U.S. college students about their body perception also asked the following question:

“With whom do you find it easiest to make friends?” (opposite sex, same sex or no difference).

Below is a snapshot of how the first 25 men and women answered the question: “With whom do you find it easiest to make friends?”

library(downloader)
url <- "https://lagunita.stanford.edu/assets/courseware/v1/6c69fdf1ef819cd13c1d53a0b3481435/asset-v1:OLI+ProbStat+Open_Jan2017+type@asset+block/friends.RData"
filename <- basename(url)
if (!file.exists(filename)) download(url, destfile=filename)
load(filename)
str(friends)
## 'data.frame':    1200 obs. of  1 variable:
##  $ Friends: Factor w/ 3 levels "No difference",..: 1 1 1 1 1 1 1 1 1 1 ...
head(friends, 25)
##          Friends
## 1  No difference
## 2  No difference
## 3  No difference
## 4  No difference
## 5  No difference
## 6  No difference
## 7  No difference
## 8  No difference
## 9  No difference
## 10 No difference
## 11 No difference
## 12 No difference
## 13 No difference
## 14 No difference
## 15 No difference
## 16 No difference
## 17 No difference
## 18 No difference
## 19 No difference
## 20 No difference
## 21 No difference
## 22 No difference
## 23 No difference
## 24 No difference
## 25 No difference
library(dplyr)
t <- summary(friends$Friends)
str(t)
##  Named int [1:3] 602 434 164
##  - attr(*, "names")= chr [1:3] "No difference" "Opposite sex" "Same sex"

Summary table of the data

prop <- prop.table(t); prop
## No difference  Opposite sex      Same sex 
##     0.5016667     0.3616667     0.1366667
writeLines("\n")
percent=prop.table(t)*100; percent
## No difference  Opposite sex      Same sex 
##      50.16667      36.16667      13.66667
writeLines("\n")
#modify our percent table so that each value is rounded to one decimal place.
pf <- round(percent,1); pf
## No difference  Opposite sex      Same sex 
##          50.2          36.2          13.7
df <- data.frame(rbind(t,pf))
rownames(df) <- 1:nrow(df)
df[2,] <- paste(df[2,], "%", sep = ""); df
##   No.difference Opposite.sex Same.sex
## 1           602          434      164
## 2         50.2%        36.2%    13.7%
df %>%
  kable("html") %>%
  kable_styling( c("striped", "hover"))
No.difference Opposite.sex Same.sex
602 434 164
50.2% 36.2% 13.7%

Next we will create a label that will include the category name and the percent as the labels for each section of the pie chart. R defaults to alphabetical order for tables and graphic creation so if you create your own labels list the names accordingly.

lbl = paste(c("No difference","Opposite sex","Same sex"),pf,"%",sep=" ");lbl
## [1] "No difference 50.2 %" "Opposite sex 36.2 %"  "Same sex 13.7 %"
pie(t, label=lbl)

dftut <- data.frame(values = as.numeric(df[1,]),
                    taxonomies = colnames(df))
dftut
##   values    taxonomies
## 1    602 No.difference
## 2    434  Opposite.sex
## 3    164      Same.sex
g <- ggplot(dftut,aes(x=taxonomies, 
                      y=values, 
                      fill=taxonomies) ) +
  geom_col(width = 1) +
  coord_polar()
g

p <- plot_ly(dftut,
             labels = dftut$taxonomies,
             values = dftut$values, type="pie",
             text=dftut$taxonomies) %>%
  layout(title = "Friends pie chart")
# api_create(p, filename = "r-friends-pie-chart")

If you were to pick one of the 1,200 surveyed students at random, he/she would most likely find it easier to make friends with which of the same or opposite sex.