The Big Picture

Michael Taylor


Statistics is about converting data in to useful information or insight.

The process of statistics begins when we identify a group we want to study or learn something about. We call this group the population. Population then is the entire group that is the target of our interest.

In most cases it is not feasible to collect data from the whole population. It is more practical and affordable to collect data from a subgroup of the population, a sample.This first step that involves choosing a sample and collecting data from it is called producing data. N.B. An effort should be made to collect data in such away that it will represent the population well.

Once data is collected we have to make sense of it. This is done by summarizing the data in a way that is meaningful. This is called exploratory data analysis.

Having obtained the sample results and summarized them, the next step is to draw some conclusions about the population from the them. We will first have to ascertain how the sample differs from the population using probability.

Without exaggeration probability allows us to draw conclusions about the populations based on the data collected from the sample.

The last thing we will do is draw conclusion about the our population. This is called inference.

Big Picture Summary

  1. Producing Data—Choosing a sample from the population of interest and collecting data.

  2. Exploratory Data Analysis (EDA)—Summarizing the data we’ve collected.

  3. and 4. Probability and Inference—Drawing conclusions about the entire population based on the data collected from the sample.

Data and Variables

Data are pieces of information about individuals organized into variables. By an individual, we mean a particular person or object. By a variable, we mean a particular characteristic of the individual.

A dataset is a set of data identified with particular circumstances. Datasets are typically displayed in tables, in which rows represent individuals and columns represent variables. ***

Variable fall in one of two categories: categorical or quantitative.

  1. The individuals described by the data are people living in the United States of America in the year 2000.
  2. Zip code is a categorical variable.
  3. Family size is a quantitative variable.
  4. Annual income is a quantitative variable.

Clinical Depression and Drug treatment


Clinical depression is the most common mental illness in the United States, affecting 19 million adults each year (Source: NIMH, 1999). Nearly 50% of individuals who experience a major episode will have a recurrence within 2 to 3 years. Researchers are interested in comparing therapeutic solutions that could delay or reduce the incidence of recurrence.

In a study conducted by the National Institutes of Health, 109 clinically depressed patients were separated into three groups, and each group was given one of two active drugs (imipramine or lithium) or no drug at all. For each patient, the dataset contains the treatment used, the outcome of the treatment, and several other interesting characteristics.

Here is a summary of the variables in our dataset:

  • Hospt: The patient’s hospital, represented by a code for each of the 5 hospitals (1, 2, 3, 5, or 6)
  • Treat: The treatment received by the patient (Lithium, Imipramine, or Placebo)
  • Outcome: Whether or not a recurrence occurred during the patient’s treatment (Recurrence or No Recurrence)
  • Time: Either the time in days till the first recurrence, or if a recurrence did not occur, the length in days of the patient’s participation in the study.
  • AcuteT: The time in days that the patient was depressed prior to the study.
  • Age: The age of the patient in years, when the patient entered the study.
  • Gender: The patient’s gender (1 = Female, 2 = Male)
summary clinical depression

summary clinical depression

The individuals described by the data are 109 clinically depressed people.

The variable treat and Outcome are categorical variables.

Time and Age are quantitative variables.

Scales of Measurement

Categorical variables can be further categorized in terms of precision. Four different scale of measurement can be used. They are listed in terms of precision below from least to most precise.

Nominal Scale of Measurement

The nominal scale of measurement is a qualitative measure that uses discrete categories to describe research participants. ex. Identifying participants as either runners or non-runners.

Ordinal Scale of Measurement

An ordinal scale of measurement rank-orders participants on some scale or attribute, but the difference between numbers does not represent a fixed or equal difference. A one unit increase on on ordinal scale represent “more” but we do not know more than that. For example, a group of participants can be rank-ordered from least to most politically active. We would know that someone ranked 5 is more active than a person ranked 4 but this does not tell us how much active person of rank 5 is from person ranked 4. The value of the variable is its ability to order participants according to the strength or presence of an attribute. The variable in this case is not used to calculate the difference between participants.

Car(Excellent, Good, Fair, Poor)
The car is ranked but the distance between the ranks is unknown.

Interval Scale of Measurement

The interval scale of measurement takes numerical form, and the distance between the pairs of consecutive number is assumed to be equal. However, interval variables do not have a meaningful zero point; thus, a zero does not mean the absence of the attribute, but rather it is a particular (but arbitrary) point on the scale. A good example of an interval measure is temperature in the Celsius scale: a temperature of zero degrees Celsius is still a temperature, and does not indicate the absence of temperature.

Intelligence (IQ) scores are interval level of measurement, where the measurement between consecutive pairs of values is assumed to be equal. There is no zero scores; after all there cannot be a complete absence of intelligence.

Ratio Scale of Measurement

The ratio scale of measurement is similar to the interval scale. As with the interval scale a number is assigned to a subject that strength or amount of an attribute that a subject has and difference between consecutive pairs of numbers is assumed to be equal. The main difference between interval and ratio scale measurement is that the zero is meaningful and represents the absence of an attribute in a participant.

A few examples.

Classifying subjects level of anxiety as high, medium, or low is an ordinal measurement.

A researcher measures political affiliation, and records a value of 1 for a Republican, 2 for a Democrat, 3 for an Independent, and 4 for other affiliations. This is a nominal level of scale.

A researcher observes Teacher A’s classroom of 30 students for a 45-minute class. The researcher records the percentage of time students spend working in groups during the class. What scale of measurement is this measure? This is ratio scale of measurement.

Scores on the SAT Math Test (note: the scores on the SAT Math Test range from 200 to 800). What scale of measurement is this measure? Interval