Model Building in R

In which we explore the basics of modeling as an exploratory tool through recording and graphing predictions and residuals, variable interactions, and transformations.

February 28, 2020 - 11 minute read -
academic r

This post provides an introduction to modeling in R without going into statistical details.1 We’ll go over some examples of fitting models to data, and then examine the list-column data structure. The following libraries are used:

library("tidyverse")
library("modelr")              # tools for modeling
library("splines")             # tools for modeling with polynomials
library("broom")               # tools for turning models into tidy data frames
library("gapminder")           # dataset with country descriptive data over time
options(na.action = na.warn) # warns you if models drop data

Model Basics

Let’s go through an example exploratory workflow of fitting a model to a continuous variable using simulated dataset sim1:

1. Graph the initial data.
ggplot(sim1, aes(x, y)) +
  geom_point()

Graph Continuous Variable

2. Fit a simple linear regression to the data.
sim1_mod <- lm(y ~ x, data = sim1)

Note: To model an interaction between x-variables, use *

3. Create a new data frame with model prediction data.
grid <- sim1 %>% 
  data_grid(x) %>%
  add_predictions(sim1_mod) 

Note: add_predictions() adds a single new column with model predictions, spread_predictions() adds one column for each model, and gather_predictions() two columns, model and prediction, and repeats the input rows for each model.

4. Graph the initial data with the fitted model.
ggplot(sim1, aes(x)) +
  geom_point(aes(y=y)) +
  geom_line(data = grid, aes(y = pred), color = "red", size = 1)

Graph Continuous Variable with Predictions

5. Add the residuals to the initial data.
sim1 <- sim1 %>%
  add_residuals(sim1_mod)
6. Graph the initial x-values against their residuals.
ggplot(sim1, aes(x, resid)) + 
  geom_ref_line(h = 0) + 
  geom_point()

Graph Continuous Variable Resids

Transformations

Transformations can be perfromed inside of the model formula, but if it contains an operation of +, -, *, ^, wrap it in I() so it doesn’t become part of the model specs. Let’s go through an example workflow again, this time with a polynomial transformation.

1. Create and graph the initial data.
sim2 <- tibble(
  x = seq(0, 3.5 * pi, length = 50),
  y = 4 * sin(x) + rnorm(length(x))
)

ggplot(sim2, aes(x, y)) +
  geom_point()

Graph Data

2. Fit linear regressions with polynomials up to degree five to the data.
mod1 <- lm(y ~ ns(x, 1), data = sim2)
mod2 <- lm(y ~ ns(x, 2), data = sim2)
mod3 <- lm(y ~ ns(x, 3), data = sim2)
mod4 <- lm(y ~ ns(x, 4), data = sim2)
mod5 <- lm(y ~ ns(x, 5), data = sim2)
3. Put predictions into a data frame.
grid <- sim5 %>% 
  data_grid(x = seq_range(x, n = 50, expand = 0.1)) %>% 
  gather_predictions(mod1, mod2, mod3, mod4, mod5, .pred = "y")

Note: seq_range() provides a specified number of values between the minimum and maximum of a variable, which can be useful for graphing.

4. Graph the initial data with the fitted models.
ggplot(sim2, aes(x, y)) + 
  geom_point() +
  geom_line(data = grid, colour = "red") +
  facet_wrap(~ model, nrow = 1)

Graph Data with Predictions

5. Add the residuals to the initial data.
sim5 <- sim5 %>%
  gather_residuals(mod1, mod2, mod3, mod4, mod5)
6. Graph the initial x-values against their residuals.
ggplot(sim5, aes(x, resid)) +
  geom_hline(yintercept = 0, size = 2, color = "white") +
  geom_point() +
  facet_wrap(~model, nrow = 1)

Graph Data Residuals

Many Models

If you have a complex dataset, it may be possible to unpack the data using many simple models. For example, the gapminder dataset contains data on the life expectancy, among other variables, in many countries over the course of 50 years. With this data, let’s explore how life expectancy changes over time for each country, and dig into which countries deviate significantly from the rest of the world.

1. Graph the initial data.
gapminder %>%
  ggplot(aes(year, lifeExp, group = country)) + 
  geom_line(alpha = 1/3)

Graph Data Many Models

2. Nest the data frame to create one row for each country and continent, with a new column of data frames containing the rest of the country information.
by_country <- gapminder %>%
  group_by(country, continent) %>%
  nest()
3. Create a linear model and run it over each country, storing the results in a new column.
# create a linear model function 
country_model <- function(df) {
  lm(lifeExp ~ year, data = df)
}

# run models and store them in a new list-column; helpful for filtering and arranging
by_country <- by_country %>%
  mutate(model = map(data, country_model))
4. Add residuals to the data.
by_country <- by_country %>%
  mutate(resids = map2(data, model, add_residuals))
5. Unnest and plot the residuals by continent. Notice that the model performs the worst in Africa.
resids <- unnest(by_country, resids)
resids %>% 
  ggplot(aes(year, resid, group = country)) +
  geom_line(alpha = 1 / 3) + 
  facet_wrap(~continent, nrow = 1)

Graph Data Many Residuals

7. Unpack the data by using glance and unnest to skim and add model data to the data frame.
glance <- by_country %>%
  mutate(glance = map(model, glance)) %>%
  unnest(glance, .drop = TRUE)
8. Graph the R-Squared coefficient to investigate where the model breaks down.
glance %>%
  arrange(r.squared) %>%
  ggplot(aes(continent, r.squared)) +
  geom_jitter(width = 0.5)

Graph Data Many R-Squared

9. Pull out the problem countries into a separate table. Then, plot the life expectancies of those countries over time by joining the table with the original data.
bad_fit <- glance %>%
  filter(r.squared < .25)
gapminder %>%
  semi_join(bad_fit, by = "country") %>%
  ggplot(aes(year, lifeExp, color = country)) +
    geom_line()

Note: History provides an explanation where the data breaks down: this graph reveals the devastating effects of the Rwandan genocide and HIV/AIDS in African countries in the 1990s.

Graph Data Many Drilldown

Data Structures: List-Columns

The life expectancy example above made use of list-column data structures. In general, an effective list-column pipeline will take the following form:

  1. Create the list-column.
  2. Create other intermediate list-columns by transforming existing list columns.
  3. Simplify the list-column back down to a data frame or atomic vector.

Creating List-Columns

  • nest() converts a grouped data frame into a nested data frame with a list-column of data frames.
  • mutate() applied with vectorized functions that return a list will create list-columns.
  • summarize() applied with summary functions that return multiple results will create list-columns.
# summarize the quantiles of miles per gallon by the number of cylinders in the car
probs <- c(0.01, 0.25, 0.5, 0.75, 0.99)
mtcars %>%
  group_by(cyl) %>%
  summarize(p = list(probs), q = list(quantile(mpg, probs))) %>%
  unnest()
#>  A tibble: 15 x 3
#>     cyl     p     q
#>   <dbl> <dbl> <dbl>
#> 1     4  0.01  21.4
#> 2     4  0.25  22.8
#> 3     4  0.5   26  
#> 4     4  0.75  30.4
#> 5     4  0.99  33.8
#> 6     6  0.01  17.8
#> 7     6  0.25  18.6
#> ...

Simplifying List-Columns

In order to manipulate and visualize the data, you will need to simplify list-columns.

  • If you want a single value from the list-column, use mutate() with map_lgl(), map_int(), map_dbl(), map_chr() to create an atomic vector.
  • If you want many values from the list-column, use unnest() to convert list columns back to regular columns, repeating the rows as many times as necessary.
# extract a single value by referring to its element name
df <- tribble(
  ~x,
  list(a = 1, b = 2),
  list(a = 2, c = 4))

df %>% mutate(
  a = map_dbl(x, "a"),
  b = map_dbl(x, "b", .null = NA_real_))
#> A tibble: 2 x 3
#>   x                    a     b
#>   <list>           <dbl> <dbl>
#> 1 <named list [2]>     1     2
#> 2 <named list [2]>     2    NA
#> 

Turning Models into Tidy Data

The following three functions help turn models into tidy data, and often make use of list-columns.

  • glance() returns a row for each model, where each column gives a model summary.
  • tidy() returns a row for each coefficient in the model, where each column has info about estimate/variability.
  • augment() returns a row for each row in data, adding extra values like residuals and influence stats.
  1. This post is meant for a person who is looking for a refresher on basic modeling in R. The content in this post is based on chapter twenty-two through twenty-five of R for Data Science by Hadley Wickham & Garrett Grolemund. 

←Index