# Model Building in R

In which we explore the basics of modeling as an exploratory tool through recording and graphing predictions and residuals, variable interactions, and transformations.

February 28, 2020 - 11 minute read -

This post provides an introduction to modeling in R without going into statistical details.1 We’ll go over some examples of fitting models to data, and then examine the list-column data structure. The following libraries are used:

## Model Basics

Let’s go through an example exploratory workflow of fitting a model to a continuous variable using simulated dataset sim1:

##### 1. Graph the initial data. ##### 2. Fit a simple linear regression to the data.

Note: To model an interaction between x-variables, use *

##### 3. Create a new data frame with model prediction data.

Note: add_predictions() adds a single new column with model predictions, spread_predictions() adds one column for each model, and gather_predictions() two columns, model and prediction, and repeats the input rows for each model.

##### 4. Graph the initial data with the fitted model. ##### 6. Graph the initial x-values against their residuals. ## Transformations

Transformations can be perfromed inside of the model formula, but if it contains an operation of +, -, *, ^, wrap it in I() so it doesn’t become part of the model specs. Let’s go through an example workflow again, this time with a polynomial transformation.

##### 1. Create and graph the initial data. ##### 3. Put predictions into a data frame.

Note: seq_range() provides a specified number of values between the minimum and maximum of a variable, which can be useful for graphing.

##### 4. Graph the initial data with the fitted models. ##### 6. Graph the initial x-values against their residuals. ## Many Models

If you have a complex dataset, it may be possible to unpack the data using many simple models. For example, the gapminder dataset contains data on the life expectancy, among other variables, in many countries over the course of 50 years. With this data, let’s explore how life expectancy changes over time for each country, and dig into which countries deviate significantly from the rest of the world.

##### 1. Graph the initial data. ##### 5. Unnest and plot the residuals by continent. Notice that the model performs the worst in Africa. ##### 8. Graph the R-Squared coefficient to investigate where the model breaks down. ##### 9. Pull out the problem countries into a separate table. Then, plot the life expectancies of those countries over time by joining the table with the original data.

Note: History provides an explanation where the data breaks down: this graph reveals the devastating effects of the Rwandan genocide and HIV/AIDS in African countries in the 1990s. ## Data Structures: List-Columns

The life expectancy example above made use of list-column data structures. In general, an effective list-column pipeline will take the following form:

1. Create the list-column.
2. Create other intermediate list-columns by transforming existing list columns.
3. Simplify the list-column back down to a data frame or atomic vector.

### Creating List-Columns

• nest() converts a grouped data frame into a nested data frame with a list-column of data frames.
• mutate() applied with vectorized functions that return a list will create list-columns.
• summarize() applied with summary functions that return multiple results will create list-columns.

### Simplifying List-Columns

In order to manipulate and visualize the data, you will need to simplify list-columns.

• If you want a single value from the list-column, use mutate() with map_lgl(), map_int(), map_dbl(), map_chr() to create an atomic vector.
• If you want many values from the list-column, use unnest() to convert list columns back to regular columns, repeating the rows as many times as necessary.

### Turning Models into Tidy Data

The following three functions help turn models into tidy data, and often make use of list-columns.

• glance() returns a row for each model, where each column gives a model summary.
• tidy() returns a row for each coefficient in the model, where each column has info about estimate/variability.
• augment() returns a row for each row in data, adding extra values like residuals and influence stats.
1. This post is meant for a person who is looking for a refresher on basic modeling in R. The content in this post is based on chapter twenty-two through twenty-five of R for Data Science by Hadley Wickham & Garrett Grolemund.