This post provides an introduction to modeling in R without going into statistical details.^{1} We’ll go over some examples of fitting models to data, and then examine the listcolumn data structure. The following libraries are used:
Model Basics
Let’s go through an example exploratory workflow of fitting a model to a continuous variable using simulated dataset sim1
:
1. Graph the initial data.
2. Fit a simple linear regression to the data.
Note: To model an interaction between xvariables, use *
3. Create a new data frame with model prediction data.
Note: add_predictions()
adds a single new column with model predictions, spread_predictions()
adds one column for each model, and gather_predictions()
two columns, model and prediction, and repeats the input rows for each model.
4. Graph the initial data with the fitted model.
5. Add the residuals to the initial data.
6. Graph the initial xvalues against their residuals.
Transformations
Transformations can be perfromed inside of the model formula, but if it contains an operation of +, , *, ^,
wrap it in I()
so it doesn’t become part of the model specs. Let’s go through an example workflow again, this time with a polynomial transformation.
1. Create and graph the initial data.
2. Fit linear regressions with polynomials up to degree five to the data.
3. Put predictions into a data frame.
Note: seq_range()
provides a specified number of values between the minimum and maximum of a variable, which can be useful for graphing.
4. Graph the initial data with the fitted models.
5. Add the residuals to the initial data.
6. Graph the initial xvalues against their residuals.
Many Models
If you have a complex dataset, it may be possible to unpack the data using many simple models. For example, the gapminder dataset contains data on the life expectancy, among other variables, in many countries over the course of 50 years. With this data, let’s explore how life expectancy changes over time for each country, and dig into which countries deviate significantly from the rest of the world.
1. Graph the initial data.
2. Nest the data frame to create one row for each country and continent, with a new column of data frames containing the rest of the country information.
3. Create a linear model and run it over each country, storing the results in a new column.
4. Add residuals to the data.
5. Unnest and plot the residuals by continent. Notice that the model performs the worst in Africa.
7. Unpack the data by using glance and unnest to skim and add model data to the data frame.
8. Graph the RSquared coefficient to investigate where the model breaks down.
9. Pull out the problem countries into a separate table. Then, plot the life expectancies of those countries over time by joining the table with the original data.
Note: History provides an explanation where the data breaks down: this graph reveals the devastating effects of the Rwandan genocide and HIV/AIDS in African countries in the 1990s.
Data Structures: ListColumns
The life expectancy example above made use of listcolumn data structures. In general, an effective listcolumn pipeline will take the following form:
 Create the listcolumn.
 Create other intermediate listcolumns by transforming existing list columns.
 Simplify the listcolumn back down to a data frame or atomic vector.
Creating ListColumns
nest()
converts a grouped data frame into a nested data frame with a listcolumn of data frames.mutate()
applied with vectorized functions that return a list will create listcolumns.summarize()
applied with summary functions that return multiple results will create listcolumns.
Simplifying ListColumns
In order to manipulate and visualize the data, you will need to simplify listcolumns.
 If you want a single value from the listcolumn, use
mutate()
withmap_lgl(), map_int(), map_dbl(), map_chr()
to create an atomic vector.  If you want many values from the listcolumn, use
unnest()
to convert list columns back to regular columns, repeating the rows as many times as necessary.
Turning Models into Tidy Data
The following three functions help turn models into tidy data, and often make use of listcolumns.
glance()
returns a row for each model, where each column gives a model summary.tidy()
returns a row for each coefficient in the model, where each column has info about estimate/variability.augment()
returns a row for each row in data, adding extra values like residuals and influence stats.

This post is meant for a person who is looking for a refresher on basic modeling in R. The content in this post is based on chapter twentytwo through twentyfive of R for Data Science by Hadley Wickham & Garrett Grolemund. ⤴