# Data Wrangling in R: Factors

In which we review some basic information about factors, a type of data used work with categorical variables.

February 21, 2020 - 9 minute read -

### What are factors, and why should we use them?

Factors are a data type designed to work with categorical variables, which have a fixed and known set of possible values.1 They are more convenient to work with than strings for two main reasons: they help with sorting data, and help with identifying valid categories (i.e. prevent typos).

For a one-page reference on manipulating factors, check out the RStudio Factors Cheat Sheet.

##### 1. Factors help with sorting data

Say we are investigating an issue that affects people differently at stages of their life, and we have a variable for the stage of life that people are in – infants, toddlers, children, adolescents, adults, and seniors. If we try to sort them as strings, they will be sorted alphabetically, but if we turn them into factors, they can be sorted in the proper age order.

##### 2. Factors help with identifying valid categories

Considering the same example, let’s take a slightly different initial list of stages of life with one category that doesn’t belong and one typo in the data. However, if we convert these strings into factors, they are replaced with missing values (NA).

• If you do not specify factor levels, then factors are created from the data in alphabetical order.
• If you would like to match the order of levels with the order of appearance in the data, then set levels as follows: factor(data, levels = unique(data))

## Working with Factors

We can reorder factors with these functions (often useful for visualizations):

• fct_inorder() orders factors in order of their appearance in the data
• fct_reorder() orders factors based on other variables
• fct_relevel() brings specified factors to the beginning of the list of factors
• fct_infreq() orders factors based on its frequency
• fct_rev() reverses the order of factor levels

And we can modify factors with these functions:

• fct_recode() changes the values of each factor level
• fct_collapse() collapses many factor levels into fewer levels
• fct_lump() lumps together the least or most common factor levels into an “other” category

We’ll explore examples of ordering and modifying factors using the gss_cat dataset, a sample of categorical values from the General Social survey that comes from the forcats package in the tidyverse. Here is a preview of the data:

### Examples: Ordering Factors

Order factors based on another variable when the factor is mapped to position with fct_reorder():

Order factors based on another variable when the factor is mapped to a non-position aesthetic with fct_reorder2():

Bring specified factors to the beginning of the list of factors with fct_relevel():

Order factors based on its frequency with fct_infreq():

### Examples: Modifying Factor Levels

Change the values of each level with fct_recode():

Collapse many specific levels into fewer levels with fct_collapse():

Lump together least or most common factor levels with fct_lump():

1. This post is meant for a person who is looking for a refresher on factors in R, and the content in this post is based on chapter fifteen of R for Data Science by Hadley Wickham & Garrett Grolemund.