In which we review some basic information about factors, a type of data used work with categorical variables.
What are factors, and why should we use them?
Factors are a data type designed to work with categorical variables, which have a fixed and known set of possible values.^{1} They are more convenient to work with than strings for two main reasons: they help with sorting data, and help with identifying valid categories (i.e. prevent typos).
Say we are investigating an issue that affects people differently at stages of their life, and we have a variable for the stage of life that people are in – infants, toddlers, children, adolescents, adults, and seniors. If we try to sort them as strings, they will be sorted alphabetically, but if we turn them into factors, they can be sorted in the proper age order.
2. Factors help with identifying valid categories
Considering the same example, let’s take a slightly different initial list of stages of life with one category that doesn’t belong and one typo in the data. However, if we convert these strings into factors, they are replaced with missing values (NA).
Additional notes:
If you do not specify factor levels, then factors are created from the data in alphabetical order.
If you would like to match the order of levels with the order of appearance in the data, then set levels as follows: factor(data, levels = unique(data))
Working with Factors
We can reorder factors with these functions (often useful for visualizations):
fct_inorder() orders factors in order of their appearance in the data
fct_reorder() orders factors based on other variables
fct_relevel() brings specified factors to the beginning of the list of factors
fct_infreq() orders factors based on its frequency
fct_rev() reverses the order of factor levels
And we can modify factors with these functions:
fct_recode() changes the values of each factor level
fct_collapse() collapses many factor levels into fewer levels
fct_lump() lumps together the least or most common factor levels into an “other” category
We’ll explore examples of ordering and modifying factors using the gss_cat dataset, a sample of categorical values from the General Social survey that comes from the forcats package in the tidyverse. Here is a preview of the data:
Examples: Ordering Factors
Order factors based on another variable when the factor is mapped to position with fct_reorder():
Order factors based on another variable when the factor is mapped to a non-position aesthetic with fct_reorder2():
Bring specified factors to the beginning of the list of factors with fct_relevel():
Order factors based on its frequency with fct_infreq():
Examples: Modifying Factor Levels
Change the values of each level with fct_recode():
Collapse many specific levels into fewer levels with fct_collapse():
Lump together least or most common factor levels with fct_lump():
This post is meant for a person who is looking for a refresher on factors in R, and the content in this post is based on chapter fifteen of R for Data Science by Hadley Wickham & Garrett Grolemund. ⤴