# Basic Data Transformation in R

In which we review the fundamentals of transforming data in R, using six key functions in the dplyr package: filter, arrange, select, mutate, summarize, and group by.

February 15, 2020 - 13 minute read -

In this post we’ll be taking a look at basic data transformation in R – namely, six important R functions included in the dplyr package:1

• filter() - filter rows
• arrange() - order rows
• select() - select columns
• mutate() - transform and create variables
• group_by() - group data
• summarize() - summarize data

First things first: Here is the RStudio Data Transformation Cheat Sheet{target=”blank”}, if that’s what you’re looking for.

### Prologue: The Pipe

Before we get into the functions themselves, let’s review an important part of R syntax: the pipe, %>%. The pipe symbol can be read as “then.”

See the R code at the bottom for examples of the pipe in action, and my post on programming in R for a deeper dive into pipes.

## Filter

filter() allows for subsetting observations based on their values. In other words, you can filter the rows in your dataset.

• Filter with comparison operators: == , != , > , >= , < , <=
• Filter with logical operators – and, or not, and exclusive or: &, |, !, xor()
• Filter on rows where x is one of the values in vector y: x %in y
• Determine if a value is missing: is.na()
• Determine if numbers fall within an inclusive range: between()

Note: Don’t forget De Morgan’s law, !(x & y) == !x | !y , !(x | y) = !x & !y

## Arrange

arrange() allows you to re-order the rows of your dataset.

• By default, arrange sorts your data in ascending order
• Sort data in descending order with desc()
• Missing values are placed at the end unless explicitly told otherwise

## Select

select() lets you zoom in on a subset of the data based on the names of columns. This is useful for narrowing data with many variables down to only the ones you care about.

• Select based on column name: starts_with(), ends_with(), contains(), matches()
• Select all columns between two columns, inclusive: (x:y)
• Exclude columns from being selected: -
• Rename a variable without dropping the others: rename()
• Reorder ceertain columns to the very beginning: select() with everything()

## Mutate

mutate() lets you add new columns that are functions of existing columns to the end of the table.

• Create columns with arithmatic operators, +, -, *, /, ^, with aggregate functions, sum, mean
• Create columns with integer division, %/%, and remainder ,%%
• Create columns with logs: log(x), log2(x), log10(x)
• Create columns with offsets, such as leading and lagging values, i.e. x - lead(x), x != lag(x)
• Create columns with cumulative and rolling aggregates: cumsum, cumprod, cummin, cummax, cummean
• Create columns with logical comparisons: <, >, ==, !=
• Create columns with ranking functions: minrank, row_number, percent_rank, ntile
• Select the top or bottom entries in a group top_n()
• Only keep the new variables that you define with transmute()

## Group By

group_by() changes the scope of analysis from the complete dataset to individually defined groups.

It is most useful when used with summarize(), but it can also be used with mutate() and filter() – for example, if you wanted to fine the worst members of each group, find all groups bigger than a threshold, or create group summary metrics.

Use ungroup() to remove groupings.

## Summarize

summarize() collapses a data frame into a single row. If the data is grouped, then it summarizes by group.

• Summarize by location: mean, median
• Summarize by spread: sd, IQR, mad, (std. deviation, interquartile range, median absolute deviation)
• Summarize by rank: min, max, quantile
• Summarize by position: first, nth, last
• Summarize by count: n(), sum(is.na(x)), n_distinct(x)
• Use the na.rm parameter for functions like mean to specify whether to remove missing values before computation.
• Use a count, n(), when doing aggregation to prevent drawing conclusions from small amounts of data
• Use the count() function (with an optional weight variable) if you simply want a count
• Use the count() function to also find counts and proportions of logical values

#### …The End!

Keep reading for examples of these functions in action.

## Examples and R Code

We’ll be using two libraries: tidyverse, which contains the dplyr package, and nycflights13, a data frame that contains all flights that departed from New York City in 2013.

1. This post is meant for a person who is looking for a refresher on the very basics of the dplyr package (i.e., me when I inevitably forget how to summarize data). The content in this post is based on chapter five of R for Data Science by Hadley Wickham & Garrett Grolemund, which I would recommend reading for a more thorough explanation.