arrange() allows you to re-order the rows of your dataset.
By default, arrange sorts your data in ascending order
Sort data in descending order with desc()
Missing values are placed at the end unless explicitly told otherwise
select() lets you zoom in on a subset of the data based on the names of columns. This is useful for narrowing data with many variables down to only the ones you care about.
Select based on column name: starts_with(), ends_with(), contains(), matches()
Select all columns between two columns, inclusive: (x:y)
Exclude columns from being selected: -
Rename a variable without dropping the others: rename()
Reorder ceertain columns to the very beginning: select() with everything()
mutate() lets you add new columns that are functions of existing columns to the end of the table.
Create columns with arithmatic operators, +, -, *, /, ^, with aggregate functions, sum, mean
Create columns with integer division, %/%, and remainder ,%%
Create columns with logs: log(x), log2(x), log10(x)
Create columns with offsets, such as leading and lagging values, i.e. x - lead(x), x != lag(x)
Create columns with cumulative and rolling aggregates: cumsum, cumprod, cummin, cummax, cummean
Create columns with logical comparisons: <, >, ==, !=
Create columns with ranking functions: minrank, row_number, percent_rank, ntile
Select the top or bottom entries in a group top_n()
Only keep the new variables that you define with transmute()
group_by() changes the scope of analysis from the complete dataset to individually defined groups.
It is most useful when used with summarize(), but it can also be used with mutate() and filter() – for example, if you wanted to fine the worst members of each group, find all groups bigger than a threshold, or create group summary metrics.
Use ungroup() to remove groupings.
summarize() collapses a data frame into a single row. If the data is grouped, then it summarizes by group.
Summarize by location: mean, median
Summarize by spread: sd, IQR, mad, (std. deviation, interquartile range, median absolute deviation)
Summarize by rank: min, max, quantile
Summarize by position: first, nth, last
Summarize by count: n(), sum(is.na(x)), n_distinct(x)
Use the na.rm parameter for functions like mean to specify whether to remove missing values before computation.
Use a count, n(), when doing aggregation to prevent drawing conclusions from small amounts of data
Use the count() function (with an optional weight variable) if you simply want a count
Use the count() function to also find counts and proportions of logical values
Keep reading for examples of these functions in action.
Examples and R Code
We’ll be using two libraries: tidyverse, which contains the dplyr package, and nycflights13, a data frame that contains all flights that departed from New York City in 2013.
This post is meant for a person who is looking for a refresher on the very basics of the dplyr package (i.e., me when I inevitably forget how to summarize data). The content in this post is based on chapter five of R for Data Science by Hadley Wickham & Garrett Grolemund, which I would recommend reading for a more thorough explanation. ⤴