# High-Level Data Wrangling in R: Imports, Pivots, and Joins

In which we go over importing data into R, working with 'tidy data,' manipulating single tables, and joining related tables.

February 17, 2020 - 8 minute read -

The scenario: you have a dataset that you’d like to analyze, but before you can do so, you have to import it into R and get it into a workable format. This post dives into the necessary precursors to any kind of data analysis: importing and tidying data.1 For reference, here is the RStudio Data Import Cheat Sheet.

First, let’s explore the concept of tidy data.

In a tidy dataset, each variable must have its own column, each observation must have its own row, and each value must have its own cell – in practice, you can make a dataset tidy by formatting it as a tibble and putting each variable into a column. This allows you to make use of R’s vectorized nature and store the data in a consistent way.

##### A quick detour to introduce tibbles, an improved type of data frame:
• Tibbles only print the first 10 rows of data by default, and has stricter subsetting rules than data frames
• Create a new tibble from individual vectors: tibble()
• Create a new tibble by doing data entry in code: tribble()
• Turn a data frame into a tibble: as_tibble()
• Turn a tibble into a data frame: as.data.frame()
• Use backticks to refer to non-syntactic column names for tibbles
• For more details, see vignette("tibble")

## Importing Data

There are a number of different functions to import different types of flat files into data frames, and they all work similarly – we’ll be using read_csv() from the readr package, which is part of the tidyverse.

read_csv() scans the first 1000 rows in the file and guesses what type of data each column is (logical, integer, double, number, time, date, and date-time). Most csv files can be read with variations of the following code:

However, if the first 1000 rows of a column is not representative of the whole dataset or if there are other parsing problems, identify the problem rows with problem() and try the following:

1. Specify the column types manually with the parameter col_types
2. Increase the number of lines to parse from 1000 with parameter guess_max
3. Import all columns as characters using col_types = cols(.default = col_character()), then resolve the errors, possibly with type_convert()
4. Read the file into a character vector of lines, read_lines() or a character vector of length one, read_file(), and then use more advanced string parsing skills.
##### But wait – how does R parse data, and what can go wrong?

Parsing functions take a character vector and return a more specialized vector like a logical, integer, or date. They make up the foundation of importing data, and are useful to understand for imports.

• parse_factor()
• parse_logical(), parse_integer()
• parse_double(), parse_number() – potential problems with different grouping marks
• parse_character() – potential problems with non UTF-8 encoding; try guess_encoding()
• parse_date(), parse_time(), parse_datetime() – potential problems with formatting; reference.

## Managing Data in a Single Table

When some of the column names are values and not variable names, we can use pivot_longer() to pivot the columns into a new pair of variables, for example:

When some of the observations are scattered across mutiple rows, we can use pivot_wider() to move values into column names, for example:

If needed, we can use separate(), unite() to split columns and combine columns:

Missing values can be missing explicitly with an NA flag, or missing implicitly if they’re just not listed. In the following example, the 4th quarter of 2015 is explicity missing and the 1st quarter of 2016 is implicitly missing.

We can make both missing values explicit with pivot_wider or complete(), or make both missing values implicit with pivot_longer(..., values_drop_na = TRUE

Note: fill() can be used to fill in missing values with the most recent non-missing value.

## Managing Relational Data in Multiple Tables

Combine data between related data by matching the unique identifier in your main table (primary key) with the unique identifier in another table (foreign key). To ensure that your joins work smoothly:

1. Identify the variables that are the primary key in each table
2. Check that none of the variables in the primary key are missing
3. Check that the primary keys match foreign keys in another table with anti_join()

#### Mutating Joins: Combine variables from two tables

• inner join(x, y) - only matching values are kept
• left_join(x, y) - all x values are kept - MOST COMMON
• right_join(x, y) - all y values are kept
• full_join(x, y) - all x and y values are kept

#### Filtering Joins: Affects observations, not variables

• semi_join(x, y) - keeps all observations in x that have a match in y
• anti_join(x, y) - drops all observations in x that have a match in y

#### Set Operations: Assumes both tables have the same variables

• intersect(x, y) - return only observations in both x and y
• union(x, y) - return unique observations in x and y
• setdiff(x, y) - return observations in x, but not in y
1. The content in this post is based on chapters nine through thirteen of R for Data Science by Hadley Wickham & Garrett Grolemund, which I would recommend reading for a more thorough explanation.