The scenario: you have a dataset that you’d like to analyze, but before you can do so, you have to import it into R and get it into a workable format. This post dives into the necessary precursors to any kind of data analysis: importing and tidying data.1 For reference, here is the RStudio Data Import Cheat Sheet.
First, let’s explore the concept of tidy data.
In a tidy dataset, each variable must have its own column, each observation must have its own row, and each value must have its own cell – in practice, you can make a dataset tidy by formatting it as a tibble and putting each variable into a column. This allows you to make use of R’s vectorized nature and store the data in a consistent way.
A quick detour to introduce tibbles, an improved type of data frame:
- Tibbles only print the first 10 rows of data by default, and has stricter subsetting rules than data frames
- Create a new tibble from individual vectors:
- Create a new tibble by doing data entry in code:
- Turn a data frame into a tibble:
- Turn a tibble into a data frame:
- Use backticks to refer to non-syntactic column names for tibbles
- For more details, see
There are a number of different functions to import different types of flat files into data frames, and they all work similarly – we’ll be using
read_csv() from the readr package, which is part of the tidyverse.
read_csv() scans the first 1000 rows in the file and guesses what type of data each column is (logical, integer, double, number, time, date, and date-time). Most csv files can be read with variations of the following code:
However, if the first 1000 rows of a column is not representative of the whole dataset or if there are other parsing problems, identify the problem rows with
problem() and try the following:
- Specify the column types manually with the parameter
- Increase the number of lines to parse from 1000 with parameter
- Import all columns as characters using
col_types = cols(.default = col_character()), then resolve the errors, possibly with
- Read the file into a character vector of lines,
read_lines()or a character vector of length one,
read_file(), and then use more advanced string parsing skills.
But wait – how does R parse data, and what can go wrong?
Parsing functions take a character vector and return a more specialized vector like a logical, integer, or date. They make up the foundation of importing data, and are useful to understand for imports.
parse_double(), parse_number()– potential problems with different grouping marks
parse_character()– potential problems with non UTF-8 encoding; try
parse_date(), parse_time(), parse_datetime()– potential problems with formatting; reference.
Managing Data in a Single Table
When some of the column names are values and not variable names, we can use
pivot_longer() to pivot the columns into a new pair of variables, for example:
When some of the observations are scattered across mutiple rows, we can use
pivot_wider() to move values into column names, for example:
If needed, we can use
separate(), unite() to split columns and combine columns:
Missing values can be missing explicitly with an NA flag, or missing implicitly if they’re just not listed. In the following example, the 4th quarter of 2015 is explicity missing and the 1st quarter of 2016 is implicitly missing.
We can make both missing values explicit with
complete(), or make both missing values implicit with
pivot_longer(..., values_drop_na = TRUE
fill() can be used to fill in missing values with the most recent non-missing value.
Managing Relational Data in Multiple Tables
Combine data between related data by matching the unique identifier in your main table (primary key) with the unique identifier in another table (foreign key). To ensure that your joins work smoothly:
- Identify the variables that are the primary key in each table
- Check that none of the variables in the primary key are missing
- Check that the primary keys match foreign keys in another table with
Mutating Joins: Combine variables from two tables
inner join(x, y)- only matching values are kept
left_join(x, y)- all x values are kept - MOST COMMON
right_join(x, y)- all y values are kept
full_join(x, y)- all x and y values are kept
Filtering Joins: Affects observations, not variables
semi_join(x, y)- keeps all observations in x that have a match in y
anti_join(x, y)- drops all observations in x that have a match in y
Set Operations: Assumes both tables have the same variables
intersect(x, y)- return only observations in both x and y
union(x, y)- return unique observations in x and y
setdiff(x, y)- return observations in x, but not in y