I’ll start with some general tips & tricks that I find useful in RStudio before jumping into the main topics:1
- cmd + shift + r creates a commented header in the code for readability
- cmd + shift + p resends a code chunk from the editor into console
- cmd + shift + f10 restarts RStudio
- cmd + shift + s reruns the current script
Pipes, which can be denoted by
%>% and pronounced while reading code as “then,” help clearly express a sequence of multiple operations by eliminating the need to create intermediate objects and bringing focus to the primary object and its transformations.
However, because pipes reassemble code in a form that overwrites an intermediate object, they will not work for:
- Functions that use the current environment must explicity define the environment if being used in a pipeline; for example,
assign(), get(), load()
- Functions that use lazy evaluation cannot be used with pipes; for example, ` try(), tryCatch(), suppresMessages(), suppressWarnings()`
Furthermore, do not use pipes in these scenarios:
- If there are multiple inputs or outputs (pipes should only transform one primary object)
- If there is a directed graph with a complex dependency structure (pipes are linear)
- If there are many steps in the pipeline, consider creating meaningfully named intermediate objects to help debugging
The pipe is loaded automatically in the tidyverse, but originally comes from the magrittr package. The magrittr package also has some additional piping tools that may be useful:
%T>%returns the left-hand side instead of the right-hand side. This is useful for printing, plotting, or saving objects in a way that doesn’t terminate the pipeline.
%$%explodes out the variables in a data frame so you can refer to them explicitly. This is useful if you want to pass individual vectors, not data frames, into functions.
You should consider creating a function when you’ve copied and pasted a block of code more than twice. The aspects of functions that we’ll dive into below are conditional execution, function arguments, and return values.2
- conditions can be linked with
near()when comparing floating point numbers
ifelse()for conditional element selection
switch()to eliminate long chains of if statements
cut()to eliminate long chains of if statements; this is useful because it works with vectors
- be careful when testing for equality because
==is vectorized; either ensure vectors are length 1, collapse vectors with
any(), or use the non-vectorized function
Below are two simple examples of
switch() in action.
Generally, the standard naming convention for function arguments are as follows:
x, y, z: vectors.
i, j: numeric indices (typically rows and columns).
df: a data frame.
n: length, or number of rows.
p: number of columns.
w: a vector of weights.
It is common to check for preconditions for your functions, such as making sure the inputs are in the correct format. If these conditions are not met, you may want to exit the function immediately. You can check for preconditions using
stop() to check a condition and throw an error, or
stopifnot() to check if each argument is true, and return a generic error message if any are not true.
Finally, functions can include a
... argument to send an arbitrary number of inputs on to another function, for example:
The return value is usually the last statement that a function evaluates, but you can coose to return early by using
return() – for example, if the inputs are empty, or if there are simple edge cases.
Functions can either transform an object or create side-effects like drawing a graph or saving a file. For functions that are side-effects, consider returning the first aragument invisibly so that they can still be used in a pipeline.
Most functions that you write will work with vectors, a data structure with a sequence of cells that contain data.
- Atomic vectors are one-dimensional, only contain one type of data, and fall into one of six categories: logical, integer, double, character, complex and raw. Null represents the absence of a vector.
- Lists, or recursive vectors, can contain multiple types of data, including other lists.
Every vector has two required properties – type, determined with
typeof(), and length, determined with
length(). They can also have aritrary additional attributes, which create augmented vectors such as factors (built on integer vectors), date/times (built on numberic vectors), and data frames/tibbles (built on lists).
1. Atomic Vectors
- Logical vectors can take three possible values: TRUE, FALSE, NA
- Integer vectors contain integers; numbers are doubles by default, place an
Lafter the number to denote an integer.
- Double vectors contain doubles; null values are represented by
NA, NaN, Inf, -Inf
- Character vectors are made up of string elements
- Raw and complex vectors are rarely used in data analysis
Note: each type of atommic vector has its own missing value:
NA, NA_integer_, NA_real_, NA_character_. They will always be converted to the correct type using implicit coercion rules.
Coerce vectors from one type to another either implicitly by using a vector in a specific context that expects a certain type, or explicitly through a function such as
as.logical(), as.double(), as.character(). In addition, R will implicitly coerce the length of vectors by recycling the shorter vector to the same length as the longer vector.
Check the vector type with the
is_* family of functions for each type; for example,
Name vectors during creation with
c() or after creation with
set_names() to provide for easier subsetting.
Subset vectors using brackets,
[.Vectors can be subsetted with numeric vectors to denote position, logical vectors to keep all values corresponding to a TRUE value, character vectors to denote names, and empty brackets to select rows or columns of matrices.
Lists can be created with
list(), and you can check the structure of a list using
str(). To subset lists,
[extracts a new, smaller list.
[[extracts a single component from a list, drilling down and removing a level of hierarchy.
$extracts named elements of a list.
Lists also have arbitrary additional attributes that can be viewed and set individually with
attr(), or collectively with
attributes(). The three fundamental attributes of lists are
- Names, used to name the elements of a vector
- Dimensions, which makes a vector behave like a matrix or array
- Class, used to implement the S3 object oriented system and controls the behavior of generic functions such as
as.Date, based on different classes of input.
This post is meant for a person who is looking for a refresher on basic programming in R, and the content in this post is based on chapters seventeen through twenty of R for Data Science by Hadley Wickham & Garrett Grolemund. ⤴