# Data Wrangling in R: Strings

In which we dive into string manipulation, with a focus on regular expressions.

February 18, 2020 - 8 minute read -

In this post we’ll be taking a look at basic string functions, regular expression syntax, and several applications of regular expressions in string manipulation.1

The RStudio String Manipulation Cheat Sheet has great reference material on this topic in a condensed format.

### String Basics

• str_length() returns the number of characters in a string
• str_replace_na() turns missing values (NA) into “NA”
• str_c(..., sep = ",") combines two or more strings with a specified separator
• str_c(..., collapse = ", " collapses a vector of strings into a single string with a specified separator
• str_sub() extracts or modifies subsets of a string by character position
• str_sort() sorts strings; specify locale if necessary
• str_to_lower(), str_to_upper(), str_to_title() changes cases; specify locale if necessary

## Regular Expressions - Basic Syntax

#### Anchors

• ^ matches the start of a string

#### Repetition

• ? is 0 or 1
• + is 1 or more
• * is 0 or more
• {n} is exactly n
• {n,} is n or more
• {,m} is at most m
• {n,m} is between n and m, inclusive
• \1, \2 backreference previous text in parentheses and search for the same pattern

Note: By default, these are greedy and match the longest string possible; make them lazy by putting a ? after them.

## Appling Regular Expressions in R

It’s important to note that in order for your regular expressions to work in R, you must add an additional backslash,  \ , to all existing backslashes in your expression. This is because backslashes have their own meaning in R strings.

You can also search for patterns using OR logic using the pipe: |.

Let’s go through some simple applications of regular expressions using the two libraries below. The two datasets that I will use as examples are words, a character vector of 980 common words, and sentences, a character vector of 720 sentences.

### 1. Detecting Matches: string_detect(), string_subset(), and string_count()

string_detect() searches for a pattern in a string and returns TRUE or FALSE.

string_subset() keeps strings matching a pattern.

string_count() counts the number of matches there are in a string.

### 2. Extracting Matches: string_extract() and string_extract_all()

str_extract(), str_extract_all() extracts the actual text of a match.

### 3. Grouped Matches: str_match() and str_match_all()

str_match() and str_match_all() are very similar to the previous string extracting functions – they extract a matching pattern from a vector, but also returns each individual component by returning a matric with one column for the complete match followed by one column for each group.

extract() does the same thing, but is especially useful for tibbles (as opposed to vectors). It will add additional columns to the tibble for each grouped match.

### 4. Replacing Matches: str_replace() and str_replace_all()

str_replace() will replace the first occurence of a match, while str_replace_all() will replace all occurences.

### 5. Splitting Matches: str_split()

str_split() will split strings based on a pattern.

### 6. Finding the Positions of Matches: str_locate(), str_locate_all()

str_locate(), str_locate_all() return the starting and ending positions of each match. When none of the other functions do what you want, you may want to locate the positions of the matching patterns, then use str_sub() to extract/modify them.

### …A final note about regular expressions:

In the examples above, the pattern matching string is automatically wrapped into a call to regex():

We can explicitly call the regex() function to change case matching, search over multiple lines, or add comments for readability.

## Applying Pattern Matching Without Regular Expressions

All of the functions that we’ve looked at to apply pattern matching via regular expressions by default. However, it is possible to override the pattern matching type by explicity specifying one of three functions in place of regex():

1. fixed() matches the exact specified sequence of bytes, ignoring all special regular expressions. It is much faster than regular expressions, but be careful with non-English data.
2. coll() compares strings using standard collation rules. This is useful for doing case insensitive matching, but is slower than the other functions.
3. boundary() can match boundaries, such as characters, words, or sentences.
1. This post is meant for a person who is looking for a refresher on string manipulation and regular expressions in R. The content in this post is based on chapter fourteen of R for Data Science by Hadley Wickham & Garrett Grolemund, which I would recommend reading for in-depth examples.