# Basic Data Exploration in R

In which we examine some common practical examples of data exploration: observing variance and co-variance with histograms, boxplots, scatterplots, and heat maps.

February 16, 2020 - 7 minute read -

In this post we’ll be taking a look at several examples of data exploration using the diamonds dataset in the tidyverse library.1

## When investigating variance within a single variable, ask:

• Which values are most common, most rare, and why?
• Does this match your expectations?
• Are there any unusual patterns, and what are the possible explanations?
• If there are clusters of observations, how are those observations similar or different?

Example 1: Use a histogram to visualize the distribution of a variable. Here we observe the distribution of the carat (weight) of diamonds.

Example 2: Play with the binwidth of histograms to reveal different patterns in the data. In this example, we get a more accurate view of the distribution of carat.

Example 3: Play with the x-axis and y-axis limits to identify outliers. In this example, there are some very high values of the “x” variable, a measure of a diamond’s dimensions, which could indicate data entry errors.

Note that in ggplot2, coord_cartesian() keeps truncated values while xlim(), ylim() discards them.

## When investigating covariance between two variables, ask:

• Could this pattern be due to random chance?
• What is the relationship implied by the pattern, and how strong is it?
• What other variables might affect the relationship?
• Does the relationship change if you look at subgroups of data?

### Case 1: A Categorical and a Continuous Variable

Example 1: Use geom_freqpoly() to overlay multiple histograms by density. Here we can see multiple histograms of carat, split by cut.

Example 2: Another way to visualize the relationship is to create a boxplot, order the variables, and flip the axes. This is illustrated with the relationship between car class and highway miles per gallon via the mpg dataset.

Example 3: In cases where there are a lot of outliers, boxplots may not be as useful. Instead, we can use a letter value plot. Here we have a good view of the density of observations when relating cut and price of diamonds.

Example 4: The violin plot is another great way to compare density distributions among different categories. This is a different way to represent the data from the previous graph.

### Case 2: Two Categorical Variables

Example 1: We can use geom_count to map both categorical variables and display their frequency with the size of a point in a grid. In this example, we can see the distribution of observations between diamond cut and color.

Example 2: A heat map can also come in handy when comparing density of observations between two categorical variables. Again, we can see the distribution of observations between diamond cut and color.

### Case 3: Two Continuous Variables

Example 1: The scatterplot is a classic way to compare continuous variables. For example, here we examine the relationship between diamond carat and price.

Example 2: We can also bin variables in two dimensions, using the fill color to represent frequency of observations.

Example 3: Another option is to bin only one continuous variables so that it behaves like a categorical variable. Here, we bin carat.

Example 4: This is the same graph, except the width of the boxplot is proportional to the number of points.

Example 5: This is again the same graph, but each box now contains approximately the same number of points.

1. This post is meant for a person who is wondering how to apply basic knowledge of R to explore datasets. Its contents are based on chapter seven of R for Data Science by Hadley Wickham & Garrett Grolemund.