In this post we’ll be taking a look at several examples of data exploration using the diamonds
dataset in the tidyverse library.^{1}
When investigating variance within a single variable, ask:
 Which values are most common, most rare, and why?
 Does this match your expectations?
 Are there any unusual patterns, and what are the possible explanations?
 If there are clusters of observations, how are those observations similar or different?
Example 1: Use a histogram to visualize the distribution of a variable. Here we observe the distribution of the carat (weight) of diamonds.
Example 2: Play with the binwidth of histograms to reveal different patterns in the data. In this example, we get a more accurate view of the distribution of carat.
Example 3: Play with the xaxis and yaxis limits to identify outliers. In this example, there are some very high values of the “x” variable, a measure of a diamond’s dimensions, which could indicate data entry errors.
Note that in ggplot2, coord_cartesian()
keeps truncated values while xlim(), ylim()
discards them.
When investigating covariance between two variables, ask:
 Could this pattern be due to random chance?
 What is the relationship implied by the pattern, and how strong is it?
 What other variables might affect the relationship?
 Does the relationship change if you look at subgroups of data?
Case 1: A Categorical and a Continuous Variable
Example 1: Use geom_freqpoly()
to overlay multiple histograms by density. Here we can see multiple histograms of carat, split by cut.
Example 2: Another way to visualize the relationship is to create a boxplot, order the variables, and flip the axes. This is illustrated with the relationship between car class and highway miles per gallon via the mpg dataset.
Example 3: In cases where there are a lot of outliers, boxplots may not be as useful. Instead, we can use a letter value plot. Here we have a good view of the density of observations when relating cut and price of diamonds.
Example 4: The violin plot is another great way to compare density distributions among different categories. This is a different way to represent the data from the previous graph.
Case 2: Two Categorical Variables
Example 1: We can use geom_count
to map both categorical variables and display their frequency with the size of a point in a grid. In this example, we can see the distribution of observations between diamond cut and color.
Example 2: A heat map can also come in handy when comparing density of observations between two categorical variables. Again, we can see the distribution of observations between diamond cut and color.
Case 3: Two Continuous Variables
Example 1: The scatterplot is a classic way to compare continuous variables. For example, here we examine the relationship between diamond carat and price.
Example 2: We can also bin variables in two dimensions, using the fill color to represent frequency of observations.
Example 3: Another option is to bin only one continuous variables so that it behaves like a categorical variable. Here, we bin carat.
Example 4: This is the same graph, except the width of the boxplot is proportional to the number of points.
Example 5: This is again the same graph, but each box now contains approximately the same number of points.

This post is meant for a person who is wondering how to apply basic knowledge of R to explore datasets. Its contents are based on chapter seven of R for Data Science by Hadley Wickham & Garrett Grolemund. ⤴