In this post we’ll be taking a look at several examples of data exploration using the
diamonds dataset in the tidyverse library.1
When investigating variance within a single variable, ask:
- Which values are most common, most rare, and why?
- Does this match your expectations?
- Are there any unusual patterns, and what are the possible explanations?
- If there are clusters of observations, how are those observations similar or different?
Example 1: Use a histogram to visualize the distribution of a variable. Here we observe the distribution of the carat (weight) of diamonds.
Example 2: Play with the binwidth of histograms to reveal different patterns in the data. In this example, we get a more accurate view of the distribution of carat.
Example 3: Play with the x-axis and y-axis limits to identify outliers. In this example, there are some very high values of the “x” variable, a measure of a diamond’s dimensions, which could indicate data entry errors.
Note that in ggplot2,
coord_cartesian() keeps truncated values while
xlim(), ylim() discards them.
When investigating covariance between two variables, ask:
- Could this pattern be due to random chance?
- What is the relationship implied by the pattern, and how strong is it?
- What other variables might affect the relationship?
- Does the relationship change if you look at subgroups of data?
Case 1: A Categorical and a Continuous Variable
Example 1: Use
geom_freqpoly() to overlay multiple histograms by density. Here we can see multiple histograms of carat, split by cut.
Example 2: Another way to visualize the relationship is to create a boxplot, order the variables, and flip the axes. This is illustrated with the relationship between car class and highway miles per gallon via the mpg dataset.
Example 3: In cases where there are a lot of outliers, boxplots may not be as useful. Instead, we can use a letter value plot. Here we have a good view of the density of observations when relating cut and price of diamonds.
Example 4: The violin plot is another great way to compare density distributions among different categories. This is a different way to represent the data from the previous graph.
Case 2: Two Categorical Variables
Example 1: We can use
geom_count to map both categorical variables and display their frequency with the size of a point in a grid. In this example, we can see the distribution of observations between diamond cut and color.
Example 2: A heat map can also come in handy when comparing density of observations between two categorical variables. Again, we can see the distribution of observations between diamond cut and color.
Case 3: Two Continuous Variables
Example 1: The scatterplot is a classic way to compare continuous variables. For example, here we examine the relationship between diamond carat and price.
Example 2: We can also bin variables in two dimensions, using the fill color to represent frequency of observations.
Example 3: Another option is to bin only one continuous variables so that it behaves like a categorical variable. Here, we bin carat.
Example 4: This is the same graph, except the width of the boxplot is proportional to the number of points.
Example 5: This is again the same graph, but each box now contains approximately the same number of points.