Data Visualization

Introduction to ggplot

Why do data viz?

  • part of EDA
  • to convey information to others

Types of data

Numeric data

  • Discrete
  • Continuous

Categorical data

  • Ordinal
  • Nominal

ggplot2

  • Tidyverse’s data visualizaton package
  • ggplot() is main function in the package
    • gg stands for Grammar of Graphics
    • inspired by book Grammar of Graphics - Wilkinson
  • Plots are built up in layers
  • Concisely describe components of a graphic
  • Easy to do incremental development and check your work!

ggplot2

Lets us control:

  • functions used for plotting
  • dataset being plotted
  • mapping of variables to plot features (aesthetics)
  • ggplot2.tidyverse.org

Components of a data graphic

Every data graphic that you make with ggplot2 will contain at least the following elements:

  • a call to the ggplot() function that contains:

    • a data argument that specifies the name of the object containing the data to be plotted

    • a mapping argument that specifies one (or more) connections between variables in the data frame and elements on the plot

  • a + operator that adds a new layer to a plot

  • a call to a geom_*() function that specifies what you actually want to draw

ggplot2

  • ggplot() – main function
  • Plot code structure:
    ggplot(data = [dataset], #where to find the data
        mapping = aes(x = [x-var], y = [y-var])) +
        geom_type() +
        other options

Our data

#Install and load the readxl package to load Excel files
#install.packages("readxl")
library(readxl)
df_ab <- readxl::read_excel("ab_data.xlsx") 
# A tibble: 6 × 3
      a     b c      
  <dbl> <dbl> <chr>  
1  -2    -0.2 inside 
2  -1.9  -1.8 outside
3  -1.8  -3.1 inside 
4  -1.5  -3.2 inside 
5  -1.4   1.2 outside
6  -1.3  -1.2 inside 

ggplot2

Define dataset

# scatterplot of a and b
ggplot(data = df_ab)

ggplot2

Set up variable mapping of one variable on the x-axis of the plot space

# scatterplot of a and b
ggplot(data = df_ab,
       mapping = aes(x = a))

ggplot2

Set up variable mapping for two variables, x and y axis of the plot space

# scatterplot of a and b
ggplot(data = df_ab,
       mapping = aes(x = a, y = b))

ggplot2

Specify the geom type

# scatterplot of a and b
ggplot(data = df_ab,
       mapping = aes(x = a, y = b)) + 
  geom_point()

ggplot2

Color points based on variable in dataset

# scatterplot of a and b
ggplot(data = df_ab,
       mapping = aes(x = a, y = b, color = c)) + 
  geom_point()

ggplot2

Color all points the same color

# scatterplot of a and b
ggplot(data = df_ab,
       mapping = aes(x = a, y = b)) + 
  geom_point(color = "purple")

ggplot2

Add a plot title

# scatterplot of a and b
ggplot(data = df_ab,
       mapping = aes(x = a, y = b, color = c)) + 
  geom_point() + 
  labs(title = "Plot of B versus A colored by C")

ggplot2

# scatterplot of a and b
ggplot(df_ab,
       aes(x = a, y = b, color = c)) + 
  geom_point() + 
  labs(title = "Plot of B versus A colored by C")

Basic plot types

  • Distribution
    • Histogram (distribution across buckets or bins)
    • Smooth density plot (pretty histogram)
    • Box plot (distribution with key summary statistics)
  • Relationships
    • Scatter plot (relationship between x and y)
  • Summarizing categories
    • Bar chart (distribution or frequency across categories)

Distribution: Histogram

hotels <- read_csv("data/hotels.csv")

ggplot(hotels, aes(x = adr)) +
  geom_histogram(binwidth = 20) # Specify size of buckets

Distribution: Histogram

ggplot(hotels, aes(x = adr)) +
  geom_histogram(binwidth = 10) +
  xlim(0,300) # Specify range shown on x-axis

Distribution: Density plot

ggplot(hotels, aes(x = adr)) +
  geom_density() +
  xlim(0,300)

Distribution: Density plot

ggplot(hotels, aes(x = adr)) +
  geom_density(color = "red", fill = "red") + # add fill color, change line color
  xlim(0,300)

Distribution: Box plot

ggplot(hotels, aes(x = adr)) +
  geom_boxplot() +
  xlim(0,600)

Distribution: Box plot

ggplot(hotels, aes(x = adr)) +
  geom_boxplot()

Categorical: Bar plot

ggplot(hotels, aes(x = hotel)) +
  geom_bar() + 
  labs(title = "Bookings by hotel type",
       x = "Hotel type",
       y = "Number of bookings")

Some Practice

Some more practice

  • ggplot_examples_1.R (handout)
  • Download ab_data.xlsx data set