Bar plots and effective data visualization

Intro to Data Analytics

Visualizing categorical data

Data: Lending Club

Thousands of loans made through the Lending Club, a platform that allows individuals to lend to other individuals
Not all loans are created equal – ease of getting a loan depends on (apparent) ability to pay back the loan.
Data includes loans made, these are not loan applications

A peek at data

library(openintro)
glimpse(loans_full_schema)

Rows: 10,000
Columns: 55
$ emp_title                        <chr> "global config engineer ", "warehouse…
$ emp_length                       <dbl> 3, 10, 3, 1, 10, NA, 10, 10, 10, 3, 1…
$ state                            <fct> NJ, HI, WI, PA, CA, KY, MI, AZ, NV, I…
$ homeownership                    <fct> MORTGAGE, RENT, RENT, RENT, RENT, OWN…
$ annual_income                    <dbl> 90000, 40000, 40000, 30000, 35000, 34…
$ verified_income                  <fct> Verified, Not Verified, Source Verifi…
$ debt_to_income                   <dbl> 18.01, 5.04, 21.15, 10.16, 57.96, 6.4…
$ annual_income_joint              <dbl> NA, NA, NA, NA, 57000, NA, 155000, NA…
$ verification_income_joint        <fct> , , , , Verified, , Not Verified, , ,…
$ debt_to_income_joint             <dbl> NA, NA, NA, NA, 37.66, NA, 13.12, NA,…
$ delinq_2y                        <int> 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0…
$ months_since_last_delinq         <int> 38, NA, 28, NA, NA, 3, NA, 19, 18, NA…
$ earliest_credit_line             <dbl> 2001, 1996, 2006, 2007, 2008, 1990, 2…
$ inquiries_last_12m               <int> 6, 1, 4, 0, 7, 6, 1, 1, 3, 0, 4, 4, 8…
$ total_credit_lines               <int> 28, 30, 31, 4, 22, 32, 12, 30, 35, 9,…
$ open_credit_lines                <int> 10, 14, 10, 4, 16, 12, 10, 15, 21, 6,…
$ total_credit_limit               <int> 70795, 28800, 24193, 25400, 69839, 42…
$ total_credit_utilized            <int> 38767, 4321, 16000, 4997, 52722, 3898…
$ num_collections_last_12m         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ num_historical_failed_to_pay     <int> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0…
$ months_since_90d_late            <int> 38, NA, 28, NA, NA, 60, NA, 71, 18, N…
$ current_accounts_delinq          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ total_collection_amount_ever     <int> 1250, 0, 432, 0, 0, 0, 0, 0, 0, 0, 0,…
$ current_installment_accounts     <int> 2, 0, 1, 1, 1, 0, 2, 2, 6, 1, 2, 1, 2…
$ accounts_opened_24m              <int> 5, 11, 13, 1, 6, 2, 1, 4, 10, 5, 6, 7…
$ months_since_last_credit_inquiry <int> 5, 8, 7, 15, 4, 5, 9, 7, 4, 17, 3, 4,…
$ num_satisfactory_accounts        <int> 10, 14, 10, 4, 16, 12, 10, 15, 21, 6,…
$ num_accounts_120d_past_due       <int> 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, 0, …
$ num_accounts_30d_past_due        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ num_active_debit_accounts        <int> 2, 3, 3, 2, 10, 1, 3, 5, 11, 3, 2, 2,…
$ total_debit_limit                <int> 11100, 16500, 4300, 19400, 32700, 272…
$ num_total_cc_accounts            <int> 14, 24, 14, 3, 20, 27, 8, 16, 19, 7, …
$ num_open_cc_accounts             <int> 8, 14, 8, 3, 15, 12, 7, 12, 14, 5, 8,…
$ num_cc_carrying_balance          <int> 6, 4, 6, 2, 13, 5, 6, 10, 14, 3, 5, 3…
$ num_mort_accounts                <int> 1, 0, 0, 0, 0, 3, 2, 7, 2, 0, 2, 3, 3…
$ account_never_delinq_percent     <dbl> 92.9, 100.0, 93.5, 100.0, 100.0, 78.1…
$ tax_liens                        <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ public_record_bankrupt           <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
$ loan_purpose                     <fct> moving, debt_consolidation, other, de…
$ application_type                 <fct> individual, individual, individual, i…
$ loan_amount                      <int> 28000, 5000, 2000, 21600, 23000, 5000…
$ term                             <dbl> 60, 36, 36, 36, 36, 36, 60, 60, 36, 3…
$ interest_rate                    <dbl> 14.07, 12.61, 17.09, 6.72, 14.07, 6.7…
$ installment                      <dbl> 652.53, 167.54, 71.40, 664.19, 786.87…
$ grade                            <ord> C, C, D, A, C, A, C, B, C, A, C, B, C…
$ sub_grade                        <fct> C3, C1, D1, A3, C3, A3, C2, B5, C2, A…
$ issue_month                      <fct> Mar-2018, Feb-2018, Feb-2018, Jan-201…
$ loan_status                      <fct> Current, Current, Current, Current, C…
$ initial_listing_status           <fct> whole, whole, fractional, whole, whol…
$ disbursement_method              <fct> Cash, Cash, Cash, Cash, Cash, Cash, C…
$ balance                          <dbl> 27015.86, 4651.37, 1824.63, 18853.26,…
$ paid_total                       <dbl> 1999.330, 499.120, 281.800, 3312.890,…
$ paid_principal                   <dbl> 984.14, 348.63, 175.37, 2746.74, 1569…
$ paid_interest                    <dbl> 1015.19, 150.49, 106.43, 566.15, 754.…
$ paid_late_fees                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

Selected variables

loans <- loans_full_schema %>%
  select(loan_amount, interest_rate, term, grade, 
         state, annual_income, homeownership, debt_to_income)
glimpse(loans)

Rows: 10,000
Columns: 8
$ loan_amount    <int> 28000, 5000, 2000, 21600, 23000, 5000, 24000, 20000, 20…
$ interest_rate  <dbl> 14.07, 12.61, 17.09, 6.72, 14.07, 6.72, 13.59, 11.99, 1…
$ term           <dbl> 60, 36, 36, 36, 36, 36, 60, 60, 36, 36, 60, 60, 36, 60,…
$ grade          <ord> C, C, D, A, C, A, C, B, C, A, C, B, C, B, D, D, D, F, E…
$ state          <fct> NJ, HI, WI, PA, CA, KY, MI, AZ, NV, IL, IL, FL, SC, CO,…
$ annual_income  <dbl> 90000, 40000, 40000, 30000, 35000, 34000, 35000, 110000…
$ homeownership  <fct> MORTGAGE, RENT, RENT, RENT, RENT, OWN, MORTGAGE, MORTGA…
$ debt_to_income <dbl> 18.01, 5.04, 21.15, 10.16, 57.96, 6.46, 23.66, 16.19, 3…

Selected variables

variable	description
`loan_amount`	Amount of loan, in US dollars
`interest_rate`	Loan interest rate, in annual percentage
`term`	Loan length, in whole number of months
`grade`	Loan grade, values A through G, likelihood of being repaid
`state`	US state where borrower resides
`annual_income`	Borrower’s annual income, including any second income (USD)
`homeownership`	Indicates whether person owns, owns but has mortgage, or rents their home
`debt_to_income`	Debt-to-income ratio

Variable types

variable	type
`loan_amount`	numerical, continuous
`interest_rate`	numerical, continuous
`term`	numerical, discrete
`grade`	categorical, ordinal
`state`	categorical, not ordinal
`annual_income`	numerical, continuous
`homeownership`	categorical, not ordinal
`debt_to_income`	numerical, continuous

Bar plot

ggplot(loans, aes(x = homeownership)) +
  geom_bar()

Segmented bar plot

ggplot(loans, aes(x = homeownership, 
                  fill = grade)) + 
  geom_bar()

Segmented bar plot

Show proportion.

ggplot(loans, aes(x = homeownership, fill = grade)) +
  geom_bar(position = "fill")

Bar Plot

Which is a more useful representation for visualizing relationship btwn homeownership and loan grade?

ggplot(loans, aes(y = homeownership, 
                  fill = grade)) +
  geom_bar(position = "fill") +
  labs( 
    x = "Proportion", 
    y = "Homeownership", 
    fill = "Grade", 
    title = "Grades of Lending Club loans", 
    subtitle = "and homeownership of lendee" 
  )

Relationships between numerical and categorical variables

Already talked about…

Coloring and faceting histograms and density plots
Side-by-side box plots

Violin plots

combine side by side box plots with density plots

ggplot(loans, aes(x = homeownership, y = loan_amount)) +
  geom_violin()

Violin plots

Shows peaks in the data
For multimodal distributions (those with multiple peaks)

ggplot(loans, aes(x = homeownership, y = loan_amount)) +
  geom_violin()

Learn more:

Describing shapes of numerical distributions

shape:
- skewness:
  - right-skewed, positively skewed
  - left-skewed, negatively skewed
  - symmetric
- modality (how many peaks or modes a data distribution has): unimodal, bimodal, multimodal, uniform
center: mean (mean), median (median), mode (not always useful)
spread: range (range), standard deviation (sd), inter-quartile range (IQR)
unusual observations

Describing shapes of numerical distributions

Violin plots

We can add more information to our violin plots to communicate more about the numerical distribution. In the plot below we add a boxplot within the violin.

ggplot(loans, aes(x = homeownership, y = loan_amount)) +
  geom_violin() +
  geom_boxplot(width=0.1)

Why did we choose a box plot in this case? What about the distributions shown in the violin geom point us towards displaying the median and IQR?

Ridge plots

library(ggridges)
ggplot(loans, aes(x = loan_amount, y = grade, fill = grade, color = grade)) + 
  geom_density_ridges(alpha = 0.5)

Designing effective visualizations

Keep it simple

Use color to draw attention

Plot
Code

d %>%
  mutate(category = str_replace(category, " ", "\n")) %>%
  ggplot(aes(x = category, y = value, fill = category)) +
  geom_col() +
  theme_minimal() +
  labs(x = "", y = "") +
  theme(legend.position = "none")

Use color to draw attention

Plot
Code

ggplot(d, aes(x = fct_reorder(category, value), y = value, fill = category)) +
  geom_col() +
  theme_minimal() +
  coord_flip() +
  labs(x = "", y = "") +
  scale_fill_manual(values = c("red", rep("gray", 4))) +
  theme(legend.position = "none")

Tell a story

Principles for effective visualizations

Order matters
Put long categories on the y-axis
Keep scales consistent
Select meaningful colors
Use meaningful and nonredundant labels

Data

In September 2019, YouGov survey asked 1,639 GB adults the following question:

In hindsight, do you think Britain was right/wrong to vote to leave EU?

Right to leave

Wrong to leave

Don’t know

Source: YouGov Survey Results, retrieved Oct 7, 2019

Order matters

Alphabetical order is rarely ideal

Plot
Code

ggplot(brexit, aes(x = opinion)) +
  geom_bar()

Order by frequency

Plot
Code

fct_infreq: Reorder factors’ levels by frequency

ggplot(brexit, aes(x = fct_infreq(opinion))) + 
  geom_bar()

Wrap the opinion variable inside the fct_infreq function and reorder the categories by frequency

Clean up labels

Plot
Code

ggplot(brexit, aes(x = fct_infreq(opinion))) +
  geom_bar() +
  labs( 
    x = "Opinion", 
    y = "Count" 
  )

Alphabetical order is rarely ideal

Plot
Code

ggplot(brexit, aes(x = region)) +
  geom_bar()

Use inherent level order

Relevel
Plot

fct_relevel: Reorder factor levels using a custom order

brexit <- brexit %>%
  mutate(
    region = fct_relevel( 
      region,
      "london", "rest_of_south", "midlands_wales", "north", "scot"
    )
  )

Clean up labels

Recode
Plot

fct_recode: Change factor levels by hand

brexit <- brexit %>%
  mutate(
    region = fct_recode( 
      region,
      London = "london",
      `Rest of South` = "rest_of_south",
      `Midlands / Wales` = "midlands_wales",
      North = "north",
      Scotland = "scot"
    )
  )

Put long categories on the y-axis

Long categories can be hard to read

Move them to the y-axis

Plot
Code

ggplot(brexit, aes(y = region)) + 
  geom_bar()

And reverse the order of levels

Plot
Code

fct_rev: Reverse order of factor levels

ggplot(brexit, aes(y = fct_rev(region))) + 
  geom_bar()

Clean up labels

Plot
Code

ggplot(brexit, aes(y = fct_rev(region))) +
  geom_bar() +
  labs( 
    x = "Count", 
    y = "Region" 
  )

Pick a purpose

Segmented bar plots can be hard to read

Plot
Code

ggplot(brexit, aes(y = region, fill = opinion)) + 
  geom_bar()

Avoid redundancy?

ggplot(brexit, aes(y = opinion)) +
  geom_bar() +
  facet_wrap(~region, nrow = 1)

Redundancy can help tell a story

Plot
Code

ggplot(brexit, aes(y = opinion, fill = opinion)) +
  geom_bar() +
  facet_wrap(~region, nrow = 1)

Be selective with redundancy

Plot
Code

ggplot(brexit, aes(y = opinion, fill = opinion)) +
  geom_bar() +
  facet_wrap(~region, nrow = 1) +
  guides(fill = "none")

Use informative labels

Plot
Code

ggplot(brexit, aes(y = opinion, fill = opinion)) +
  geom_bar() +
  facet_wrap(~region, nrow = 1) +
  guides(fill = "none") +
  labs(
    title = "Was Britain right/wrong to vote to leave EU?", 
    x = NULL, y = NULL
  )

A bit more info

Plot
Code

ggplot(brexit, aes(y = opinion, fill = opinion)) +
  geom_bar() +
  facet_wrap(~region, nrow = 1) +
  guides(fill = "none") +
  labs(
    title = "Was Britain right/wrong to vote to leave EU?",
    subtitle = "YouGov Survey Results, 2-3 September 2019", 
    caption = "Source: https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/x0msmggx08/YouGov%20-%20Brexit%20and%202019%20election.pdf", 
    x = NULL, y = NULL
  )

Add a source for our visualization

Let’s do better

Plot
Code

ggplot(brexit, aes(y = opinion, fill = opinion)) +
  geom_bar() +
  facet_wrap(~region, nrow = 1) +
  guides(fill = "none") +
  labs(
    title = "Was Britain right/wrong to vote to leave EU?",
    subtitle = "YouGov Survey Results, 2-3 September 2019",
    caption = "Source: bit.ly/2lCJZVg", 
    x = NULL, y = NULL
  )

Shorten the URL using bit.ly

Fix up facet labels

Plot
Code

ggplot(brexit, aes(y = opinion, fill = opinion)) +
  geom_bar() +
  facet_wrap(~region,
    nrow = 1,
    labeller = label_wrap_gen(width = 12) 
  ) + 
  guides(fill = "none") +
  labs(
    title = "Was Britain right/wrong to vote to leave EU?",
    subtitle = "YouGov Survey Results, 2-3 September 2019",
    caption = "Source: bit.ly/2lCJZVg",
    x = NULL, y = NULL
  )

Add the labeller argument to the facet_wrap function

Select meaningful colors

Rainbow colors not always the right choice

Manually choose colors when needed

Plot
Code

ggplot(brexit, aes(y = opinion, fill = opinion)) +
  geom_bar() +
  facet_wrap(~region, nrow = 1, labeller = label_wrap_gen(width = 12)) +
  guides(fill = "none") +
  labs(title = "Was Britain right/wrong to vote to leave EU?",
       subtitle = "YouGov Survey Results, 2-3 September 2019",
       caption = "Source: bit.ly/2lCJZVg",
       x = NULL, y = NULL) +
  scale_fill_manual(values = c( 
    "Wrong" = "red", 
    "Right" = "green", 
    "Don't know" = "gray" 
  ))

Choosing better colors

colorbrewer2.org

Use better colors

Plot
Code

ggplot(brexit, aes(y = opinion, fill = opinion)) +
  geom_bar() +
  facet_wrap(~region, nrow = 1, labeller = label_wrap_gen(width = 12)) +
  guides(fill = "none") +
  labs(title = "Was Britain right/wrong to vote to leave EU?",
       subtitle = "YouGov Survey Results, 2-3 September 2019",
       caption = "Source: bit.ly/2lCJZVg",
       x = NULL, y = NULL) +
  scale_fill_manual(values = c(
    "Wrong" = "#ef8a62", 
    "Right" = "#67a9cf", 
    "Don't know" = "gray" 
  ))

Select theme

ggplot complete themes guide

Plot
Code

ggplot(brexit, aes(y = opinion, fill = opinion)) +
  geom_bar() +
  facet_wrap(~region, nrow = 1, labeller = label_wrap_gen(width = 12)) +
  guides(fill = "none") +
  labs(title = "Was Britain right/wrong to vote to leave EU?",
       subtitle = "YouGov Survey Results, 2-3 September 2019",
       caption = "Source: bit.ly/2lCJZVg",
       x = NULL, y = NULL) +
  scale_fill_manual(values = c("Wrong" = "#ef8a62",
                               "Right" = "#67a9cf",
                               "Don't know" = "gray")) +
  theme_minimal()

Bar plots and effective data visualization

Visualizing categorical data

Data: Lending Club

A peek at data

Selected variables

Selected variables

Variable types

Bar plot

Bar plot

Segmented bar plot

Segmented bar plot

Bar Plot

Customizing bar plots

Relationships between numerical and categorical variables

Already talked about…

Violin plots

Violin plots

Describing shapes of numerical distributions

Describing shapes of numerical distributions

Violin plots

Ridge plots

Designing effective visualizations

Keep it simple

Use color to draw attention

Use color to draw attention

Tell a story

Principles for effective visualizations

Principles for effective visualizations

Data

Order matters

Alphabetical order is rarely ideal

Order by frequency

Clean up labels

Alphabetical order is rarely ideal

Use inherent level order

Clean up labels

Put long categories on the y-axis

Long categories can be hard to read

Move them to the y-axis

And reverse the order of levels

Clean up labels

Pick a purpose

Segmented bar plots can be hard to read

Use facets

Avoid redundancy?

Redundancy can help tell a story

Be selective with redundancy

Use informative labels

A bit more info

Let’s do better

Fix up facet labels

Select meaningful colors

Rainbow colors not always the right choice

Manually choose colors when needed

Choosing better colors

Use better colors

Select theme