Data Visualization - ggplot part 3

Intro to Data Analytics

Notes on where we are heading

More dplyr and ggplot basics. Learning outcomes:
- Know how to summarize data in tables and plots.
- Know how to use categorical and numerical data in plots.
Moving on to data wrangling: reshaping data (Ch 11) and table joins (Ch 12)
Then more visualization and statistical analysis

Why are we here?

Exploratory Data Analysis!

Exploratory Data Analysis

“A state of mind”

Generate questions about your data.
Search for answers by visualizing, transforming, and modelling your data.
Use what you learn to refine your questions and/or generate new questions.

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

Learn more: What is EDA?

Exploratory Data Analysis

What type of variation occurs within my variables?
What type of covariation occurs between my variables?

Exploratory Data Analysis

Summarizing data
- Summary statistics
- Univariate plots (geom_line, distribution plots)
- Bar plots
Understanding relationships between variables
- Scatter plots
Understanding and comparing distributions
- Side-by-side box plots or violin plots

Some review: data visualization types

Bar chart - distribution across categories (for categorical data, x-axis is categories)
Visualizing numerical distributions
- Histogram - distribution across bins or buckets (for numeric data, x-axis is numbers, y-axis is the count of observations per bin)
- Density plot - think of it as a pretty histogram (still for numeric data, x-axis is numbers, but now y-axis is proportion of the whole). Makes it easier to compare two distributions on a single plot. For example, if you have used group_by to break your data into two distinct groups)
- Box plot
- Violin plot

What are the geoms we have used so far?

Some review: ggplot

Basic components:

Data – what are data?
Geometry – can have more than one
Aesthetic mapping – connects data properties and graph features – what must go in there and what can go in there depends on the geometry being used.

Some review: ggplot

You can have more than one geom in a single visualization
Add a geom_line to add a line at a measure of central tendency (mean, median)
Add geom_smooth() to generate pattern line with confidence interval
Global aesthetic is done in the call to ggplot, and is then in effect for all layers.
Local aesthetics are done in specific layers.
Arguments that are specific to a function and are not part of aesthetics are also in the function call used for each layer.

Some review: ggplot annotations

Scale functions – change default scale of axes
We’ve seen labs() – used for title, subtitle, caption, x-axis, y-axis, etc.

Recommended practice

I highly recommend looking carefully at all examples in Chapter 8 and then working through all the exercises at the end of the chapter.

Data: Lending Club

Thousands of loans made through the Lending Club, a platform that allows individuals to lend to other individuals
Not all loans are created equal – ease of getting a loan depends on (apparent) ability to pay back the loan
Data includes loans made, these are not loan applications

A peek at data

library(openintro)
glimpse(loans_full_schema)

Rows: 10,000
Columns: 55
$ emp_title                        <chr> "global config engineer ", "warehouse…
$ emp_length                       <dbl> 3, 10, 3, 1, 10, NA, 10, 10, 10, 3, 1…
$ state                            <fct> NJ, HI, WI, PA, CA, KY, MI, AZ, NV, I…
$ homeownership                    <fct> MORTGAGE, RENT, RENT, RENT, RENT, OWN…
$ annual_income                    <dbl> 90000, 40000, 40000, 30000, 35000, 34…
$ verified_income                  <fct> Verified, Not Verified, Source Verifi…
$ debt_to_income                   <dbl> 18.01, 5.04, 21.15, 10.16, 57.96, 6.4…
$ annual_income_joint              <dbl> NA, NA, NA, NA, 57000, NA, 155000, NA…
$ verification_income_joint        <fct> , , , , Verified, , Not Verified, , ,…
$ debt_to_income_joint             <dbl> NA, NA, NA, NA, 37.66, NA, 13.12, NA,…
$ delinq_2y                        <int> 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0…
$ months_since_last_delinq         <int> 38, NA, 28, NA, NA, 3, NA, 19, 18, NA…
$ earliest_credit_line             <dbl> 2001, 1996, 2006, 2007, 2008, 1990, 2…
$ inquiries_last_12m               <int> 6, 1, 4, 0, 7, 6, 1, 1, 3, 0, 4, 4, 8…
$ total_credit_lines               <int> 28, 30, 31, 4, 22, 32, 12, 30, 35, 9,…
$ open_credit_lines                <int> 10, 14, 10, 4, 16, 12, 10, 15, 21, 6,…
$ total_credit_limit               <int> 70795, 28800, 24193, 25400, 69839, 42…
$ total_credit_utilized            <int> 38767, 4321, 16000, 4997, 52722, 3898…
$ num_collections_last_12m         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ num_historical_failed_to_pay     <int> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0…
$ months_since_90d_late            <int> 38, NA, 28, NA, NA, 60, NA, 71, 18, N…
$ current_accounts_delinq          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ total_collection_amount_ever     <int> 1250, 0, 432, 0, 0, 0, 0, 0, 0, 0, 0,…
$ current_installment_accounts     <int> 2, 0, 1, 1, 1, 0, 2, 2, 6, 1, 2, 1, 2…
$ accounts_opened_24m              <int> 5, 11, 13, 1, 6, 2, 1, 4, 10, 5, 6, 7…
$ months_since_last_credit_inquiry <int> 5, 8, 7, 15, 4, 5, 9, 7, 4, 17, 3, 4,…
$ num_satisfactory_accounts        <int> 10, 14, 10, 4, 16, 12, 10, 15, 21, 6,…
$ num_accounts_120d_past_due       <int> 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, 0, …
$ num_accounts_30d_past_due        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ num_active_debit_accounts        <int> 2, 3, 3, 2, 10, 1, 3, 5, 11, 3, 2, 2,…
$ total_debit_limit                <int> 11100, 16500, 4300, 19400, 32700, 272…
$ num_total_cc_accounts            <int> 14, 24, 14, 3, 20, 27, 8, 16, 19, 7, …
$ num_open_cc_accounts             <int> 8, 14, 8, 3, 15, 12, 7, 12, 14, 5, 8,…
$ num_cc_carrying_balance          <int> 6, 4, 6, 2, 13, 5, 6, 10, 14, 3, 5, 3…
$ num_mort_accounts                <int> 1, 0, 0, 0, 0, 3, 2, 7, 2, 0, 2, 3, 3…
$ account_never_delinq_percent     <dbl> 92.9, 100.0, 93.5, 100.0, 100.0, 78.1…
$ tax_liens                        <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ public_record_bankrupt           <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
$ loan_purpose                     <fct> moving, debt_consolidation, other, de…
$ application_type                 <fct> individual, individual, individual, i…
$ loan_amount                      <int> 28000, 5000, 2000, 21600, 23000, 5000…
$ term                             <dbl> 60, 36, 36, 36, 36, 36, 60, 60, 36, 3…
$ interest_rate                    <dbl> 14.07, 12.61, 17.09, 6.72, 14.07, 6.7…
$ installment                      <dbl> 652.53, 167.54, 71.40, 664.19, 786.87…
$ grade                            <ord> C, C, D, A, C, A, C, B, C, A, C, B, C…
$ sub_grade                        <fct> C3, C1, D1, A3, C3, A3, C2, B5, C2, A…
$ issue_month                      <fct> Mar-2018, Feb-2018, Feb-2018, Jan-201…
$ loan_status                      <fct> Current, Current, Current, Current, C…
$ initial_listing_status           <fct> whole, whole, fractional, whole, whol…
$ disbursement_method              <fct> Cash, Cash, Cash, Cash, Cash, Cash, C…
$ balance                          <dbl> 27015.86, 4651.37, 1824.63, 18853.26,…
$ paid_total                       <dbl> 1999.330, 499.120, 281.800, 3312.890,…
$ paid_principal                   <dbl> 984.14, 348.63, 175.37, 2746.74, 1569…
$ paid_interest                    <dbl> 1015.19, 150.49, 106.43, 566.15, 754.…
$ paid_late_fees                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

Selected variables

loans <- loans_full_schema %>%
  select(loan_amount, interest_rate, term, grade, 
         state, annual_income, homeownership, debt_to_income)
glimpse(loans)

Rows: 10,000
Columns: 8
$ loan_amount    <int> 28000, 5000, 2000, 21600, 23000, 5000, 24000, 20000, 20…
$ interest_rate  <dbl> 14.07, 12.61, 17.09, 6.72, 14.07, 6.72, 13.59, 11.99, 1…
$ term           <dbl> 60, 36, 36, 36, 36, 36, 60, 60, 36, 36, 60, 60, 36, 60,…
$ grade          <ord> C, C, D, A, C, A, C, B, C, A, C, B, C, B, D, D, D, F, E…
$ state          <fct> NJ, HI, WI, PA, CA, KY, MI, AZ, NV, IL, IL, FL, SC, CO,…
$ annual_income  <dbl> 90000, 40000, 40000, 30000, 35000, 34000, 35000, 110000…
$ homeownership  <fct> MORTGAGE, RENT, RENT, RENT, RENT, OWN, MORTGAGE, MORTGA…
$ debt_to_income <dbl> 18.01, 5.04, 21.15, 10.16, 57.96, 6.46, 23.66, 16.19, 3…

Selected variables

variable	description
`loan_amount`	Amount of loan, in US dollars
`interest_rate`	Loan interest rate, in annual percentage
`term`	Loan length, in whole number of months
`grade`	Loan grade, values A through G, likelihood of being repaid
`state`	US state where borrower resides
`annual_income`	Borrower’s annual income, including any second income (USD)
`homeownership`	Indicates whether person owns, owns but has mortgage, or rents their home
`debt_to_income`	Debt-to-income ratio

Variable types

variable	type
`loan_amount`	numerical, continuous
`interest_rate`	numerical, continuous
`term`	numerical, discrete
`grade`	categorical, ordinal
`state`	categorical, not ordinal
`annual_income`	numerical, continuous
`homeownership`	categorical, not ordinal
`debt_to_income`	numerical, continuous

Visualizing numerical data

Distribution plots

Sometimes the median and mean aren’t enough to understand a data set. Are most of the values clustered around the median? Or are they clustered around the minimum and the maximum with nothing in the middle? When you have questions like these, distribution plots are your friends.

The most basic statistical summary of a list of objects or numbers is its distribution.

Learn more: https://flowingdata.com/2012/05/15/how-to-visualize-and-compare-distributions/

Describing shapes of numerical distributions

shape:
- skewness:
  - right-skewed,
  - left-skewed,
  - symmetric
- modality (how many peaks or modes a data distribution has): unimodal, bimodal, multimodal, uniform
center: mean (mean), median (median), mode (not always useful)
spread: range (range), standard deviation (sd), inter-quartile range (IQR)
unusual observations

Histogram

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Customizing histograms

Plot
Code

ggplot(loans, aes(x = loan_amount)) +
  geom_histogram(binwidth = 5000) +
  labs( 
    x = "Loan amount ($)", 
    y = "Frequency", 
    title = "Amounts of Lending Club loans" 
  )

Fill with a categorical variable

Plot
Code

ggplot(loans, aes(x = loan_amount, 
                  fill = homeownership)) + 
  geom_histogram(binwidth = 5000,
                 alpha = 0.5) + 
  labs(
    x = "Loan amount ($)",
    y = "Frequency",
    title = "Amounts of Lending Club loans"
  )

Facet with a categorical variable

Plot
Code

ggplot(loans, aes(x = loan_amount, fill = homeownership)) + 
  geom_histogram(binwidth = 5000) +
  labs(
    x = "Loan amount ($)",
    y = "Frequency",
    title = "Amounts of Lending Club loans"
  ) +
  facet_wrap(~ homeownership, nrow = 3)

Density plot

ggplot(loans, aes(x = loan_amount)) +
  geom_density()

Works better for continuous data

Density plots and adjusting bandwidth

adjust = 0.5
adjust = 1
adjust = 2

ggplot(loans, aes(x = loan_amount)) +
  geom_density(adjust = 0.5)

ggplot(loans, aes(x = loan_amount)) +
  geom_density(adjust = 1) # default bandwidth

ggplot(loans, aes(x = loan_amount)) +
  geom_density(adjust = 2)

Customizing density plots

Plot
Code

ggplot(loans, aes(x = loan_amount)) +
  geom_density(adjust = 2) +
  labs( 
    x = "Loan amount ($)", 
    y = "Density", 
    title = "Amounts of Lending Club loans" 
  )

Adding a categorical variable

Plot
Code

ggplot(loans, aes(x = loan_amount, 
                  fill = homeownership)) + 
  geom_density(adjust = 2, 
               alpha = 0.5) + 
  labs(
    x = "Loan amount ($)",
    y = "Density",
    title = "Amounts of Lending Club loans", 
    fill = "Homeownership" 
  )

Box plot

ggplot(loans, aes(x = interest_rate)) +
  geom_boxplot()

Box plot - flip orientation

ggplot(loans, aes(y = interest_rate)) +
  geom_boxplot()

Box plot and outliers

ggplot(loans, aes(x = annual_income)) +
  geom_boxplot()

Customizing box plots

Plot
Code

ggplot(loans, aes(x = interest_rate)) +
  geom_boxplot() +
  labs(
    x = "Interest rate (%)",
    y = NULL,
    title = "Interest rates of Lending Club loans"
  ) +
  theme( 
    axis.ticks.y = element_blank(), 
    axis.text.y = element_blank() 
  )

Side-by-side box plots

Interest rate by housing tenure (own, own with mortgage, rent)

Plot
Code

ggplot(loans, aes(x = interest_rate, 
                  y = homeownership)) +
  geom_boxplot() +
  labs(
    x = "Interest rate (%)",
    y = "Housing tenure",
    title = "Interest rates of Lending Club loans"
  )

Adding a categorical variable

Plot
Code

ggplot(loans, aes(x = interest_rate,
                  y = grade)) + 
  geom_boxplot() +
  labs(
    x = "Interest rate (%)",
    y = "Grade",
    title = "Interest rates of Lending Club loans",
    subtitle = "by grade of loan" 
  )

Relationships between numerical variables

Scatterplot

ggplot(loans, aes(x = debt_to_income, y = interest_rate)) +
  geom_point()