Introduction to Tidyverse

Intro to Data Analytics

Jordan Ayala

A reminder

  • The goal of homework and other exercises you complete outside of class to encourage you to use the documentation and try things out on your own, but the goal is not for you to end up frustrated beyond belief.

  • If you make a serious attempt and feel really stuck, ask for help (in office hours or via email).

  • Lots of R programming involves exploring techniques and figuring out how to use them, so it’s a good idea to try things out that you don’t yet fully understand, but there’s no benefit in you being completely frustrated beyond belief.

Getting Started

  1. Create a new Quarto doc in your course RStudio project.
  2. Load the tidyverse package.

Where we are heading

We will be using the ggplot2 package to visualize data as a part of the exploratory data analysis workflow

  • ggplot expects your data to be organized in a “tidy” layout (Wickham 2014).

A quick example

Load data (download the hotels dataset for practice later)

library(tidyverse)
hotels <- read_csv("data/hotels.csv")
  • What are some ways that we can learn more about the hotels data?

Data: Hotel bookings1

  • Data from two hotels: one resort and one city hotel
  • Observations: Each row represents a hotel booking
  • Goal for original data collection: Development of prediction models to classify likelihood that a hotel’s booking’s would be cancelled (Antonia et al., 2019)

First look: Variables

names(hotels)
 [1] "hotel"                          "is_canceled"                   
 [3] "lead_time"                      "arrival_date_year"             
 [5] "arrival_date_month"             "arrival_date_week_number"      
 [7] "arrival_date_day_of_month"      "stays_in_weekend_nights"       
 [9] "stays_in_week_nights"           "adults"                        
[11] "children"                       "babies"                        
[13] "meal"                           "country"                       
[15] "market_segment"                 "distribution_channel"          
[17] "is_repeated_guest"              "previous_cancellations"        
[19] "previous_bookings_not_canceled" "reserved_room_type"            
[21] "assigned_room_type"             "booking_changes"               
[23] "deposit_type"                   "agent"                         
[25] "company"                        "days_in_waiting_list"          
[27] "customer_type"                  "adr"                           
[29] "required_car_parking_spaces"    "total_of_special_requests"     
[31] "reservation_status"             "reservation_status_date"       

Second look: Overview

glimpse(hotels)
Rows: 119,390
Columns: 32
$ hotel                          <chr> "Resort Hotel", "Resort Hotel", "Resort…
$ is_canceled                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, …
$ lead_time                      <dbl> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, …
$ arrival_date_year              <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 201…
$ arrival_date_month             <chr> "July", "July", "July", "July", "July",…
$ arrival_date_week_number       <dbl> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,…
$ arrival_date_day_of_month      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ stays_in_weekend_nights        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ stays_in_week_nights           <dbl> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, …
$ adults                         <dbl> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ children                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ babies                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ meal                           <chr> "BB", "BB", "BB", "BB", "BB", "BB", "BB…
$ country                        <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "GBR…
$ market_segment                 <chr> "Direct", "Direct", "Direct", "Corporat…
$ distribution_channel           <chr> "Direct", "Direct", "Direct", "Corporat…
$ is_repeated_guest              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ previous_cancellations         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ reserved_room_type             <chr> "C", "C", "A", "A", "A", "A", "C", "C",…
$ assigned_room_type             <chr> "C", "C", "C", "A", "A", "A", "C", "C",…
$ booking_changes                <dbl> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ deposit_type                   <chr> "No Deposit", "No Deposit", "No Deposit…
$ agent                          <chr> "NULL", "NULL", "NULL", "304", "240", "…
$ company                        <chr> "NULL", "NULL", "NULL", "NULL", "NULL",…
$ days_in_waiting_list           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ customer_type                  <chr> "Transient", "Transient", "Transient", …
$ adr                            <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,…
$ required_car_parking_spaces    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ total_of_special_requests      <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, …
$ reservation_status             <chr> "Check-Out", "Check-Out", "Check-Out", …
$ reservation_status_date        <date> 2015-07-01, 2015-07-01, 2015-07-02, 20…

Now let’s create some basic plots (charts).

Basic ggplot charts

Hotel type, city or resort

Basic ggplot charts

Hotel type, city or resort

ggplot(data = hotels, aes(x = hotel, fill = hotel)) +
  geom_bar() + 
  labs(title = "Distribution by hotel type")

Basic ggplot charts

Average daily rate for a room by hotel type

Basic ggplot charts

Average daily rate for a room by hotel type

ggplot(hotels, aes(x = adr, color = hotel, fill = hotel)) +
  geom_density() +
  xlim(0,200)

Tidy data

Characteristics of tidy data:

  • One observation per row – each observation forms a row.
  • One variable per column – each variable forms a column.
  • One type of observation per data set – dataset has same underlying observational unit.

Tidy data

https://r4ds.hadley.nz/data-tidy#sec-tidy-data

We will use tidyverse for data cleaning, wrangling, and visualization.

Tidy data examples

The hotels dataset is tidy.

  • What is the underlying observational unit? In other words, what does each row represent?

Is the us_metro_areas data frame tidy? Why/why not?

Tidyverse packages

https://www.tidyverse.org/packages/

What makes this data not tidy?1

What makes this data not tidy?1

Activity

Load the data

us_metro_areas.csv

Summarize a column (vector)

What is the average number of cats owned by each person?

us_metro_areas %>% 
  summarize(average_pct_children = mean(pct_children))

Create a new variable

Create a column in the data frame to store population density for each metro area

us_metro_areas <- us_metro_areas |> 
  mutate(pop_density = total_pop/`area sq miles`)

Arrange (sort) a dataframe

us_metro_areas <- us_metro_areas |> 
  arrange(pop_density)

Filter a dataframe

Filter a data frame based on column (observation) values

# Subset the to only those observations where the population density is greater than or equal to 1,000
us_metro_areas |> 
  filter(pop_density >= 1000)

How would we accomplish this task in base R?