Tidyverse

Intro to Data Analytics

Jordan Ayala

Tidy data

Tidy data

Characteristics of tidy data:

  • Each variable forms a column.
  • Each observation forms a row.
  • Each type of observational unit forms a table.

Tidying data

  • We have a data set from real life.
    • It’s a mess, or at least not well organized.
  • We need to reorganize it.

Tidying data

  • Numerous actions we can take to tidy data
  • At the same time we can also make the data more useful for analysis
  • Some actions might be
    • Pivoting
      • pivot( ) function is very versatile

      • adding more columns / adding more rows

        …we’ll get to this later

    • Wrangling
      • use dplyr library from tidyverse

Grammar of Data Wrangling

A grammar of data wrangling…

… based on the concepts of functions as verbs that manipulate data frames

Functions that manipulate existing data:

  • select: pick columns by name
  • arrange: reorder rows
  • slice: pick rows using index(es)
  • filter: pick rows matching criteria
  • distinct: filter for unique rows

A grammar of data wrangling…

… based on the concepts of functions as verbs that manipulate data frames

Functions that generate new data:

  • mutate: add new variables
  • summarize: reduce variables to values
  • group_by: for grouped operations
  • count: to create frequency tables

Rules of dplyr functions

  • First argument: always a data frame
  • Second argument: what to do with the data
  • Always return a data frame
  • Don’t modify in place, without creating a separate copy of the data structure

dplyr functions expect a dataframe as input, produce a dataframe as output

Data: Hotel bookings1

  • Data from two hotels: one resort and one city hotel
  • Observations: Each row represents a hotel booking
  • Goal for original data collection: Development of prediction models to classify likelihood that a hotel’s booking’s would be cancelled (Antonia et al., 2019)

First look: Variables

names(hotels)
 [1] "hotel"                          "is_canceled"                   
 [3] "lead_time"                      "arrival_date_year"             
 [5] "arrival_date_month"             "arrival_date_week_number"      
 [7] "arrival_date_day_of_month"      "stays_in_weekend_nights"       
 [9] "stays_in_week_nights"           "adults"                        
[11] "children"                       "babies"                        
[13] "meal"                           "country"                       
[15] "market_segment"                 "distribution_channel"          
[17] "is_repeated_guest"              "previous_cancellations"        
[19] "previous_bookings_not_canceled" "reserved_room_type"            
[21] "assigned_room_type"             "booking_changes"               
[23] "deposit_type"                   "agent"                         
[25] "company"                        "days_in_waiting_list"          
[27] "customer_type"                  "adr"                           
[29] "required_car_parking_spaces"    "total_of_special_requests"     
[31] "reservation_status"             "reservation_status_date"       

Second look: Overview

glimpse(hotels)
Rows: 119,390
Columns: 32
$ hotel                          <chr> "Resort Hotel", "Resort Hotel", "Resort…
$ is_canceled                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, …
$ lead_time                      <dbl> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, …
$ arrival_date_year              <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 201…
$ arrival_date_month             <chr> "July", "July", "July", "July", "July",…
$ arrival_date_week_number       <dbl> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,…
$ arrival_date_day_of_month      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ stays_in_weekend_nights        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ stays_in_week_nights           <dbl> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, …
$ adults                         <dbl> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ children                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ babies                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ meal                           <chr> "BB", "BB", "BB", "BB", "BB", "BB", "BB…
$ country                        <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "GBR…
$ market_segment                 <chr> "Direct", "Direct", "Direct", "Corporat…
$ distribution_channel           <chr> "Direct", "Direct", "Direct", "Corporat…
$ is_repeated_guest              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ previous_cancellations         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ reserved_room_type             <chr> "C", "C", "A", "A", "A", "A", "C", "C",…
$ assigned_room_type             <chr> "C", "C", "C", "A", "A", "A", "C", "C",…
$ booking_changes                <dbl> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ deposit_type                   <chr> "No Deposit", "No Deposit", "No Deposit…
$ agent                          <chr> "NULL", "NULL", "NULL", "304", "240", "…
$ company                        <chr> "NULL", "NULL", "NULL", "NULL", "NULL",…
$ days_in_waiting_list           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ customer_type                  <chr> "Transient", "Transient", "Transient", …
$ adr                            <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,…
$ required_car_parking_spaces    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ total_of_special_requests      <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, …
$ reservation_status             <chr> "Check-Out", "Check-Out", "Check-Out", …
$ reservation_status_date        <date> 2015-07-01, 2015-07-01, 2015-07-02, 20…

Changing data frames

Adding a new variable (column)

mutate(hotels, little_ones = children + babies)
names(hotels)
 [1] "hotel"                          "is_canceled"                   
 [3] "lead_time"                      "arrival_date_year"             
 [5] "arrival_date_month"             "arrival_date_week_number"      
 [7] "arrival_date_day_of_month"      "stays_in_weekend_nights"       
 [9] "stays_in_week_nights"           "adults"                        
[11] "children"                       "babies"                        
[13] "meal"                           "country"                       
[15] "market_segment"                 "distribution_channel"          
[17] "is_repeated_guest"              "previous_cancellations"        
[19] "previous_bookings_not_canceled" "reserved_room_type"            
[21] "assigned_room_type"             "booking_changes"               
[23] "deposit_type"                   "agent"                         
[25] "company"                        "days_in_waiting_list"          
[27] "customer_type"                  "adr"                           
[29] "required_car_parking_spaces"    "total_of_special_requests"     
[31] "reservation_status"             "reservation_status_date"       

Adding a new variable (column) - 2nd attempt

hotels <- mutate(hotels, little_ones = children + babies)
names(hotels)
 [1] "hotel"                          "is_canceled"                   
 [3] "lead_time"                      "arrival_date_year"             
 [5] "arrival_date_month"             "arrival_date_week_number"      
 [7] "arrival_date_day_of_month"      "stays_in_weekend_nights"       
 [9] "stays_in_week_nights"           "adults"                        
[11] "children"                       "babies"                        
[13] "meal"                           "country"                       
[15] "market_segment"                 "distribution_channel"          
[17] "is_repeated_guest"              "previous_cancellations"        
[19] "previous_bookings_not_canceled" "reserved_room_type"            
[21] "assigned_room_type"             "booking_changes"               
[23] "deposit_type"                   "agent"                         
[25] "company"                        "days_in_waiting_list"          
[27] "customer_type"                  "adr"                           
[29] "required_car_parking_spaces"    "total_of_special_requests"     
[31] "reservation_status"             "reservation_status_date"       
[33] "little_ones"                   

Pipes

What is a pipe?

In programming, a pipe is a technique for passing information from one process to another.

In R, written %>% or |>

Adding a new variable (column) - 3rd attempt

hotels <- hotels %>% mutate(little_ones = children + babies)

This becomes useful when we want to chain together several operations.

filter

filter to select a subset of rows

# bookings in City Hotels
hotels %>%
  filter(hotel == "City Hotel") 
# A tibble: 79,330 × 33
   hotel      is_canceled lead_time arrival_date_year arrival_date_month
   <chr>            <dbl>     <dbl>             <dbl> <chr>             
 1 City Hotel           0         6              2015 July              
 2 City Hotel           1        88              2015 July              
 3 City Hotel           1        65              2015 July              
 4 City Hotel           1        92              2015 July              
 5 City Hotel           1       100              2015 July              
 6 City Hotel           1        79              2015 July              
 7 City Hotel           0         3              2015 July              
 8 City Hotel           1        63              2015 July              
 9 City Hotel           1        62              2015 July              
10 City Hotel           1        62              2015 July              
# ℹ 79,320 more rows
# ℹ 28 more variables: arrival_date_week_number <dbl>,
#   arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
#   stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
#   meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>, …

filter for many conditions at once

hotels %>%
  filter( 
    adults == 0,     
    children >= 1    
    ) %>% 
  select(adults, babies, children)
# A tibble: 223 × 3
   adults babies children
    <dbl>  <dbl>    <dbl>
 1      0      0        3
 2      0      0        2
 3      0      0        2
 4      0      0        2
 5      0      0        2
 6      0      0        3
 7      0      1        2
 8      0      0        2
 9      0      0        2
10      0      0        2
# ℹ 213 more rows

select to select variables (columns)

Select a single column

View only lead_time (number of days between booking and arrival date):

select(hotels, lead_time)
# A tibble: 119,390 × 1
   lead_time
       <dbl>
 1       342
 2       737
 3         7
 4        13
 5        14
 6        14
 7         0
 8         9
 9        85
10        75
# ℹ 119,380 more rows

Select a single column

select(
  hotels, 
  lead_time
  )
  • Start with the function (a verb): select()

Select a single column

select(
  hotels,    
  lead_time   
  )
  • Start with the function (a verb): select()
  • First argument: data frame we’re working with , hotels

Select a single column

select( 
  hotels, 
  lead_time 
  )
  • Start with the function (a verb): select()
  • First argument: data frame we’re working with , hotels
  • Second argument: variable we want to select, lead_time

Select a single column

# A tibble: 119,390 × 1
   lead_time
       <dbl>
 1       342
 2       737
 3         7
 4        13
 5        14
 6        14
 7         0
 8         9
 9        85
10        75
# ℹ 119,380 more rows

  • Start with the function (a verb): select()
  • First argument: data frame we’re working with , hotels
  • Second argument: variable we want to select, lead_time
  • Result: data frame with 119390 rows and 1 column

Tip

dplyr functions always expect a data frame and always yield a data frame.

select(hotels, lead_time)
# A tibble: 119,390 × 1
   lead_time
       <dbl>
 1       342
 2       737
 3         7
 4        13
 5        14
 6        14
 7         0
 8         9
 9        85
10        75
# ℹ 119,380 more rows

select to keep variables (columns)

hotels %>%
  select(hotel, lead_time) 
# A tibble: 119,390 × 2
   hotel        lead_time
   <chr>            <dbl>
 1 Resort Hotel       342
 2 Resort Hotel       737
 3 Resort Hotel         7
 4 Resort Hotel        13
 5 Resort Hotel        14
 6 Resort Hotel        14
 7 Resort Hotel         0
 8 Resort Hotel         9
 9 Resort Hotel        85
10 Resort Hotel        75
# ℹ 119,380 more rows

select to exclude variables

hotels %>%
  select(!agent) 
# A tibble: 119,390 × 32
   hotel        is_canceled lead_time arrival_date_year arrival_date_month
   <chr>              <dbl>     <dbl>             <dbl> <chr>             
 1 Resort Hotel           0       342              2015 July              
 2 Resort Hotel           0       737              2015 July              
 3 Resort Hotel           0         7              2015 July              
 4 Resort Hotel           0        13              2015 July              
 5 Resort Hotel           0        14              2015 July              
 6 Resort Hotel           0        14              2015 July              
 7 Resort Hotel           0         0              2015 July              
 8 Resort Hotel           0         9              2015 July              
 9 Resort Hotel           1        85              2015 July              
10 Resort Hotel           1        75              2015 July              
# ℹ 119,380 more rows
# ℹ 27 more variables: arrival_date_week_number <dbl>,
#   arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
#   stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
#   meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>, …

select a range of variables

hotels %>%
  select(hotel:arrival_date_month) 
# A tibble: 119,390 × 5
   hotel        is_canceled lead_time arrival_date_year arrival_date_month
   <chr>              <dbl>     <dbl>             <dbl> <chr>             
 1 Resort Hotel           0       342              2015 July              
 2 Resort Hotel           0       737              2015 July              
 3 Resort Hotel           0         7              2015 July              
 4 Resort Hotel           0        13              2015 July              
 5 Resort Hotel           0        14              2015 July              
 6 Resort Hotel           0        14              2015 July              
 7 Resort Hotel           0         0              2015 July              
 8 Resort Hotel           0         9              2015 July              
 9 Resort Hotel           1        85              2015 July              
10 Resort Hotel           1        75              2015 July              
# ℹ 119,380 more rows

select variables with certain characteristics

hotels %>%
  select(starts_with("arrival")) 
# A tibble: 119,390 × 4
   arrival_date_year arrival_date_month arrival_date_week_number
               <dbl> <chr>                                 <dbl>
 1              2015 July                                     27
 2              2015 July                                     27
 3              2015 July                                     27
 4              2015 July                                     27
 5              2015 July                                     27
 6              2015 July                                     27
 7              2015 July                                     27
 8              2015 July                                     27
 9              2015 July                                     27
10              2015 July                                     27
# ℹ 119,380 more rows
# ℹ 1 more variable: arrival_date_day_of_month <dbl>

select variables with certain characteristics

hotels %>%
  select(ends_with("type")) 
# A tibble: 119,390 × 4
   reserved_room_type assigned_room_type deposit_type customer_type
   <chr>              <chr>              <chr>        <chr>        
 1 C                  C                  No Deposit   Transient    
 2 C                  C                  No Deposit   Transient    
 3 A                  C                  No Deposit   Transient    
 4 A                  A                  No Deposit   Transient    
 5 A                  A                  No Deposit   Transient    
 6 A                  A                  No Deposit   Transient    
 7 C                  C                  No Deposit   Transient    
 8 C                  C                  No Deposit   Transient    
 9 A                  A                  No Deposit   Transient    
10 D                  D                  No Deposit   Transient    
# ℹ 119,380 more rows

Select helpers

  • starts_with(): Starts with a prefix
  • ends_with(): Ends with a suffix
  • contains(): Contains a literal string
  • num_range(): Matches a numerical range like x01, x02, x03
  • one_of(): Matches variable names in a character vector
  • everything(): Matches all variables
  • last_col(): Select last variable, possibly with an offset
  • matches(): Matches a regular expression (a sequence of symbols/characters expressing a string/pattern to be searched for within text)

See help for any of these functions for more info, e.g. ?everything.

Select multiple columns

View only the hotel type and lead_time:

select(hotels, hotel, lead_time)
# A tibble: 119,390 × 2
   hotel        lead_time
   <chr>            <dbl>
 1 Resort Hotel       342
 2 Resort Hotel       737
 3 Resort Hotel         7
 4 Resort Hotel        13
 5 Resort Hotel        14
 6 Resort Hotel        14
 7 Resort Hotel         0
 8 Resort Hotel         9
 9 Resort Hotel        85
10 Resort Hotel        75
# ℹ 119,380 more rows

What if we wanted to select these columns, and then arrange the data in descending order of lead time?

Data wrangling, step-by-step

Select:

hotels %>%
  select(hotel, lead_time)
# A tibble: 119,390 × 2
   hotel        lead_time
   <chr>            <dbl>
 1 Resort Hotel       342
 2 Resort Hotel       737
 3 Resort Hotel         7
 4 Resort Hotel        13
 5 Resort Hotel        14
 6 Resort Hotel        14
 7 Resort Hotel         0
 8 Resort Hotel         9
 9 Resort Hotel        85
10 Resort Hotel        75
# ℹ 119,380 more rows

Select, then arrange:

hotels %>%
  select(hotel, lead_time) %>%
  arrange(desc(lead_time))
# A tibble: 119,390 × 2
   hotel        lead_time
   <chr>            <dbl>
 1 Resort Hotel       737
 2 Resort Hotel       709
 3 City Hotel         629
 4 City Hotel         629
 5 City Hotel         629
 6 City Hotel         629
 7 City Hotel         629
 8 City Hotel         629
 9 City Hotel         629
10 City Hotel         629
# ℹ 119,380 more rows

Pipes

What is a pipe?

In programming, a pipe is a technique for passing information from one process to another.

  • Start with the data frame hotels, and pass it to the select() function,
hotels %>% 
  select(hotel, lead_time) %>%
  arrange(desc(lead_time))
# A tibble: 119,390 × 2
   hotel        lead_time
   <chr>            <dbl>
 1 Resort Hotel       737
 2 Resort Hotel       709
 3 City Hotel         629
 4 City Hotel         629
 5 City Hotel         629
 6 City Hotel         629
 7 City Hotel         629
 8 City Hotel         629
 9 City Hotel         629
10 City Hotel         629
# ℹ 119,380 more rows

What is a pipe?

In programming, a pipe is a technique for passing information from one process to another.

  • Start with the data frame hotels, and pass it to the select() function,
  • then we select the variables hotel and lead_time,
hotels %>%
  select(hotel, lead_time) %>% 
  arrange(desc(lead_time))
# A tibble: 119,390 × 2
   hotel        lead_time
   <chr>            <dbl>
 1 Resort Hotel       737
 2 Resort Hotel       709
 3 City Hotel         629
 4 City Hotel         629
 5 City Hotel         629
 6 City Hotel         629
 7 City Hotel         629
 8 City Hotel         629
 9 City Hotel         629
10 City Hotel         629
# ℹ 119,380 more rows

What is a pipe?

In programming, a pipe is a technique for passing information from one process to another.

  • Start with the data frame hotels, and pass it to the select() function,
  • then we select the variables hotel and lead_time,
  • and then we arrange the data frame by lead_time in descending order.
hotels %>%
  select(hotel, lead_time) %>% 
  arrange(desc(lead_time)) 
# A tibble: 119,390 × 2
   hotel        lead_time
   <chr>            <dbl>
 1 Resort Hotel       737
 2 Resort Hotel       709
 3 City Hotel         629
 4 City Hotel         629
 5 City Hotel         629
 6 City Hotel         629
 7 City Hotel         629
 8 City Hotel         629
 9 City Hotel         629
10 City Hotel         629
# ℹ 119,380 more rows

arrange in ascending / descending order

hotels %>%
  select(adults, children, babies) %>%
  arrange(babies)
# A tibble: 119,390 × 3
   adults children babies
    <dbl>    <dbl>  <dbl>
 1      2        0      0
 2      2        0      0
 3      1        0      0
 4      1        0      0
 5      2        0      0
 6      2        0      0
 7      2        0      0
 8      2        0      0
 9      2        0      0
10      2        0      0
# ℹ 119,380 more rows
hotels %>%
  select(adults, children, babies) %>%
  arrange(desc(babies)) 
# A tibble: 119,390 × 3
   adults children babies
    <dbl>    <dbl>  <dbl>
 1      2        0     10
 2      1        0      9
 3      2        0      2
 4      2        0      2
 5      2        0      2
 6      2        0      2
 7      2        0      2
 8      2        0      2
 9      2        0      2
10      2        0      2
# ℹ 119,380 more rows

slice for certain row numbers

# first five
hotels %>%
  slice(1:5) 
# A tibble: 5 × 33
  hotel        is_canceled lead_time arrival_date_year arrival_date_month
  <chr>              <dbl>     <dbl>             <dbl> <chr>             
1 Resort Hotel           0       342              2015 July              
2 Resort Hotel           0       737              2015 July              
3 Resort Hotel           0         7              2015 July              
4 Resort Hotel           0        13              2015 July              
5 Resort Hotel           0        14              2015 July              
# ℹ 28 more variables: arrival_date_week_number <dbl>,
#   arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
#   stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
#   meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, …

Tip

In R, you can use the # for adding comments to your code. Any text following # will be printed as is, and won’t be run as R code. This is useful for leaving comments in your code and for temporarily disabling certain lines of code while debugging.

hotels %>%
  # slice the first five rows  # this line is a comment
  #select(hotel) %>%           # this one doesn't run
  slice(1:5)                   # this line runs
# A tibble: 5 × 33
  hotel        is_canceled lead_time arrival_date_year arrival_date_month
  <chr>              <dbl>     <dbl>             <dbl> <chr>             
1 Resort Hotel           0       342              2015 July              
2 Resort Hotel           0       737              2015 July              
3 Resort Hotel           0         7              2015 July              
4 Resort Hotel           0        13              2015 July              
5 Resort Hotel           0        14              2015 July              
# ℹ 28 more variables: arrival_date_week_number <dbl>,
#   arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
#   stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
#   meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, …

Logical operators in R


operator definition operator definition
< less than x | y x OR y
<= less than or equal to is.na(x) test if x is NA
> greater than !is.na(x) test if x is not NA
>= greater than or equal to x %in% y test if x is in y
== exactly equal to !(x %in% y) test if x is not in y
!= not equal to !x not x
x & y x AND y

distinct and count

distinct to filter for unique rows

And arrange to order alphabetically

hotels %>% 
  distinct(market_segment) %>% 
  arrange(market_segment)
# A tibble: 8 × 1
  market_segment
  <chr>         
1 Aviation      
2 Complementary 
3 Corporate     
4 Direct        
5 Groups        
6 Offline TA/TO 
7 Online TA     
8 Undefined     
hotels %>% 
  distinct(hotel, market_segment) %>% 
  arrange(hotel, market_segment)
# A tibble: 14 × 2
   hotel        market_segment
   <chr>        <chr>         
 1 City Hotel   Aviation      
 2 City Hotel   Complementary 
 3 City Hotel   Corporate     
 4 City Hotel   Direct        
 5 City Hotel   Groups        
 6 City Hotel   Offline TA/TO 
 7 City Hotel   Online TA     
 8 City Hotel   Undefined     
 9 Resort Hotel Complementary 
10 Resort Hotel Corporate     
11 Resort Hotel Direct        
12 Resort Hotel Groups        
13 Resort Hotel Offline TA/TO 
14 Resort Hotel Online TA     

count to create frequency tables

# alphabetical order by default
hotels %>%
  count(market_segment) 
# A tibble: 8 × 2
  market_segment     n
  <chr>          <int>
1 Aviation         237
2 Complementary    743
3 Corporate       5295
4 Direct         12606
5 Groups         19811
6 Offline TA/TO  24219
7 Online TA      56477
8 Undefined          2
# descending frequency order
hotels %>%
  count(market_segment, 
        sort = TRUE) 
# A tibble: 8 × 2
  market_segment     n
  <chr>          <int>
1 Online TA      56477
2 Offline TA/TO  24219
3 Groups         19811
4 Direct         12606
5 Corporate       5295
6 Complementary    743
7 Aviation         237
8 Undefined          2

count and arrange

# ascending frequency order
hotels %>%
  count(market_segment) %>%
  arrange(n) 
# A tibble: 8 × 2
  market_segment     n
  <chr>          <int>
1 Undefined          2
2 Aviation         237
3 Complementary    743
4 Corporate       5295
5 Direct         12606
6 Groups         19811
7 Offline TA/TO  24219
8 Online TA      56477
# descending frequency order
# just like adding sort = TRUE
hotels %>%
  count(market_segment) %>%
  arrange(desc(n)) 
# A tibble: 8 × 2
  market_segment     n
  <chr>          <int>
1 Online TA      56477
2 Offline TA/TO  24219
3 Groups         19811
4 Direct         12606
5 Corporate       5295
6 Complementary    743
7 Aviation         237
8 Undefined          2

count for multiple variables

hotels %>%
  count(hotel, market_segment) 
# A tibble: 14 × 3
   hotel        market_segment     n
   <chr>        <chr>          <int>
 1 City Hotel   Aviation         237
 2 City Hotel   Complementary    542
 3 City Hotel   Corporate       2986
 4 City Hotel   Direct          6093
 5 City Hotel   Groups         13975
 6 City Hotel   Offline TA/TO  16747
 7 City Hotel   Online TA      38748
 8 City Hotel   Undefined          2
 9 Resort Hotel Complementary    201
10 Resort Hotel Corporate       2309
11 Resort Hotel Direct          6513
12 Resort Hotel Groups          5836
13 Resort Hotel Offline TA/TO   7472
14 Resort Hotel Online TA      17729

order matters when you count

# hotel type first
hotels %>%
  count(hotel, market_segment) 
# A tibble: 14 × 3
   hotel        market_segment     n
   <chr>        <chr>          <int>
 1 City Hotel   Aviation         237
 2 City Hotel   Complementary    542
 3 City Hotel   Corporate       2986
 4 City Hotel   Direct          6093
 5 City Hotel   Groups         13975
 6 City Hotel   Offline TA/TO  16747
 7 City Hotel   Online TA      38748
 8 City Hotel   Undefined          2
 9 Resort Hotel Complementary    201
10 Resort Hotel Corporate       2309
11 Resort Hotel Direct          6513
12 Resort Hotel Groups          5836
13 Resort Hotel Offline TA/TO   7472
14 Resort Hotel Online TA      17729
# market segment first
hotels %>%
  count(market_segment, hotel) 
# A tibble: 14 × 3
   market_segment hotel            n
   <chr>          <chr>        <int>
 1 Aviation       City Hotel     237
 2 Complementary  City Hotel     542
 3 Complementary  Resort Hotel   201
 4 Corporate      City Hotel    2986
 5 Corporate      Resort Hotel  2309
 6 Direct         City Hotel    6093
 7 Direct         Resort Hotel  6513
 8 Groups         City Hotel   13975
 9 Groups         Resort Hotel  5836
10 Offline TA/TO  City Hotel   16747
11 Offline TA/TO  Resort Hotel  7472
12 Online TA      City Hotel   38748
13 Online TA      Resort Hotel 17729
14 Undefined      City Hotel       2

What is happening in the following chunk?

hotels %>%
  mutate(little_ones = children + babies) %>%
  count(hotel, little_ones) %>%
  mutate(prop = n / sum(n))
# A tibble: 12 × 4
   hotel        little_ones     n       prop
   <chr>              <dbl> <int>      <dbl>
 1 City Hotel             0 73923 0.619     
 2 City Hotel             1  3263 0.0273    
 3 City Hotel             2  2056 0.0172    
 4 City Hotel             3    82 0.000687  
 5 City Hotel             9     1 0.00000838
 6 City Hotel            10     1 0.00000838
 7 City Hotel            NA     4 0.0000335 
 8 Resort Hotel           0 36131 0.303     
 9 Resort Hotel           1  2183 0.0183    
10 Resort Hotel           2  1716 0.0144    
11 Resort Hotel           3    29 0.000243  
12 Resort Hotel          10     1 0.00000838

summarize and group_by

summarize for summary stats

# mean average daily rate for all bookings
hotels %>%
  summarize(mean_adr = mean(adr)) 
# A tibble: 1 × 1
  mean_adr
     <dbl>
1     102.

Tip

summarize() changes entirely the kind of info we get from the data frame. It collapses rows down to a single summary statistic, and removes all columns that are irrelevant to the calculation.

Tip

summarize() also lets you get away with being sloppy and not naming your new column, but that’s not recommended!

:red_square:

hotels %>%
  summarize(mean(adr))
# A tibble: 1 × 1
  `mean(adr)`
        <dbl>
1        102.

:green_square:

hotels %>%
  summarize(mean_adr = mean(adr))
# A tibble: 1 × 1
  mean_adr
     <dbl>
1     102.

group_by for grouped operations

# mean average daily rate for all booking at city and resort hotels
hotels %>%
  group_by(hotel) %>% 
  summarize(mean_adr = mean(adr))
# A tibble: 2 × 2
  hotel        mean_adr
  <chr>           <dbl>
1 City Hotel      105. 
2 Resort Hotel     95.0

Calculating frequencies

The following two give the same result, so count is simply short for group_by, then determine frequencies

hotels %>%
  group_by(hotel) %>%
  summarize(n = n())
# A tibble: 2 × 2
  hotel            n
  <chr>        <int>
1 City Hotel   79330
2 Resort Hotel 40060
hotels %>%
  count(hotel)
# A tibble: 2 × 2
  hotel            n
  <chr>        <int>
1 City Hotel   79330
2 Resort Hotel 40060

Multiple summary statistics

summarize can be used for multiple summary statistics

hotels %>%
  summarize(
    min_adr = min(adr),
    mean_adr = mean(adr),
    median_adr = median(adr),
    max_adr = max(adr)
    )
# A tibble: 1 × 4
  min_adr mean_adr median_adr max_adr
    <dbl>    <dbl>      <dbl>   <dbl>
1   -6.38     102.       94.6    5400

Quarto

quarto.org

  • Lets users write both code and content in reproducible computational documents
  • Generally refer to Quarto documents (with .qmd extension), e.g. “Do this in your Quarto document”

Quarto

  • Fully reproducible reports – each render runs the analysis from the beginning
  • Simple markdown syntax for text
  • Code goes in chunks, defined by three backticks, narrative goes outside of chunks

Tour: Quarto

Environments

The environment of your Quarto document is separate from the Console!

Remember this, and expect it to bite you a few times as you’re learning to work with Quarto!

Environments

First, run the following in the console

x <- 2
x * 3

All looks good, right?

Then, add the following in an R chunk in your Quarto document

x * 3

What happens? Why the error?

Quarto help

  • Many many online sources. Can start with

Markdown Basics

  • Markdown Quick Reference
    Help -> Markdown Quick Reference