```{r packages, echo=FALSE, message=FALSE, warning=FALSE}
library(tidyverse)
```

# Tidy data {.middle}

## Tidy data {.smaller}

**Characteristics of tidy data:**

-   Each variable forms a column.
-   Each observation forms a row.
-   Each type of observational unit forms a table.

------------------------------------------------------------------------

## Tidying data

-   We have a data set from real life.
    -   It's a mess, or at least not well organized.
-   We need to reorganize it.

------------------------------------------------------------------------

::: {question}
What makes this data not tidy?[^3]
:::

[^3]: Source: [US Census Fact Finder, General Economic Characteristics, ACS 2017](https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_17_5YR_DP03&src=pt)

```{r us-general-economic-characteristic-acs-2017, echo=FALSE, out.width="100%", fig.align = "center", caption = "US General Economic Characteristics, ACS 2017", out.width = "100%"}
knitr::include_graphics("img/us-general-economic-characteristic-acs-2017.png")
```


## 

::: question
What makes this data not tidy?[^1]
:::

[^1]: Source: [US Census Fact Finder, General Economic Characteristics, ACS 2017](https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_17_5YR_DP03&src=pt)

```{r hyperwar-airplanes-on-hand, echo=FALSE, out.width="100%", fig.align = "center", caption = "WW2 Army Air Force combat aircraft", out.width = "100%"}
knitr::include_graphics("img/hyperwar-airplanes-on-hand.png")
```

------------------------------------------------------------------------

::: question
What makes this data not tidy?[^2]
:::

[^2]: Source: [Gapminder, Estimated HIV prevalence among 15-49 year olds](https://www.gapminder.org/data)

```{r hiv-est-prevalence-15-49, echo=FALSE, out.width="100%", fig.align = "center", caption = "Estimated HIV prevalence among 15-49 year olds", out.width = "100%"}
knitr::include_graphics("img/hiv-est-prevalence-15-49.png")
```


## Tidying data {.smaller}

-   Numerous actions we can take to tidy data
-   At the same time we can also make the data more useful for analysis
-   Some actions might be
    -   Pivoting
        -   pivot( ) function is very versatile

        -   adding more columns / adding more rows

            ...we'll get to this later
    -   Wrangling
        -   use *dplyr* library from *tidyverse*

# Grammar of Data Wrangling {.middle}

------------------------------------------------------------------------

## A grammar of data wrangling... {.smaller}

... based on the concepts of functions as **verbs that manipulate data frames**

::::: columns
::: {.column width="50%"}
```{r dplyr-part-of-tidyverse, echo=FALSE, out.width="60%", caption = "dplyr is part of the tidyverse"}
knitr::include_graphics("img/dplyr-part-of-tidyverse.png")
```
:::

::: {.column width="50%"}
Functions that manipulate existing data:

-   `select`: pick columns by name
-   `arrange`: reorder rows
-   `slice`: pick rows using index(es)
-   `filter`: pick rows matching criteria
-   `distinct`: filter for unique rows
:::
:::::

## A grammar of data wrangling... {.smaller}

... based on the concepts of functions as **verbs that manipulate data frames**

::::: columns
::: {.column width="50%"}
```{r dplyr-part-of-tidyverse2, echo=FALSE, out.width="60%", caption = "dplyr is part of the tidyverse"}
knitr::include_graphics("img/dplyr-part-of-tidyverse.png")
```
:::

::: {.column width="50%"}
Functions that generate new data:

-   `mutate`: add new variables
-   `summarize`: reduce variables to values
-   `group_by`: for grouped operations
-   `count`: to create frequency tables
:::
:::::

## Rules of **dplyr** functions {.smaller}

-   First argument: *always* a data frame
-   Second argument: what to do with the data
-   Always return a data frame
-   Don't modify in place, without creating a separate copy of the data structure

dplyr functions expect a data frame as input, produce a data frame as output

## Data: Hotel bookings[^4]

[^4]: Source: [TidyTuesday](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md)

-   Data from two hotels: one resort and one city hotel
-   Observations: Each row represents a hotel booking
-   Goal for original data collection: Development of prediction models to classify likelihood that a hotel's booking's would be cancelled ([Antonia et al., 2019](https://www.sciencedirect.com/science/article/pii/S2352340918315191#bib5))

```{r message=FALSE}
hotels <- read_csv("data/hotels.csv")
```

## First look: Variables

```{r output.lines=18, echo=TRUE}
names(hotels)
```

------------------------------------------------------------------------

## Second look: Overview {.smaller}

```{r output.lines=18, echo=TRUE}
glimpse(hotels)
```

# Changing data frames {.middle}

## Adding a new variable (column)

```{r echo=TRUE, `code-line-numbers`="1", output = FALSE }
mutate(hotels, little_ones = children + babies)
```

```{r echo=TRUE}
names(hotels)
```

## Adding a new variable (column) - 2nd attempt

```{r echo=TRUE, `code-line-numbers`="1" }
hotels <- mutate(hotels, little_ones = children + babies)
names(hotels)
```

# Pipes {.middle}

------------------------------------------------------------------------

## What is a pipe?

In programming, a pipe is a technique for passing information from one process to another.

. . .

::::: {columns}
::: {.column width="50%"}
-   Start with the data frame `hotels`, and pass it to the `select()` function,
:::

::: {.column width="50%"}
```{r, echo=TRUE, `code-line-numbers`="1"}
hotels |> 
  select(hotel, lead_time) |>
  arrange(desc(lead_time))
```
:::
:::::

------------------------------------------------------------------------

## What is a pipe?

In programming, a pipe is a technique for passing information from one process to another.

::::: {columns}
::: {.column width="50%"}
<font size = 6>

-   Start with the data frame `hotels`, and pass it to the `select()` function,
-   then we select the variables `hotel` and `lead_time`, </font>
:::

::: {.column width="50%"}
```{r, echo=TRUE, `code-line-numbers`="2"}
hotels |>
  select(hotel, lead_time) |> 
  arrange(desc(lead_time))
```
:::
:::::

------------------------------------------------------------------------

## What is a pipe?

In programming, a pipe is a technique for passing information from one process to another.

::::: {columns}
::: {.column width="50%"}
<font size = 6>

-   Start with the data frame `hotels`, and pass it to the `select()` function,
-   then we select the variables `hotel` and `lead_time`,
-   and then we arrange the data frame by `lead_time` in descending order.</font>
:::

::: {.column width="50%"}
```{r, echo=TRUE, `code-line-numbers`="3"}
hotels |>
  select(hotel, lead_time) |> 
  arrange(desc(lead_time)) 
```
:::
:::::

## Adding a new variable (column) - 3rd attempt

```{r echo=TRUE, `code-line-numbers`="1" }
hotels <- hotels |> mutate(little_ones = children + babies)
```

This becomes useful when we want to chain together several operations.

# `filter` {.middle}

------------------------------------------------------------------------

## `filter` to select a subset of rows

```{r output.lines=17, echo=TRUE, `code-line-numbers`="3"}
# bookings in City Hotels
hotels |>
  filter(hotel == "City Hotel") 
```

## Logical operators in R {.smaller}

<br>

| operator | definition               | operator      | definition                |
|---------------|:--------------------|:--------------|:---------------------|
| `<`      | less than                | `x` \| `y`    | `x` OR `y`                |
| `<=`     | less than or equal to    | `is.na(x)`    | test if `x` is `NA`       |
| `>`      | greater than             | `!is.na(x)`   | test if `x` is not `NA`   |
| `>=`     | greater than or equal to | `x %in% y`    | test if `x` is in `y`     |
| `==`     | exactly equal to         | `!(x %in% y)` | test if `x` is not in `y` |
| `!=`     | not equal to             | `!x`          | not `x`                   |
| `x & y`  | `x` AND `y`              |               |                           |

## `filter` for many conditions at once

```{r echo=TRUE, `code-line-numbers`="3-4"}
hotels |>
  filter( 
    adults == 0,     
    children >= 1    
    ) |> 
  select(adults, babies, children)
```

# `select` to select variables (columns) {.middle}

## Select a single column

View only `lead_time` (number of days between booking and arrival date):

```{r, echo=TRUE}
select(hotels, lead_time)
```

------------------------------------------------------------------------

## Select a single column

::::: columns
::: {.column width="30%"}
```{r eval=FALSE, echo=TRUE, `code-line-numbers`="1"}
select(
  hotels, 
  lead_time
  )
```
:::

::: {.column width="70%"}
-   Start with the function (a verb): `select()`
:::
:::::

------------------------------------------------------------------------

## Select a single column

::::: {columns}
::: {.column width="30%"}
```{r eval=FALSE, echo=TRUE,  `code-line-numbers`="2"}
select(
  hotels,    
  lead_time   
  )
```
:::

::: {.column width="70%"}
-   Start with the function (a verb): `select()`
-   First argument: data frame we're working with , `hotels`
:::
:::::

------------------------------------------------------------------------

## Select a single column

::::: {columns}
::: {.column width="30%"}
```{r eval=FALSE, echo=TRUE, `code-line-numbers`="3"}
select( 
  hotels, 
  lead_time 
  )
```
:::

::: {.column width="70%"}
-   Start with the function (a verb): `select()`
-   First argument: data frame we're working with , `hotels`
-   Second argument: variable we want to select, `lead_time`
:::
:::::

------------------------------------------------------------------------

## Select a single column

::::: {columns}
::: {.column width="30%"}
```{r}
select( 
  hotels, 
  lead_time
  )
```
:::

::: {.column width="70%"}
<font size = 6>

-   Start with the function (a verb): `select()`
-   First argument: data frame we're working with , `hotels`
-   Second argument: variable we want to select, `lead_time`
-   Result: data frame with `r nrow(hotels)` rows and 1 column </font>
:::
:::::

------------------------------------------------------------------------

::: callout-tip
dplyr functions always expect a data frame and always yield a data frame.
:::

```{r, echo=TRUE}
select(hotels, lead_time)
```

------------------------------------------------------------------------

## `select` to keep variables (columns)

```{r, echo=TRUE,  `code-line-numbers`="2"}
hotels |>
  select(hotel, lead_time) 
```

------------------------------------------------------------------------

## `select` to exclude variables

```{r output.lines=18, echo=TRUE, `code-line-numbers`="2"}
hotels |>
  select(!agent) 
```

## `select` a range of variables

```{r echo=TRUE, `code-line-numbers`="2"}
hotels |>
  select(hotel:arrival_date_month) 
```

## `select` variables with certain characteristics

```{r echo=TRUE, `code-line-numbers`="2"}
hotels |>
  select(starts_with("arrival")) 
```

## `select` variables with certain characteristics

```{r echo=TRUE, `code-line-numbers`="2"}
hotels |>
  select(ends_with("type")) 
```

## Select helpers {.smaller}

-   `starts_with()`: Starts with a prefix
-   `ends_with()`: Ends with a suffix
-   `contains()`: Contains a literal string
-   `num_range()`: Matches a numerical range like x01, x02, x03
-   `one_of()`: Matches variable names in a character vector
-   `everything()`: Matches all variables
-   `last_col()`: Select last variable, possibly with an offset
-   `matches()`: Matches a regular expression (a sequence of symbols/characters expressing a string/pattern to be searched for within text)

See help for any of these functions for more info, e.g. `?everything`.

## Select multiple columns

View only the `hotel` type and `lead_time`:

. . .

:::::: {columns}
::: {.column width="50%"}
```{r, echo=TRUE}
select(hotels, hotel, lead_time)
```
:::

:::: {.column width="50%"}
::: question
<font size = 6>

What if we wanted to select these columns, and then arrange the data in descending order of lead time?</font>
:::
::::
::::::

------------------------------------------------------------------------

## Data wrangling, step-by-step

::::: {columns}
::: {.column width="40%"}
Select:

```{r, echo=TRUE}
hotels |>
  select(hotel, lead_time)
```
:::

::: {.column width="50%"}
Select, then arrange:

```{r, echo=TRUE}
hotels |>
  select(hotel, lead_time) |>
  arrange(desc(lead_time))
```
:::
:::::

## `arrange` in ascending / descending order

::::: columns
::: {.column width="50%"}
```{r echo=TRUE, `code-line-numbers`="2"}
hotels |>
  select(adults, children, babies) |>
  arrange(babies)
```
:::

::: {.column width="50%"}
```{r echo=TRUE, `code-line-numbers`="2"}
hotels |>
  select(adults, children, babies) |>
  arrange(desc(babies)) 
```
:::
:::::

## `slice` for certain row numbers

```{r output.lines=17, echo=TRUE, `code-line-numbers`="3"}
# first five
hotels |>
  slice(1:5) 
```

------------------------------------------------------------------------

::: callout-tip
In R, you can use the `#` for adding comments to your code. Any text following `#` will be printed as is, and won't be run as R code. This is useful for leaving comments in your code and for temporarily disabling certain lines of code while debugging.
:::

```{r output.lines=10, echo=TRUE}
hotels |>
  # slice the first five rows  # this line is a comment
  #select(hotel) |>           # this one doesn't run
  slice(1:5)                   # this line runs
```

# `distinct` and `count` {.middle}

------------------------------------------------------------------------

```{r include=FALSE}
options(dplyr.print_max = 20)
```

## `distinct` to filter for unique rows

And `arrange` to order alphabetically

::::: columns
::: {.column width="50%"}
```{r echo=TRUE, `code-line-numbers`="2"}
hotels |> 
  distinct(market_segment) |> 
  arrange(market_segment)
```
:::

::: {.column width="50%"}
```{r output.lines=13, echo=TRUE, `code-line-numbers`="2"}
hotels |> 
  distinct(hotel, market_segment) |> 
  arrange(hotel, market_segment)
```
:::
:::::

------------------------------------------------------------------------

## `count` to create frequency tables

::::: columns
::: {.column width="50%"}
```{r echo=TRUE, `code-line-numbers`="3"}
# alphabetical order by default
hotels |>
  count(market_segment) 
```
:::

::: {.column width="50%"}
```{r echo=TRUE, `code-line-numbers`="3"}
# descending frequency order
hotels |>
  count(market_segment, 
        sort = TRUE) 
```
:::
:::::

------------------------------------------------------------------------

## `count` and `arrange`

::::: columns
::: {.column width="50%"}
```{r echo=TRUE, `code-line-numbers`="4"}
# ascending frequency order
hotels |>
  count(market_segment) |>
  arrange(n) 
```
:::

::: {.column width="50%"}
```{r echo=TRUE, `code-line-numbers`="5"}
# descending frequency order
# just like adding sort = TRUE
hotels |>
  count(market_segment) |>
  arrange(desc(n)) 
```
:::
:::::

------------------------------------------------------------------------

## `count` for multiple variables

```{r echo=TRUE, `code-line-numbers`="2"}
hotels |>
  count(hotel, market_segment) 
```

------------------------------------------------------------------------

## order matters when you `count`

::::: columns
::: {.column width="50%"}
```{r echo=TRUE, `code-line-numbers`="3"}
# hotel type first
hotels |>
  count(hotel, market_segment) 
```
:::

::: {.column width="50%"}
```{r echo=TRUE, `code-line-numbers`="3"}
# market segment first
hotels |>
  count(market_segment, hotel) 
```
:::
:::::

------------------------------------------------------------------------

::: {question}
What is happening in the following chunk?
:::

```{r echo=TRUE}
hotels |>
  mutate(little_ones = children + babies) |>
  count(hotel, little_ones) |>
  mutate(prop = n / sum(n))
```

# `summarize` and `group_by` {.middle}

## `summarize` for summary stats

```{r echo=TRUE, `code-line-numbers`="3"}
# mean average daily rate for all bookings
hotels |>
  summarize(mean_adr = mean(adr)) 
```

. . .

::: callout-tip
`summarize()` changes entirely the kind of info we get from the data frame. It collapses rows down to a single summary statistic, and removes all columns that are irrelevant to the calculation.
:::

------------------------------------------------------------------------

::: callout-tip
`summarize()` also lets you get away with being sloppy and not naming your new column, but that's not recommended!
:::

::::: {columns}
::: {.column width="50%"}
:red_square:

```{r echo=TRUE}
hotels |>
  summarize(mean(adr))
```
:::

::: {.column width="50%"}
:green_square:

```{r echo=TRUE}
hotels |>
  summarize(mean_adr = mean(adr))
```
:::
:::::

------------------------------------------------------------------------

## `group_by` for grouped operations

```{r echo=TRUE, `code-line-numbers`="3"}
# mean average daily rate for all booking at city and resort hotels
hotels |>
  group_by(hotel) |> 
  summarize(mean_adr = mean(adr))
```

------------------------------------------------------------------------

## Calculating frequencies

The following two give the same result, so `count` is simply short for `group_by,` then determine frequencies

::::: columns
::: {.column width="50%"}
```{r echo=TRUE}
hotels |>
  group_by(hotel) |>
  summarize(n = n())
```
:::

::: {.column width="50%"}
```{r echo=TRUE}
hotels |>
  count(hotel)
```
:::
:::::

------------------------------------------------------------------------

## Multiple summary statistics

`summarize` can be used for multiple summary statistics

```{r echo=TRUE}
hotels |>
  summarize(
    min_adr = min(adr),
    mean_adr = mean(adr),
    median_adr = median(adr),
    max_adr = max(adr)
    )
```