Practicing Pivots Activity 1

Why are we here?

  • We have data that’s organized in way that doesn’t facilitate our analysis.

  • We want to reorganize, or reshape, the data so we can do the analysis.

Definition of a tidy data set

In R, it is easiest to work with data that follow five basic rules:

  1. Every variable is stored in its own column.

  2. Every observation is stored in its own row—that is, every row corresponds to a single underlying observation or case.

  3. Each value of a variable is stored in a cell of the table.

  4. Values should not contain units. Rather, units should be specified in the supporting documentation for the data set, often called a codebook.

  5. There should be no extraneous information (footnotes, table titles, etc.) in the table.

Practicing pivots with longitudinal data

Before you get started, load the tidyverse.

Part 1 Child mortality rates

The file under5mortality.csv contains the child mortality rate per 1,000 children born for each country from 1800 to 2015. (Source: https://www.gapminder.org/data/)

Step 1

Load the data stored on GitHub

df <- read_csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/tidying-data/data/under5mortality.csv")

Briefly describe why it is not considered to be tidy data and what changes need to be made to tidy it.

Step 2

Use pivot_longer to create a tidy data set with columns country, year and mortality.

# Your code here

Step 3

After you pivot the data frame:

  • Coerce the year column to numeric

  • Remove rows where the mortality rate is NA

# Your code here

Part 2 Tidying crash data

In this exercise you will use pivot_longer, separate_wider_delim, and pivot_wider. See Chapter 11 of the textbook for more on these functions.

The airlines data set contains the raw data behind the article Should Travelers Avoid Flying Airlines That Have Had Crashes in the Past? that appeared on fivethirtyeight.com.

Step 1

Read in the airlines.csv file available here (link). Name the new object airlines.

# Your code here

Understanding the context for data wrangling

  • What is the underlying observation or “case?”

In this example, an underlying observation or “case” is best described as an airline in a specific time frame, so these data are not tidy because each case is not its own row. Additionally, the last six column names contain the time frame (e.g,, 1985 to 1999 as “85_99”), which is a value. In order to tidy this data set we must

  • Step 2: have rows corresponding to airlines in a specific time frame,

  • Step 3: create a years column to specify the time frame,

  • Step 4: and create columns for each type of accident: incidents, fatal_accidents, and fatalities.

Step 2

First, we pivot_longer the last six columns into a common accidents column. Note: This will allow us to easily create the years column in Step 3.

In our cols = statement, we can use negation (- or !) as a shortcut to tell R to pivot all except the mentioned columns. For this data set, pivoting columns 3:8 is the same as pivoting everything except columns 1 & 2.

Create a new object airlines_longer.

# Your code here

Step 3

Next, using the separate_wider_delim function we separate the values of the new accidents column in the into var (short for variable) and years. We can specify delim = "_" to denote that the underscore is the separator.

  • The separate function separates a character column into multiple columns with a regular expression or numeric locations.

    • Use ?separate_wider_delim to learn more about the function. Also, see Chapter 11 Section 11.3 of IDS Intro

    • Note that you need to add a too_many = "merge" argument in the function because we have two underscores in each cell of the accidents column.

Create a new object called accidents_separate

# Your code here

Step 4

Finally, we need to ensure that each row corresponds to an underlying observation (case). (Don’t worry, this will also make each column a variable!) Currently, there are six rows for each airline: one for each var in each time frame, year. To solve this problem, we need to pivot_wider out the var column so that each variable has its own column.

Create a new object named airlines_wider

# Your code here

The first argument given to pivot_wider should be the data frame via the piping operator, followed by the name-value pair. The names_from is the name of the column whose values will be used as column headings and the values_from is the name of the column whose values will populate the cells of the new columns. In this example, we use var for the names_from argument and populate the cells with the count of accidents.