Lab 7 Data wrangling with pivots and joins

Why are we here?

In this lab assignment you will wrangle data about college admissions and books.

In Part 1, you will practice pivoting tables.
In Part 2, you will practice working with multiple data tables using joins.

When you are finished, submit a single .qmd containing your responses to Part 1 and Part 2 on Brightspace. You do not need to render your Quarto for Lab 7.

To get started, load the tidyverse and dslabs packages.

Part 1 College admissions

Step 1

The admissions data set contains admission information for men and women across six majors. In the next chunk we’ll load it and keep everything but the data on number of applicants.

data("admissions")
dat <- admissions |> select(-applicants)

After executing the above, you might want to examine the data to see what the data frame has in it. (Rather than try to view it in QMD, repeat the two lines in the Console and then you can run View(dat)).

If we think of an observation as a major, and that each observation has two variables (men admitted percentage and women admitted percentage) then this is not tidy. Instead we want to have one row for each major, with separate columns for the men and the women (6 rows, 3 columns).

Use the pivot_wider function to wrangle this into tidy shape.

# your code here

What now?

What we really want to do is wrangle the admissions data so that for each major we have 4 observations: admitted_men, admitted_women, applicants_men and applicants_women. The trick we perform here is actually quite common:

first use pivot_longer to generate an intermediate data frame.
then use pivot_wider to obtain the tidy data we want.

We will go step by step in the next three exercises.

Step 2

Note that this exercise is based on the original admissions data frame, not on dat

Use the pivot_longer function to create a tmp data frame with a column containing the type of observation: admitted or applicants. Call the new columns name and value.

# your code here

Step 3

Now you have an object tmp with columns major, gender, name and value. Note that if you combine name and gender, then you get the column names we want: admitted_men, admitted_women, applicants_men, and applicants_women. Use the function unite to create a new column called column_name, and overwrite tmp with this new version. You can find helpful usage information by using ?unite at the Console prompt.

# your code here

Step 4

As the final step, create admissions_wider by applying the pivot_wider function to tmp in order to generate the tidy data with four variables for each major.

# your code here

Part 2 Books and authors

The files books.csv, authors.csv, and book-authors.csv give details about the planned summer reading of a statistics student. books.csv provides details for each book (isbn, title, year, and genre), authors.csv provides details about each author (authorid, first name, last name, and nationality), and book-authors.csv provides the author identification (authorid) for each isbn (books with multiple authors will have multiple rows).

For Part 2, run code to prevent the use of scientific notation:

options(scipen=999)

Step 1

Read the three files into R, naming them books, authors, and book_authors. To read in the datasets below from GitHub, place the entire url inside of the read_csv() function.

# Your code here

Step 2

Use the appropriate join statement to add the ISBNs to the authors data table. Create a new object with the result.

Why does the resulting data frame have 31 rows instead of 11? Include your response to the question here.

# Your code here

Step 3

To eliminate the duplicate rows of your data frame from Step 2 (which we’ll assume you named df2) run the following code (change the object names to align with your code as necessary):

df2 <- unique(df2)

# Your code here

Step 4

Use the appropriate join statement to add the author information table from Step 3 to the books data table. Create a new object with the result.

# Your code here

Step 5

Are there any authors in the authors data table that do not correspond to the books that are included in the books data table? Use an appropriate join statement to do this.

# Your code here

Step 6

After reading A Game of Thrones the student decides to read the rest of the series over the summer. books2.csv (available at https://raw.githubusercontent.com/ds4stats/r-tutorials/master/merging/data/books2.csv) contains the updated books on the student’s reading list. Read this file into R, naming it books2.

# Your code here

Step 7

Use the same join statement that you did in Step 4, but using books2 rather than books.

# Your code here

When you are done upload your .qmd file to Brightspace.