Join Exercises

How to use this page:

  • Create a new Quarto titled “join_review_yourname.qmd”

In this document the code is not automatically displayed, but you can see it by clicking the Code buttons. This gives you the opportunity to think first about how something should be done and then check to see what code was actually used.


Load the required packages: tidyverse, babynames, and nycflights13

Part 1 dplyr review

Baby Names

The babynames data frame contains full baby name information provided by the U.S. Social Security Administration. The data frame has 5 variables:

Code
names(babynames)
[1] "year" "sex"  "name" "n"    "prop"

where n is the number of occurrences of that name for that sex in that year, and prop is n divided by the total number of applicants in that year – so it is the proportion of people born in that year with that sex and name. In other words, consider the first row of the data frame:

Code
babynames %>% head(1)
# A tibble: 1 × 5
   year sex   name      n   prop
  <dbl> <chr> <chr> <int>  <dbl>
1  1880 F     Mary   7065 0.0724

which tells us that in 1880 there were 7065 girl babies named Mary and that was 7.24% of babies born that year!

Step 1

Update the babynames data frame by adding a new variable called popular that will be TRUE if the name was assigned to more than 1% of all babies in that year.

Step 2

Write code that will show the 10 baby names that were the most popular in any single year (we’re not trying to aggregate across years).

Step 3

We are interested in knowing which year had the greatest number of births. Write code that will show the data in descending order by number of births in each year.


Part 2: Joins

  • For Part 2, try the code in your Quarto while you work through the examples.

Preparing for joins

In the babynames package there is a table called births. It has just two columns:

Code
head(births)
# A tibble: 6 × 2
   year  births
  <int>   <int>
1  1909 2718000
2  1910 2777000
3  1911 2809000
4  1912 2840000
5  1913 2869000
6  1914 2966000

These data come from the U.S. Census Bureau (see help(births) for more information). For clarity, let’s put this table into a new object with a better name.

Code
census_births <- births

The information about individual names is stored in the babynames table. These data come from the Social Security Administration (see help(babynames) for more details). We expect that the total number of births recorded by the Census Bureau should match the total number of births as recorded by the Social Security Administration, but let’s investigate this.

First we should condense the SSA data into the same yearly form as the Census Bureau data. We can do this with group_by and summarize. Think about how to do this, and try it out in your Quarto, before you look at the code.

  • Reminder: the n() function counts the number of observations in a group. It can only be used in the summarize, mutate, and filter functions.
Code
ssa_births <- babynames %>%
  group_by(year) %>%
  summarize(N = n(), births = sum(n))

Actually, I’m going to make one small change. In order to make the subsequent exercises more interesting, I’m going to repeat the above but filer to exclude data from 2012 on so that the tables don’t end in the same year.

Code
ssa_births <- babynames %>%
  filter(year < 2012) %>%
  group_by(year) %>%
  summarize(N = n(), births = sum(n))

Now I have two separate tables with what should be the same information: census_births and ssa_births, but they don’t cover the same set of years. We can see this by doing some summarizing of both tables.

Code
census_births %>%
  summarize(
    N = n(),
    earliest = min(year),
    latest = max(year)
  )
# A tibble: 1 × 3
      N earliest latest
  <int>    <int>  <int>
1   109     1909   2017
Code
ssa_births %>%
  summarize(
    N = n(),
    earliest = min(year),
    latest = max(year)
  )
# A tibble: 1 × 3
      N earliest latest
  <int>    <dbl>  <dbl>
1   132     1880   2011

Joins

As you know, we can combine two data frames with a join operation. Generally we want to in some way match rows in one data frame rows that correspond in a different data frame.

If we use an inner_join the result will have only rows that have a corresponding match in both of the data frames.

Code
census_births %>%
  inner_join(ssa_births, by = "year") %>%
  summarize(
    N = n(),
    earliest = min(year),
    latest = max(year)
  )
# A tibble: 1 × 3
      N earliest latest
  <int>    <dbl>  <dbl>
1   103     1909   2011

You can see that the set of years that results is the intersection of the two sets of years in the original data frames.

If we use a left_join then we will get all the rows from census_births, even if there is no corresponding entry in the ssa_births table. This means that rows that have no corresponding entry will get NAs in place of the missing data.

Code
total_births <- census_births %>%
  left_join(ssa_births, by = "year")

total_births %>%
  filter(is.na(births.y))
# A tibble: 6 × 4
   year births.x     N births.y
  <dbl>    <int> <int>    <int>
1  2012  3952841    NA       NA
2  2013  3932181    NA       NA
3  2014  3988076    NA       NA
4  2015  3978497    NA       NA
5  2016  3945875    NA       NA
6  2017  3855500    NA       NA

By using a filter for the NA data, I can see that the data from 2012 onward exists in the census_births data frame, but not in the ssa_births data frame. Obviously, this is no longer the intersection of the two data frames.

Code
total_births %>%
  summarize(
    N = n(),
    earliest = min(year),
    latest = max(year)
  )
# A tibble: 1 × 3
      N earliest latest
  <int>    <dbl>  <dbl>
1   109     1909   2017

We can easily see that a right_join will have the opposite effect. There are many years of SSA data that have no match in the Census data and we can see that by looking for the data that has NAs.

Code
total_births <- census_births %>%
  right_join(ssa_births, by = "year")

total_births %>%
  filter(is.na(births.x))
# A tibble: 29 × 4
    year births.x     N births.y
   <dbl>    <int> <int>    <int>
 1  1880       NA  2000   201484
 2  1881       NA  1935   192696
 3  1882       NA  2127   221533
 4  1883       NA  2084   216946
 5  1884       NA  2297   243462
 6  1885       NA  2294   240854
 7  1886       NA  2392   255317
 8  1887       NA  2373   247394
 9  1888       NA  2651   299473
10  1889       NA  2590   288946
# ℹ 19 more rows

The number of years represented is neither the intersection, nor the union of the original data frames.

Code
total_births %>%
  summarize(
    N = n(),
    earliest = min(year),
    latest = max(year)
  )
# A tibble: 1 × 3
      N earliest latest
  <int>    <dbl>  <dbl>
1   132     1880   2011

Keep in mind that if you switch the order of the data frames and switch back to left_join you get exactly the same results. In other words, left_join(a, b) is the same as right_join(b, a).

Code
ssa_births %>%
  left_join(census_births, by = "year") %>%
  summarize(
    N = n(),
    earliest = min(year),
    latest = max(year)
  )
# A tibble: 1 × 3
      N earliest latest
  <int>    <dbl>  <dbl>
1   132     1880   2011

This brings us to full_join which results in all rows, regardless of whether or not thaty are matched.

Code
total_births <- census_births %>%
  full_join(ssa_births, by = "year")

total_births %>%
  filter(is.na(births.x) | is.na(births.y))
# A tibble: 35 × 4
    year births.x     N births.y
   <dbl>    <int> <int>    <int>
 1  2012  3952841    NA       NA
 2  2013  3932181    NA       NA
 3  2014  3988076    NA       NA
 4  2015  3978497    NA       NA
 5  2016  3945875    NA       NA
 6  2017  3855500    NA       NA
 7  1880       NA  2000   201484
 8  1881       NA  1935   192696
 9  1882       NA  2127   221533
10  1883       NA  2084   216946
# ℹ 25 more rows

Now the set of years that results really is the union of the years in the two data frames.

Code
total_births %>%
  summarize(
    N = n(),
    earliest = min(year),
    latest = max(year)
  )
# A tibble: 1 × 3
      N earliest latest
  <int>    <dbl>  <dbl>
1   138     1880   2017

Data Reconciliation

Let’s take a look ahead at the stats package to help us understand the relationship between these two datasets.

Once we have the data frames joined, we can compare the birth numbers directly. I’ll use the cor function which calculates the correlation between two vectors (it’s in the stats package). While the birth numbers are strongly correlated…

Code
total_births %>%
  summarize(
    N = n(),
    correlation = cor(births.x, births.y,
      use = "complete.obs"
    )
  )
# A tibble: 1 × 2
      N correlation
  <int>       <dbl>
1   138       0.892

the numbers are not the same.

Code
library(ggplot2)
ggplot(data = total_births, aes(x = year, y = births.x)) +
  geom_line() +
  geom_line(aes(y = births.y))

A deeper dive into the documentation for babynames might help you figure out why.


Join Exercises

Some of these joins may be familiar, others won’t be. Try to write each one on your own and test your code before you check back here to reveal the solution. Don’t forget the library(nycflights13) command. You installed this package when completing the Exercise 4 assignment.

For each exercise, ask yourself these questions:

  • does answering it require data from more than one data frame?
  • what data from each data frame is necessary so that you can solve the problem?
  • what sort of combining command will give you the right set of data?
  • what operations should you do on the data before combining data frames?
  • what operations will you need to do after the combining?

We will use the following datasets from the nycflights13 package:

  • flights contains flights that departed from New York City in 2013.

Side note: Use the following code to learn more about the dataset:

Code
?nycflights13::flights

#To learn about a particular dataset within an R package use 
?packagename::datasetname
  • airlines contains airline full names with their carrier codes.

  • airports contains metadata about airports

  • planes contains airplane metadata for all plane tailnumbers found in the FAA aircraft registry

Explore these datasets in RStudio before moving on to the exercises below. Look for the join keys, shared variables that will allow you to join the data tables.

Exercise 1

Note: for Exercise 1-5, do not overwrite (redefine) the flights data frame.

  • Compute the average arrival delay time for each carrier

  • Then, conduct a join to include the full name of the carrier in your result.

Code
# sample solution
library(nycflights13)
flights %>%
  group_by(carrier) %>%
  summarize(avg_delay = mean(arr_delay, na.rm = TRUE)) %>%
  left_join(airlines, by = c("carrier" = "carrier"))

Exercise 2

What was the full name of the airport that was the most common destination for planes flying from NYC in 2013?

  • dest destination airport in the flights table.

  • faa is the unique Federal Aviation Administration code for each airport in the airports table.

Code
# sample solution
flights %>%
  group_by(dest) %>%
  summarize(N = n()) %>%
  arrange(desc(N)) %>%
  left_join(airports, by = c("dest" = "faa"))

Exercise 3

What is the most commonly used plane? Find it’s manufacturer and model.

Code
# sample solution
flights %>%
  left_join(planes, by = c("tailnum" = "tailnum")) %>%
  group_by(manufacturer, model) %>%
  summarize(N = n()) %>%
  arrange(desc(N))

Exercise 4

Were there any flights that flew to “mystery” destination airports (i.e., airports that don’t appear in the airports table)?

Code
# sample solution
flights %>%
  left_join(airports, by = c("dest" = "faa")) %>%
  filter(is.na(name)) %>%
  group_by(dest) %>%
  summarize(N = n())

Exercise 5

Were there any “mystery” planes (i.e., planes that don’t appear in the planes table)?

Code
# sample solution
flights %>%
  left_join(planes, by = c("tailnum" = "tailnum")) %>%
  filter(is.na(model)) %>%
  group_by(tailnum) %>%
  summarize(N = n()) %>%
  arrange(desc(N))

Mutating joins

Now, try responding to Exercise 6-9 below without code to reference.

A mutating join allows you to combine variables from two data frames: it first matches observations by their keys, then copies across variables from one data frame to the other. Like mutate(), the join functions add variables to the right, so if your dataset has many variables, you won’t see the new ones unless you scroll to the right side or the data table.

We will keep working with the nycflights13 package datasets. In addition to the data sets introduced above, you will use:

  • weather data table that contains hourly meteorological data for NYC airports: LGA, JFK, and EWR (Newark).

Exercise 6

For the following examples, we’ll make it easier to see what’s going on by creating a narrower dataset with just six variables.

Write code to create a new object flights2 with the following variables from the flights data table: year, time_hour, origin, dest, tailnum, and carrier. Doing so will make it easier for you to observe the changes to the data tables after conducting a join.

# A tibble: 336,776 × 6
    year time_hour           origin dest  tailnum carrier
   <int> <dttm>              <chr>  <chr> <chr>   <chr>  
 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA     
 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA     
 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA     
 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6     
 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL     
 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA     
 7  2013 2013-01-01 06:00:00 EWR    FLL   N516JB  B6     
 8  2013 2013-01-01 06:00:00 LGA    IAD   N829AS  EV     
 9  2013 2013-01-01 06:00:00 JFK    MCO   N593JB  B6     
10  2013 2013-01-01 06:00:00 LGA    ORD   N3ALAA  AA     
# ℹ 336,766 more rows

Exercise 7

There are four types of mutating joins, but there’s one that you’ll use almost all of the time: left_join(). It’s special because the output will always have the same rows as x, the data frame you’re joining to. The primary use of left_join() is to add in additional metadata.

Use the the planes data table to add what size of plane was flying to the flights2 data table. You can determine the size based on the number of engines and seats on a plan. Create a new object with the result.

Note

When conducting a join, we can specify a join using multiple join keys.

In the example in this note, we use the the planes data table to find out the temperature and wind speed when each plane departed. We will join by both the origin airport and the time the plan took off.

Code
#Be specific about join keys. Note: we do not set time_hour equal to origin:
df2 <- flights2 |> 
  left_join(weather, by = c("time_hour", "origin"))
            
#Alternatively, we can achieve the same result using join_by():

df3 <- flights2 |> 
  left_join(weather, by = join_by(time_hour, origin))

Filtering joins

Exercise 8

The primary action of a filtering join is to filter the rows. There are two types: semi-joins and anti-joins. Semi-joins keep all rows in x that have a match in y.

For example, we could use a semi-join to filter the airports dataset to show just the destination airports in your flights2 data table.

Write code to create the result below.

# A tibble: 101 × 8
   faa   name                                lat    lon   alt    tz dst   tzone 
   <chr> <chr>                             <dbl>  <dbl> <dbl> <dbl> <chr> <chr> 
 1 ABQ   Albuquerque International Sunport  35.0 -107.   5355    -7 A     Ameri…
 2 ACK   Nantucket Mem                      41.3  -70.1    48    -5 A     Ameri…
 3 ALB   Albany Intl                        42.7  -73.8   285    -5 A     Ameri…
 4 ANC   Ted Stevens Anchorage Intl         61.2 -150.    152    -9 A     Ameri…
 5 ATL   Hartsfield Jackson Atlanta Intl    33.6  -84.4  1026    -5 A     Ameri…
 6 AUS   Austin Bergstrom Intl              30.2  -97.7   542    -6 A     Ameri…
 7 AVL   Asheville Regional Airport         35.4  -82.5  2165    -5 A     Ameri…
 8 BDL   Bradley Intl                       41.9  -72.7   173    -5 A     Ameri…
 9 BGR   Bangor Intl                        44.8  -68.8   192    -5 A     Ameri…
10 BHM   Birmingham Intl                    33.6  -86.8   644    -6 A     Ameri…
# ℹ 91 more rows

Exercise 9

Imagine you’ve found the top 10 most popular destinations using this code:

Code
top_dest <- flights2 |>
  count(dest, sort = TRUE) |>
  head(10)

How can you find all flights to those destinations? Write code to do so. Create a new object with the result.

What percentage of flights were to the top 10 destinations?