Code
names(babynames)
[1] "year" "sex" "name" "n" "prop"
In this document the code is not automatically displayed, but you can see it by clicking the Code
buttons. This gives you the opportunity to think first about how something should be done and then check to see what code was actually used.
Load the required packages: tidyverse
, babynames
, and nycflights13
Baby Names
The babynames
data frame contains full baby name information provided by the U.S. Social Security Administration. The data frame has 5 variables:
where n
is the number of occurrences of that name for that sex in that year, and prop
is n
divided by the total number of applicants in that year – so it is the proportion of people born in that year with that sex and name. In other words, consider the first row of the data frame:
# A tibble: 1 × 5
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 1880 F Mary 7065 0.0724
which tells us that in 1880 there were 7065 girl babies named Mary and that was 7.24% of babies born that year!
Update the babynames
data frame by adding a new variable called popular
that will be TRUE if the name was assigned to more than 1% of all babies in that year.
Write code that will show the 10 baby names that were the most popular in any single year (we’re not trying to aggregate across years).
We are interested in knowing which year had the greatest number of births. Write code that will show the data in descending order by number of births in each year.
In the babynames
package there is a table called births
. It has just two columns:
# A tibble: 6 × 2
year births
<int> <int>
1 1909 2718000
2 1910 2777000
3 1911 2809000
4 1912 2840000
5 1913 2869000
6 1914 2966000
These data come from the U.S. Census Bureau (see help(births)
for more information). For clarity, let’s put this table into a new object with a better name.
The information about individual names is stored in the babynames
table. These data come from the Social Security Administration (see help(babynames)
for more details). We expect that the total number of births recorded by the Census Bureau should match the total number of births as recorded by the Social Security Administration, but let’s investigate this.
First we should condense the SSA data into the same yearly form as the Census Bureau data. We can do this with group_by
and summarize
. Think about how to do this, and try it out in your Quarto, before you look at the code.
n()
function counts the number of observations in a group. It can only be used in the summarize
, mutate
, and filter
functions.Actually, I’m going to make one small change. In order to make the subsequent exercises more interesting, I’m going to repeat the above but filer to exclude data from 2012 on so that the tables don’t end in the same year.
Now I have two separate tables with what should be the same information: census_births
and ssa_births
, but they don’t cover the same set of years. We can see this by doing some summarizing of both tables.
# A tibble: 1 × 3
N earliest latest
<int> <int> <int>
1 109 1909 2017
# A tibble: 1 × 3
N earliest latest
<int> <dbl> <dbl>
1 132 1880 2011
As you know, we can combine two data frames with a join operation. Generally we want to in some way match rows in one data frame rows that correspond in a different data frame.
If we use an inner_join
the result will have only rows that have a corresponding match in both of the data frames.
# A tibble: 1 × 3
N earliest latest
<int> <dbl> <dbl>
1 103 1909 2011
You can see that the set of years that results is the intersection of the two sets of years in the original data frames.
If we use a left_join
then we will get all the rows from census_births
, even if there is no corresponding entry in the ssa_births
table. This means that rows that have no corresponding entry will get NA
s in place of the missing data.
# A tibble: 6 × 4
year births.x N births.y
<dbl> <int> <int> <int>
1 2012 3952841 NA NA
2 2013 3932181 NA NA
3 2014 3988076 NA NA
4 2015 3978497 NA NA
5 2016 3945875 NA NA
6 2017 3855500 NA NA
By using a filter for the NA
data, I can see that the data from 2012 onward exists in the census_births
data frame, but not in the ssa_births
data frame. Obviously, this is no longer the intersection of the two data frames.
# A tibble: 1 × 3
N earliest latest
<int> <dbl> <dbl>
1 109 1909 2017
We can easily see that a right_join
will have the opposite effect. There are many years of SSA data that have no match in the Census data and we can see that by looking for the data that has NA
s.
# A tibble: 29 × 4
year births.x N births.y
<dbl> <int> <int> <int>
1 1880 NA 2000 201484
2 1881 NA 1935 192696
3 1882 NA 2127 221533
4 1883 NA 2084 216946
5 1884 NA 2297 243462
6 1885 NA 2294 240854
7 1886 NA 2392 255317
8 1887 NA 2373 247394
9 1888 NA 2651 299473
10 1889 NA 2590 288946
# ℹ 19 more rows
The number of years represented is neither the intersection, nor the union of the original data frames.
# A tibble: 1 × 3
N earliest latest
<int> <dbl> <dbl>
1 132 1880 2011
Keep in mind that if you switch the order of the data frames and switch back to left_join
you get exactly the same results. In other words, left_join(a, b)
is the same as right_join(b, a)
.
# A tibble: 1 × 3
N earliest latest
<int> <dbl> <dbl>
1 132 1880 2011
This brings us to full_join
which results in all rows, regardless of whether or not thaty are matched.
# A tibble: 35 × 4
year births.x N births.y
<dbl> <int> <int> <int>
1 2012 3952841 NA NA
2 2013 3932181 NA NA
3 2014 3988076 NA NA
4 2015 3978497 NA NA
5 2016 3945875 NA NA
6 2017 3855500 NA NA
7 1880 NA 2000 201484
8 1881 NA 1935 192696
9 1882 NA 2127 221533
10 1883 NA 2084 216946
# ℹ 25 more rows
Now the set of years that results really is the union of the years in the two data frames.
# A tibble: 1 × 3
N earliest latest
<int> <dbl> <dbl>
1 138 1880 2017
Let’s take a look ahead at the stats
package to help us understand the relationship between these two datasets.
Once we have the data frames joined, we can compare the birth numbers directly. I’ll use the cor
function which calculates the correlation between two vectors (it’s in the stats
package). While the birth numbers are strongly correlated…
# A tibble: 1 × 2
N correlation
<int> <dbl>
1 138 0.892
the numbers are not the same.
A deeper dive into the documentation for babynames
might help you figure out why.
Some of these joins may be familiar, others won’t be. Try to write each one on your own and test your code before you check back here to reveal the solution. Don’t forget the library(nycflights13)
command. You installed this package when completing the Exercise 4 assignment.
For each exercise, ask yourself these questions:
We will use the following datasets from the nycflights13
package:
flights
contains flights that departed from New York City in 2013.Side note: Use the following code to learn more about the dataset:
airlines
contains airline full names with their carrier codes.
airports
contains metadata about airports
planes
contains airplane metadata for all plane tailnumbers found in the FAA aircraft registry
Explore these datasets in RStudio before moving on to the exercises below. Look for the join keys, shared variables that will allow you to join the data tables.
Note: for Exercise 1-5, do not overwrite (redefine) the flights
data frame.
Compute the average arrival delay time for each carrier
Then, conduct a join to include the full name of the carrier in your result.
What was the full name of the airport that was the most common destination for planes flying from NYC in 2013?
dest
destination airport in the flights
table.
faa
is the unique Federal Aviation Administration code for each airport in the airports
table.
What is the most commonly used plane? Find it’s manufacturer and model.
Were there any flights that flew to “mystery” destination airports (i.e., airports that don’t appear in the airports
table)?
Were there any “mystery” planes (i.e., planes that don’t appear in the planes
table)?
Now, try responding to Exercise 6-9 below without code to reference.
A mutating join allows you to combine variables from two data frames: it first matches observations by their keys, then copies across variables from one data frame to the other. Like mutate()
, the join functions add variables to the right, so if your dataset has many variables, you won’t see the new ones unless you scroll to the right side or the data table.
We will keep working with the nycflights13
package datasets. In addition to the data sets introduced above, you will use:
weather
data table that contains hourly meteorological data for NYC airports: LGA, JFK, and EWR (Newark).For the following examples, we’ll make it easier to see what’s going on by creating a narrower dataset with just six variables.
Write code to create a new object flights2
with the following variables from the flights
data table: year, time_hour, origin, dest, tailnum, and carrier. Doing so will make it easier for you to observe the changes to the data tables after conducting a join.
# A tibble: 336,776 × 6
year time_hour origin dest tailnum carrier
<int> <dttm> <chr> <chr> <chr> <chr>
1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA
2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA
3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA
4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6
5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL
6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA
7 2013 2013-01-01 06:00:00 EWR FLL N516JB B6
8 2013 2013-01-01 06:00:00 LGA IAD N829AS EV
9 2013 2013-01-01 06:00:00 JFK MCO N593JB B6
10 2013 2013-01-01 06:00:00 LGA ORD N3ALAA AA
# ℹ 336,766 more rows
There are four types of mutating joins, but there’s one that you’ll use almost all of the time: left_join()
. It’s special because the output will always have the same rows as x
, the data frame you’re joining to. The primary use of left_join()
is to add in additional metadata.
Use the the planes
data table to add what size of plane was flying to the flights2
data table. You can determine the size based on the number of engines and seats on a plan. Create a new object with the result.
When conducting a join, we can specify a join using multiple join keys.
In the example in this note, we use the the planes
data table to find out the temperature and wind speed when each plane departed. We will join by both the origin airport and the time the plan took off.
The primary action of a filtering join is to filter the rows. There are two types: semi-joins and anti-joins. Semi-joins keep all rows in x
that have a match in y
.
For example, we could use a semi-join to filter the airports
dataset to show just the destination airports in your flights2
data table.
Write code to create the result below.
# A tibble: 101 × 8
faa name lat lon alt tz dst tzone
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 ABQ Albuquerque International Sunport 35.0 -107. 5355 -7 A Ameri…
2 ACK Nantucket Mem 41.3 -70.1 48 -5 A Ameri…
3 ALB Albany Intl 42.7 -73.8 285 -5 A Ameri…
4 ANC Ted Stevens Anchorage Intl 61.2 -150. 152 -9 A Ameri…
5 ATL Hartsfield Jackson Atlanta Intl 33.6 -84.4 1026 -5 A Ameri…
6 AUS Austin Bergstrom Intl 30.2 -97.7 542 -6 A Ameri…
7 AVL Asheville Regional Airport 35.4 -82.5 2165 -5 A Ameri…
8 BDL Bradley Intl 41.9 -72.7 173 -5 A Ameri…
9 BGR Bangor Intl 44.8 -68.8 192 -5 A Ameri…
10 BHM Birmingham Intl 33.6 -86.8 644 -6 A Ameri…
# ℹ 91 more rows
Imagine you’ve found the top 10 most popular destinations using this code:
How can you find all flights to those destinations? Write code to do so. Create a new object with the result.
What percentage of flights were to the top 10 destinations?