Working with multiple data frames

Intro to Data Analytics

Basic Idea

We have multiple data frames…..

and we want to bring them together

Download the data here.

Data: Women in science

Information on 10 women in science who changed the world¹

name
Ada Lovelace
Marie Curie
Janaki Ammal
Chien-Shiung Wu
Katherine Johnson
Rosalind Franklin
Vera Rubin
Gladys West
Flossie Wong-Staal
Jennifer Doudna

professions

# A tibble: 10 × 2
   name               profession                        
   <chr>              <chr>                             
 1 Ada Lovelace       Mathematician                     
 2 Marie Curie        Physicist and Chemist             
 3 Janaki Ammal       Botanist                          
 4 Chien-Shiung Wu    Physicist                         
 5 Katherine Johnson  Mathematician                     
 6 Rosalind Franklin  Chemist                           
 7 Vera Rubin         Astronomer                        
 8 Gladys West        Mathematician                     
 9 Flossie Wong-Staal Virologist and Molecular Biologist
10 Jennifer Doudna    Biochemist

dates

# A tibble: 8 × 3
  name               birth_year death_year
  <chr>                   <dbl>      <dbl>
1 Janaki Ammal             1897       1984
2 Chien-Shiung Wu          1912       1997
3 Katherine Johnson        1918       2020
4 Rosalind Franklin        1920       1958
5 Vera Rubin               1928       2016
6 Gladys West              1930         NA
7 Flossie Wong-Staal       1947         NA
8 Jennifer Doudna          1964         NA

works

# A tibble: 9 × 2
  name               known_for                                                  
  <chr>              <chr>                                                      
1 Ada Lovelace       first computer algorithm                                   
2 Marie Curie        theory of radioactivity,  discovery of elements polonium a…
3 Janaki Ammal       hybrid species, biodiversity protection                    
4 Chien-Shiung Wu    confim and refine theory of radioactive beta decy, Wu expe…
5 Katherine Johnson  calculations of orbital mechanics critical to sending the …
6 Vera Rubin         existence of dark matter                                   
7 Gladys West        mathematical modeling of the shape of the Earth which serv…
8 Flossie Wong-Staal first scientist to clone HIV and create a map of its genes…
9 Jennifer Doudna    one of the primary developers of CRISPR, a ground-breaking…

Desired output

# A tibble: 10 × 5
   name               profession                 birth_year death_year known_for
   <chr>              <chr>                           <dbl>      <dbl> <chr>    
 1 Ada Lovelace       Mathematician                      NA         NA first co…
 2 Marie Curie        Physicist and Chemist              NA         NA theory o…
 3 Janaki Ammal       Botanist                         1897       1984 hybrid s…
 4 Chien-Shiung Wu    Physicist                        1912       1997 confim a…
 5 Katherine Johnson  Mathematician                    1918       2020 calculat…
 6 Rosalind Franklin  Chemist                          1920       1958 <NA>     
 7 Vera Rubin         Astronomer                       1928       2016 existenc…
 8 Gladys West        Mathematician                    1930         NA mathemat…
 9 Flossie Wong-Staal Virologist and Molecular …       1947         NA first sc…
10 Jennifer Doudna    Biochemist                       1964         NA one of t…

Inputs, reminder

names(professions)

[1] "name"       "profession"

names(dates)

[1] "name"       "birth_year" "death_year"

names(works)

[1] "name"      "known_for"

nrow(professions)

[1] 10

nrow(dates)

[1] 8

nrow(works)

[1] 9

Joining data frames

something_join(x, y)

Mutating Joins

left_join(): all rows from x
right_join(): all rows from y
full_join(): all rows from both x and y
inner_join(): all rows from x where there are matching values in y, return all combination of multiple matches in the case of multiple matches

Filtering Joins

semi_join(): all rows from x where there are matching values in y, keeping just columns from x
anti_join(): return all rows from x where there are not matching values in y, never duplicate rows of x
…and more

Setup

For the next few slides…

x <- tibble(
  id = c(1, 2, 3),
  value_x = c("x1", "x2", "x3")
  )

# A tibble: 3 × 2
     id value_x
  <dbl> <chr>  
1     1 x1     
2     2 x2     
3     3 x3

y <- tibble(
  id = c(1, 2, 4),
  value_y = c("y1", "y2", "y4")
  )

# A tibble: 3 × 2
     id value_y
  <dbl> <chr>  
1     1 y1     
2     2 y2     
3     4 y4

`left_join()`

left_join(x, y)

# A tibble: 3 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2     
3     3 x3      <NA>

Keep all rows from x: left_join(x, y)

`left_join()`

professions %>%
  left_join(dates, by = "name")

# A tibble: 10 × 4
   name               profession                         birth_year death_year
   <chr>              <chr>                                   <dbl>      <dbl>
 1 Ada Lovelace       Mathematician                              NA         NA
 2 Marie Curie        Physicist and Chemist                      NA         NA
 3 Janaki Ammal       Botanist                                 1897       1984
 4 Chien-Shiung Wu    Physicist                                1912       1997
 5 Katherine Johnson  Mathematician                            1918       2020
 6 Rosalind Franklin  Chemist                                  1920       1958
 7 Vera Rubin         Astronomer                               1928       2016
 8 Gladys West        Mathematician                            1930         NA
 9 Flossie Wong-Staal Virologist and Molecular Biologist       1947         NA
10 Jennifer Doudna    Biochemist                               1964         NA

Keep all rows from x: left_join(professions, dates)

Let’s try out some more reshaping and a left join

First, be specific about join variables
- Why? ID variables might not have the same spelling

professions %>%
  left_join(dates, by = c("name" = "name"))

# A tibble: 10 × 4
   name               profession                         birth_year death_year
   <chr>              <chr>                                   <dbl>      <dbl>
 1 Ada Lovelace       Mathematician                              NA         NA
 2 Marie Curie        Physicist and Chemist                      NA         NA
 3 Janaki Ammal       Botanist                                 1897       1984
 4 Chien-Shiung Wu    Physicist                                1912       1997
 5 Katherine Johnson  Mathematician                            1918       2020
 6 Rosalind Franklin  Chemist                                  1920       1958
 7 Vera Rubin         Astronomer                               1928       2016
 8 Gladys West        Mathematician                            1930         NA
 9 Flossie Wong-Staal Virologist and Molecular Biologist       1947         NA
10 Jennifer Doudna    Biochemist                               1964         NA

`right_join()`

right_join(x, y)

# A tibble: 3 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2     
3     4 <NA>    y4

Keep all rows from y: right_join(x, y)

`right_join()`

  right_join(professions, dates)

# A tibble: 8 × 4
  name               profession                         birth_year death_year
  <chr>              <chr>                                   <dbl>      <dbl>
1 Janaki Ammal       Botanist                                 1897       1984
2 Chien-Shiung Wu    Physicist                                1912       1997
3 Katherine Johnson  Mathematician                            1918       2020
4 Rosalind Franklin  Chemist                                  1920       1958
5 Vera Rubin         Astronomer                               1928       2016
6 Gladys West        Mathematician                            1930         NA
7 Flossie Wong-Staal Virologist and Molecular Biologist       1947         NA
8 Jennifer Doudna    Biochemist                               1964         NA

Keep all rows from y: right_join(professions, dates)

`full_join()`

full_join(x, y)

# A tibble: 4 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2     
3     3 x3      <NA>   
4     4 <NA>    y4

Keep all rows from both x and y

`full_join()`

full_join(dates, works)

# A tibble: 10 × 4
   name               birth_year death_year known_for                           
   <chr>                   <dbl>      <dbl> <chr>                               
 1 Janaki Ammal             1897       1984 hybrid species, biodiversity protec…
 2 Chien-Shiung Wu          1912       1997 confim and refine theory of radioac…
 3 Katherine Johnson        1918       2020 calculations of orbital mechanics c…
 4 Rosalind Franklin        1920       1958 <NA>                                
 5 Vera Rubin               1928       2016 existence of dark matter            
 6 Gladys West              1930         NA mathematical modeling of the shape …
 7 Flossie Wong-Staal       1947         NA first scientist to clone HIV and cr…
 8 Jennifer Doudna          1964         NA one of the primary developers of CR…
 9 Ada Lovelace               NA         NA first computer algorithm            
10 Marie Curie                NA         NA theory of radioactivity,  discovery…

Keep all rows from both x and y

`inner_join()`

inner_join(x, y)

# A tibble: 2 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2

Keeps all rows from x where there are matching values in y

Returns all combination of multiple matches in the case of multiple matches

`inner_join()`

inner_join(dates,works)

# A tibble: 7 × 4
  name               birth_year death_year known_for                            
  <chr>                   <dbl>      <dbl> <chr>                                
1 Janaki Ammal             1897       1984 hybrid species, biodiversity protect…
2 Chien-Shiung Wu          1912       1997 confim and refine theory of radioact…
3 Katherine Johnson        1918       2020 calculations of orbital mechanics cr…
4 Vera Rubin               1928       2016 existence of dark matter             
5 Gladys West              1930         NA mathematical modeling of the shape o…
6 Flossie Wong-Staal       1947         NA first scientist to clone HIV and cre…
7 Jennifer Doudna          1964         NA one of the primary developers of CRI…

Keeps all rows from x where there are matching values in y

Filtering Joins

`semi_join()`

semi_join(x, y)

# A tibble: 2 × 2
     id value_x
  <dbl> <chr>  
1     1 x1     
2     2 x2

Keep all rows from x where there are matching values in y, keeping just columns from x.

The rows from x returned by semi_join() and inner_join() are the same. The difference is that inner_join will add columns present in y but not present in x, but a semi_join will not add any columns from y.

`semi_join()`

semi_join(dates, works)

# A tibble: 7 × 3
  name               birth_year death_year
  <chr>                   <dbl>      <dbl>
1 Janaki Ammal             1897       1984
2 Chien-Shiung Wu          1912       1997
3 Katherine Johnson        1918       2020
4 Vera Rubin               1928       2016
5 Gladys West              1930         NA
6 Flossie Wong-Staal       1947         NA
7 Jennifer Doudna          1964         NA

Keep all rows from x where there are matching values in y, keeping just columns from x

`anti_join()`

anti_join(x, y)

# A tibble: 1 × 2
     id value_x
  <dbl> <chr>  
1     3 x3

Return all rows from x where there are not matching values in y, never duplicate rows of x

`anti_join()`

anti_join(dates, works)

# A tibble: 1 × 3
  name              birth_year death_year
  <chr>                  <dbl>      <dbl>
1 Rosalind Franklin       1920       1958

Return all rows from x where there are not matching values in y, never duplicate rows of x

Putting it altogether

professions %>%
  left_join(dates) %>%
  left_join(works)

# A tibble: 10 × 5
   name               profession                 birth_year death_year known_for
   <chr>              <chr>                           <dbl>      <dbl> <chr>    
 1 Ada Lovelace       Mathematician                      NA         NA first co…
 2 Marie Curie        Physicist and Chemist              NA         NA theory o…
 3 Janaki Ammal       Botanist                         1897       1984 hybrid s…
 4 Chien-Shiung Wu    Physicist                        1912       1997 confim a…
 5 Katherine Johnson  Mathematician                    1918       2020 calculat…
 6 Rosalind Franklin  Chemist                          1920       1958 <NA>     
 7 Vera Rubin         Astronomer                       1928       2016 existenc…
 8 Gladys West        Mathematician                    1930         NA mathemat…
 9 Flossie Wong-Staal Virologist and Molecular …       1947         NA first sc…
10 Jennifer Doudna    Biochemist                       1964         NA one of t…

Case study: Student records

Student records

Have:
- Enrollment: official university enrollment records
- Survey: Student provided info,
  - missing students who never filled it out
  - includes students who filled it out but dropped the class
Want: Survey info for all enrolled in class, and only those students.

enrollment

# A tibble: 3 × 2
     id name        
  <dbl> <chr>       
1     1 Grace Hopper
2     2 Pink!       
3     3 Fei-Fei Li

survey

# A tibble: 4 × 3
     id name     username        
  <dbl> <chr>    <chr>           
1     2 Pink!    pinkmusic       
2     3 Fei-Fei  feifei          
3     4 Rosalind queenofdna      
4     5 Jean     programminglangs

What type of join should we set up?

Student records

In class
Survey missing
Dropped

enrollment %>% 
  left_join(survey, by = "id")

# A tibble: 3 × 4
     id name.x       name.y  username 
  <dbl> <chr>        <chr>   <chr>    
1     1 Grace Hopper <NA>    <NA>     
2     2 Pink!        Pink!   pinkmusic
3     3 Fei-Fei Li   Fei-Fei feifei

enrollment %>% 
  anti_join(survey, by = "id")

# A tibble: 1 × 2
     id name        
  <dbl> <chr>       
1     1 Grace Hopper

survey %>% 
  anti_join(enrollment, by = "id")

# A tibble: 2 × 3
     id name     username        
  <dbl> <chr>    <chr>           
1     4 Rosalind queenofdna      
2     5 Jean     programminglangs

Case study: Grocery sales

Grocery sales

Have:
- Purchases: One row per customer per item, listing purchases they made
- Prices: One row per item in the store, listing their prices
Want: Total revenue

purchases

# A tibble: 5 × 2
  customer_id item        
        <dbl> <chr>       
1           1 bread       
2           1 milk        
3           1 banana      
4           2 milk        
5           2 toilet paper

prices

# A tibble: 5 × 2
  item         price
  <chr>        <dbl>
1 avocado       0.5 
2 banana        0.15
3 bread         1   
4 milk          0.8 
5 toilet paper  3

What type of join should we use?

Grocery sales

Total revenue
Revenue per customer

purchases %>% 
  left_join(prices)

# A tibble: 5 × 3
  customer_id item         price
        <dbl> <chr>        <dbl>
1           1 bread         1   
2           1 milk          0.8 
3           1 banana        0.15
4           2 milk          0.8 
5           2 toilet paper  3

purchases %>% 
  left_join(prices) %>%
  summarize(total_revenue = sum(price))

# A tibble: 1 × 1
  total_revenue
          <dbl>
1          5.75

purchases %>% 
  left_join(prices)

# A tibble: 5 × 3
  customer_id item         price
        <dbl> <chr>        <dbl>
1           1 bread         1   
2           1 milk          0.8 
3           1 banana        0.15
4           2 milk          0.8 
5           2 toilet paper  3

purchases %>% 
  left_join(prices) %>%
  group_by(customer_id) %>% 
  summarize(total_revenue = sum(price))

# A tibble: 2 × 2
  customer_id total_revenue
        <dbl>         <dbl>
1           1          1.95
2           2          3.8

Working with multiple data frames

Basic Idea

and we want to bring them together

Data: Women in science

Inputs

Desired output

Inputs, reminder

Joining data frames

Joining data frames

Mutating Joins

Filtering Joins

Setup

left_join()

left_join()

Let’s try out some more reshaping and a left join

right_join()

right_join()

full_join()

full_join()

inner_join()

inner_join()

Filtering Joins

semi_join()

semi_join()

anti_join()

anti_join()

Putting it altogether

Case study: Student records

Student records

Student records

Case study: Grocery sales

Grocery sales

Grocery sales

`left_join()`

`left_join()`

`right_join()`

`right_join()`

`full_join()`

`full_join()`

`inner_join()`

`inner_join()`

`semi_join()`

`semi_join()`

`anti_join()`

`anti_join()`