Working with multiple data frames

Intro to Data Analytics

Basic Idea

  • We have multiple data frames…..
  • and we want to bring them together

Download the data here.

Data: Women in science

Information on 10 women in science who changed the world1

name
Ada Lovelace
Marie Curie
Janaki Ammal
Chien-Shiung Wu
Katherine Johnson
Rosalind Franklin
Vera Rubin
Gladys West
Flossie Wong-Staal
Jennifer Doudna

Inputs

professions
# A tibble: 10 × 2
   name               profession                        
   <chr>              <chr>                             
 1 Ada Lovelace       Mathematician                     
 2 Marie Curie        Physicist and Chemist             
 3 Janaki Ammal       Botanist                          
 4 Chien-Shiung Wu    Physicist                         
 5 Katherine Johnson  Mathematician                     
 6 Rosalind Franklin  Chemist                           
 7 Vera Rubin         Astronomer                        
 8 Gladys West        Mathematician                     
 9 Flossie Wong-Staal Virologist and Molecular Biologist
10 Jennifer Doudna    Biochemist                        
dates
# A tibble: 8 × 3
  name               birth_year death_year
  <chr>                   <dbl>      <dbl>
1 Janaki Ammal             1897       1984
2 Chien-Shiung Wu          1912       1997
3 Katherine Johnson        1918       2020
4 Rosalind Franklin        1920       1958
5 Vera Rubin               1928       2016
6 Gladys West              1930         NA
7 Flossie Wong-Staal       1947         NA
8 Jennifer Doudna          1964         NA
works
# A tibble: 9 × 2
  name               known_for                                                  
  <chr>              <chr>                                                      
1 Ada Lovelace       first computer algorithm                                   
2 Marie Curie        theory of radioactivity,  discovery of elements polonium a…
3 Janaki Ammal       hybrid species, biodiversity protection                    
4 Chien-Shiung Wu    confim and refine theory of radioactive beta decy, Wu expe…
5 Katherine Johnson  calculations of orbital mechanics critical to sending the …
6 Vera Rubin         existence of dark matter                                   
7 Gladys West        mathematical modeling of the shape of the Earth which serv…
8 Flossie Wong-Staal first scientist to clone HIV and create a map of its genes…
9 Jennifer Doudna    one of the primary developers of CRISPR, a ground-breaking…

Desired output

# A tibble: 10 × 5
   name               profession                 birth_year death_year known_for
   <chr>              <chr>                           <dbl>      <dbl> <chr>    
 1 Ada Lovelace       Mathematician                      NA         NA first co…
 2 Marie Curie        Physicist and Chemist              NA         NA theory o…
 3 Janaki Ammal       Botanist                         1897       1984 hybrid s…
 4 Chien-Shiung Wu    Physicist                        1912       1997 confim a…
 5 Katherine Johnson  Mathematician                    1918       2020 calculat…
 6 Rosalind Franklin  Chemist                          1920       1958 <NA>     
 7 Vera Rubin         Astronomer                       1928       2016 existenc…
 8 Gladys West        Mathematician                    1930         NA mathemat…
 9 Flossie Wong-Staal Virologist and Molecular …       1947         NA first sc…
10 Jennifer Doudna    Biochemist                       1964         NA one of t…

Inputs, reminder

names(professions)
[1] "name"       "profession"
names(dates)
[1] "name"       "birth_year" "death_year"
names(works)
[1] "name"      "known_for"
nrow(professions)
[1] 10
nrow(dates)
[1] 8
nrow(works)
[1] 9

Joining data frames

Joining data frames

something_join(x, y)


Mutating Joins

  • left_join(): all rows from x
  • right_join(): all rows from y
  • full_join(): all rows from both x and y
  • inner_join(): all rows from x where there are matching values in y, return all combination of multiple matches in the case of multiple matches

Filtering Joins

  • semi_join(): all rows from x where there are matching values in y, keeping just columns from x
  • anti_join(): return all rows from x where there are not matching values in y, never duplicate rows of x
  • …and more

Setup

For the next few slides…

x <- tibble(
  id = c(1, 2, 3),
  value_x = c("x1", "x2", "x3")
  )
x
# A tibble: 3 × 2
     id value_x
  <dbl> <chr>  
1     1 x1     
2     2 x2     
3     3 x3     
y <- tibble(
  id = c(1, 2, 4),
  value_y = c("y1", "y2", "y4")
  )
y
# A tibble: 3 × 2
     id value_y
  <dbl> <chr>  
1     1 y1     
2     2 y2     
3     4 y4     

left_join()

left_join(x, y)
# A tibble: 3 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2     
3     3 x3      <NA>   

Keep all rows from x: left_join(x, y)

left_join()

professions %>%
  left_join(dates, by = "name") 
# A tibble: 10 × 4
   name               profession                         birth_year death_year
   <chr>              <chr>                                   <dbl>      <dbl>
 1 Ada Lovelace       Mathematician                              NA         NA
 2 Marie Curie        Physicist and Chemist                      NA         NA
 3 Janaki Ammal       Botanist                                 1897       1984
 4 Chien-Shiung Wu    Physicist                                1912       1997
 5 Katherine Johnson  Mathematician                            1918       2020
 6 Rosalind Franklin  Chemist                                  1920       1958
 7 Vera Rubin         Astronomer                               1928       2016
 8 Gladys West        Mathematician                            1930         NA
 9 Flossie Wong-Staal Virologist and Molecular Biologist       1947         NA
10 Jennifer Doudna    Biochemist                               1964         NA

Keep all rows from x: left_join(professions, dates)

Let’s try out some more reshaping and a left join

  • First, be specific about join variables

    • Why? ID variables might not have the same spelling
professions %>%
  left_join(dates, by = c("name" = "name")) 
# A tibble: 10 × 4
   name               profession                         birth_year death_year
   <chr>              <chr>                                   <dbl>      <dbl>
 1 Ada Lovelace       Mathematician                              NA         NA
 2 Marie Curie        Physicist and Chemist                      NA         NA
 3 Janaki Ammal       Botanist                                 1897       1984
 4 Chien-Shiung Wu    Physicist                                1912       1997
 5 Katherine Johnson  Mathematician                            1918       2020
 6 Rosalind Franklin  Chemist                                  1920       1958
 7 Vera Rubin         Astronomer                               1928       2016
 8 Gladys West        Mathematician                            1930         NA
 9 Flossie Wong-Staal Virologist and Molecular Biologist       1947         NA
10 Jennifer Doudna    Biochemist                               1964         NA

right_join()

right_join(x, y)
# A tibble: 3 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2     
3     4 <NA>    y4     

Keep all rows from y: right_join(x, y)

right_join()

  right_join(professions, dates) 
# A tibble: 8 × 4
  name               profession                         birth_year death_year
  <chr>              <chr>                                   <dbl>      <dbl>
1 Janaki Ammal       Botanist                                 1897       1984
2 Chien-Shiung Wu    Physicist                                1912       1997
3 Katherine Johnson  Mathematician                            1918       2020
4 Rosalind Franklin  Chemist                                  1920       1958
5 Vera Rubin         Astronomer                               1928       2016
6 Gladys West        Mathematician                            1930         NA
7 Flossie Wong-Staal Virologist and Molecular Biologist       1947         NA
8 Jennifer Doudna    Biochemist                               1964         NA

Keep all rows from y: right_join(professions, dates)

full_join()

full_join(x, y)
# A tibble: 4 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2     
3     3 x3      <NA>   
4     4 <NA>    y4     

Keep all rows from both x and y

full_join()

full_join(dates, works) 
# A tibble: 10 × 4
   name               birth_year death_year known_for                           
   <chr>                   <dbl>      <dbl> <chr>                               
 1 Janaki Ammal             1897       1984 hybrid species, biodiversity protec…
 2 Chien-Shiung Wu          1912       1997 confim and refine theory of radioac…
 3 Katherine Johnson        1918       2020 calculations of orbital mechanics c…
 4 Rosalind Franklin        1920       1958 <NA>                                
 5 Vera Rubin               1928       2016 existence of dark matter            
 6 Gladys West              1930         NA mathematical modeling of the shape …
 7 Flossie Wong-Staal       1947         NA first scientist to clone HIV and cr…
 8 Jennifer Doudna          1964         NA one of the primary developers of CR…
 9 Ada Lovelace               NA         NA first computer algorithm            
10 Marie Curie                NA         NA theory of radioactivity,  discovery…

Keep all rows from both x and y

inner_join()

inner_join(x, y)
# A tibble: 2 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2     

Keeps all rows from x where there are matching values in y

Returns all combination of multiple matches in the case of multiple matches

inner_join()

inner_join(dates,works) 
# A tibble: 7 × 4
  name               birth_year death_year known_for                            
  <chr>                   <dbl>      <dbl> <chr>                                
1 Janaki Ammal             1897       1984 hybrid species, biodiversity protect…
2 Chien-Shiung Wu          1912       1997 confim and refine theory of radioact…
3 Katherine Johnson        1918       2020 calculations of orbital mechanics cr…
4 Vera Rubin               1928       2016 existence of dark matter             
5 Gladys West              1930         NA mathematical modeling of the shape o…
6 Flossie Wong-Staal       1947         NA first scientist to clone HIV and cre…
7 Jennifer Doudna          1964         NA one of the primary developers of CRI…

Keeps all rows from x where there are matching values in y

Filtering Joins

semi_join()

semi_join(x, y)
# A tibble: 2 × 2
     id value_x
  <dbl> <chr>  
1     1 x1     
2     2 x2     

Keep all rows from x where there are matching values in y, keeping just columns from x.

The rows from x returned by semi_join() and inner_join() are the same. The difference is that inner_join will add columns present in y but not present in x, but a semi_join will not add any columns from y.

semi_join()

semi_join(dates, works) 
# A tibble: 7 × 3
  name               birth_year death_year
  <chr>                   <dbl>      <dbl>
1 Janaki Ammal             1897       1984
2 Chien-Shiung Wu          1912       1997
3 Katherine Johnson        1918       2020
4 Vera Rubin               1928       2016
5 Gladys West              1930         NA
6 Flossie Wong-Staal       1947         NA
7 Jennifer Doudna          1964         NA

Keep all rows from x where there are matching values in y, keeping just columns from x

anti_join()

anti_join(x, y)
# A tibble: 1 × 2
     id value_x
  <dbl> <chr>  
1     3 x3     

Return all rows from x where there are not matching values in y, never duplicate rows of x

anti_join()

anti_join(dates, works) 
# A tibble: 1 × 3
  name              birth_year death_year
  <chr>                  <dbl>      <dbl>
1 Rosalind Franklin       1920       1958

Return all rows from x where there are not matching values in y, never duplicate rows of x

Putting it altogether

professions %>%
  left_join(dates) %>%
  left_join(works)
# A tibble: 10 × 5
   name               profession                 birth_year death_year known_for
   <chr>              <chr>                           <dbl>      <dbl> <chr>    
 1 Ada Lovelace       Mathematician                      NA         NA first co…
 2 Marie Curie        Physicist and Chemist              NA         NA theory o…
 3 Janaki Ammal       Botanist                         1897       1984 hybrid s…
 4 Chien-Shiung Wu    Physicist                        1912       1997 confim a…
 5 Katherine Johnson  Mathematician                    1918       2020 calculat…
 6 Rosalind Franklin  Chemist                          1920       1958 <NA>     
 7 Vera Rubin         Astronomer                       1928       2016 existenc…
 8 Gladys West        Mathematician                    1930         NA mathemat…
 9 Flossie Wong-Staal Virologist and Molecular …       1947         NA first sc…
10 Jennifer Doudna    Biochemist                       1964         NA one of t…

Case study: Student records

Student records

  • Have:
    • Enrollment: official university enrollment records
    • Survey: Student provided info,
      • missing students who never filled it out
      • includes students who filled it out but dropped the class
  • Want: Survey info for all enrolled in class, and only those students.
enrollment
# A tibble: 3 × 2
     id name        
  <dbl> <chr>       
1     1 Grace Hopper
2     2 Pink!       
3     3 Fei-Fei Li  
survey
# A tibble: 4 × 3
     id name     username        
  <dbl> <chr>    <chr>           
1     2 Pink!    pinkmusic       
2     3 Fei-Fei  feifei          
3     4 Rosalind queenofdna      
4     5 Jean     programminglangs

What type of join should we set up?

Student records

enrollment %>% 
  left_join(survey, by = "id") 
# A tibble: 3 × 4
     id name.x       name.y  username 
  <dbl> <chr>        <chr>   <chr>    
1     1 Grace Hopper <NA>    <NA>     
2     2 Pink!        Pink!   pinkmusic
3     3 Fei-Fei Li   Fei-Fei feifei   
enrollment %>% 
  anti_join(survey, by = "id") 
# A tibble: 1 × 2
     id name        
  <dbl> <chr>       
1     1 Grace Hopper
survey %>% 
  anti_join(enrollment, by = "id") 
# A tibble: 2 × 3
     id name     username        
  <dbl> <chr>    <chr>           
1     4 Rosalind queenofdna      
2     5 Jean     programminglangs

Case study: Grocery sales

Grocery sales

  • Have:
    • Purchases: One row per customer per item, listing purchases they made
    • Prices: One row per item in the store, listing their prices
  • Want: Total revenue
purchases
# A tibble: 5 × 2
  customer_id item        
        <dbl> <chr>       
1           1 bread       
2           1 milk        
3           1 banana      
4           2 milk        
5           2 toilet paper
prices
# A tibble: 5 × 2
  item         price
  <chr>        <dbl>
1 avocado       0.5 
2 banana        0.15
3 bread         1   
4 milk          0.8 
5 toilet paper  3   


What type of join should we use?

Grocery sales

purchases %>% 
  left_join(prices) 
# A tibble: 5 × 3
  customer_id item         price
        <dbl> <chr>        <dbl>
1           1 bread         1   
2           1 milk          0.8 
3           1 banana        0.15
4           2 milk          0.8 
5           2 toilet paper  3   
purchases %>% 
  left_join(prices) %>%
  summarize(total_revenue = sum(price)) 
# A tibble: 1 × 1
  total_revenue
          <dbl>
1          5.75
purchases %>% 
  left_join(prices) 
# A tibble: 5 × 3
  customer_id item         price
        <dbl> <chr>        <dbl>
1           1 bread         1   
2           1 milk          0.8 
3           1 banana        0.15
4           2 milk          0.8 
5           2 toilet paper  3   
purchases %>% 
  left_join(prices) %>%
  group_by(customer_id) %>% 
  summarize(total_revenue = sum(price))
# A tibble: 2 × 2
  customer_id total_revenue
        <dbl>         <dbl>
1           1          1.95
2           2          3.8