Tidyverse activity with U.S cities and…cats!

Silly? Yes. But let’s practice tidyverse

Tidyverse activity 2

  • US cities

  • Cats

  • Work with a numeric vector

  • Calculate course grades

Explore U.S. Cities with the tidyverse

Work with your small group to answer explore and answer questions using tidyverse functions.

us_metro_areas <- read_csv("data/us_metro_areas.csv")

Note: this dataset is the same as the .csv from our U.S. cities activity in week 2, but here is a link to download the dataset.

What questions can we ask of these data?

Start with…

  • Create a new variable for population density.

  • Calculate the mean and median of one continuous variable in the dataset.

  • Calculate the mean and median of one discrete variable in the dataset.

  • What are the top 10 metro areas by population? Write and run code to display the top 10 cities by total population.

  • Calculate the average and median ‘percent area covered by parks’ for cities with above median asthma rates. Then, calculate the avereage and median ‘percent area covered by parks’ for cities with below median asthma rates.

What other questions could we ask? We’ll come up with some questions and work in groups to answer them.

Yes, there are data about… Cats!

Load packages

library(tidyverse)

Load data (link to download cats data)

cat_lovers <- read_csv("data/cat-lovers.csv")

What is in the data frame?

Summarize the data

Dplyr summarize()

cat_lovers |>
  summarize(avg_cats = mean(number_of_cats)) #Annotate
# A tibble: 1 × 1
  avg_cats
     <dbl>
1       NA
  • Why doesn’t this work? (Hint: explore the data frame)

Summarize

cat_lovers |> #Annotate
  summarize(avg_cats = mean(as.numeric(number_of_cats))) 
  • Why doesn’t this work? (Hint: explore the data frame)

Summarize

cat_lovers |> #Annotate
  summarize(avg_cats = mean(as.numeric(number_of_cats), na.rm = TRUE)) 
# A tibble: 1 × 1
  avg_cats
     <dbl>
1    0.776

Group by

Dplyr group_by()

cat_lovers %>% 
  group_by(handedness)
# A tibble: 60 × 3
# Groups:   handedness [3]
   name           number_of_cats handedness
   <chr>          <chr>          <chr>     
 1 Bernice Warren 0              left      
 2 Woodrow Stone  0              left      
 3 Willie Bass    1              left      
 4 Tyrone Estrada 3              left      
 5 Alex Daniels   3              left      
 6 Jane Bates     2              left      
 7 Latoya Simpson 1              left      
 8 Darin Woods    1              left      
 9 Agnes Cobb     0              left      
10 Tabitha Grant  0              left      
# ℹ 50 more rows
  • What is the class of the result of running group_by ?

Group by

  • Stratifying data before computing summary statistics
cat_lovers %>% 
  group_by(handedness)
# A tibble: 60 × 3
# Groups:   handedness [3]
   name           number_of_cats handedness
   <chr>          <chr>          <chr>     
 1 Bernice Warren 0              left      
 2 Woodrow Stone  0              left      
 3 Willie Bass    1              left      
 4 Tyrone Estrada 3              left      
 5 Alex Daniels   3              left      
 6 Jane Bates     2              left      
 7 Latoya Simpson 1              left      
 8 Darin Woods    1              left      
 9 Agnes Cobb     0              left      
10 Tabitha Grant  0              left      
# ℹ 50 more rows

What’s a tibble?

Tibbles tbl

  • Read about them in IDS Intro Chapter 4 section 4.6 Tibbles
  • The tbl, pronounced “tibble”, is a special kind of data frame.
  • The functions group_by and summarize always return this type of data frame.

Group by

Do people with different “handedness” own more/less cats?

  • Group by the “handedness” of cat owners and calculate the average number of cats per owner
cat_lovers %>%  #Annotate
  mutate(number_of_cats = as.numeric(number_of_cats)) %>% 
  group_by(handedness) %>% 
  summarize(mean_cats = mean(number_of_cats, na.rm = TRUE))
# A tibble: 3 × 2
  handedness   mean_cats
  <chr>            <dbl>
1 ambidextrous     0.8  
2 left             0.923
3 right            0.725

If/then in Dplyr to calculate grades

case_when(condition ~ output_value)

  • condition is the condition that evaluates as TRUE (the “if”)

  • output_value is the value to output if the condition is TRUE (the “then”)

df <- data.frame(
  student = c("Natascha", "Alex", "Arun", "Arturo", "Ashley", "Oscar", "James", "Elliot"), 
  score = c(92, 78, 85, 86, 93, 67, 56, 73))

df %>% 
  mutate(grade = case_when(
    score >= 90 ~ 'A',
    score >= 80 ~ 'B',
    score >= 70 ~ 'C',
    score >= 60 ~ 'D',
    TRUE ~ 'F'))
   student score grade
1 Natascha    92     A
2     Alex    78     C
3     Arun    85     B
4   Arturo    86     B
5   Ashley    93     A
6    Oscar    67     D
7    James    56     F
8   Elliot    73     C

Work with a numeric vector

%% operator

https://www.datacamp.com/doc/r/operators

  • The %% operator returns the modulus (remainder) of a division operation
x <- 1:50 #Create a vector
y <- case_when(
  (x %% 2) == 0 ~ "even",
  TRUE ~ "odd"
)
y
 [1] "odd"  "even" "odd"  "even" "odd"  "even" "odd"  "even" "odd"  "even"
[11] "odd"  "even" "odd"  "even" "odd"  "even" "odd"  "even" "odd"  "even"
[21] "odd"  "even" "odd"  "even" "odd"  "even" "odd"  "even" "odd"  "even"
[31] "odd"  "even" "odd"  "even" "odd"  "even" "odd"  "even" "odd"  "even"
[41] "odd"  "even" "odd"  "even" "odd"  "even" "odd"  "even" "odd"  "even"
x %% 2
 [1] 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
[39] 1 0 1 0 1 0 1 0 1 0 1 0

Categorize the vector x by number size small (less than 25), middle (equal to 25), and big (those integers larger than 25):

Categorize the vector x by number size small (less than 25), middle (equal to 25), and big (those integers larger than 25):

z <- case_when(
  x < 25 ~ "small",
  x == 25 ~ "middle",
  TRUE ~ "big")

Use case_when to recode

cat_lovers <- cat_lovers %>%
  mutate(number_of_cats = 
           case_when (
             name == "Ginger Clark" ~ "2",
             name == "Doug Bass"    ~ "3",
             TRUE                   ~ number_of_cats
           ),
         number_of_cats = as.numeric(number_of_cats)
         )

case_when() function

case_when(
    If Logical test 1 ~ new value,
    If Logical test 2 ~ new value,
    ....more tests and values...
    TRUE ~ default value
)