Tidyverse activity with U.S cities and…cats!

Silly? Yes. But let’s practice tidyverse

Tidyverse activity 2

US cities
Cats
Work with a numeric vector
Calculate course grades

Explore U.S. Cities with the tidyverse

Work with your small group to answer explore and answer questions using tidyverse functions.

Load the us_metro_areas.csv file into RStudio.
Use functions to explore the structure of the dataset.
Review the data dictionary (download here)

us_metro_areas <- read_csv("data/us_metro_areas.csv")

Note: this dataset is the same as the .csv from our U.S. cities activity in week 2, but here is a link to download the dataset.

What questions can we ask of these data?

Start with…

Create a new variable for population density.
Calculate the mean and median of one continuous variable in the dataset.
Calculate the mean and median of one discrete variable in the dataset.
What are the top 10 metro areas by population? Write and run code to display the top 10 cities by total population.
Calculate the average and median ‘percent area covered by parks’ for cities with above median asthma rates. Then, calculate the avereage and median ‘percent area covered by parks’ for cities with below median asthma rates.

What other questions could we ask? We’ll come up with some questions and work in groups to answer them.

Yes, there are data about… Cats!

Load packages

library(tidyverse)

Load data (link to download cats data)

cat_lovers <- read_csv("data/cat-lovers.csv")

What is in the data frame?

Summarize the data

Dplyr summarize()

cat_lovers |>
  summarize(avg_cats = mean(number_of_cats)) #Annotate

# A tibble: 1 × 1
  avg_cats
     <dbl>
1       NA

Why doesn’t this work? (Hint: explore the data frame)

Summarize

cat_lovers |> #Annotate
  summarize(avg_cats = mean(as.numeric(number_of_cats)))

Why doesn’t this work? (Hint: explore the data frame)

Summarize

cat_lovers |> #Annotate
  summarize(avg_cats = mean(as.numeric(number_of_cats), na.rm = TRUE))

# A tibble: 1 × 1
  avg_cats
     <dbl>
1    0.776

Group by

Dplyr group_by()

cat_lovers %>% 
  group_by(handedness)

# A tibble: 60 × 3
# Groups:   handedness [3]
   name           number_of_cats handedness
   <chr>          <chr>          <chr>     
 1 Bernice Warren 0              left      
 2 Woodrow Stone  0              left      
 3 Willie Bass    1              left      
 4 Tyrone Estrada 3              left      
 5 Alex Daniels   3              left      
 6 Jane Bates     2              left      
 7 Latoya Simpson 1              left      
 8 Darin Woods    1              left      
 9 Agnes Cobb     0              left      
10 Tabitha Grant  0              left      
# ℹ 50 more rows

What is the class of the result of running group_by ?

Group by

Stratifying data before computing summary statistics

cat_lovers %>% 
  group_by(handedness)

# A tibble: 60 × 3
# Groups:   handedness [3]
   name           number_of_cats handedness
   <chr>          <chr>          <chr>     
 1 Bernice Warren 0              left      
 2 Woodrow Stone  0              left      
 3 Willie Bass    1              left      
 4 Tyrone Estrada 3              left      
 5 Alex Daniels   3              left      
 6 Jane Bates     2              left      
 7 Latoya Simpson 1              left      
 8 Darin Woods    1              left      
 9 Agnes Cobb     0              left      
10 Tabitha Grant  0              left      
# ℹ 50 more rows

What’s a tibble?

Tibbles `tbl`

Read about them in IDS Intro Chapter 4 section 4.6 Tibbles
The tbl, pronounced “tibble”, is a special kind of data frame.
The functions group_by and summarize always return this type of data frame.

Group by

Do people with different “handedness” own more/less cats?

Group by the “handedness” of cat owners and calculate the average number of cats per owner

cat_lovers %>%  #Annotate
  mutate(number_of_cats = as.numeric(number_of_cats)) %>% 
  group_by(handedness) %>% 
  summarize(mean_cats = mean(number_of_cats, na.rm = TRUE))

# A tibble: 3 × 2
  handedness   mean_cats
  <chr>            <dbl>
1 ambidextrous     0.8  
2 left             0.923
3 right            0.725

If/then in Dplyr to calculate grades

case_when(condition ~ output_value)

condition is the condition that evaluates as TRUE (the “if”)
output_value is the value to output if the condition is TRUE (the “then”)

df <- data.frame(
  student = c("Natascha", "Alex", "Arun", "Arturo", "Ashley", "Oscar", "James", "Elliot"), 
  score = c(92, 78, 85, 86, 93, 67, 56, 73))

df %>% 
  mutate(grade = case_when(
    score >= 90 ~ 'A',
    score >= 80 ~ 'B',
    score >= 70 ~ 'C',
    score >= 60 ~ 'D',
    TRUE ~ 'F'))

   student score grade
1 Natascha    92     A
2     Alex    78     C
3     Arun    85     B
4   Arturo    86     B
5   Ashley    93     A
6    Oscar    67     D
7    James    56     F
8   Elliot    73     C

Work with a numeric vector

%% operator

https://www.datacamp.com/doc/r/operators

The %% operator returns the modulus (remainder) of a division operation

x <- 1:50 #Create a vector
y <- case_when(
  (x %% 2) == 0 ~ "even",
  TRUE ~ "odd"
)
y

 [1] "odd"  "even" "odd"  "even" "odd"  "even" "odd"  "even" "odd"  "even"
[11] "odd"  "even" "odd"  "even" "odd"  "even" "odd"  "even" "odd"  "even"
[21] "odd"  "even" "odd"  "even" "odd"  "even" "odd"  "even" "odd"  "even"
[31] "odd"  "even" "odd"  "even" "odd"  "even" "odd"  "even" "odd"  "even"
[41] "odd"  "even" "odd"  "even" "odd"  "even" "odd"  "even" "odd"  "even"

x %% 2

 [1] 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
[39] 1 0 1 0 1 0 1 0 1 0 1 0

Categorize the vector x by number size small (less than 25), middle (equal to 25), and big (those integers larger than 25):

z <- case_when(
  x < 25 ~ "small",
  x == 25 ~ "middle",
  TRUE ~ "big")

Use `case_when` to recode

cat_lovers <- cat_lovers %>%
  mutate(number_of_cats = 
           case_when (
             name == "Ginger Clark" ~ "2",
             name == "Doug Bass"    ~ "3",
             TRUE                   ~ number_of_cats
           ),
         number_of_cats = as.numeric(number_of_cats)
         )

`case_when()` function

case_when(
    If Logical test 1 ~ new value,
    If Logical test 2 ~ new value,
    ....more tests and values...
    TRUE ~ default value
)

Tidyverse activity with U.S cities and…cats!

Tidyverse activity 2

Explore U.S. Cities with the tidyverse

What questions can we ask of these data?

Yes, there are data about… Cats!

Summarize the data

Summarize

Summarize

Group by

Group by

Tibbles tbl

Group by

If/then in Dplyr to calculate grades

Work with a numeric vector

%% operator

Use case_when to recode

case_when() function

Tibbles `tbl`

Use `case_when` to recode

`case_when()` function