Silly? Yes. But let’s practice tidyverse
US cities
Cats
Work with a numeric vector
Calculate course grades
Work with your small group to answer explore and answer questions using tidyverse functions.
Load the us_metro_areas.csv
file into RStudio.
Use functions to explore the structure of the dataset.
Review the data dictionary (download here)
Note: this dataset is the same as the .csv from our U.S. cities activity in week 2, but here is a link to download the dataset.
Start with…
Create a new variable for population density.
Calculate the mean and median of one continuous variable in the dataset.
Calculate the mean and median of one discrete variable in the dataset.
What are the top 10 metro areas by population? Write and run code to display the top 10 cities by total population.
Calculate the average and median ‘percent area covered by parks’ for cities with above median asthma rates. Then, calculate the avereage and median ‘percent area covered by parks’ for cities with below median asthma rates.
What other questions could we ask? We’ll come up with some questions and work in groups to answer them.
Load packages
Load data (link to download cats data)
What is in the data frame?
Dplyr summarize()
# A tibble: 1 × 1
avg_cats
<dbl>
1 NA
Dplyr group_by()
# A tibble: 60 × 3
# Groups: handedness [3]
name number_of_cats handedness
<chr> <chr> <chr>
1 Bernice Warren 0 left
2 Woodrow Stone 0 left
3 Willie Bass 1 left
4 Tyrone Estrada 3 left
5 Alex Daniels 3 left
6 Jane Bates 2 left
7 Latoya Simpson 1 left
8 Darin Woods 1 left
9 Agnes Cobb 0 left
10 Tabitha Grant 0 left
# ℹ 50 more rows
group_by
?# A tibble: 60 × 3
# Groups: handedness [3]
name number_of_cats handedness
<chr> <chr> <chr>
1 Bernice Warren 0 left
2 Woodrow Stone 0 left
3 Willie Bass 1 left
4 Tyrone Estrada 3 left
5 Alex Daniels 3 left
6 Jane Bates 2 left
7 Latoya Simpson 1 left
8 Darin Woods 1 left
9 Agnes Cobb 0 left
10 Tabitha Grant 0 left
# ℹ 50 more rows
What’s a tibble?
tbl
tbl
, pronounced “tibble”, is a special kind of data frame.group_by
and summarize
always return this type of data frame.Do people with different “handedness” own more/less cats?
case_when(condition ~ output_value)
condition
is the condition that evaluates as TRUE (the “if”)
output_value
is the value to output if the condition is TRUE (the “then”)
df <- data.frame(
student = c("Natascha", "Alex", "Arun", "Arturo", "Ashley", "Oscar", "James", "Elliot"),
score = c(92, 78, 85, 86, 93, 67, 56, 73))
df %>%
mutate(grade = case_when(
score >= 90 ~ 'A',
score >= 80 ~ 'B',
score >= 70 ~ 'C',
score >= 60 ~ 'D',
TRUE ~ 'F'))
student score grade
1 Natascha 92 A
2 Alex 78 C
3 Arun 85 B
4 Arturo 86 B
5 Ashley 93 A
6 Oscar 67 D
7 James 56 F
8 Elliot 73 C
https://www.datacamp.com/doc/r/operators
[1] "odd" "even" "odd" "even" "odd" "even" "odd" "even" "odd" "even"
[11] "odd" "even" "odd" "even" "odd" "even" "odd" "even" "odd" "even"
[21] "odd" "even" "odd" "even" "odd" "even" "odd" "even" "odd" "even"
[31] "odd" "even" "odd" "even" "odd" "even" "odd" "even" "odd" "even"
[41] "odd" "even" "odd" "even" "odd" "even" "odd" "even" "odd" "even"
[1] 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
[39] 1 0 1 0 1 0 1 0 1 0 1 0
Categorize the vector x
by number size small (less than 25), middle (equal to 25), and big (those integers larger than 25):
Categorize the vector x
by number size small (less than 25), middle (equal to 25), and big (those integers larger than 25):
case_when
to recodecase_when()
function