Code
library(tidyverse)
library(schrute)
library(lubridate)Create a new Quarto titled “review_count_group_by_yourname.qmd” to record your work on the activity.
When you are finished, submit your .QMD on Brightspace for completion credit.
Load packages
Install the schrute package from the CRAN Repository. Then, load the following packages.
Code buttons. This gives you the opportunity to think first about how something should be done and then check to see what code was actually used. I strongly recommend you take the time to attempt an answer the question before revealing the solution.Use theoffice data from the schrute package to predict IMDB scores for episodes of The Office.
Rows: 55,130
Columns: 12
$ index <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
$ season <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ episode <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ episode_name <chr> "Pilot", "Pilot", "Pilot", "Pilot", "Pilot", "Pilot",…
$ director <chr> "Ken Kwapis", "Ken Kwapis", "Ken Kwapis", "Ken Kwapis…
$ writer <chr> "Ricky Gervais;Stephen Merchant;Greg Daniels", "Ricky…
$ character <chr> "Michael", "Jim", "Michael", "Jim", "Michael", "Micha…
$ text <chr> "All right Jim. Your quarterlies look very good. How …
$ text_w_direction <chr> "All right Jim. Your quarterlies look very good. How …
$ imdb_rating <dbl> 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6…
$ total_votes <int> 3706, 3706, 3706, 3706, 3706, 3706, 3706, 3706, 3706,…
$ air_date <chr> "2005-03-24", "2005-03-24", "2005-03-24", "2005-03-24…
We will
Review the breakdown of seasons and episodes using the distinct function:
How many lines are in each episode? Hint: you will need to create a frequency table by season and episode
Inside the same function you used in the previous step, add episode names for context.
Keep only key characters: Pam, Jim, Michael, Dwight. Store the result in a new data frame called office_subset.
Count lines by character within each episode. Use the new data frame from Exercise 3 office_subset.
Using office_subset, compute proportions within each episode, and store resulting table in an object called office_counts.
office_counts <- office_subset %>%
count(season, episode, episode_name, character) %>%
group_by(season, episode, episode_name) %>%
mutate(
n_lines = sum(n),
proportion = n / n_lines
) %>%
ungroup() # Ungroup() clears the groups stored in memory to prevent errors when using this data table in future steps. For example: If you keep the grouping, any later operations (like summarise(), mutate(), etc.) will still run within each episode, even if you don’t intend that. This is best practice, and not strictly necessary for this step.
office_countsReshape the result of Exercise 5 into a wider format so that we can see which main character spoke the most in each episode.
There is a more direct approach to get to the result we found in Exercise 6.
Using the original data frame theoffice, calculate the percentage of lines spoken by Jim, Pam, Michael, and Dwight for each episode of The Office using group_by and mutate.
office_lines <- office_subset %>%
group_by(season, episode) %>%
mutate(
n_lines = n(), # n() gives the number of observations in each group.
lines_jim = sum(character == "Jim") / n_lines,
lines_pam = sum(character == "Pam") / n_lines,
lines_michael = sum(character == "Michael") / n_lines,
lines_dwight = sum(character == "Dwight") / n_lines,
) %>%
ungroup() %>%
select(season, episode, episode_name, contains("lines_")) %>% # Select the columns we want to display
distinct(season, episode, episode_name, .keep_all = TRUE) # Remove duplicates generated. Because we used mutate(), every row in an episode gets the same values, we only need them once.
office_lines # Print the resulting objectNow it’s your turn…
What is the observational unit of the original theoffice data set? In other words, what does each row in the table represent?
Include a brief answer in your Quarto.
Choose two categorical variables from the theoffice data frame that you haven’t used in previous steps. Use either count() OR use group_by() followed by a summarize or mutate function to learn something about the data.