The Office

Create a new Quarto titled “review_count_group_by_yourname.qmd” to record your work on the activity.

When you are finished, submit your .QMD on Brightspace for completion credit.

Load packages

Install the schrute package from the CRAN Repository. Then, load the following packages.

Code

library(tidyverse)
library(schrute)
library(lubridate)

How to use this page:

The solution code is not automatically displayed, but you can see it by clicking the Code buttons. This gives you the opportunity to think first about how something should be done and then check to see what code was actually used. I strongly recommend you take the time to attempt an answer the question before revealing the solution.

Use theoffice data from the schrute package to predict IMDB scores for episodes of The Office.

Code

glimpse(theoffice)

Rows: 55,130
Columns: 12
$ index            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
$ season           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ episode          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ episode_name     <chr> "Pilot", "Pilot", "Pilot", "Pilot", "Pilot", "Pilot",…
$ director         <chr> "Ken Kwapis", "Ken Kwapis", "Ken Kwapis", "Ken Kwapis…
$ writer           <chr> "Ricky Gervais;Stephen Merchant;Greg Daniels", "Ricky…
$ character        <chr> "Michael", "Jim", "Michael", "Jim", "Michael", "Micha…
$ text             <chr> "All right Jim. Your quarterlies look very good. How …
$ text_w_direction <chr> "All right Jim. Your quarterlies look very good. How …
$ imdb_rating      <dbl> 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6…
$ total_votes      <int> 3706, 3706, 3706, 3706, 3706, 3706, 3706, 3706, 3706,…
$ air_date         <chr> "2005-03-24", "2005-03-24", "2005-03-24", "2005-03-24…

We will

engineer features based on episode scripts

Review the breakdown of seasons and episodes using the distinct function:

Code

theoffice %>%
  distinct(season, episode)

# A tibble: 186 × 2
   season episode
    <int>   <int>
 1      1       1
 2      1       2
 3      1       3
 4      1       4
 5      1       5
 6      1       6
 7      2       1
 8      2       2
 9      2       3
10      2       4
# ℹ 176 more rows

Exercise 1

How many lines are in each episode? Hint: you will need to create a frequency table by season and episode

Code

theoffice %>%
  count(season, episode)

Exercise 2

Inside the same function you used in the previous step, add episode names for context.

Code

theoffice %>%
  count(season, episode, episode_name)

Exercise 3

Keep only key characters: Pam, Jim, Michael, Dwight. Store the result in a new data frame called office_subset.

Code

office_subset <- theoffice %>%
  filter(character %in% c("Pam","Jim","Dwight","Michael"))

Exercise 4

Count lines by character within each episode. Use the new data frame from Exercise 3 office_subset.

Code

office_subset %>%
  count(season, episode, episode_name, character)

Exercise 5

Using office_subset, compute proportions within each episode, and store resulting table in an object called office_counts.

Code

office_counts <- office_subset %>%
  count(season, episode, episode_name, character) %>%
  group_by(season, episode, episode_name) %>%
  mutate(
    n_lines = sum(n),
    proportion = n / n_lines
  ) %>%
  ungroup() # Ungroup() clears the groups stored in memory to prevent errors when using this data table in future steps. For example: If you keep the grouping, any later operations (like summarise(), mutate(), etc.) will still run within each episode, even if you don’t intend that. This is best practice, and not strictly necessary for this step.
office_counts

Exercise 6

Reshape the result of Exercise 5 into a wider format so that we can see which main character spoke the most in each episode.

Code

office_lines <- office_counts %>%
  select(season, episode, episode_name, character, proportion) %>%
  pivot_wider(
    names_from = character,
    values_from = proportion,
    names_prefix = "lines_" 
  )
office_lines

Exercise 7

There is a more direct approach to get to the result we found in Exercise 6.

Using the original data frame theoffice, calculate the percentage of lines spoken by Jim, Pam, Michael, and Dwight for each episode of The Office using group_by and mutate.

Code

office_lines <- office_subset %>%
  group_by(season, episode) %>%
  mutate(
    n_lines = n(), # n() gives the number of observations in each group.
    lines_jim = sum(character == "Jim") / n_lines,
    lines_pam = sum(character == "Pam") / n_lines,
    lines_michael = sum(character == "Michael") / n_lines,
    lines_dwight = sum(character == "Dwight") / n_lines,
  ) %>%
  ungroup() %>%
  select(season, episode, episode_name, contains("lines_")) %>% # Select the columns we want to display
  distinct(season, episode, episode_name, .keep_all = TRUE) # Remove duplicates generated. Because we used mutate(), every row in an episode gets the same values, we only need them once.

office_lines # Print the resulting object

Now it’s your turn…

Exercise 8

What is the observational unit of the original theoffice data set? In other words, what does each row in the table represent?

Include a brief answer in your Quarto.

Exercise 9

Choose two categorical variables from the theoffice data frame that you haven’t used in previous steps. Use either count() OR use group_by() followed by a summarize or mutate function to learn something about the data.

Note: ratings are ordinal categorical data, not discrete or continuous numerical data. They are ordinal because they represent an ordered position or quality rather than a measurable quantity. Although rankings use numbers (1st, 2nd, 3rd) or scales (1-5 stars), these represent labels with an inherent order, where the interval between ranks are not equal or consistent. Learn more.

Code

# Your code here