Data Wrangling with Polling Data

Data Wrangling

Why?

  • Data is often organized to facilitate some goal other than analysis.

  • For example, it’s common for data to be structured to make data entry, not analysis, easy.

  • Most people aren’t familiar with the principles of tidy data, and it’s hard to derive them yourself unless you spend a lot of time working with data.

Data Wrangling

Where do we start?

  • Figure out what the underlying variables and observations are.

    • Metadata and data dictionaries – some will be available via the help page in RStudio!

    • Reach out to those who built the dataset

  • Clean your data (likely at many points during your analysis)

    • Select the variables you need

    • Filter to the observations you need

Data Wrangling

  • Pivot your data into a tidy form, with variables in the columns and observations in the rows.

  • Clean your data again

  • Then, conduct exploratory data analysis: summarize and visualize the cleaned dataset

Approval polling averages

Load the polling averages data

approval_averages.csv (link to download if you want to try this out on your own later)

approval
# A tibble: 3,998 × 4
   `politician/institution` date       Approve Disapprove
   <chr>                    <date>       <dbl>      <dbl>
 1 Congress                 2024-10-30    21.3       62.4
 2 Congress                 2024-10-29    21.3       62.4
 3 Congress                 2024-10-28    20.9       63.0
 4 Congress                 2024-10-27    20.9       63.0
 5 Congress                 2024-10-26    20.8       63.0
 6 Congress                 2024-10-25    21.2       62.9
 7 Congress                 2024-10-24    21.2       62.9
 8 Congress                 2024-10-23    21.3       62.9
 9 Congress                 2024-10-22    21.4       62.4
10 Congress                 2024-10-21    21.9       62.9
# ℹ 3,988 more rows

Source: Approval polling averages downloaded from 538

Goal

Aesthetic mappings:
🟩 x = date
🟥 y = rating_value
🟥 color = rating_type

Facet:
🟩 politician/institution (Congress, Biden, Supreme Court)

Goal

- On the x axis is the date which is already in the dataset

- y axis has approval rating, but it is spread across two columns

- faceted by the institution or politician

Goal

- We’ll need to create a column to hold the rating value for both, and one to indicate rating type: whether the rating was approval or disaproval

We need to pivot

  • We have to reorganize the data frame in order to create rating_value and rating_type.

  • The rating value is in the cells/entries of approval and disapproval, and the rating type is stored in the column headers approval & disapproval.

Pivot

approval_longer <- approval %>%
  pivot_longer(
    cols = c(Approve, Disapprove),
    names_to = "rating_type",
    values_to = "rating_value"
  )

This gives us a data frame that has twice the number of rows since the two old columns of approval/disapproval data are now each represented in a row.

Pivot

These changes will make it possible for us to color by rating_type and facet by subgroup.

approval_longer <- approval %>%
  pivot_longer(
    cols = c(Approve, Disapprove),
    names_to = "rating_type",
    values_to = "rating_value"
  )

approval_longer
# A tibble: 7,996 × 4
   `politician/institution` date       rating_type rating_value
   <chr>                    <date>     <chr>              <dbl>
 1 Congress                 2024-10-30 Approve             21.3
 2 Congress                 2024-10-30 Disapprove          62.4
 3 Congress                 2024-10-29 Approve             21.3
 4 Congress                 2024-10-29 Disapprove          62.4
 5 Congress                 2024-10-28 Approve             20.9
 6 Congress                 2024-10-28 Disapprove          63.0
 7 Congress                 2024-10-27 Approve             20.9
 8 Congress                 2024-10-27 Disapprove          63.0
 9 Congress                 2024-10-26 Approve             20.8
10 Congress                 2024-10-26 Disapprove          63.0
# ℹ 7,986 more rows

Build the basic plot

● x-axis is date,

● y-axis is rating_value,

● we’ll color the lines by rating_type.

● Specify how to join the values together with lines. Group = rating_type tells ggplot to connect together the approval values with a line and connect the disapproval values with a line. Then we use geom_line.

● Finally we facet by the subgroups into columns.

ggplot(approval_longer, 
       aes(x = date, y = rating_value, color = rating_type, group = rating_type)) +
  geom_line() +
  facet_wrap(~ `politician/institution`)

Plot

ggplot(approval_longer, 
       aes(x = date, y = rating_value, color = rating_type, group = rating_type)) +
  geom_line() +
  facet_wrap(~ `politician/institution`)

Plot

Now we can refine the visualization:

  • Add a new layer with scale_color_manual to specify colors

  • Add another layer with all the labels, title, subtitle, and caption

  • Remove legend label with color=NULL: this gets rid of the “rating_type” text which we don’t really need

ggplot(approval_longer,
       aes(x = date, y = rating_value, 
             color = rating_type, group = rating_type)) +
  geom_line() +
  facet_wrap(~ `politician/institution`) +
  scale_color_manual(values = c("darkgreen", "orange")) + 
  labs( 
    x = "Date", y = "Rating", 
    color = NULL, 
    title = "How (un)popular are Congress, Biden, and the Supreme Court?", 
    subtitle = "Estimates based on polls of all adults and polls of likely/registered voters", 
    caption = "Source: FiveThirtyEight modeling estimates" 
  ) 

ggplot(approval_longer,
       aes(x = date, y = rating_value, 
             color = rating_type, group = rating_type)) +
  geom_line() +
  facet_wrap(~ `politician/institution`) +
  scale_color_manual(values = c("darkgreen", "orange")) + 
  labs( 
    x = "Date", y = "Rating", 
    color = NULL, 
    title = "How (un)popular are Congress, Biden, and the Supreme Court?", 
    subtitle = "Estimates based on polls of all adults and polls of likely/registered voters", 
    caption = "Source: FiveThirtyEight modeling estimates" 
  ) + 
  theme_minimal() +
  theme(legend.position = "bottom")

Recap

  • If you’ve got data that contains the information you need, but not in a helpful shape, we can reorganize it to make it easier to do analysis and to visualize it.

  • Then you can modify the visualization to help it convey the desired information to your audience.

Data Wrangling with Presidential Polls