Data Wrangling with Polling Data

Data Wrangling

Why?

Data is often organized to facilitate some goal other than analysis.
For example, it’s common for data to be structured to make data entry, not analysis, easy.
Most people aren’t familiar with the principles of tidy data, and it’s hard to derive them yourself unless you spend a lot of time working with data.

Data Wrangling

Where do we start?

Figure out what the underlying variables and observations are.
- Metadata and data dictionaries – some will be available via the help page in RStudio!
- Reach out to those who built the dataset
Clean your data (likely at many points during your analysis)
- Select the variables you need
- Filter to the observations you need

Data Wrangling

Pivot your data into a tidy form, with variables in the columns and observations in the rows.
Clean your data again
Then, conduct exploratory data analysis: summarize and visualize the cleaned dataset

Approval polling averages

Load the polling averages data

approval_averages.csv (link to download if you want to try this out on your own later)

approval

# A tibble: 3,998 × 4
   `politician/institution` date       Approve Disapprove
   <chr>                    <date>       <dbl>      <dbl>
 1 Congress                 2024-10-30    21.3       62.4
 2 Congress                 2024-10-29    21.3       62.4
 3 Congress                 2024-10-28    20.9       63.0
 4 Congress                 2024-10-27    20.9       63.0
 5 Congress                 2024-10-26    20.8       63.0
 6 Congress                 2024-10-25    21.2       62.9
 7 Congress                 2024-10-24    21.2       62.9
 8 Congress                 2024-10-23    21.3       62.9
 9 Congress                 2024-10-22    21.4       62.4
10 Congress                 2024-10-21    21.9       62.9
# ℹ 3,988 more rows

Source: Approval polling averages downloaded from 538

Goal

Aesthetic mappings:
🟩 x = date
🟥 y = rating_value
🟥 color = rating_type

Facet:
🟩 politician/institution (Congress, Biden, Supreme Court)

Goal

- On the x axis is the date which is already in the dataset

- y axis has approval rating, but it is spread across two columns

- faceted by the institution or politician

Goal

- We’ll need to create a column to hold the rating value for both, and one to indicate rating type: whether the rating was approval or disaproval

We need to pivot

We have to reorganize the data frame in order to create rating_value and rating_type.
The rating value is in the cells/entries of approval and disapproval, and the rating type is stored in the column headers approval & disapproval.

Pivot

approval_longer <- approval %>%
  pivot_longer(
    cols = c(Approve, Disapprove),
    names_to = "rating_type",
    values_to = "rating_value"
  )

This gives us a data frame that has twice the number of rows since the two old columns of approval/disapproval data are now each represented in a row.

Pivot

These changes will make it possible for us to color by rating_type and facet by subgroup.

approval_longer <- approval %>%
  pivot_longer(
    cols = c(Approve, Disapprove),
    names_to = "rating_type",
    values_to = "rating_value"
  )

approval_longer

# A tibble: 7,996 × 4
   `politician/institution` date       rating_type rating_value
   <chr>                    <date>     <chr>              <dbl>
 1 Congress                 2024-10-30 Approve             21.3
 2 Congress                 2024-10-30 Disapprove          62.4
 3 Congress                 2024-10-29 Approve             21.3
 4 Congress                 2024-10-29 Disapprove          62.4
 5 Congress                 2024-10-28 Approve             20.9
 6 Congress                 2024-10-28 Disapprove          63.0
 7 Congress                 2024-10-27 Approve             20.9
 8 Congress                 2024-10-27 Disapprove          63.0
 9 Congress                 2024-10-26 Approve             20.8
10 Congress                 2024-10-26 Disapprove          63.0
# ℹ 7,986 more rows

Build the basic plot

● x-axis is date,

● y-axis is rating_value,

● we’ll color the lines by rating_type.

● Specify how to join the values together with lines. Group = rating_type tells ggplot to connect together the approval values with a line and connect the disapproval values with a line. Then we use geom_line.

● Finally we facet by the subgroups into columns.

ggplot(approval_longer, 
       aes(x = date, y = rating_value, color = rating_type, group = rating_type)) +
  geom_line() +
  facet_wrap(~ `politician/institution`)

Plot

ggplot(approval_longer, 
       aes(x = date, y = rating_value, color = rating_type, group = rating_type)) +
  geom_line() +
  facet_wrap(~ `politician/institution`)

Plot

Now we can refine the visualization:

Add a new layer with scale_color_manual to specify colors
Add another layer with all the labels, title, subtitle, and caption
Remove legend label with color=NULL: this gets rid of the “rating_type” text which we don’t really need

Code
Plot

ggplot(approval_longer,
       aes(x = date, y = rating_value, 
             color = rating_type, group = rating_type)) +
  geom_line() +
  facet_wrap(~ `politician/institution`) +
  scale_color_manual(values = c("darkgreen", "orange")) + 
  labs( 
    x = "Date", y = "Rating", 
    color = NULL, 
    title = "How (un)popular are Congress, Biden, and the Supreme Court?", 
    subtitle = "Estimates based on polls of all adults and polls of likely/registered voters", 
    caption = "Source: FiveThirtyEight modeling estimates" 
  )

Code
Plot

ggplot(approval_longer,
       aes(x = date, y = rating_value, 
             color = rating_type, group = rating_type)) +
  geom_line() +
  facet_wrap(~ `politician/institution`) +
  scale_color_manual(values = c("darkgreen", "orange")) + 
  labs( 
    x = "Date", y = "Rating", 
    color = NULL, 
    title = "How (un)popular are Congress, Biden, and the Supreme Court?", 
    subtitle = "Estimates based on polls of all adults and polls of likely/registered voters", 
    caption = "Source: FiveThirtyEight modeling estimates" 
  ) + 
  theme_minimal() +
  theme(legend.position = "bottom")

Recap

If you’ve got data that contains the information you need, but not in a helpful shape, we can reorganize it to make it easier to do analysis and to visualize it.
Then you can modify the visualization to help it convey the desired information to your audience.

Data Wrangling with Polling Data

Data Wrangling

Data Wrangling

Data Wrangling

Approval polling averages

Goal

Goal

Goal

We need to pivot

Pivot

Pivot

Build the basic plot

Plot

Plot

Recap

Data Wrangling with Presidential Polls