Data cleaning and wrangling for mapping in R

Intro to Data Analytics

Basic Idea

  • The process of asking and answering questions with data, exploratory data analysis or EDA for short…i

    • Is 80% data cleaning and wrangling for our analysis purposes, and 20% a mix of analysis, visualization, and writing.

    • Tidy datasets are all alike but every messy dataset is messy in its own way.

    • Start with a questions, then identify appropriate data to explore.

Data summary table with pivot_longer()

summary <- dutchess_indicators %>% 
  st_drop_geometry() %>% #remove the geometry column
  select_if(is.numeric) %>%
  summarise(across(where(is.numeric),list(mean=mean,std=sd,med = median))) %>%
  pivot_longer(everything(),
               names_to = "summary_stat", 
               values_to = "value") %>% 
  mutate(across("value", round, 2)) # Round the summary statistics to two significant digits
summary
# A tibble: 27 × 2
   summary_stat             value
   <chr>                    <dbl>
 1 total_pop_mean         3609.  
 2 total_pop_std          1357.  
 3 total_pop_med          3502   
 4 median_hh_income_mean 94374.  
 5 median_hh_income_std  31898.  
 6 median_hh_income_med  93278   
 7 percent_children_mean    17.7 
 8 percent_children_std      5.51
 9 percent_children_med     18   
10 percent_white_mean       69.0 
# ℹ 17 more rows

Mapping with tmap

A simple static tmap map

map_children_layout <- 
  tm_shape(dutchess_indicators) +
    tm_polygons("percent_children", title = "% Children")
map_children_layout

Mapping with tmap

A simple static tmap map with required map elements: north arrow and scale bar

map_children_layout <- 
  tm_shape(dutchess_indicators) +
    tm_polygons("percent_children", title = "% Children") +
  tm_compass() +  #Add a north arrow to your map layout
  tm_scale_bar(position = c("left", "bottom"))  #Add a scale bar to your map layout
map_children_layout

Mapping with tmap

Static or interactive plots

tmap_mode("plot")   #Default mode for tmap, makes static maps
tmap_mode("view")   #Mode that creates interactive maps using leaflet

Save a plot as a .png

tmap_save(map_children_layout, filename = "dutchess_children_with_layout.png")

Mapping with tmap

#Map the percentage of children by tract in Dutchess County, NY
map_children_layout <- tm_shape(dutchess_indicators) +
  tm_polygons("percent_children", 
              title = "% Children", 
              palette = "viridis",
              style = "jenks") +
  tm_scale_bar(position = c("left", "bottom")) +  #Add a scale bar to your map layout
tm_layout("Children in Dutchess County, NY") #Set the map title
map_children_layout

Mapping with tmap

  • Add point locations using tm_dots()

Mapping with tmap

layout <- tm_shape(dutchess_indicators) +
  tm_polygons("median_hh_income", title = "Median Household Income", palette = "BuGn", style = "jenks") +
  tm_scale_bar(position = c("left", "bottom")) +
tm_layout("TRI Facilities and Income in Dutchess County, NY") +
  tm_shape(tri_dutchess) +
    tm_dots("red", size = .05)
layout

Working with spatial data

Key packages:

  • sf() simple features package

  • tmap

  • ggplot geom_sf

  • leaflet interactive maps with ggplot functions

Use:

  • st_read() or read_sf()

  • For geojson files (open source spatial data file):

    • Use the geojsonsf package and the function

Mapping with a latitude and longitude

Project the latitude and longitude columns from a data table as spatial coordinates.

k12_points <- st_as_sf(k12_df, coords = c("x", "y"), crs = 4326)
  • Note: the standard geographic coordinate system (GCS) is World Geodetic Survey 1984 (WGS 1984).

  • The coordinate reference system number for WGS 1984 is 4326

Mapping with ggplot

let’s recreate the map of children and TRI sites.

  • geom_sf() for spatial data
  • Let’s use the viridis package for the thematic layer
dutchess_with_indicator <- ggplot() + 
  geom_sf(data = dutchess_indicators, aes(fill = percent_children)) +
  scale_fill_viridis(name = "% Children") +
  geom_sf(data = tri_dutchess) + #add layer of TRI facilities
  labs(x = NULL,                      #no label on the x-axis
       y = NULL,                      #no label on the y-axis
       title = "Environmental Exposure in Dutchess County", #Map title
       subtitle = "Children and TRI Facilities",
       caption = "Source: U.S Census Bureau ACS 5-year Estimates | EPA TRI Inventory") + #Caption to add data source
  annotation_scale(location = "bl", width_hint = 0.4) + #scale bar added in bottom left corner
  annotation_north_arrow(location = "br", which_north = "true", #North arrow added in bottom right corner
        pad_x = unit(0.0, "in"), pad_y = unit(0.2, "in"), #format arrow placement
        style = north_arrow_minimal) +          #select north arrow style
    theme_minimal()    #set overall map layout theme, blank white background
dutchess_with_indicator

Add a base map to a ggplot map

  • Load the prettymapr package to access base map options

  • Let’s use the popular OpenStreetMap basemap

dutchess_with_indicator_basemap <- ggplot() + 
  annotation_map_tile("osm") + #add a basemap using the prettymapr package
  geom_sf(data = tri_dutchess) + #add layer of TRI facilities
  labs(x = NULL,                      #no label on the x-axis
       y = NULL,                      #no label on the y-axis
       title = "Environmental Exposure in Dutchess County", #Map title
       subtitle = "TRI Facilities",
       caption = "Source: EPA TRI Inventory") + #Caption to add data source
  annotation_scale(location = "bl", width_hint = 0.4) + #scale bar added in bottom left corner
  annotation_north_arrow(location = "br", which_north = "true", #North arrow added in bottom right corner
        pad_x = unit(0.0, "in"), pad_y = unit(0.2, "in"), #format arrow placement
        style = north_arrow_minimal) +          #select north arrow style
    theme_minimal()    #set overall map layout theme, blank white background
dutchess_with_indicator_basemap

Use grepl to identify rows with matching strings

Subset the Dutchess County, NY schools data to only those that can be identified as high schools, middle schools, and elementary schools.

dutchess_schools_subset <- dutchess_schools %>% 
  filter(grepl('HIGH|MIDDLE|ELEMENTARY', `School Name`))

Use stringr to identify rows with matching strings

Complete the same result using functions from the tidyverse stringr package.

dutchess_schools_subset2 <- dutchess_schools %>% 
  filter(str_detect(`School Name`, 'HIGH|MIDDLE|ELEMENTARY'))

Use contains() to subset columns with matching strings

Dplyr function

heating_clean <- heating %>% 
  select(-contains(c("Margin of Error"))) %>% 
  mutate_all(str_trim) %>%  #Remove extra spaces at the beginning and end of all entries
  mutate_at(c(2:83), str_replace,",","") %>%  #Remove the , from numeric column entries
  mutate_at(c(2:83), as.numeric)  # Coerce value columns to numeric

Review of cleaning and wrangling functions

  • grepl() for pattern matching and replacement: grepl(pattern, variable)

  • contains() Select variables that match a pattern

  • Mutate multiple columns:

    • mutate_all the mutate affects every variable

    • mutate_at affects variables selected with a character vector or vars()

Working with strings

A core tidyverse package, the stringr package contains several useful functions for working with strings in the tidy framework.

Strings represent a sequence of characters and can contain letters, numbers, symbols and even spaces.

  • str_extract extract a string match

    str_extract(string, pattern)
  • str_replace replace matches with new text

    • str_replace() replaces the first match; str_replace_all() replaces all matches.
  • str_remove remove matched patterns

    str_remove(string, pattern)
  • str_trim removes whitespace from start and end of string

Learn more about these functions using the help resources within RStudio