Lab 8 Mapping and data wrangling

Environmental justice in the Muhheakunnuk / Hudson Valley region

After you complete the Introduction to Exploratory Spatial Data Analysis activity, complete the steps below.

First, run all of the code from the Introduction to Exploratory Spatial Data Analysis activity.

Create a new Quarto document called Lab8_mapping_in_R.qmd and complete the steps below.

Step 1 (3 points)

Explore a dataset of all K-12 schools in New York State Schools K-12_1.csv. This file is avaiable in the folder you downloaded for the mapping activity. (Source: https://data.gis.ny.gov/maps/b6c624c740e4476689aa60fdc4aacb8f/about)

Read in data with the location and attributes of Kindergarten through 12th grade schools in NY State.

# Your code here

After you read in the .csv containing K-12 schools in New York State, use the st_as_sf function to create a spatial sf object with the latitude and longitude variables.

Project the latitude and longitude as spatial coordinates using the st_as_sf function. We use the EPSG code 4326 to plot the school locations using the most common geographic coordinate system, World Geodetic Survey 1984. WGS 84 is the coordinate system used in most GPS and online mapping software (e.g. Google maps).

k12_points <- st_as_sf(k12_df, coords = c("x", "y"), crs = 4326)

Take a look at the K-12 school locations

Create a basic map plot of the the location of schools in the k12_csv dataset using the sf package plot function.

plot(k12_points)

Step 2 (3 points)

Subset the schools to include only those observations in Dutchess County, NY.

#Your code here

Step 3 (6 points)

Map the K-12 school locations using tmap in “view” mode.

  • Change the color of the points to red.
# Your code here

Subset using grepl

In this step, we subset the schools to only include high schools, middle schools, and elementary schools.

  • We identify the column with the needed information and write code to subset the schools to only those that are identified as high schools, middle schools, and elementary schools.

    • Use the grepl function to identify rows that contain the string that represent high schools, middle schools, and elementary schools.

    • Your result should include 50 observations.

dutchess_schools_subset <- dutchess_schools %>% 
  filter(grepl('HIGH|MIDDLE|ELEMENTARY', `School Name`))

Subset using stringr

We can achieve the same result using functions from the tidyverse stringr package.

dutchess_schools_subset2 <- dutchess_schools %>% 
  filter(str_detect(`School Name`, 'HIGH|MIDDLE|ELEMENTARY'))

Step 4 (14 points)

Create a ggplot map to display the location of the schools from the previous step (subset using grepl or stringr) alongside TRI sites. You will also add some of the base map layers from the mapping in R activity.

  • Each point location dataset, TRI and schools, should be displayed with different shapes and colors to differentiate them on the map.

    • Be sure that you have run the code from the Introduction to Exploratory Spatial Data Analysis activity. You will need to use the TRI facilities for only Dutchess County, NY.
  • Add the cities and city labels layers from the ESDA mapping in R activity. Symbolize appropriately so that they are visible on the map.

  • Include a title, subtitle, and a caption with the source of the TRI and schools data

  • Include a north arrow

  • Include a scale bar

  • Save a version of your map as a .png file

  • Provide a brief interpretation of your map: What patterns do you observe? Use cardinal directions to describe spatial patterns. Are there areas with clusters of both schools and TRI sites?

# Your code here

Step 5 (6 points)

Add a basemap annotation layer to your ggplot map from the previous step.

Add an annotation_map_tile() layer after you call the ggplot function. Set the zoom = argument to display a clear, readable base map underneath the geom_sf() point layers.

# Your code here

Step 6 (12 points)

Finally, let’s add an additional indicator from its source, the U.S. Census Bureau data.census.gov. We will process this directly downloaded untidy data and then join the table with the census tract data frame ducthess_indicators.

We will download two variables in order to create an indicator of the percentage of housing units using wood fired heating:

Download the table:

  • Select More Tools (button with three horizontal dots near the upper right corner of the webpage

  • Select .CSV

Read in your dataset to a new object called heating.

# Your code here

Some data cleaning before you proceed to the next step

First, let’s remove the margin of error columns using the dplyr function contains(), and we’ll coerce the numeric columns to be the numeric data type.

heating_clean <- heating %>% 
  select(-contains(c("Margin of Error"))) %>% 
  mutate_all(str_trim) %>%  #Remove extra spaces at the beginning and end of all entries
  mutate_at(c(2:83), str_replace,",","") %>%  #Remove the comma from numeric column entries
  mutate_at(c(2:83), as.numeric)  # Coerce all value columns to numeric

Step 7 (10 points)

Pivot heating_clean longer.

For this step:

  • Pivot the heating_clean data frame to create the heating_longer.

  • Create a column with the tract identifier called tract_ID.

  • Rename the `Label (Grouping)` variable to Variable.

  • Finally, coerce the estimate variable to numeric.

# Your code here

Some additional data cleaning before you proceed to Step 7

Let’s remove the unnecessary and repetitive part of the string in the tract_ID variable: “; Dutchess County; New York!!Estimate”

#Extract substring before the semicolon using regex
heating_longer <- heating_longer %>% 
  mutate(tract_ID = str_extract(tract_ID, "[^;]+"))

# Remove the colon and remove the extra spaces in the Variable column
heating_longer <- heating_longer %>% 
  mutate(Variable = str_remove(Variable, ":")) %>%  #Remove the colon : from all Total variable entries
  mutate(Variable = str_trim(Variable)) #Remove the extra spaces at the beginning or end of strings in the Variable column

Step 8 (7 points)

Subset heating_longer to include only those observations with the total number of housing units Total and Wood heating type. Use tidyverse functions.

#Your code here

Step 9 (6 points)

Pivot wider to create column with the total number of housing units.

#Your code here

Step 10 (7 points)

Create a new variable with the percentage of total housing units that use wood for heating. It is important to normalize when mapping to minimize the distortion of spatial patterns when viewing data aggregated to geographic units of different sizes.

#Your code here

Step 11 (6 points)

Join the resulting table with the original Census tract indicators in dutchess_indicators.

#Your code here

Step 12 (8 points)

Use ggplot to map the newly added variable containing the percentage of housing units using wood for heat.

  • Include a title, subtitle, and a caption with the source of the heating data

  • Use the scale_fill_viridis function to generate the color scheme.

  • Include a clear, brief legend title

  • Include a north arrow

  • Include a scale bar

  • Include a basemap annotation layer at the appropriate zoom level

  • Save a version of your map as a .png file

  • Include a brief description of your map inside the code chunk.

#Your code here

Step 13 (12 points)

Use a tmap or ggplot map (or maps if you like) to help you answer the following questions:

  1. What is the geographic relationship between active TRI facilities and flood zones?
  2. Which two cities appear to have the most TRI sites?
#Your code here

Extra Credit (+10 points on the assignment grade)

  • Download and process another indicator from U.S. Census Bureau using the steps laid out above.

  • Create a ggplot map with all map elements included in Step 12.

  • The indicator you map needs to be normalized.

  • Provide a brief interpretation of your map.