# Your code here
Lab 8 Mapping and data wrangling
Environmental justice in the Muhheakunnuk / Hudson Valley region
After you complete the Introduction to Exploratory Spatial Data Analysis activity, complete the steps below.
First, run all of the code from the Introduction to Exploratory Spatial Data Analysis activity.
Create a new Quarto document called Lab8_mapping_in_R.qmd
and complete the steps below.
Step 1 (3 points)
Explore a dataset of all K-12 schools in New York State Schools K-12_1.csv
. This file is avaiable in the folder you downloaded for the mapping activity. (Source: https://data.gis.ny.gov/maps/b6c624c740e4476689aa60fdc4aacb8f/about)
Read in data with the location and attributes of Kindergarten through 12th grade schools in NY State.
After you read in the .csv containing K-12 schools in New York State, use the st_as_sf
function to create a spatial sf
object with the latitude and longitude variables.
Project the latitude and longitude as spatial coordinates using the st_as_sf
function. We use the EPSG code 4326 to plot the school locations using the most common geographic coordinate system, World Geodetic Survey 1984. WGS 84 is the coordinate system used in most GPS and online mapping software (e.g. Google maps).
<- st_as_sf(k12_df, coords = c("x", "y"), crs = 4326) k12_points
Take a look at the K-12 school locations
Create a basic map plot of the the location of schools in the k12_csv dataset using the sf
package plot
function.
plot(k12_points)
Step 2 (3 points)
Subset the schools to include only those observations in Dutchess County, NY.
#Your code here
Step 3 (6 points)
Map the K-12 school locations using tmap in “view” mode.
- Change the color of the points to red.
# Your code here
Subset using grepl
In this step, we subset the schools to only include high schools, middle schools, and elementary schools.
We identify the column with the needed information and write code to subset the schools to only those that are identified as high schools, middle schools, and elementary schools.
Use the
grepl
function to identify rows that contain the string that represent high schools, middle schools, and elementary schools.Your result should include 50 observations.
<- dutchess_schools %>%
dutchess_schools_subset filter(grepl('HIGH|MIDDLE|ELEMENTARY', `School Name`))
Subset using stringr
We can achieve the same result using functions from the tidyverse stringr
package.
<- dutchess_schools %>%
dutchess_schools_subset2 filter(str_detect(`School Name`, 'HIGH|MIDDLE|ELEMENTARY'))
Step 4 (14 points)
Create a ggplot
map to display the location of the schools from the previous step (subset using grepl
or stringr
) alongside TRI sites. You will also add some of the base map layers from the mapping in R activity.
Each point location dataset, TRI and schools, should be displayed with different shapes and colors to differentiate them on the map.
- Be sure that you have run the code from the Introduction to Exploratory Spatial Data Analysis activity. You will need to use the TRI facilities for only Dutchess County, NY.
Add the cities and city labels layers from the ESDA mapping in R activity. Symbolize appropriately so that they are visible on the map.
Include a title, subtitle, and a caption with the source of the TRI and schools data
Include a north arrow
Include a scale bar
Save a version of your map as a .png file
Provide a brief interpretation of your map: What patterns do you observe? Use cardinal directions to describe spatial patterns. Are there areas with clusters of both schools and TRI sites?
# Your code here
Step 5 (6 points)
Add a basemap annotation layer to your ggplot map from the previous step.
Add an annotation_map_tile()
layer after you call the ggplot function. Set the zoom =
argument to display a clear, readable base map underneath the geom_sf()
point layers.
# Your code here
Step 6 (12 points)
Finally, let’s add an additional indicator from its source, the U.S. Census Bureau data.census.gov. We will process this directly downloaded untidy data and then join the table with the census tract data frame ducthess_indicators
.
We will download two variables in order to create an indicator of the percentage of housing units using wood fired heating:
Total housing units
The count of housing units using wood fired heating. Smoke from wood burning stoves and fireplaces is a major contributor to wintertime air pollution.
Download the table:
Select More Tools (button with three horizontal dots near the upper right corner of the webpage
Select .CSV
Read in your dataset to a new object called heating
.
# Your code here
Some data cleaning before you proceed to the next step
First, let’s remove the margin of error columns using the dplyr function contains()
, and we’ll coerce the numeric columns to be the numeric data type.
<- heating %>%
heating_clean select(-contains(c("Margin of Error"))) %>%
mutate_all(str_trim) %>% #Remove extra spaces at the beginning and end of all entries
mutate_at(c(2:83), str_replace,",","") %>% #Remove the comma from numeric column entries
mutate_at(c(2:83), as.numeric) # Coerce all value columns to numeric
Step 7 (10 points)
Pivot heating_clean
longer.
For this step:
Pivot the
heating_clean
data frame to create theheating_longer
.Create a column with the tract identifier called
tract_ID.
Rename the
`Label (Grouping)`
variable toVariable
.Finally, coerce the
estimate
variable to numeric.
# Your code here
Some additional data cleaning before you proceed to Step 7
Let’s remove the unnecessary and repetitive part of the string in the tract_ID
variable: “; Dutchess County; New York!!Estimate”
#Extract substring before the semicolon using regex
<- heating_longer %>%
heating_longer mutate(tract_ID = str_extract(tract_ID, "[^;]+"))
# Remove the colon and remove the extra spaces in the Variable column
<- heating_longer %>%
heating_longer mutate(Variable = str_remove(Variable, ":")) %>% #Remove the colon : from all Total variable entries
mutate(Variable = str_trim(Variable)) #Remove the extra spaces at the beginning or end of strings in the Variable column
Step 8 (7 points)
Subset heating_longer
to include only those observations with the total number of housing units Total
and Wood
heating type. Use tidyverse functions.
#Your code here
Step 9 (6 points)
Pivot wider to create column with the total number of housing units.
#Your code here
Step 10 (7 points)
Create a new variable with the percentage of total housing units that use wood for heating. It is important to normalize when mapping to minimize the distortion of spatial patterns when viewing data aggregated to geographic units of different sizes.
#Your code here
Step 11 (6 points)
Join the resulting table with the original Census tract indicators in dutchess_indicators.
#Your code here
Step 12 (8 points)
Use ggplot to map the newly added variable containing the percentage of housing units using wood for heat.
Include a title, subtitle, and a caption with the source of the heating data
Use the
scale_fill_viridis
function to generate the color scheme.Include a clear, brief legend title
Include a north arrow
Include a scale bar
Include a basemap annotation layer at the appropriate zoom level
Save a version of your map as a .png file
Include a brief description of your map inside the code chunk.
#Your code here
Step 13 (12 points)
Use a tmap or ggplot map (or maps if you like) to help you answer the following questions:
- What is the geographic relationship between active TRI facilities and flood zones?
- Which two cities appear to have the most TRI sites?
#Your code here
Extra Credit (+10 points on the assignment grade)
Download and process another indicator from U.S. Census Bureau using the steps laid out above.
Create a ggplot map with all map elements included in Step 12.
The indicator you map needs to be normalized.
Provide a brief interpretation of your map.