Explore US Cities

Intro to Data Analytics

Jordan Ayala

Exploring Metropolitan Areas in the US

Practicing R Basics

Before we start, some review

  1. How do you install a package in R?
  2. How do you load a package into your R session?
  3. What are two ways to create the sequence of numbers 10, 11, 12….25?
  4. Create a vector containing all of the positive even numbers smaller than 60.
  5. Create a vector of numbers that starts at 6, does not pass 55, and adds numbers in increments of 4/7: 6, 6 + 4/7, 6 + 8/7, and so on.
    • How many numbers does the list have? Hint: use seq and length

Topics

Corresponds to IDS 2.3-2.9 and IMS Ch 1

  1. Data frames

  2. The accessor $

  3. Sorting

  4. Indexing

Getting started

  1. First, open up your CMSC 121 RStudio project folder and select the .Rproj file to open RStudio.

  2. Within RStudio access the Files pane (bottom right window) and navigate to your /class_activities folder.

  3. Create a new Quarto document for today’s in-class activity called us_cities_activity_Rbasics.qmd

  4. Create a code chunk in the Quarto and load the tidyverse package in your RStudio environment using library(tidyverse) The tidyverse package of packages contains a package called readr that helps us cleanly read in data files.

Exploring Metropolitan Areas in the US

  • How do metropolitan areas in the US compare?

    • Which metro area is the largest by land area?

    • What are the top 10 metro areas by population?

    • Where does it cost the most to rent a home?

    • Which areas have the highest median income?

Bring data into R

  1. In the Files pane, create a new folder called data within your /class_activities folder.

  2. Download the data set us_metro_areas.csv file here (link), and place the file in your new /data folder.

  3. Double check your working directory by running getwd(). The result shown in the Console pane should be similar to "/Users/username/Documents/CMSC 121".

  4. Create a new code chunk in your Quarto and read in the .csv using the following code:

    us_metro_areas <- read_csv("class_activities/data/us_metro_areas.csv")

Explore the data

  1. What is the class of the us_metro_areas object?

  2. Use the head() function to explore the object

  3. Open the object in a table using the Environment pane

  4. Use the names() function to see all of the columns in the object

Practice R Basics

  1. Assign a vector of the total population for all US metro areas to the object pop
  2. What is the class of the object pop?
  3. How many entries are in the total population vector object?
  4. What is the total land area of all metro areas in the US?

Accessing using [[

We can also use double square brackets ([[) to access a vector in a data frame:

us_metro_areas[["pct_children"]] # Access the column with the percentage of the population under age 18

Next, access the area in square miles of US metro areas and assign the vector to the object area

Coercion

  1. Create a vector of median household income median_hh_income for US metro areas
  2. What is the class of the median household income column?

Coercion

  1. Coerce the object from numeric to character
  2. Calculate the average median household income for US metro areas and assign it to the object metro_average_income
  3. Coerce the vector object back to numeric.

Sorting and subsetting

  • Which metro areas have the largest population?

  • Which metro areas have the highest median income?

What other questions could we ask?

Sorting

  1. Use the $ operator to access the population total data and store it as the object pop.
  2. Then use the sort function to redefine pop so that it is sorted.
  3. Finally, use the [ operator to report the smallest population size: pop[1]

Sorting

  1. Use the args() function to learn about the argument options for the sort() function.
  2. How do we sort from highest to lowest?

Sorting

  1. Use the second argument in the sort() function to sort the pop vector from highest to lowest. Assign the result to a new object pop_desc.
  2. Use the [ and : operators to access the top 10 metro areas by population. Add the result to the object top10_pop

Which are the population totals for the top 10 largest metro areas?

Subset

  1. Create a vector that contains the population density of the metro areas

pop_density <- us_metro_areas$total_pop / us_metro_areas$`area sq miles`

  1. Create an object ind where population density is greater than 500 people per square mile
  2. Create an object pop_density_over500 that contains all metro area names with population density over 500 people per square mile

Subset

  1. What is the median population density in US metro areas?

median_density <- median(pop_density)

  1. Create an object above_median with population density greater than or equal to the median population density

above_median <- pop_density > median_density

How would we write the code above if we hadn’t already created an object with the median density?

Working with a logical vector

How many metro areas have a population density greater than or equal to the median? Use the sum() function.

Try the same process with the average population density.

Subset and index

  1. Which metro areas have above median density?

us_metro_areas$city[above_median]

  1. How many metro areas have density above 1,000 people per square mile?
  2. Which metro areas?

Sorting, logicals, and indexing

What are the most expensive cities to rent a home in the US?

  1. Create a vector with the median home rent for all US metro areas
  2. Sort the resulting vector from highest to lowest
  3. Use [ to select only the top 10 highest values for median rent in the US
  4. What is the 10th highest median rent?
  5. Create a logical vector ind where the median rent is greater than or equal to the 10th highest median rent.
  6. Use the ind logical vector to display the 10 most expensive metro areas to rent.

Order()

Order takes a vector as input and returns the vector of indexes that sorts the input vector.

Which metro areas have the lowest percentage of the population under age 18?

  1. Create a vector with the percentage of children variable
  2. Sort the pct_children column lowest to highest. Do not assign the result to a new object.
  3. Use the order() function on the vector containing the values for pct_children. This creates an index of the position of entries in the vector based on the value of the percentage children variable, lowest (1) to highest (873).
  4. Assign the ordered vector from step 3 to an object index.
  5. Use the index vector to create a new object children_decreasing with the names of metro areas sorted by the percentage of children.
  6. Use [ to show the 10 metro areas with the lowest percentage of children.

Practice

  1. Use order() to display the top 10 US cities by median rent.

  2. Use order() to display which 5 metro areas have the highest percentage of land area covered by parks.

  3. Use order() to display the 10 cities with the highest rates of asthma among adults.