Intro to Data Analytics
Practicing R Basics
seq and lengthCorresponds to IDS 2.3-2.9 and IMS Ch 1
Data frames
Sorting
Indexing
First, open up your RStudio project folder and select the .Rproj file to open RStudio.
Within RStudio, access the Files pane (bottom right window) and navigate to your /activities folder.
Create a new Quarto document for today’s in-class activity called us_cities_activity_Rbasics.qmd
Create a code chunk in the Quarto and load the tidyverse package in your RStudio environment using library(tidyverse) The tidyverse package of packages contains a package called readr that helps us cleanly read in data files.
How do metropolitan areas in the US compare?
Which metro area is the largest by land area?
What are the top 10 metro areas by population?
Where does it cost the most to rent a home?
Which areas have the highest median income?
In the Files pane, create a new folder called data within your /activities folder.
Download the data set us_metro_areas.csv file avaialable here (link), and place the file in your new /data folder.
Double check your working directory by running getwd(). The result shown in the Console pane should be similar to "/Users/username/Documents/CMSC 121".
Create a new code chunk in your Quarto and read in the .csv using the following code:
us_metro_areas <- read_csv("activities/data/us_metro_areas.csv")
What is the class of the us_metro_areas object?
Use the head() and glimpse() functions to explore the object.
Open the data frame in a table from the Environment pane.
Use the names() function to see all of the columns in the object.
poppop?[[ ]]We can also use double square brackets ([[) to access a vector in a data frame:
us_metro_areas[["pct_children"]] # Access the column with the percentage of the population under age 18
Next, access the area in square miles of US metro areas and assign the vector to the object area
median_hh_income for US metro areasmetro_average_incomeWhich metro areas have the largest population?
Which metro areas have the highest median income?
What other questions could we ask?
$ operator to access the population total data and store it as the object pop.sort function to redefine pop so that it is sorted.[ operator to report the smallest population size: pop[1]Sorting
args() function to learn about the argument options for the sort() function.Sorting
sort() function to sort the pop vector from highest to lowest. Assign the result to a new object pop_desc.[ and : operators to access the top 10 metro areas by population. Add the result to the object top10_popWhich are the population totals for the top 10 largest metro areas?
pop_density <- us_metro_areas$total_pop / us_metro_areas$`area sq miles`
ind that tells use where population density is greater than 500 people per square mile
ind? What does it tell us?pop_density_over500 that contains all metro area names with population density over 500 people per square mile. Use the ind object to compute.Subset
median_density <- median(pop_density)
above_median that indicates where population density is greater than or equal to the median population density.How many metro areas have a population density greater than or equal to the median? Use the sum() function.
Try the same process with the average population density.
us_metro_areas$city[above_median]
What are the most expensive cities to rent a home in the US?
[ to select only the top 10 highest values for median rent in the USind where the median rent is greater than or equal to the 10th highest median rent.ind logical vector to display the 10 most expensive metro areas to rent.Order takes a vector as input and returns the vector of indexes that sorts the input vector.
Which metro areas have the lowest percentage of the population under age 18?
order() function on the vector containing the values for pct_children. This creates an index of the position of entries in the vector based on the value of the percentage children variable, lowest (1) to highest (873).index.index vector to create a new object children with the names of metro areas sorted by the percentage of children. Hint: the first city entry should be Punta Gorda, FL Area.[ to show the 10 metro areas with the lowest percentage of children.Use order() to display the top 10 US cities by median rent.
Use order() to display which 5 metro areas have the highest percentage of land area covered by parks.
Use order() to display the 10 cities with the highest rates of asthma among adults.