Intro to Data Analytics
Practicing R Basics
seq
and length
Corresponds to IDS 2.3-2.9 and IMS Ch 1
Data frames
The accessor $
Sorting
Indexing
First, open up your CMSC 121
RStudio project folder and select the .Rproj file to open RStudio.
Within RStudio access the Files pane (bottom right window) and navigate to your /class_activities
folder.
Create a new Quarto document for today’s in-class activity called us_cities_activity_Rbasics.qmd
Create a code chunk in the Quarto and load the tidyverse
package in your RStudio environment using library(tidyverse)
The tidyverse package of packages contains a package called readr
that helps us cleanly read in data files.
How do metropolitan areas in the US compare?
Which metro area is the largest by land area?
What are the top 10 metro areas by population?
Where does it cost the most to rent a home?
Which areas have the highest median income?
In the Files pane, create a new folder called data
within your /class_activities
folder.
Download the data set us_metro_areas.csv
file here (link), and place the file in your new /data
folder.
Double check your working directory by running getwd()
. The result shown in the Console pane should be similar to "/Users/username/Documents/CMSC 121"
.
Create a new code chunk in your Quarto and read in the .csv using the following code:
us_metro_areas <- read_csv("class_activities/data/us_metro_areas.csv")
What is the class of the us_metro_areas
object?
Use the head()
function to explore the object
Open the object in a table using the Environment pane
Use the names()
function to see all of the columns in the object
pop
pop
?We can also use double square brackets ([[
) to access a vector in a data frame:
us_metro_areas[["pct_children"]]
# Access the column with the percentage of the population under age 18
Next, access the area in square miles of US metro areas and assign the vector to the object area
median_hh_income
for US metro areasmetro_average_income
Which metro areas have the largest population?
Which metro areas have the highest median income?
What other questions could we ask?
$
operator to access the population total data and store it as the object pop
.sort
function to redefine pop
so that it is sorted.[
operator to report the smallest population size: pop[1]
Sorting
args()
function to learn about the argument options for the sort()
function.Sorting
sort()
function to sort the pop
vector from highest to lowest. Assign the result to a new object pop_desc
.[
and :
operators to access the top 10 metro areas by population. Add the result to the object top10_pop
Which are the population totals for the top 10 largest metro areas?
pop_density <- us_metro_areas$total_pop / us_metro_areas$`area sq miles`
ind
where population density is greater than 500 people per square milepop_density_over500
that contains all metro area names with population density over 500 people per square mileSubset
median_density <- median(pop_density)
above_median
with population density greater than or equal to the median population densityabove_median <- pop_density > median_density
How would we write the code above if we hadn’t already created an object with the median density?
How many metro areas have a population density greater than or equal to the median? Use the sum()
function.
Try the same process with the average population density.
us_metro_areas$city[above_median]
What are the most expensive cities to rent a home in the US?
[
to select only the top 10 highest values for median rent in the USind
where the median rent is greater than or equal to the 10th highest median rent.ind
logical vector to display the 10 most expensive metro areas to rent.Order takes a vector as input and returns the vector of indexes that sorts the input vector.
Which metro areas have the lowest percentage of the population under age 18?
order()
function on the vector containing the values for pct_children. This creates an index of the position of entries in the vector based on the value of the percentage children variable, lowest (1) to highest (873).index
.index
vector to create a new object children_decreasing
with the names of metro areas sorted by the percentage of children.[
to show the 10 metro areas with the lowest percentage of children.Use order()
to display the top 10 US cities by median rent.
Use order()
to display which 5 metro areas have the highest percentage of land area covered by parks.
Use order()
to display the 10 cities with the highest rates of asthma among adults.