Summarizing data review

Question

Load the mosaicData R package in your Quarto document, and then visit the documentation page for the SaratogaHouses data set. Make sure you also load tidyverse!

Question

Import the Saratoga Houses data set into your environment using the command SaratogaHouses <- mosaicData::SaratogaHouses in your Quarto document. Use nrow() to compute the number of rows in SaratogaHouses.

Question

How old was a “typical” home in Saratoga County during 2006?. To find out, use summarize() to compute the median age of all the homes in the SaratogaHouses data set.

Question

What was the average amount of living space of a home for sale in Saratoga County during 2006?

Be sure to think about:

  1. What variable in the data set measures home size?
  2. What function in R calculates the average?
  3. What are the units in the output?
  4. What would be a good name for your summary?
Question

Does the price of a home depend on the type of heating system it has? To find out, use the incomplete code below as a template and compute the median price of all homes, taking into account the type of heating system the home has. Be sure to look back at the variables in the data set to figure out which one can tell you what type of heating system a home uses.

SaratogaHouses |>
  group_by(______) |>
  summarize(median_price = median(price))
Question

How does the average size of a home depend on the number of bedrooms in the home? Be sure to think about

  1. Which variable should be the “grouping” variable (i.e., which variable you should look at to figure out which group a home belongs to) and
  2. Which variable should be the summarized variable?
Question

Get some practice using the n() function to summarize the frequency of different category values by using the incomplete code below as a template to figure out how many homes use electric, gas, or oil as their heating fuel. Be sure to look back in the documentation for the SaratogaHouses data set to figure out which variable measures what kind of fuel a home uses, so you know which variable to group the data by!

SaratogaHouses |>
  group_by(____) |>
  summarize(num_homes = ____)
Question

Now write code to calculate how many homes in the data set do and do not have a waterfront on their property.

Question

Revisiting the Palmer Penguins dataset, one thing we didn’t previously discuss is the fact that it has instances of missing data. We can see the issue when we summarize for average body mass in grams.

Code
library(tidyverse)
library(palmerpenguins)
penguins |>
  group_by(species) |>
  summarize(avg_body_mass_g = mean(body_mass_g))
# A tibble: 3 × 2
  species   avg_body_mass_g
  <fct>               <dbl>
1 Adelie                NA 
2 Chinstrap           3733.
3 Gentoo                NA 

Use the is.na function and the penguins dataset to see only penguins with missing body mass.

You should see that there are only 2 penguins missing body mass observations, one Adelie and one Gentoo, compared to 1728 total penguins. With so few missing values, it’s unlikely that the mean mass would change much even if the masses of these two penguins were serious outliers. In this type of case, it’s often okay to exclude the NA values and take the average of all known observations. If a large portion of the data (either overall or within a group) were missing, we might not want to do this.

We tell R to exclude NA values by using the na.rm argument of the mean() function. This argument removes NA values before the calculation. It’s an optional argument, meaning you don’t need to include it unless you want the default behavior of the function to change.

Question

Use the na.rm argument of the mean() function to calculate the average body mass in grams for each species of penguin with NA values removed. If you’re not sure how to add this argument, refer to the documentation for the mean() function.

You should now see numeric values for all three penguin species.

Many other functions that you use within summarize() will behave similarly to the mean() function here and will also have the na.rm optional argument.