Lab 9 Linear Regression
Housing prices
This lab is based on Chapter 10 of Introduction to Modern Statistics.
Open a new Quarto document (File -> New File -> Quarto Document) in your working directory for the course. You should also name the Quarto doc lab9_yourname.qmd. Make sure that you put your name in as author in the Quarto document. Label each step in your Quarto
This lab is based on housing data that were scraped from Zillow and are in the duke_forest
dataset stored in the openintro
package (make sure you install all necessary packages and have the appropriate library
commands).
Once you have set up the packages etc., run a View(duke_forest)
command to be sure that everything is set up properly. Run a names(duke_forest)
or glimpse(duke_forest)
so that you can easily see the column names.
Step 1
Explore the distribution of two numerical variables and one categorical variable using ggplot
.
Step 2
Start by generating separate scatterplots describing price
as a function of each possible predictor variable (number of bedrooms, number of bathrooms, area of home, year built, cooling type, lot size).
Step 3
Based on the plots you created above, is there one variable that seems to be most informative for predicting house price? Discuss your reasoning.
Step 4
Use the lm
function to determine the intercept and slope for a linear model fit to predict price
based on area
(which is area of the home). Show the code and its result.
Run the summary()
function to see the results of the model. Is the relationship between area and price found to be statistically significant? If so, at what level of significance or “strength?”
Step 5
Given the results of the above step, write out the equation that represents the model for predicting price
from area
. Store the result in an object estimated_price
. Write the linear model, interpret the slope and the intercept in context of the data, and determine and interpret the \(R^2\).
Step 6
Calculate the required variables and make a scatter plot which shows the residuals \(e_i = y_i - \widehat{y_i}\) on the y-axis and the predicted values \(\widehat{y_i}\) on the x-axis.
What aspects of the residual plot show that the linear model is appropriate? Are there any aspects of the residual plot that cause you to be concerned about the fitting of the linear model? Add your response in your Quarto.