Lab 9 Linear Regression

Housing prices

This lab is based on Chapter 10 of Introduction to Modern Statistics.

Open a new Quarto document (File -> New File -> Quarto Document) in your working directory for the course. You should also name the Quarto doc lab9_yourname.qmd. Make sure that you put your name in as author in the Quarto document. Label each step in your Quarto

This lab is based on housing data that were scraped from Zillow and are in the duke_forest dataset stored in the openintro package (make sure you install all necessary packages and have the appropriate library commands).

Once you have set up the packages etc., run a View(duke_forest) command to be sure that everything is set up properly. Run a names(duke_forest) or glimpse(duke_forest) so that you can easily see the column names.

Step 1

Explore the distribution of two numerical variables and one categorical variable using ggplot.

Step 2

Start by generating separate scatterplots describing price as a function of each possible predictor variable (number of bedrooms, number of bathrooms, area of home, year built, cooling type, lot size).

Step 3

Based on the plots you created above, is there one variable that seems to be most informative for predicting house price? Discuss your reasoning.

Step 4

Use the lm function to determine the intercept and slope for a linear model fit to predict price based on area (which is area of the home). Show the code and its result.

Run the summary() function to see the results of the model. Is the relationship between area and price found to be statistically significant? If so, at what level of significance or “strength?”

Step 5

Given the results of the above step, write out the equation that represents the model for predicting price from area. Store the result in an object estimated_price. Write the linear model, interpret the slope and the intercept in context of the data, and determine and interpret the \(R^2\).

Step 6

Calculate the required variables and make a scatter plot which shows the residuals \(e_i = y_i - \widehat{y_i}\) on the y-axis and the predicted values \(\widehat{y_i}\) on the x-axis.

What aspects of the residual plot show that the linear model is appropriate? Are there any aspects of the residual plot that cause you to be concerned about the fitting of the linear model? Add your response in your Quarto.