Linear Regression Model Specification (regression)
Computational engine: lm
Intro to Data Analytics
Install and load tidymodels (see next slide if you get thrown an error).
Download and place the R script file LM-code.R in your working directory.
tidymodels package.The newest version of tidymodels requires you to either:
Update the rlang package. Use install.packages("rlang") to update.
Update your version of RStudio. To do so, you will need to download the newest release of RStudio (that came out just after we started the semester):
Note: you do not need to uninstall anything in order to update.
y and x by using
slope for numerical explanatory variablesdifferences for categorical explanatory variablesx, get predicted y
\[\widehat{height}_{i} = \beta_0 + \beta_1 \times width_{i}\]
… using formula syntax
parsnip model object
Call:
stats::lm(formula = Height_in ~ Width_in, data = data)
Coefficients:
(Intercept) Width_in
3.6214 0.7808
\[\widehat{height}_{i} = 3.6214 + 0.7808 \times width_{i}\]
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 3.62 0.254 14.3 8.82e-45
2 Width_in 0.781 0.00950 82.1 0
\[\widehat{height}_{i} = 3.62 + 0.781 \times width_{i}\]
\[\widehat{height}_{i} = 3.62 + 0.781 \times width_{i}\]
Take out your notes on regression from last week.
If you haven’t already, download and place the LM-code.R R script file in your working directory.
Let’s work through some examples.
\[\widehat{y}_{i} = \beta_0 + \beta_1*x_{i}\]
\[\widehat{y}_{i} = b_0 + b_1*x_{i}\]
\[\bar{y} = b_0 + b_1 \bar{x} ~ \rightarrow ~ b_0 = \bar{y} - b_1 \bar{x}\]
Up to now dependent and independent variables are continuous
But life is not continuous
Categorical variables have special properties in models
# A tibble: 3,393 × 3
name Height_in landsALL
<chr> <dbl> <dbl>
1 L1764-2 37 0
2 L1764-3 18 0
3 L1764-4 13 1
4 L1764-5a 14 1
5 L1764-5b 14 1
6 L1764-6 7 0
7 L1764-7a 6 0
8 L1764-7b 6 0
9 L1764-8 15 0
10 L1764-9a 9 0
11 L1764-9b 9 0
12 L1764-10a 16 1
13 L1764-10b 16 1
14 L1764-10c 16 1
15 L1764-11 20 0
16 L1764-12a 14 1
17 L1764-12b 14 1
18 L1764-13a 15 1
19 L1764-13b 15 1
20 L1764-14 37 0
# ℹ 3,373 more rows
landsALL = 0: No landscape featureslandsALL = 1: Some landscape featureslinear_reg() %>%
set_engine("lm") %>%
fit(Height_in ~ factor(landsALL), data = paintings) %>%
tidy()# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 22.7 0.328 69.1 0
2 factor(landsALL)1 -5.65 0.532 -10.6 7.97e-26
factor() because we want R to read the numerical dummy variable as two categories (not as numerical values).\[\widehat{Height_{in}} = 22.7 - 5.645~landsALL\]
landsALL = 0) to the other level (landsALL = 1)Extend this out to multiple categories in one variable
# A tibble: 7 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 14.0 10.0 1.40 0.162
2 school_pntgD/FL 2.33 10.0 0.232 0.816
3 school_pntgF 10.2 10.0 1.02 0.309
4 school_pntgG 1.65 11.9 0.139 0.889
5 school_pntgI 10.3 10.0 1.02 0.306
6 school_pntgS 30.4 11.4 2.68 0.00744
7 school_pntgX 2.87 10.3 0.279 0.780
forcats package fct_relevel.| school_pntg | D_FL | F | G | I | S | X |
|---|---|---|---|---|---|---|
| A | 0 | 0 | 0 | 0 | 0 | 0 |
| D/FL | 1 | 0 | 0 | 0 | 0 | 0 |
| F | 0 | 1 | 0 | 0 | 0 | 0 |
| G | 0 | 0 | 1 | 0 | 0 | 0 |
| I | 0 | 0 | 0 | 1 | 0 | 0 |
| S | 0 | 0 | 0 | 0 | 1 | 0 |
| X | 0 | 0 | 0 | 0 | 0 | 1 |
# A tibble: 3,393 × 3
name Height_in school_pntg
<chr> <dbl> <chr>
1 L1764-2 37 F
2 L1764-3 18 I
3 L1764-4 13 D/FL
4 L1764-5a 14 F
5 L1764-5b 14 F
6 L1764-6 7 I
7 L1764-7a 6 F
8 L1764-7b 6 F
9 L1764-8 15 I
10 L1764-9a 9 D/FL
11 L1764-9b 9 D/FL
12 L1764-10a 16 X
13 L1764-10b 16 X
14 L1764-10c 16 X
15 L1764-11 20 D/FL
16 L1764-12a 14 D/FL
17 L1764-12b 14 D/FL
18 L1764-13a 15 D/FL
19 L1764-13b 15 D/FL
20 L1764-14 37 F
# ℹ 3,373 more rows
Call:
lm(formula = Height_in ~ school_pntg, data = paintings)
Coefficients:
(Intercept) school_pntgD/FL school_pntgF school_pntgG
14.000 2.329 10.197 1.650
school_pntgI school_pntgS school_pntgX
10.287 30.429 2.869
Note: we don’t need to use factor(school_pntg) because school_pntg is text (not numeric entries).
R reads the categories in school_pntg in alphabetical order ## Relationship between height and school
We can manually select the baseline category using fct_relevel.
linear_reg() %>%
set_engine("lm") %>%
fit(Height_in ~ school_relevel, data = paintings_relevel) %>%
tidy()# A tibble: 7 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 44.4 5.36 8.30 1.59e-16
2 school_relevelI -20.1 5.40 -3.73 1.97e- 4
3 school_relevelF -20.2 5.37 -3.77 1.68e- 4
4 school_relevelX -27.6 5.81 -4.75 2.16e- 6
5 school_relevelD/FL -28.1 5.37 -5.23 1.77e- 7
6 school_relevelG -28.8 8.30 -3.47 5.31e- 4
7 school_relevelA -30.4 11.4 -2.68 7.44e- 3
Residuals: look for structure, size, and spread
Standard error:
Read SEs as “On average, the model’s predictions are off by this many units of the outcome variable.”
Lower is better, but only compared to alternative models on the same data.
R-squared: Explanatory power of model
We will focus on the fastest way to detect problems with a model fit…
Residuals
The vertical arrows from the predicted to observed values represent the residuals. The up arrows correspond to positive residuals, and the down arrows correspond to negative residuals.
each point with a value greater than zero corresponds to a data point in the original data set where the observed value is greater than the predicted value. Similarly, negative values correspond to data points where the observed value is less than the predicted value.
Unexpected patterns