Linear Regression Model Specification (regression)
Computational engine: lm
Intro to Data Analytics
Install and load tidymodels
(see next slide if you get thrown an error).
Download and place the R script file LM-code.R
in your working directory.
tidymodels
package.The newest version of tidymodels
requires you to either:
Update the rlang
package. Use install.packages("rlang")
to update.
Update your version of RStudio. To do so, you will need to download the newest release of RStudio (that came out just after we started the semester):
Note: you do not need to uninstall anything in order to update.
y
and x
by using
slope
for numerical explanatory variablesdifferences
for categorical explanatory variablesx
, get predicted y
\[\widehat{height}_{i} = \beta_0 + \beta_1 \times width_{i}\]
… using formula syntax
parsnip model object
Call:
stats::lm(formula = Height_in ~ Width_in, data = data)
Coefficients:
(Intercept) Width_in
3.6214 0.7808
\[\widehat{height}_{i} = 3.6214 + 0.7808 \times width_{i}\]
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 3.62 0.254 14.3 8.82e-45
2 Width_in 0.781 0.00950 82.1 0
\[\widehat{height}_{i} = 3.62 + 0.781 \times width_{i}\]
\[\widehat{height}_{i} = 3.62 + 0.781 \times width_{i}\]
Take out your notes on regression from last week.
If you haven’t already, download and place the LM-code.R
R script file in your working directory.
Let’s work through some examples.
\[\widehat{y}_{i} = \beta_0 + \beta_1*x_{i}\]
\[\widehat{y}_{i} = b_0 + b_1*x_{i}\]
\[\bar{y} = b_0 + b_1 \bar{x} ~ \rightarrow ~ b_0 = \bar{y} - b_1 \bar{x}\]
Up to now dependent and independent variables are continuous
But life is not continuous
Categorical variables have special properties in models
# A tibble: 3,393 × 3
name Height_in landsALL
<chr> <dbl> <dbl>
1 L1764-2 37 0
2 L1764-3 18 0
3 L1764-4 13 1
4 L1764-5a 14 1
5 L1764-5b 14 1
6 L1764-6 7 0
7 L1764-7a 6 0
8 L1764-7b 6 0
9 L1764-8 15 0
10 L1764-9a 9 0
11 L1764-9b 9 0
12 L1764-10a 16 1
13 L1764-10b 16 1
14 L1764-10c 16 1
15 L1764-11 20 0
16 L1764-12a 14 1
17 L1764-12b 14 1
18 L1764-13a 15 1
19 L1764-13b 15 1
20 L1764-14 37 0
# ℹ 3,373 more rows
landsALL = 0
: No landscape featureslandsALL = 1
: Some landscape features# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 22.7 0.328 69.1 0
2 factor(landsALL)1 -5.65 0.532 -10.6 7.97e-26
factor()
because we want R to read the numerical dummy variable as two categories (not as numerical values).\[\widehat{Height_{in}} = 22.7 - 5.645~landsALL\]
landsALL = 0
) to the other level (landsALL = 1
)Extend this out to multiple categories in one variable
# A tibble: 7 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 14.0 10.0 1.40 0.162
2 school_pntgD/FL 2.33 10.0 0.232 0.816
3 school_pntgF 10.2 10.0 1.02 0.309
4 school_pntgG 1.65 11.9 0.139 0.889
5 school_pntgI 10.3 10.0 1.02 0.306
6 school_pntgS 30.4 11.4 2.68 0.00744
7 school_pntgX 2.87 10.3 0.279 0.780
forcats
package fct_relevel
.school_pntg | D_FL | F | G | I | S | X |
---|---|---|---|---|---|---|
A | 0 | 0 | 0 | 0 | 0 | 0 |
D/FL | 1 | 0 | 0 | 0 | 0 | 0 |
F | 0 | 1 | 0 | 0 | 0 | 0 |
G | 0 | 0 | 1 | 0 | 0 | 0 |
I | 0 | 0 | 0 | 1 | 0 | 0 |
S | 0 | 0 | 0 | 0 | 1 | 0 |
X | 0 | 0 | 0 | 0 | 0 | 1 |
# A tibble: 3,393 × 3
name Height_in school_pntg
<chr> <dbl> <chr>
1 L1764-2 37 F
2 L1764-3 18 I
3 L1764-4 13 D/FL
4 L1764-5a 14 F
5 L1764-5b 14 F
6 L1764-6 7 I
7 L1764-7a 6 F
8 L1764-7b 6 F
9 L1764-8 15 I
10 L1764-9a 9 D/FL
11 L1764-9b 9 D/FL
12 L1764-10a 16 X
13 L1764-10b 16 X
14 L1764-10c 16 X
15 L1764-11 20 D/FL
16 L1764-12a 14 D/FL
17 L1764-12b 14 D/FL
18 L1764-13a 15 D/FL
19 L1764-13b 15 D/FL
20 L1764-14 37 F
# ℹ 3,373 more rows
We can manually select the baseline category. This is the variable reflected in the intercept.
# A tibble: 7 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 44.4 5.36 8.30 1.59e-16
2 school_relevelI -20.1 5.40 -3.73 1.97e- 4
3 school_relevelF -20.2 5.37 -3.77 1.68e- 4
4 school_relevelX -27.6 5.81 -4.75 2.16e- 6
5 school_relevelD/FL -28.1 5.37 -5.23 1.77e- 7
6 school_relevelG -28.8 8.30 -3.47 5.31e- 4
7 school_relevelA -30.4 11.4 -2.68 7.44e- 3
The vertical arrows from the predicted to observed values represent the residuals. The up arrows correspond to positive residuals, and the down arrows correspond to negative residuals.
each point with a value greater than zero corresponds to a data point in the original data set where the observed value is greater than the predicted value. Similarly, negative values correspond to data points where the observed value is less than the predicted value.
Unexpected patterns