library(ggplot2)
<- read.csv("data/students_env221.csv")
dtf ggplot(dtf) + geom_boxplot(aes(SHOE))
<- dtf[dtf$SHOE < 100 & !is.na(dtf$SHOE), ]
dtf <- mean(dtf$SHOE, na.rm = TRUE)
mean_shoe round(mean_shoe, 1)
[1] 39.5
In this lecture, you will
Example: Predict the shoe size of an ENV221 student. What is the best prediction for a student I do not know?
library(ggplot2)
<- read.csv("data/students_env221.csv")
dtf ggplot(dtf) + geom_boxplot(aes(SHOE))
<- dtf[dtf$SHOE < 100 & !is.na(dtf$SHOE), ]
dtf <- mean(dtf$SHOE, na.rm = TRUE)
mean_shoe round(mean_shoe, 1)
[1] 39.5
The more information about the students, the more accurate prediction.
plot(x = dtf$HEIGHT, y = dtf$SHOE, xlab = "Height (cm)", ylab = "Shoe size", pch = 16)
<- mean(dtf$HEIGHT, na.rm = TRUE)
mean_height points(x = mean_height, y = mean_shoe, col = "red", cex = 3)
abline(h = mean_shoe, v = mean_height, col = "red")
Generally, a tall student has a large shoe size.
Then, if we know a student’s height, can we predict her/his shoe size better than using the mean?
\[y' = a + bx\]
The regression coefficients. \(a\) is the intercept, i.e. the point where the line crosses the vertical axis, and \(b\) is the slope.
The best fitting line, which passes through the maximum number of points and isn’t too far away from the points that it does not pass through.
\[b = \frac{\sum(x-\bar x)(y -\bar y)}{\sum(x-\bar x)^2}\]
\[a = \bar y - b\bar x\]
Example: Calculate \(a\) and \(b\) for the regression line of the shoe size against the height, using these two equations above. Give the prediction equation.
<- mean(dtf$HEIGHT)
mean_height <- sum((dtf$HEIGHT - mean_height) * (dtf$SHOE - mean_shoe)) / sum((dtf$HEIGHT - mean_height) ^ 2)
b <- mean_shoe - b * mean_height
a a
[1] -8.601667
b
[1] 0.2835591
\[y' = -8.60 + 0.284 x\]
\(y-y'\) (Deviations of the observed \(y\)-values around the predicted \(y\)-values)
roughly an average of the squared residuals:
\[s^2_{yx} = \frac{\sum(y-y')^2}{n-2}\]
Example: Fill in the following table. Calculate the residual standard error.
$predicted <- round(a + b * dtf$HEIGHT, 1)
dtf$residual <- round(dtf$SHOE - dtf$predicted, 1) dtf
Residual standard error:
<- sqrt(sum(dtf$residual ^ 2) / (nrow(dtf) - 2))
syx syx
[1] 1.470015
The smaller the residual standard error is, the better the fit is.
But how small is the best?
Assumptions:
SLR model:
\[y' = \alpha + \beta x +\epsilon\]
\[\epsilon = y' - (\alpha + \beta x) = y' - \mu_{y|x} \]
The error term. The amount by which an observation y deviates from the mean of the subpopulation of Y values from which it is drawn.
Is there relationshipt between \(X\) and \(Y\)?
Collect data.
Calculate a test statistic: \(F\)-test.
<- 177
xi <- lm_env$coefficients[1] + lm_env$coefficients[2] * xi
y_prime <- mean_shoe
y_mean <- dtf$SHOE[dtf$HEIGHT == 177][1]
yi
plot(x = dtf$HEIGHT, y = dtf$SHOE, xlab = "Height (cm)", ylab = "Shoe size", pch = 16, las = 1)
points(x = mean_height, y = mean_shoe, col = "red", cex = 3)
abline(lm_env, col = "blue", lty = 2)
points(x = rep(xi, 2), y = c(y_prime, yi),
col = c("blue", "black"), pch = c(1, 16), cex = 2)
abline(h = c(y_mean, y_prime, yi), col = c("red", "blue", "black"))
text(x = rep(155, 3), y = c(y_mean, y_prime, yi),
labels = c(expression(bar(y)), expression("y'"), expression(y[i])),
pos = 3,
col = c('red', 'blue', 'black'))
arrows(xi - 1.3, y_mean, xi - 1.3, yi, length = 0.2, code = 3, col = 'red')
text(xi - 1.3, mean(c(y_mean, yi)), expression(SS[total]), pos = 2, col = 'red')
arrows(xi, y_mean, xi, y_prime, length = 0.2, code = 3, col = 'purple')
text(xi, mean(c(y_mean, y_prime)), expression(SS[regression]), pos = 4, col = 'purple')
arrows(xi + 1.3, y_prime, xi + 1.3, yi, length = 0.2, code = 3, col = 'blue')
text(xi + 1.3, mean(c(y_prime, yi)), expression(SS[residual]), pos = 4, col = 'blue')
\[y_i - \bar y = (y_i-y') - (y'-\bar y)\]
\[SS_\mathrm{residual} + SS_\mathrm{regression} = SS_\mathrm{total}\]
\[df_\mathrm{residual} + df_\mathrm{regression} = df_\mathrm{total}\]
\[df_\mathrm{regression} = 1\]
\[df_\mathrm{total} = n - 1\]
\[df_\mathrm{residual} = n - 2\]
Source | \(df\) | \(SS\) (Sum of squares) | \(MS\) (Mean of squares) | \(F\) |
---|---|---|---|---|
between the obs. and the mean | \(df_\mathrm{total} = n - 1\) | \(SS_\mathrm{total} = \sum(y_i -\bar y)^2\) | \(MS_\mathrm{total} = \frac{SS_\mathrm{total}}{df_\mathrm{total}}\) | \(F = \frac{MS_\mathrm{regression}}{MS_\mathrm{residual}}\) |
between the obs. and the regression line | \(df_\mathrm{residual} = n - 2\) | \(SS_\mathrm{residual} = \sum(y_i-y')^2\) | \(MS_\mathrm{residual} = \frac {SS_\mathrm{residual}}{df_\mathrm{residual}}\) | |
between the regression line and the mean | \(df_\mathrm{regression} = 1\) | \(SS_\mathrm{regression} = \sum(y'-\bar y)^2\) | \(MS_\mathrm{regression} = \frac{SS_\mathrm{regression}}{df_\mathrm{regression}}\) |
Equivalent hypotheses:
Action: Fill in the following table.
Source | \(df\) | \(SS\) | \(MS\) | \(F\) |
---|---|---|---|---|
between the obs. and the mean | ||||
between the obs. and the regression line | ||||
between the regression line and the mean |
# df
<- nrow(dtf)
n <- n - 1)
(dftot <- n - 2)
(dfres <- 1)
(dfreg - dfres
dftot
# SS
<- sum((dtf$SHOE - mean_shoe) ^ 2))
(SStot <- sum((dtf$SHOE - dtf$predicted) ^2))
(SSres <- sum((dtf$predicted - mean_shoe) ^ 2))
(SSreg + SSres
SSreg
# MS
<- SSreg / dfreg)
(MSreg <- SSres / dfres)
(MSres
# F
<- MSreg / MSres)
(F_score <- qf(0.05, df1 = dfreg, df2 = dfres, lower.tail = FALSE))
(F_critical <- pf(F_score, df1 = dfreg, df2 = dfres, lower.tail = FALSE)) (p_value
Source | \(df\) | \(SS\) | \(MS\) | \(F\) |
---|---|---|---|---|
between the obs. and the mean | 54 | 503.4272727 | ||
between the obs. and the regression line | 53 | 114.53 | 2.1609434 | |
between the regression line and the mean | 1 | 388.4354545 | 388.4354545 | 179.7527206 |
With 1 and 53 degrees of freedom and \(\alpha = 0.05\), the critical value of \(F\) is 4.023017. The calculated \(F\) (179.7527206) is greater than the critical \(F\), so the decision is to reject \(H_0\).
Is the population slope (\(\beta\) in \(\mu_{y|x} = \alpha + \beta x\)) a given value (such as 0)?
Collect data
Calculate a test statistic. \(t\)-test.
\[t = \frac{b - \beta}{s_b}\] \[s_b = \frac{s_{yx}}{s_x \sqrt{n-1}}\]
Example: Calculate the \(t\) score for the slope of the Shoe-Height regression line.
syx<- sd(dtf$HEIGHT))
(sx <- syx/(sx * sqrt(n-1)))
(sb <- 0
beta
b<- (b-beta) /sb)
(t_score <- qt(0.975, df = n-2))
(t_critical pt(t_score, df = n - 2, lower.tail = FALSE) * 2
With 53 degrees of freedom and \(\alpha = 0.05\), the critical value of \(t\) is 2.005746. The calculated \(t\) (13.422428) is greater than the critical \(t\), so the decision is to reject \(H_0\).
Compare the p-value with the one in the previous F-test.
Is the population intercept (\(\alpha\) in \(\mu_{y|x} = \alpha + \beta x\)) a given value (such as 0)?
Collect data
Calculate a test statistic. \(t\)-test.
\[t = \frac{a - \alpha}{s_a}\] \[s_a = s_{yx}\sqrt{\frac{1}{n} + \frac{\bar x^2}{(n-1)s_x^2}}\]
Example: Calculate the \(t\) score for the intercept of the Shoe-Height regression line.
syx
n<- mean_height)
(xbar
sx<- syx * sqrt(1/n + xbar ^2 / ((n-1) * sx^2)))
(sa <- 0
alpha
a<- (a-alpha) /sa)
(t_score <- qt(0.975, df = n-1))
(t_critical pt(t_score, df = n - 2) * 2
With 54 degrees of freedom and \(\alpha = 0.05\), the critical value of \(t\) is 2.0048793. The absolute value of the calculated \(t\) (-2.394766) is greater than the critical \(t\), so the decision is to reject \(H_0\).
<- lm(SHOE ~ HEIGHT, data = dtf)
env_lm
env_lm$coefficients
env_lmsummary(env_lm)
plot(dtf$HEIGHT, dtf$SHOE, xlab = 'Height (cm)', ylab = 'Shoe size', las = 1, pch = 16)
abline(lm_env)
library(ggplot2)
ggplot(dtf, aes(HEIGHT, SHOE)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = 'Height (cm)', y = 'Shoe size')
Sample correlation coefficient \(r\):
\[r = \frac{\frac{1}{n-1} \sum(x-\bar x)(y - \bar y)}{s_x s_y}\]
or:
\[ r = \frac{\mathrm{cov}(x,y)}{s_x s_y}\]
Example: Calculate the correlation coefficient for the height and shoe size data using the equations above.
<- nrow(dtf)
n <- 1 / (n - 1) * sum((dtf$SHOE - mean(dtf$SHOE)) * (dtf$HEIGHT - mean(dtf$HEIGHT)))
co cov(dtf$SHOE, dtf$HEIGHT)
<- sd(dtf$HEIGHT)
sx <- sd(dtf$SHOE)
sy <- co / (sx * sy)
r
# one step
cor(dtf$HEIGHT, dtf$SHOE)
What does it mean?
Source | \(df\) | \(SS\) | \(MS\) | \(F\) |
---|---|---|---|---|
between the obs. and the mean | \(df_\mathrm{total} = n - 1\) | \(SS_\mathrm{total} = \sum(y -\bar y)^2\) | \(MS_\mathrm{total} = \frac{SS_\mathrm{total}}{df_\mathrm{total}}\) | \(F = \frac{MS_\mathrm{regression}}{MS_\mathrm{residual}}\) |
between the obs. and the regression line | \(df_\mathrm{residual} = n - 2\) | \(SS_\mathrm{residual} = \sum(y-y')^2\) | \(MS_\mathrm{residual} = \frac {SS_\mathrm{residual}}{df_\mathrm{residual}}\) | |
between the regression line and the mean | \(df_\mathrm{regression} = 1\) | \(SS_\mathrm{regression} = \sum(y'-\bar y)^2\) | \(MS_\mathrm{regression} = \frac{SS_\mathrm{regression}}{df_\mathrm{regression}}\) |
Alternatively, we compare \(SS_\mathrm{regression}\) with \(SS_\mathrm{total}\). If the ratio of them is large, it tells us the \(x\)-\(y\) relationship is strong.
Example: Calculate \(\frac{SS_\mathrm{regression}}{SS_\mathrm{total}}\). Calculate \(r^2\).
$predicted <- a + b * dtf$HEIGHT
dtf<- sum((dtf$SHOE - mean(dtf$SHOE)) ^ 2)
SStot <- sum((dtf$SHOE - dtf$predicted) ^2)
SSres <- sum((dtf$predicted - mean(dtf$SHOE)) ^ 2)
SSreg
/ SStot SSreg
[1] 0.7733371
^ 2 r
[1] 0.7733371
\(SS_\mathrm{regression}\) explains 77% of \(SS_\mathrm{total}\). Is it a large proportion?
Correlation Coefficient (r) | Description (Rough Guideline ) |
---|---|
+1.0 | Perfect + association |
+0.8 to 1.0 | Very strong + association |
+0.6 to 0.8 | Strong + association |
+0.4 to 0.6 | Moderate + association |
+0.2 to 0.4 | Weak + association |
0.0 to +0.2 | Very weak + or no association |
0.0 to -0.2 | Very weak - or no association |
-0.2 to – 0.4 | Weak - association |
-0.4 to -0.6 | Moderate - association |
-0.6 to -0.8 | Strong - association |
-0.8 to -1.0 | Very strong - association |
-1.0 | Perfect negative association |
Is the population correlation coefficient \(\rho\) equal to a given value (such as zero)?
Collect data
Calculate a test statistic.
\[t = \frac{r - \rho}{s_r}\]
\[s_r = \sqrt {\frac{1 - r ^2}{n-2}}\]
\[t=r \sqrt{\frac{n-2}{1-r^{2}}}\]
Action: Calculate the \(t\) score of the hypothesis test about correlation coefficient.
<- r * sqrt((n-2) / (1 - r ^ 2))
t_score <- qt(0.975, df = n - 2)
t_critical pt(t_score, df = n-2, lower.tail = FALSE) * 2
<- lm(SHOE ~ HEIGHT, data = dtf)
env_lm env_lm
Call:
lm(formula = SHOE ~ HEIGHT, data = dtf)
Coefficients:
(Intercept) HEIGHT
-8.6017 0.2836
summary(env_lm)
Call:
lm(formula = SHOE ~ HEIGHT, data = dtf)
Residuals:
Min 1Q Median 3Q Max
-2.7527 -0.8123 -0.0363 1.0535 3.4117
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.60167 3.58524 -2.399 0.02 *
HEIGHT 0.28356 0.02109 13.447 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.467 on 53 degrees of freedom
Multiple R-squared: 0.7733, Adjusted R-squared: 0.7691
F-statistic: 180.8 on 1 and 53 DF, p-value: < 2.2e-16
Example: Use the iris data. Calculate \(r\) between Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width.
<- iris[, -5]
i2 <- cor(i2)
icor library(beginr)
lmdf(i2)
library(psych)
<- corr.test(iris[, 1:4])
tb_corr $r tb_corr
The corrplot package:
library(corrplot)
par(mfrow = c(4, 2))
corrplot(icor)
corrplot(icor, order = "AOE")
corrplot(icor, order = "AOE", method = "number")
corrplot(icor, order = "AOE", method = "ellipse")
corrplot(icor, order = "AOE", method = "shade")
corrplot(icor, order = "AOE", method = "pie")
corrplot(icor, order = "AOE", addCoef.col = "grey")
corrplot(icor, order = "AOE", type = "upper")
corrplot(icor, order = "AOE", type = "lower", method = "ell",
diag = FALSE, tl.pos = "n", cl.pos = "n", add = TRUE)
The GGally package:
library(GGally)
ggpairs(dtf[, c("HEIGHT", "SHOE")])
ggpairs(iris, aes(fill = Species, color = Species, alpha = 0.5))