ENV221 L10

Author

Peng Zhao

1 Overview of the module and R

2 R Basic Operations

3 R Programming

4 Statistical Graphs

5 Basic Concepts

6 Descriptive statistics

7 Distributions and the Central Limit Theorem

8 Hypothesis test

9 t-test

10 Numerical vs. Numerical

10.1 Learning objectives

In this lecture, you will

  1. Explore a relationship between two numeric variables,
  2. Work with regression,
  3. Understand what correlation is all about,
  4. Discover how correlation connects to regression, and
  5. Visualize linear regression and correlations.

10.2 Regression

10.2.1 Prediction

Example: Predict the shoe size of an ENV221 student. What is the best prediction for a student I do not know?

  • The mean.
library(ggplot2)
dtf <- read.csv("data/students_env221.csv")
ggplot(dtf) + geom_boxplot(aes(SHOE))

dtf <- dtf[dtf$SHOE < 100 & !is.na(dtf$SHOE), ]
mean_shoe <- mean(dtf$SHOE, na.rm = TRUE)
round(mean_shoe, 1)
[1] 39.5

The more information about the students, the more accurate prediction.

plot(x = dtf$HEIGHT, y = dtf$SHOE, xlab = "Height (cm)", ylab = "Shoe size", pch = 16)
mean_height <- mean(dtf$HEIGHT, na.rm = TRUE)
points(x = mean_height, y = mean_shoe, col = "red", cex = 3)
abline(h = mean_shoe, v = mean_height, col = "red")

Generally, a tall student has a large shoe size.

Then, if we know a student’s height, can we predict her/his shoe size better than using the mean?

10.2.2 Linear regression

  • Draw a line to summarize the relationship in the scatter plot.
  • We can draw many many lines. Which one best summarizes the relationship?

\[y' = a + bx\]

\(a, b\):

The regression coefficients. \(a\) is the intercept, i.e. the point where the line crosses the vertical axis, and \(b\) is the slope.

The regression line (The Least Squares Line):

The best fitting line, which passes through the maximum number of points and isn’t too far away from the points that it does not pass through.

  • The distances in the vertical direction between the points is a minimum — the minimal sum of the square of these distances (\(\sum(y'-y)^2\)).

\[b = \frac{\sum(x-\bar x)(y -\bar y)}{\sum(x-\bar x)^2}\]

\[a = \bar y - b\bar x\]

Example: Calculate \(a\) and \(b\) for the regression line of the shoe size against the height, using these two equations above. Give the prediction equation.

mean_height <- mean(dtf$HEIGHT)
b <- sum((dtf$HEIGHT - mean_height) * (dtf$SHOE - mean_shoe)) / sum((dtf$HEIGHT - mean_height) ^ 2)
a <- mean_shoe - b * mean_height
a
[1] -8.601667
b
[1] 0.2835591

\[y' = -8.60 + 0.284 x\]

10.2.3 Variation around the regression line

Residual:

\(y-y'\) (Deviations of the observed \(y\)-values around the predicted \(y\)-values)

Residual variance:

roughly an average of the squared residuals:

\[s^2_{yx} = \frac{\sum(y-y')^2}{n-2}\]

Residual standard error:
\(s_{yx} = \sqrt\frac{\sum(y-y')^2}{n-2}\)

Example: Fill in the following table. Calculate the residual standard error.

dtf$predicted <- round(a + b * dtf$HEIGHT, 1)
dtf$residual <- round(dtf$SHOE - dtf$predicted, 1)

Residual standard error:

syx <- sqrt(sum(dtf$residual ^ 2) / (nrow(dtf) - 2))
syx
[1] 1.470015

The smaller the residual standard error is, the better the fit is.

But how small is the best?

10.2.4 Simple linear regression model (SLR)

  • X: the independent variable. Input or cause variables. Synonyms: predictor variable, regressor, controlled variable, manipulated variable, explanatory variable, exposure variable, risk factor, feature and input variable.
  • Y: the dependent variable. Output or effect variables. Synonyms: Response variable, regressand, measured variable, responding variable, explained variable, outcome variable, experimental variable, and output variable.
  • Regressing Y on X

Assumptions:

  1. Values of X are “fixed” and non-random.
  2. X is measured without error.
  3. For each value of X there is a subpopulation of values of Y that are normally distributed.
  4. Assumption of homoskedasticity: The variances of the sub-populations of Y are equal.
  5. Assumption of linearity: The means of the sub-populations of Y all lie on the same straight line \(\mu_{y|x} = \alpha + \beta x\)
  6. The Y values are statistically independent.

SLR model:

\[y' = \alpha + \beta x +\epsilon\]

\[\epsilon = y' - (\alpha + \beta x) = y' - \mu_{y|x} \]

\(\epsilon\):

The error term. The amount by which an observation y deviates from the mean of the subpopulation of Y values from which it is drawn.

10.2.5 Hypothesis test about the fit

Is there relationshipt between \(X\) and \(Y\)?

  1. Hypotheses
  • \(H_0\): There is no relationship between the height and the shoe size.
  • \(H_1\): There is relationship between the height and the shoe size.
  • Question: Reject \(H_0\)? Given \(\alpha\).
  1. Collect data.

  2. Calculate a test statistic: \(F\)-test.

Code
xi <- 177
y_prime <- lm_env$coefficients[1] + lm_env$coefficients[2] * xi
y_mean <- mean_shoe
yi <- dtf$SHOE[dtf$HEIGHT == 177][1]

plot(x = dtf$HEIGHT, y = dtf$SHOE, xlab = "Height (cm)", ylab = "Shoe size", pch = 16, las = 1)
points(x = mean_height, y = mean_shoe, col = "red", cex = 3)
abline(lm_env, col = "blue", lty = 2)
points(x = rep(xi, 2), y = c(y_prime, yi), 
       col = c("blue", "black"), pch = c(1, 16), cex = 2)
abline(h = c(y_mean, y_prime, yi), col = c("red", "blue", "black"))
text(x = rep(155, 3), y = c(y_mean, y_prime, yi), 
     labels = c(expression(bar(y)), expression("y'"), expression(y[i])), 
     pos = 3, 
     col = c('red', 'blue', 'black'))
arrows(xi - 1.3, y_mean, xi - 1.3, yi, length = 0.2, code = 3, col = 'red')
text(xi - 1.3, mean(c(y_mean, yi)), expression(SS[total]), pos = 2, col = 'red')
arrows(xi, y_mean, xi, y_prime, length = 0.2, code = 3, col = 'purple')
text(xi, mean(c(y_mean, y_prime)), expression(SS[regression]), pos = 4, col = 'purple')
arrows(xi + 1.3, y_prime, xi + 1.3, yi, length = 0.2, code = 3, col = 'blue')
text(xi + 1.3, mean(c(y_prime, yi)), expression(SS[residual]), pos = 4, col = 'blue')

\[y_i - \bar y = (y_i-y') - (y'-\bar y)\]

\(\sum(y_i-y')^2\):
The residual sum of squares (\(SS_\mathrm{residual}\)).
\(\sum(y'-\bar y)^2\):
The regression sum of squares (\(SS_\mathrm{regression}\)). The gain in prediction due to using the regression line rather than using the mean.
\(\sum(y_i -\bar y)^2\):
The total sum of squares (\(SS_\mathrm{total}\)). The numerator of the variance of \(y\). The numerator of the total variance.

\[SS_\mathrm{residual} + SS_\mathrm{regression} = SS_\mathrm{total}\]

\[df_\mathrm{residual} + df_\mathrm{regression} = df_\mathrm{total}\]

\[df_\mathrm{regression} = 1\]

\[df_\mathrm{total} = n - 1\]

\[df_\mathrm{residual} = n - 2\]

Source \(df\) \(SS\) (Sum of squares) \(MS\) (Mean of squares) \(F\)
between the obs. and the mean \(df_\mathrm{total} = n - 1\) \(SS_\mathrm{total} = \sum(y_i -\bar y)^2\) \(MS_\mathrm{total} = \frac{SS_\mathrm{total}}{df_\mathrm{total}}\) \(F = \frac{MS_\mathrm{regression}}{MS_\mathrm{residual}}\)
between the obs. and the regression line \(df_\mathrm{residual} = n - 2\) \(SS_\mathrm{residual} = \sum(y_i-y')^2\) \(MS_\mathrm{residual} = \frac {SS_\mathrm{residual}}{df_\mathrm{residual}}\)
between the regression line and the mean \(df_\mathrm{regression} = 1\) \(SS_\mathrm{regression} = \sum(y'-\bar y)^2\) \(MS_\mathrm{regression} = \frac{SS_\mathrm{regression}}{df_\mathrm{regression}}\)
  • If \(H_0\) is true (i.e. no relationship between \(x\) and \(y\)), \(MS_\mathrm{regression}\) should be no greater than \(MS_\mathrm{residual}\).
  • If \(H_0\) is false (i.e. relationship between \(x\) and \(y\)), \(MS_\mathrm{regression}\) should be greater than \(MS_\mathrm{residual}\).

Equivalent hypotheses:

  • \(H_0: \sigma ^2_\mathrm{regression} \le \sigma ^2_\mathrm{residual}\)
  • \(H_1: \sigma ^2_\mathrm{regression} > \sigma ^2_\mathrm{residual}\)
  • \(F\)-test

Action: Fill in the following table.

Source \(df\) \(SS\) \(MS\) \(F\)
between the obs. and the mean
between the obs. and the regression line
between the regression line and the mean
# df
n <- nrow(dtf)
(dftot <- n - 1)
(dfres <- n - 2)
(dfreg <- 1)
dftot - dfres

# SS
(SStot <- sum((dtf$SHOE - mean_shoe) ^ 2))
(SSres <- sum((dtf$SHOE - dtf$predicted) ^2))
(SSreg <- sum((dtf$predicted - mean_shoe) ^ 2))
SSreg + SSres

# MS
(MSreg <- SSreg / dfreg)
(MSres <- SSres / dfres)

# F
(F_score <- MSreg / MSres)
(F_critical <- qf(0.05, df1 = dfreg, df2 = dfres, lower.tail = FALSE))
(p_value <- pf(F_score, df1 = dfreg, df2 = dfres, lower.tail = FALSE))
Source \(df\) \(SS\) \(MS\) \(F\)
between the obs. and the mean 54 503.4272727
between the obs. and the regression line 53 114.53 2.1609434
between the regression line and the mean 1 388.4354545 388.4354545 179.7527206
  1. Decision.

With 1 and 53 degrees of freedom and \(\alpha = 0.05\), the critical value of \(F\) is 4.023017. The calculated \(F\) (179.7527206) is greater than the critical \(F\), so the decision is to reject \(H_0\).

  1. Conclusion. There is relationship beteen the body height and the shoe size.

10.2.6 Hypothesis test about the slope

Is the population slope (\(\beta\) in \(\mu_{y|x} = \alpha + \beta x\)) a given value (such as 0)?

  1. Hypotheses:
  • \(H_0: \beta = 0\). The slope of the regression line is zero.
  • \(H_1: \beta \ne 0\). The slope of the regression line is not zero.
  1. Collect data

  2. Calculate a test statistic. \(t\)-test.

\[t = \frac{b - \beta}{s_b}\] \[s_b = \frac{s_{yx}}{s_x \sqrt{n-1}}\]

Example: Calculate the \(t\) score for the slope of the Shoe-Height regression line.

Code
syx
(sx <- sd(dtf$HEIGHT))
(sb <- syx/(sx * sqrt(n-1)))
beta <- 0
b
(t_score <- (b-beta) /sb)
(t_critical <- qt(0.975, df = n-2))
pt(t_score, df = n - 2, lower.tail = FALSE) * 2
  1. Decision.

With 53 degrees of freedom and \(\alpha = 0.05\), the critical value of \(t\) is 2.005746. The calculated \(t\) (13.422428) is greater than the critical \(t\), so the decision is to reject \(H_0\).

  1. Conclusion. The slope of the regression line is not equal to 0.

Compare the p-value with the one in the previous F-test.

10.2.7 Hypothesis test about the intercept

Is the population intercept (\(\alpha\) in \(\mu_{y|x} = \alpha + \beta x\)) a given value (such as 0)?

  1. Hypotheses:
  • \(H_0: \alpha = 0\). The intercept of the regression line is zero.
  • \(H_1: \alpha \ne 0\). The intercept of the regression line is not zero.
  1. Collect data

  2. Calculate a test statistic. \(t\)-test.

\[t = \frac{a - \alpha}{s_a}\] \[s_a = s_{yx}\sqrt{\frac{1}{n} + \frac{\bar x^2}{(n-1)s_x^2}}\]

Example: Calculate the \(t\) score for the intercept of the Shoe-Height regression line.

Code
syx
n
(xbar <- mean_height)
sx
(sa <- syx * sqrt(1/n + xbar ^2 / ((n-1) * sx^2)))
alpha <- 0
a
(t_score <- (a-alpha) /sa)
(t_critical <- qt(0.975, df = n-1))
pt(t_score, df = n - 2) * 2
  1. Decision.

With 54 degrees of freedom and \(\alpha = 0.05\), the critical value of \(t\) is 2.0048793. The absolute value of the calculated \(t\) (-2.394766) is greater than the critical \(t\), so the decision is to reject \(H_0\).

  1. Conclusion: The intercept of the regression line is not equal to 0.

10.2.8 One-step in R

env_lm <- lm(SHOE ~ HEIGHT, data = dtf)
env_lm
env_lm$coefficients
summary(env_lm)

10.2.9 Visualization

Code
plot(dtf$HEIGHT, dtf$SHOE, xlab = 'Height (cm)', ylab = 'Shoe size', las = 1, pch = 16)
abline(lm_env)

Code
library(ggplot2)
ggplot(dtf, aes(HEIGHT, SHOE)) +
  geom_point() +
  geom_smooth(method = "lm") + 
  labs(x = 'Height (cm)', y = 'Shoe size')

10.3 Correlation

10.3.1 Basic Concepts

Correlation:
The technique used to assess two variables that have a co-relationship with one another.
Positive correlation:
High values on one are associated with high values on the other.
Negative correlation:
High values on one are associated with low values on the other.
Population Correlation Coefficient (\(\rho\)):
The parameter which measures the strength of the linear relationship.
  • \(-1 \le \rho \le +1\).
  • \(\rho = −1\): perfect negative correlation between the two variables
  • \(\rho = 1\): perfect positive correlation.
  • \(\rho = 0\): the two variables are not linearly correlated.

Sample correlation coefficient \(r\):

\[r = \frac{\frac{1}{n-1} \sum(x-\bar x)(y - \bar y)}{s_x s_y}\]

or:

\[ r = \frac{\mathrm{cov}(x,y)}{s_x s_y}\]

\(\mathrm{cov}(x,y)\):
covariance, which represents \(x\) and \(y\) varying together.

Example: Calculate the correlation coefficient for the height and shoe size data using the equations above.

Code
n <- nrow(dtf)
co <- 1 / (n - 1) * sum((dtf$SHOE - mean(dtf$SHOE)) * (dtf$HEIGHT - mean(dtf$HEIGHT)))
cov(dtf$SHOE, dtf$HEIGHT)
sx <- sd(dtf$HEIGHT)
sy <- sd(dtf$SHOE)
r <- co / (sx * sy)

# one step
cor(dtf$HEIGHT, dtf$SHOE)

What does it mean?

10.3.2 Revisit the regression

Source \(df\) \(SS\) \(MS\) \(F\)
between the obs. and the mean \(df_\mathrm{total} = n - 1\) \(SS_\mathrm{total} = \sum(y -\bar y)^2\) \(MS_\mathrm{total} = \frac{SS_\mathrm{total}}{df_\mathrm{total}}\) \(F = \frac{MS_\mathrm{regression}}{MS_\mathrm{residual}}\)
between the obs. and the regression line \(df_\mathrm{residual} = n - 2\) \(SS_\mathrm{residual} = \sum(y-y')^2\) \(MS_\mathrm{residual} = \frac {SS_\mathrm{residual}}{df_\mathrm{residual}}\)
between the regression line and the mean \(df_\mathrm{regression} = 1\) \(SS_\mathrm{regression} = \sum(y'-\bar y)^2\) \(MS_\mathrm{regression} = \frac{SS_\mathrm{regression}}{df_\mathrm{regression}}\)

Alternatively, we compare \(SS_\mathrm{regression}\) with \(SS_\mathrm{total}\). If the ratio of them is large, it tells us the \(x\)-\(y\) relationship is strong.

Example: Calculate \(\frac{SS_\mathrm{regression}}{SS_\mathrm{total}}\). Calculate \(r^2\).

Code
dtf$predicted <- a + b * dtf$HEIGHT
SStot <- sum((dtf$SHOE - mean(dtf$SHOE)) ^ 2)
SSres <- sum((dtf$SHOE - dtf$predicted) ^2)
SSreg <- sum((dtf$predicted - mean(dtf$SHOE)) ^ 2)

SSreg / SStot
[1] 0.7733371
Code
r ^ 2
[1] 0.7733371
\(r^2\):
Coefficient of determination.

\(SS_\mathrm{regression}\) explains 77% of \(SS_\mathrm{total}\). Is it a large proportion?

Correlation Coefficient (r) Description (Rough Guideline )
+1.0 Perfect + association
+0.8 to 1.0 Very strong + association
+0.6 to 0.8 Strong + association
+0.4 to 0.6 Moderate + association
+0.2 to 0.4 Weak + association
0.0 to +0.2 Very weak + or no association
0.0 to -0.2 Very weak - or no association
-0.2 to – 0.4 Weak - association
-0.4 to -0.6 Moderate - association
-0.6 to -0.8 Strong - association
-0.8 to -1.0 Very strong - association
-1.0 Perfect negative association

10.3.3 Hypothesis test about correlation

Is the population correlation coefficient \(\rho\) equal to a given value (such as zero)?

  • Sample statistic: \(r\)
  • Population parameter: \(\rho\)
  1. Hypotheses
  • \(H_0: \rho = 0\)
  • \(H_1: \rho \ne 0\)
  • Question: Reject \(H_0\)? Given \(\alpha\).
  1. Collect data

  2. Calculate a test statistic.

\[t = \frac{r - \rho}{s_r}\]

\[s_r = \sqrt {\frac{1 - r ^2}{n-2}}\]

\[t=r \sqrt{\frac{n-2}{1-r^{2}}}\]

Action: Calculate the \(t\) score of the hypothesis test about correlation coefficient.

Code
t_score <- r * sqrt((n-2) / (1 - r ^ 2))
t_critical <- qt(0.975, df = n - 2)
pt(t_score, df = n-2, lower.tail = FALSE) * 2
  1. Decision: reject $H_0$ at \(\alpha = 0.05\).
  2. Conclusion: The population correlation coefficient is not zero.

10.3.4 Fast calculations in R

env_lm <- lm(SHOE ~ HEIGHT, data = dtf)
env_lm

Call:
lm(formula = SHOE ~ HEIGHT, data = dtf)

Coefficients:
(Intercept)       HEIGHT  
    -8.6017       0.2836  
summary(env_lm)

Call:
lm(formula = SHOE ~ HEIGHT, data = dtf)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7527 -0.8123 -0.0363  1.0535  3.4117 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -8.60167    3.58524  -2.399     0.02 *  
HEIGHT       0.28356    0.02109  13.447   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.467 on 53 degrees of freedom
Multiple R-squared:  0.7733,    Adjusted R-squared:  0.7691 
F-statistic: 180.8 on 1 and 53 DF,  p-value: < 2.2e-16

Example: Use the iris data. Calculate \(r\) between Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width.

Code
i2 <- iris[, -5]
icor <- cor(i2)
library(beginr)
lmdf(i2)
library(psych)
tb_corr <- corr.test(iris[, 1:4])
tb_corr$r

10.3.5 Visualization

The corrplot package:

library(corrplot)
par(mfrow = c(4, 2))
corrplot(icor)
corrplot(icor, order = "AOE")
corrplot(icor, order = "AOE", method = "number")
corrplot(icor, order = "AOE", method = "ellipse")
corrplot(icor, order = "AOE", method = "shade")
corrplot(icor, order = "AOE", method = "pie")
corrplot(icor, order = "AOE", addCoef.col = "grey")
corrplot(icor, order = "AOE", type = "upper")
corrplot(icor, order = "AOE", type = "lower", method = "ell", 
         diag = FALSE, tl.pos = "n", cl.pos = "n", add = TRUE)

The GGally package:

library(GGally)
ggpairs(dtf[, c("HEIGHT", "SHOE")])

ggpairs(iris, aes(fill = Species, color = Species, alpha = 0.5))

10.4 Readings

  • Applied Environmental Statistics with R - Chapter 11
  • Manga Guide to Statistics - Chapter 6

10.5 Highlights

  • For a scientific question about two numerical variables,
    • Calculate the SLR slope and intercept, and explain them.
    • Calculate the correlation coefficient and the coefficient of determination, and explain them.
    • Carry out hypothesis tests for testing the regression, the slope, the intercept, and the correlation coefficient.
      • State \(H_0\) and \(H_1\).
      • Calculate the critical value for a given \(\alpha\).
      • Calculate the testing statistic.
      • Draw a conclusion.