dtf <- data.frame(diet1 = c(90, 95, 100),
diet2 = c(120, 125, 130),
diet3 = c(125, 130, 135))Analysis of Variance (ANOVA)
Numerical vs. categorical
Dr. Peng Zhao (✉ peng.zhao@xjtlu.edu.cn)
Department of Health and Environmental Sciences
Xi’an Jiaotong-Liverpool University
1 Learning objectives
In this lecture, you will
- Understand the concept of the analysis of variance (ANOVA), and
- Carry out one-way and two-way ANOVA for answering scientific questions.
2 Revisit the t-test
Example: Rats on diets
A biologist studies the weight gain of male lab rats on diets over a 4-week period. Three different diets are applied. The results are shown in the following table.
| diet1 | diet2 | diet3 |
|---|---|---|
| 90 | 120 | 125 |
| 95 | 125 | 130 |
| 100 | 130 | 135 |
- Are the weight gains of the three treatments are all equal?
- In another word, do the diets have influence on the weight gain?
If we do the \(t\)-test between each two treatments…
dtt <- data.frame(Number_of_Samples = 2:10)
dtt$Number_of_Tests <- choose(dtt$Number_of_Samples, 2)
dtt$alpha_overall <- round(1 - (1 - 0.05)^dtt$Number_of_Tests, 2)| Number_of_Samples | Number_of_Tests | alpha_overall |
|---|---|---|
| 2 | 1 | 0.05 |
| 3 | 3 | 0.14 |
| 4 | 6 | 0.26 |
| 5 | 10 | 0.40 |
| 6 | 15 | 0.54 |
| 7 | 21 | 0.66 |
| 8 | 28 | 0.76 |
| 9 | 36 | 0.84 |
| 10 | 45 | 0.90 |
Instead of \(t\)-test for testing the means, we can transform the data as:
dtf2 <- stack(dtf)
names(dtf2) <- c("wg", "diet")| wg | diet |
|---|---|
| 90 | diet1 |
| 95 | diet1 |
| 100 | diet1 |
| 120 | diet2 |
| 125 | diet2 |
| 130 | diet2 |
| 125 | diet3 |
| 130 | diet3 |
| 135 | diet3 |
and analyse the relationship between the weight gain (numerical variable) of the diet treatments (categorical variable).
3 One-way ANOVA
- Analysis of variance (ANOVA):
- One of the most widely used statistical techniques. The test partitions the total variation present in a set of data into two or more components. Associated with each of these components is a specific source of variation, so that it is possible to ascertain the contributions of each of these sources to the total variation.
- One-way ANOVA:
- A test that concerns only one independent variable (\(x\)), which is called a factor and has multiple levels (settings, groups).
- Hypotheses:
- \(H_0: \mu _1 = \mu _2 = \mu _3 = ... = \mu_k\)
- \(H_1\): at least one mean is different from others
- Reject \(H_0\)? Given \(\alpha\).
- Collect data. Suppose we have \(k\) samples. The \(i\)-th sample has \(n_i\) observations.
| Level 1 | Level 2 | Level 3 | … | Level \(k\) | |
|---|---|---|---|---|---|
| \(x_{1,1}\) | \(x_{2, 1}\) | \(x_{3,1}\) | … | \(x_{k,1}\) | |
| \(x_{1,2}\) | \(x_{2,2}\) | \(x_{3,2}\) | … | \(x_{k,2}\) | |
| \(x_{1,3}\) | \(x_{2, 3}\) | \(x_{3,3}\) | … | \(x_{k, 3}\) | |
| . | . | . | . | . | |
| . | \(x_{2, n_2}\) | . | . | . | |
| . | \(x_{3,n_3}\) | . | . | ||
| \(x_{1, n_1}\) | … | \(x_{k, n_k}\) | |||
| Mean | \(\bar x_1\) | \(\bar x_2\) | \(\bar x_3\) | \(\bar x_k\) |
- Calculate a test statistic: \(F\)-test.
| Source | \(df\) | \(SS\) | \(MS\) | \(F\) |
|---|---|---|---|---|
| Within samples | \(df_W = n - k\) | \(SS_W = \sum_{i=1}^{k} \sum_{j=1}^{n_i}(x_{ij}-\bar x_i)^2\) | \(MS_W = \frac{SS_W}{df_W}\) | \(F = \frac{MS_B}{MS_W}\) |
| Between samples | \(df_B = k - 1\) | \(SS_B = \sum_{i = 1}^k n_i(\bar x_i - \bar x)^2\) | \(MS_B = \frac {SS_B}{df_B}\) | |
| Total | \(df_T = n - 1\) | \(SS_T = \sum_{i = 1}^k \sum _{j=1}^{n_i} (x_{ij}-\bar x) ^2\) | \(MS_T = \frac{SS_T}{df_T}\) |
- \(n\)
- The total number of the observations. \(n = \sum n_i\)
- \(x_{ij}\)
- The \(j\)-th observation in the \(i\)-th sample
- \(\bar x\)
- Overall/grand mean of all the observations. \(\bar x = \frac{\sum x_{ij}}{n}\)
- \(MS\)
-
Mean
ofsquared deviation from the mean - \(SS\)
-
Sum
ofsquared deviation from the mean - \(df\)
- degrees of freedom
If \(H_0\) is true, then any difference you see between the \(k\) samples are due to chance, i.e. \(\sigma^2_\mathrm {between} = \sigma^2_\mathrm {withiin} = \sigma^2_\mathrm {total}\), i.e. \(MS_B = MS_T = MS_W\). ==> \(F\)-test.
From the previous table, we can get:
\[df_T = df_W+ df_B\]
\[SS_T = SS_W + SS_B\]
which can be used for double-check.
- Decision. Reject or not reject \(H_0\).
- Conclusion. Whether the categorical variable has effect on the numerical variable.
Example: Rats on diets
Hypotheses and question:
- \(H_0: \mu _1 = \mu _2 = \mu _3\)
- \(H_1\): at least one mean is different from others
- Reject \(H_0\)?
Collect data:
Calculate a test statistics: \(F\) test.
Action: Fill in the ANOVA table.
| Source | \(df\) | \(SS\) | \(MS\) | \(F\) |
|---|---|---|---|---|
| Within samples | ||||
| Between samples | ||||
| Total |
Code
# df
(k <- ncol(dtf))
(n <- length(unlist(dtf)))
(dfW <- n - k)
(dfB <- k - 1)
(dfT <- n-1)
# mean
(xbar <- mean(unlist(dtf)))
(xibar <- colMeans(dtf))
# SS
(SSW1 <- sum((dtf$diet1 - xibar[1]) ^ 2))
(SSW2 <- sum((dtf$diet2 - xibar[2]) ^ 2))
(SSW3 <- sum((dtf$diet3 - xibar[3]) ^ 2))
(SSW <- SSW1 + SSW2 + SSW3)
(SSB1 <- length(dtf$diet1) * (xibar[1] - xbar) ^ 2)
(SSB2 <- length(dtf$diet2) * (xibar[2] - xbar) ^ 2)
(SSB3 <- length(dtf$diet3) * (xibar[3] - xbar) ^ 2)
(SSB <- SSB1 + SSB2 + SSB3)
# Double check
(SST <- sum((unlist(dtf) - xbar) ^ 2))
SSW + SSB
SSB/SST # Correlation ratio
# F
(MSW <- SSW / dfW)
(MSB <- SSB / dfB)
(F_score <- MSB / MSW)
(F_critical <- qf(0.95, df1 = dfB, df2 = dfW))
pf(F_score, df1 = dfB, df2 = dfW, lower.tail = FALSE)| Diet 1 | Diet2 | Diet3 | Total | |
|---|---|---|---|---|
| 90 | 120 | 125 | ||
| 95 | 125 | 130 | ||
| 100 | 130 | 135 | ||
| \(\bar x_i\) | 95 | 125 | 130 | |
| \(SS_W\) | 50 | 50 | 50 | 150 |
| \(SS_B\) | 1408.3333333 | 208.3333333 | 533.3333333 | 2150 |
| Source | \(df\) | \(SS\) | \(MS\) | \(F\) |
|---|---|---|---|---|
| Within samples | 6 | 150 | 25 | 43 |
| Between samples | 2 | 2150 | 1075 | |
| Total | 2300 |
- Decision.
Since the F value is 43, which exceeds the critical value of 5.1432528 at the significance level of \(\alpha = 0.05\), we can reject \(H_0\).
One step:
wg_aov <- aov(wg ~ diet, data = dtf2)
summary(wg_aov) Df Sum Sq Mean Sq F value Pr(>F)
diet 2 2150 1075 43 0.000277 ***
Residuals 6 150 25
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- Conclusion.
There is a significant effect of diet on the weight again of male laboratory rats.
The connection between ANOVA and t-test:
Code
dtf3 <- dtf2[1:6, ]
t.test(wg~diet, dtf3)
summary(aov(wg~diet, dtf3))4 Understand ANOVA
4.1 The model
\[x_{i j}=\mu+\tau_{j}+\epsilon_{i j} ; \quad i=1,2, \ldots, n_{j} ; \quad j=1,2, \ldots, k\]
- \(\mu\)
- The grand mean. The mean of all \(k\) population means.
- \(\tau_{j}\)
- The treatment effect. The difference between the mean of the j-th population and the grand mean (\(\bar x_j - \mu\)).
- \(\epsilon_{i j}\)
- The error term. The amount by which an individual measurement differs from the mean of the population to which it belongs (\(x_{ij} - \bar x_j\)).
\(H_{0}: \mu_{1}=\mu_{2}=\cdots=\mu_{k}\)
\(H_{A}:\) not all \(\mu_{k}\) are equal
4.2 Assumptions
- The \(k\) sets of observed data constitute \(k\) independent random samples from the respective populations.
- Each of the populations from which the samples come is normally distributed with mean \(\mu_{j}\) and variance \(\sigma_{j}^{2}\).
- Each of the populations has the same variance. That is, \(\sigma_{1}^{2}=\sigma_{2}^{2}=\cdots=\sigma_{k}^{2}=\sigma^{2},\) the common variance.
- The \(\tau_{j}\) are unknown constants and \(\sum \tau_{j}=0\) since the sum of all deviations of the \(\mu_{j}\) from their mean, \(\mu,\) is zero.
- The \(\epsilon_{i j}\) have a mean of \(0,\) since the mean of \(x_{i j}\) is \(\mu_{j}\)
- The \(\epsilon_{i j}\) have a variance equal to the variance of the \(x_{i j},\) since the \(\epsilon_{i j}\) and \(x_{i j}\) differ only by a constant; that is, the error variance is equal to \(\sigma^{2},\) the common variance specified in the assumption above.
- The \(\epsilon_{i j}\) are normally (and independently) distributed.
4.3 Estimating the statistics
The population variance \(\sigma^{2}\) may be estimated in two ways:
- \(\sigma ^2 = MS_W = \frac{SS_W}{df_W}\)
- \(\sigma ^2 = MS_B = \frac{SS_B}{df_B}\)
Compare the two estimates of the population variance: \(F_\mathrm {score} = MSB/MSW\).
- The numerator df: \(k − 1\).
- The denominator df: \(n - k\).
5 Repeated measures
Repeated measures (matched samples, randomized blocks, within subjects):
- \(k\) samples.
- Each sample has \(n\) observations.
| ID | Level 1 | Level 2 | Level 3 | … | Level \(k\) |
|---|---|---|---|---|---|
| 1 | \(x_{1,1}\) | \(x_{2, 1}\) | \(x_{3,1}\) | … | \(x_{k,1}\) |
| 2 | \(x_{1,2}\) | \(x_{2,2}\) | \(x_{3,2}\) | … | \(x_{k,2}\) |
| 3 | \(x_{1,3}\) | \(x_{2, 3}\) | \(x_{3,3}\) | … | \(x_{k, 3}\) |
| . | . | . | . | . | . |
| n | \(x_{1, n}\) | \(x_{2, n}\) | \(x_{3, n}\). | … | \(x_{k, n}\) |
| Source | \(df\) | \(SS\) | \(MS\) | \(F\) |
|---|---|---|---|---|
| Within samples | \(df_W = nk - k\) | \(SS_W = \sum_{i=1}^{k} \sum_{j=1}^{n}(x_{ij}-\bar x_i)^2\) | \(MS_W = \frac{SS_W}{df_W}\) | \(F = \frac{MS_B}{MS_{Wcorr}}\) |
| Between samples | \(df_B = k - 1\) | \(SS_B = \sum_{i = 1}^k n(\bar x_i - \bar x)^2\) | \(MS_B = \frac {SS_B}{df_B}\) | |
| Subjects (Row) | \(df_S = n-1\) | \(SS_S = k\sum _{j=1}^{n} (\bar x_j - \bar x)\) | ||
| Within Corrected | \(df_{Wcorr} = df_W- df_S\) | \(SS_{Wcorr} = SS_W-SS_S\) | \(MS_{Wcorr} = \frac{SS_{Wcorr}}{df_{Wcorr}}\) | |
| Total | \(df_T = nk - 1\) | \(SS_T = \sum_{i = 1}^k \sum _{j=1}^{n} (x_{ij}-\bar x) ^2\) | \(MS_T = \frac{SS_T}{df_T}\) |
Demo: Weight-loss program
dtf <- data.frame(
# id = LETTERS[1:10],
before = c(198, 201, 210, 185, 204, 156, 167, 197, 220, 186),
one = c(194, 203, 200, 183, 200, 153, 166, 197, 215, 184),
two = c(191, 200, 192, 180, 195, 150, 167, 195, 209, 179),
three = c(188, 196, 188, 178, 191, 145, 166, 192, 205, 175)
)
rownames(dtf) <- LETTERS[1:10]| before | one month | two months | three months | |
|---|---|---|---|---|
| A | 198 | 194 | 191 | 188 |
| B | 201 | 203 | 200 | 196 |
| C | 210 | 200 | 192 | 188 |
| D | 185 | 183 | 180 | 178 |
| E | 204 | 200 | 195 | 191 |
| F | 156 | 153 | 150 | 145 |
| G | 167 | 166 | 167 | 166 |
| H | 197 | 197 | 195 | 192 |
| I | 220 | 215 | 209 | 205 |
| J | 186 | 184 | 179 | 175 |
Is the weight-loss program effective?
Click to see the transformed data frame
Action: Follow the steps and fill in the ANOVA table.
Hypotheses and question:
- \(H_0: \mu_0 = \mu_1 = \mu_2 = \mu_3\)
- \(H_1:\) Not \(H_0\)
- Question: Reject \(H_0\)? Given \(\alpha\).
Collect data
Calculate a test statistic.
| Source | \(df\) | \(SS\) | \(MS\) | \(F\) |
|---|---|---|---|---|
| Within samples | ||||
| Between samples | ||||
| Subjects (Row) | ||||
| Within Corrected | ||||
| Total |
Click to see the results
dtf <- data.frame(
# id = LETTERS[1:10],
before = c(198, 201, 210, 185, 204, 156, 167, 197, 220, 186),
one = c(194, 203, 200, 183, 200, 153, 166, 197, 215, 184),
two = c(191, 200, 192, 180, 195, 150, 167, 195, 209, 179),
three = c(188, 196, 188, 178, 191, 145, 166, 192, 205, 175)
)
# df
(k <- ncol(dtf))
(n <- nrow(dtf))
(N <- length(unlist(dtf)))
(dfW <- N - k)
(dfB <- k - 1)
(dfT <- N - 1)
(dfS <- n - 1)
(dfWcorr <- dfW-dfS)
# mean
(xbar <- mean(unlist(dtf)))
(xibar <- colMeans(dtf))
(xjbar <- rowMeans(dtf))
# SS
(SSW1 <- sum((dtf$before - xibar[1]) ^ 2))
(SSW2 <- sum((dtf$one - xibar[2]) ^ 2))
(SSW3 <- sum((dtf$two - xibar[3]) ^ 2))
(SSW4 <- sum((dtf$three - xibar[4]) ^ 2))
(SSW <- SSW1 + SSW2 + SSW3 + SSW4)
sum(apply(dtf, 2, function(x) (x - mean(x)) ^2))
(SSB1 <- n * (xibar[1] - xbar) ^ 2)
(SSB2 <- n * (xibar[2] - xbar) ^ 2)
(SSB3 <- n * (xibar[3] - xbar) ^ 2)
(SSB4 <- n * (xibar[4] - xbar) ^ 2)
(SSB <- SSB1 + SSB2 + SSB3 + SSB4)
sum(n * (xibar - xbar) ^ 2)
(SSS <- k * sum((xjbar - xbar) ^ 2))
(SSWcorr <- SSW-SSS)
# Double check
(SST <- sum((unlist(dtf) - xbar) ^ 2))
SSW + SSB
# F
(MSB <- SSB / dfB)
(MSWcorr <- SSWcorr / dfWcorr)
(F_score <- MSB / MSWcorr)
(F_critical <- qf(0.95, df1 = dfB, df2 = dfWcorr))
pf(F_score, df1 = dfB, df2 = dfWcorr, lower.tail = FALSE)| Source | \(df\) | \(SS\) | \(MS\) | \(F\) |
|---|---|---|---|---|
| Within samples | 36 | 1.18409^{4} | 25 | 24.4851201 |
| Between samples | 3 | 569.075 | 189.6916667 | |
| Subjects (Row) | 9 | 1.1631725^{4} | ||
| Within Corrected | 27 | 209.175 | 7.7472222 | |
| Total | 39 | 1.2409975^{4} |
Decision.
With 3 and 27 degrees of freedom, the critical \(F\) for \(\alpha = 0.05\) is 2.9603513, which is smaller than the calculated \(F\) value 24.4851201. Thus, the decision is to reject \(H_0\).
One step:
dtf2 <- stack(dtf)
names(dtf2) <- c("w", "level")
w_aov <- aov(w ~ level, data = dtf2)
summary(w_aov) Df Sum Sq Mean Sq F value Pr(>F)
level 3 569 189.7 0.577 0.634
Residuals 36 11841 328.9
dtf2$subject <- rep(LETTERS[1:10], 4)
w_aov2 <- aov(w ~ level + Error(subject/level), data = dtf2)
summary(w_aov2)
Error: subject
Df Sum Sq Mean Sq F value Pr(>F)
Residuals 9 11632 1292
Error: subject:level
Df Sum Sq Mean Sq F value Pr(>F)
level 3 569.1 189.69 24.48 7.3e-08 ***
Residuals 27 209.2 7.75
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- Conclusion.
This program has significant effect on weight-loss.
6 Two-way ANOVA
Demo: Rats on diets
A biologist studies the weight gain of lab rats dependently on diets and gender over a 4-week period. Three different diets are applied. . Do diet and gender have an effect on weight gain?
| Diet 1 | Diet2 | Diet3 | |
|---|---|---|---|
| Male | 90 | 120 | 125 |
| 95 | 125 | 130 | |
| 100 | 130 | 135 | |
| Female | 75 | 100 | 118 |
| 78 | 118 | 125 | |
| 90 | 112 | 132 |
Can we apply a one-way ANOVA for diet effect, then another one-way ANOVA for gender effect?
Transform the data frame as:
dtf <- data.frame(
w = c(90,95,100,75,78,90,120,125,130,100,118,112,125,130,135,118,125,132),
diet = rep(c("Diet1", "Diet2", "Diet3"), each = 6),
gender = rep(c("Male", "Female"), each = 3)
)| w | diet | gender |
|---|---|---|
| 90 | Diet1 | Male |
| 95 | Diet1 | Male |
| 100 | Diet1 | Male |
| 75 | Diet1 | Female |
| 78 | Diet1 | Female |
| 90 | Diet1 | Female |
| 120 | Diet2 | Male |
| 125 | Diet2 | Male |
| 130 | Diet2 | Male |
| 100 | Diet2 | Female |
| 118 | Diet2 | Female |
| 112 | Diet2 | Female |
| 125 | Diet3 | Male |
| 130 | Diet3 | Male |
| 135 | Diet3 | Male |
| 118 | Diet3 | Female |
| 125 | Diet3 | Female |
| 132 | Diet3 | Female |
Interaction:
- A
- No significant effect of diet,
- No significant effect of gender,
- No interaction
- B
-
- No significant effect of diet,
- Effect of gender,
- No interaction
- C
-
- Effect of diet,
- Effect of gender,
- No interaction
- D
-
- Effect of diet in males
- No significant effect of diet in females,
- Effect of gender,
- Interaction (positive)
- E
-
- Effect of diet in males,
- No significant effect of diet in females,
- Effect of gender,
- Interaction (negative)
- F
-
- Effect of diet in male,
- Effect of diet in female,
- Effect of gender,
- Interaction (negative)
| Level 1 | Level 2 | … | Level \(k\) | … | Level \(K\) | |
|---|---|---|---|---|---|---|
| Category 1 | \(x_{1,1,1}\) | \(x_{1,2,1}\) | … | \(x_{1,k,1}\) | … | \(x_{1, K, 1}\) |
| \(x_{1,1,2}\) | \(x_{1,2,2}\) | … | \(x_{1,k,2}\) | … | \(x_{1,K,2}\) | |
| … | … | … | … | … | … | |
| \(x_{1,1, j}\) | \(x_{1,2, j}\) | … | \(x_{1,k, j}\) | … | \(x_{1,K,j}\) | |
| … | … | … | … | … | … | |
| \(x_{1,1, J}\) | \(x_{1,2, J}\) | … | \(x_{1,k, J}\) | … | \(x_{1,K,J}\) | |
| Category 2 | \(x_{2,1,1}\) | \(x_{2,2,1}\) | … | \(x_{2,k,1}\) | … | \(x_{2, K, 1}\) |
| \(x_{2,1,2}\) | \(x_{2,2,2}\) | … | \(x_{2,k,2}\) | … | \(x_{2,K,2}\) | |
| … | … | … | … | … | … | |
| \(x_{2,1, j}\) | \(x_{2,2, j}\) | … | \(x_{2,k, j}\) | … | \(x_{2,K,j}\) | |
| … | … | … | … | … | … | |
| \(x_{2,1, J}\) | \(x_{2,2, J}\) | … | \(x_{2,k, J}\) | … | \(x_{2,K,J}\) | |
| … | … | … | … | … | … | … |
| Category m | \(x_{m,1,1}\) | \(x_{m,2,1}\) | … | \(x_{m,k,1}\) | … | \(x_{m, K, 1}\) |
| \(x_{m,1,2}\) | \(x_{m,2,2}\) | … | \(x_{m,k,2}\) | … | \(x_{m,K,2}\) | |
| … | … | … | … | … | … | |
| \(x_{m,1, j}\) | \(x_{m,2, j}\) | … | \(x_{m,k, j}\) | … | \(x_{m,K,j}\) | |
| … | … | … | … | … | … | |
| \(x_{m,1, J}\) | \(x_{m,2, J}\) | … | \(x_{m,k, J}\) | … | \(x_{m,K,J}\) | |
| … | … | … | … | … | … | … |
| Category M | \(x_{M,1,1}\) | \(x_{M,2,1}\) | … | \(x_{M,k,1}\) | … | \(x_{M, K, 1}\) |
| \(x_{M,1,2}\) | \(x_{M,2,2}\) | … | \(x_{M,k,2}\) | … | \(x_{M,K,2}\) | |
| … | … | … | … | … | … | |
| \(x_{M,1, j}\) | \(x_{M,2, j}\) | … | \(x_{M,k, j}\) | … | \(x_{M,K,j}\) | |
| … | … | … | … | … | … | |
| \(x_{M,1, J}\) | \(x_{M,2, J}\) | … | \(x_{M,k, J}\) | … | \(x_{M,K,J}\) |
| Source | \(df\) | \(SS\) | \(MS\) | \(F\) |
|---|---|---|---|---|
| V1 (row) | \(df_r = M-1\) | \(SS_r = \sum_{m = 1}^{M} (\bar x_{m} - \bar x)^2\) | \(MS_r = \frac{SS_r}{df_r}\) | \(F_r = \frac{MS_r}{MS_W}\) |
| V2 (column) | \(df_c = K - 1\) | \(SS_c = \sum_{k = 1}^K (\bar x_k - \bar x)^2\) | \(MS_c = \frac{SS_c}{df_c}\) | \(F_c = \frac{MS_c}{MS_W}\) |
| Interaction | \(df_I=df_r df_c\) | \(SS_I = SS_B-SS_c-SS_r\) | \(MS_I = \frac{SS_I}{df_I}\) | \(F_I = \frac{MS_I}{MS_W}\) |
| Within samples | \(df_W = MK(J-1)\) | \(SS_W = \sum_{m=1}^{M} \sum_{k=1}^{K}(x_{m,k,j}-\bar x_{m,k})^2\) | \(MS_W = \frac{SS_W}{df_W}\) | |
| Between samples | \(df_B = MK - 1\) | \(SS_B = J\sum_{m = 1}^M \sum_{k=1}^{K}(\bar x_{m,k} - \bar x)^2\) | \(MS_B = \frac {SS_B}{df_B}\) | |
| Total | \(df_T = MKJ - 1\) | \(SS_T = \sum(x-\bar x) ^2\) | \(MS_T = \frac{SS_T}{df_T}\) |
Action: Fill in the ANOVA table on the basis of the following data and draw your conclusion (15 minutes).
| Source | \(df\) | \(SS\) | \(MS\) | \(F\) |
|---|---|---|---|---|
| V1 (row) | ||||
| V2 (column) | ||||
| Interaction | ||||
| Within samples | ||||
| Between samples | ||||
| Total |
Click to see the results
dtf <- data.frame(
w = c(90,95,100,75,78,90,120,125,130,100,118,112,125,130,135,118,125,132),
diet = rep(c("Diet1", "Diet2", "Diet3"), each = 6),
gender = rep(c("Male", "Female"), each = 3)
)
# df
(M <- nlevels(as.factor(dtf$gender)))[1] 2
(K <- nlevels(as.factor(dtf$diet)))[1] 3
(n <- nrow(dtf))[1] 18
(J <- n / M / K)[1] 3
(dfr <- M - 1)[1] 1
(dfc <- K - 1)[1] 2
(dfI <- dfr * dfc)[1] 2
(dfW <- M * K * ( J - 1))[1] 12
(dfB <- M * K - 1)[1] 5
(dfT <- n - 1)[1] 17
# mean
(xbar <- mean(dtf$w))[1] 111
(xmk_bar <- tapply(dtf$w, list(dtf$diet, dtf$gender), mean)) Female Male
Diet1 81 95
Diet2 110 125
Diet3 125 130
dtf$xmk_bar <- mapply(function(d, g) xmk_bar[d, g], dtf$diet, dtf$gender)
(xm_bar <- colMeans(xmk_bar)) Female Male
105.3333 116.6667
(xk_bar <- rowMeans(xmk_bar))Diet1 Diet2 Diet3
88.0 117.5 127.5
# SS
(SSB <- J * sum((xmk_bar - xbar) ^ 2))[1] 5730
(SSW <- sum((dtf$w - dtf$xmk_bar) ^2))[1] 542
(SST <- sum((dtf$w - xbar) ^ 2))[1] 6272
(SSr <- K * J * sum((xm_bar - xbar) ^2))[1] 578
(SSc <- M * J * sum((xk_bar - xbar) ^ 2))[1] 5061
(SSI <- SSB - SSc - SSr)[1] 91
(SSS <- k * sum((xjbar - xbar) ^ 2))[1] 245874.8
(SSWccor <- SSW-SSS)[1] -245332.8
# Double check
SSW + SSB[1] 6272
SST[1] 6272
# MS
(MSr <- SSr / dfr)[1] 578
(MSc <- SSc / dfc)[1] 2530.5
(MSI <- SSI / dfI)[1] 45.5
(MSW <- SSW / dfW)[1] 45.16667
# F
(F_r <- MSr / MSW)[1] 12.79705
qf(0.95, df1 = dfr, df2 = dfW)[1] 4.747225
pf(F_r, df1 = dfr, df2 = dfW, lower.tail = FALSE)[1] 0.0038011
(F_c <- MSc / MSW)[1] 56.02583
qf(0.95, df1 = dfc, df2 = dfW)[1] 3.885294
pf(F_c, df1 = dfc, df2 = dfW, lower.tail = FALSE)[1] 8.193548e-07
(F_I <- MSI / MSW)[1] 1.00738
qf(0.95, df1 = dfI, df2 = dfW)[1] 3.885294
pf(F_I, df1 = dfI, df2 = dfW, lower.tail = FALSE)[1] 0.3940701
| Source | \(df\) | \(SS\) | \(MS\) | \(F\) |
|---|---|---|---|---|
| V1 (row) | 1 | 578 | 578 | 12.797048 |
| V2 (column) | 2 | 5061 | 2530.5 | 56.0258303 |
| Interaction | 2 | 91 | 45.5 | 1.0073801 |
| Within samples | 12 | 542 | 45.1666667 | |
| Between samples | 5 | 5730 | 189.6916667 | |
| Total | 17 | 6272 |
Conclusion:
Both diet and gender have a significant effect on the weight gain of rats, and there is no significant interaction between gender and diet in weight gain.
One-step:
aov_wg <- aov(w ~ diet * gender, data = dtf)
summary(aov_wg) Df Sum Sq Mean Sq F value Pr(>F)
diet 2 5061 2530.5 56.026 8.19e-07 ***
gender 1 578 578.0 12.797 0.0038 **
diet:gender 2 91 45.5 1.007 0.3941
Residuals 12 542 45.2
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
7 Readings
- The R Book - Chapter 11
8 Highlights
- Carry out a step-by-step one-way ANOVA (repeated or not repeated) for a scientific question.
- State \(H_0\) and \(H_1\).
- Calculate the critical value for a given \(\alpha\).
- Calculate the testing statistics (\(F\) score).
- Draw a conclusion for the scientific question.