ENV221 L12
1 Overview of the module and R
2 R Basic Operations
3 R Programming
4 Statistical Graphs
5 Basic Concepts
6 Descriptive statistics
7 Distributions and the Central Limit Theorem
8 Hypothesis test
9 t-test
10 Numerical vs. Numerical
11 Numerical vs. Categorical
12 Categorical vs. Categorical
12.1 Learning objectives
In this lecture, you will
- understand \(\chi^2\) test,
- use \(\chi^2\) test in scientific research.
12.2 Revisit the frequency table
Example: GAD
A scientist measures the frequency of occurrence of antibodies to the enzyme glutamic acid decarboxylase (GAD) in the plasma of normal control subjects and of subjects with the autoimmune disease stiff-man syndrome. A total of 550 subjects are tested. The results are shown below. Is the occurrence of GAD antibodies associated with the occurrence of stiff-man syndrome?
<- read.csv("data/gad.csv")
gad <- table(gad$status, gad$antibody) tb_gad
<- table(gad$status, gad$antibody) tb_gad
negative | positive | |
---|---|---|
normal | 125 | 55 |
sm | 150 | 220 |
Transform it into a long table:
<- data.frame(tb_gad) df_gad
Var1 | Var2 | Freq |
---|---|---|
normal | negative | 125 |
sm | negative | 150 |
normal | positive | 55 |
sm | positive | 220 |
12.3 \(\chi^2\) test
12.3.1 \(\chi^2\) Distribution
PDF:
\[f(x)=\frac{x^{(df / 2)-1} e^{-(x / 2)}}{\left(\frac{df}{2}-1\right) ! 2^{df / 2}}\]
\[df = (r − 1) ( c −1)\]
- \(r\):
- Row number
- \(c\):
- Column number
A graph:
Code
<- c(1:30, 4:10 * 10)
df <- c(0, 10)
x_lim <- c(0, 0.5)
y_lim curve(dchisq(x, df = 1), ylab = "Density", xlab = expression(chi^2),
xlim = x_lim, ylim = y_lim, lwd = 3,
col = 1, las = 1)
for (i in 2: length(df)) {
curve(dchisq(x, df = df[i]), col = i, add = TRUE,
lwd = c(rep(3, 5), rep(1, length(df - 5)))[i])
}legend("topright", lty = 1, col = 1:5, legend = 1:5, bty = "n", lwd = 3, title = 'df')
Properties:
- A family of curves.
- Mean = \(df\).
- variance = \(2 df\).
<- qchisq(p = 0.05, df = 2, lower.tail = FALSE)
chi_sq_score chi_sq_score
[1] 5.9915
pchisq(q = chi_sq_score, df = 2, lower.tail = FALSE)
[1] 0.05
12.3.2 One-category \(\chi^2\) test
\[\chi^{2}=\sum\left[\frac{\left(O_{i}-E_{i}\right)^{2}}{E_{i}}\right]\]
- O:
- Observations.
- E:
- Expectations.
Demo: Mendelian inheritance
A biologist counted the number of red-, white- and pink-flowered plants resulting after cross-pollination of white and red sweet peas. Mendelian inheritance of this trait predicts that the ratio of red to white to pink should be 1:1:2. The biologist’s results are as follows:
- Red: 72 plants
- White: 63 plants
- Pink: 125 plants
Do the experimental results support this mode of inheritance?
- Hypotheses and question:
- \(H_0\): The ratio of red to white to pink is 1:1:2 (The experiment results follow the theoretical prediction).
- \(H_1\): The ratio of red to white to pink is not 1:1:2 (The experiment results do not follow the theoretical prediction).
- Question: Reject \(H_0\)? Given \(\alpha\).
- Collect data.
<- data.frame(colour = c("red", "white", "pink"),
dtf observed = c(72, 63, 125))
sum(dtf$observed)
[1] 260
$expected <- sum(dtf$observed) * c(1, 1, 2) / 4 dtf
Action: Fill in the following table.
colour | observed | expected | O - E | O - E square |
---|---|---|---|---|
red | 72 | 65 | ||
white | 63 | 65 | ||
pink | 125 | 130 |
Click to see the results
colour | observed | expected | O - E | O - E square |
---|---|---|---|---|
red | 72 | 65 | 7 | 49 |
white | 63 | 65 | -2 | 4 |
pink | 125 | 130 | -5 | 25 |
- Calculate a test statistic: \(\chi ^2\)
Click to see the results
<- sum(dtf$`O - E square` / dtf$expected)) (chi_sq_score
[1] 1.0077
<- nrow(dtf) - 1) (df
[1] 2
qchisq(0.05, df, lower.tail = FALSE)
[1] 5.9915
pchisq(chi_sq_score, df, lower.tail = FALSE)
[1] 0.6042
- Decision.
As \(\chi ^2 _\mathrm{score} \le \chi ^2_\mathrm{critical}\), we cannot reject \(H_0\).
- Conclusion.
The ratio of red to white to pink is 1:1:2. The experimental results support this mode of inheritance.
12.3.3 Multiple-category \(\chi ^2\) test
When the data contain two or more samples or multiple categories, the data are arranged in contingency tables, and a \(\chi ^2\) test can be used to test for association between the variables.
\[E = \frac{\sum _ \mathrm{row}}{\sum _ \mathrm{grand}} \times \frac{\sum _ \mathrm{column}}{\sum _ \mathrm{grand}} \times \sum_\mathrm{grand}=\frac{\sum _ \mathrm{row} \times \sum _ \mathrm{column}}{\sum _ \mathrm{grand}} \]
Example: GAD
A scientist measures the frequency of occurrence of antibodies to the enzyme glutamic acid decarboxylase (GAD) in the plasma of normal control subjects and of subjects with the autoimmune disease stiff-man syndrome. Is the occurrence of GAD antibodies associated with the occurrence of stiff-man syndrome?
- Hypotheses and question:
- \(H_0\): No association between the occurrence of GAD antibodies and the occurrence of stiff-man syndrome.
- \(H_1\): Association between the occurrence of GAD antibodies and the occurrence of stiff-man syndrome.
- Question: Reject \(H_0\)? Given \(\alpha\).
- Collect data.
- Calculate a test statistic.
Action: Fill in the following tables and calculate \(\chi ^2\).
observed negative | observed positive | sum | |
---|---|---|---|
normal | |||
sm syndrom | |||
sum |
expected negative | expected positive | |
---|---|---|
normal | ||
sm syndrom |
Click to see the results
<- table(gad$status, gad$antibody))
(tb_gad <- matrix(nrow = 2, ncol = 2)
tb_gad2 1, 1] <- sum(tb_gad[1, ]) / sum(tb_gad) * sum(tb_gad[, 1])
tb_gad2[1, 2] <- sum(tb_gad[1, ]) / sum(tb_gad) * sum(tb_gad[, 2])
tb_gad2[2, 1] <- sum(tb_gad[2, ]) / sum(tb_gad) * sum(tb_gad[, 1])
tb_gad2[2, 2] <- sum(tb_gad[2, ]) / sum(tb_gad) * sum(tb_gad[, 2])
tb_gad2[
tb_gad2<- sum((tb_gad2 - tb_gad) ^ 2 / tb_gad2))
(chi_sq
sqrt(chi_sq / (sum(tb_gad) * (min(dim(tb_gad)) - 1))) # Cramer's coefficient
One-step:
chisq.test(tb_gad)
<- chisq.test(tb_gad)
chit str(chit)
$statistic chit
Why different?
12.3.4 Yates’ correction
For small samples (\(2\times2\) contingency table):
\[\chi^{2}=\sum\frac{(|O_{i}-E_{i}| - 0.5)^{2}}{E_{i}}\]
<- sum((abs(tb_gad2 - tb_gad) - 0.5) ^ 2 / tb_gad2)) (chi_sq2
[1] 39.318
pchisq(chi_sq2, 1, lower.tail = FALSE)
[1] 3.6019e-10
- Decision.
For \(df = (2 -1)\times(2-1) = 1\), in the \(\chi^2\) table, the critical value of \(\chi^2 (\alpha=0.05, df=1) = 3.841\), which is much smaller than our calculated \(\chi^2\). Thus, we reject \(H_0\).
- Conclusion.
There is a significant association between the occurrence of GAD antibodies and the occurrence of stiff-man syndrome.
12.4 Readings
- Elementary Statistics, Chapter 10.1 and 10.2
12.5 Highlights
- Carry out a step-by-step \(\chi^2\)-test for a scientific question.
- State \(H_0\) and \(H_1\).
- Calculate the critical value for a given \(\alpha\).
- Calculate the testing statistics (\(\chi^2\) score).
- Draw a conclusion for the scientific question.