ENV221 L12

Author

Peng Zhao

1 Overview of the module and R

2 R Basic Operations

3 R Programming

4 Statistical Graphs

5 Basic Concepts

6 Descriptive statistics

7 Distributions and the Central Limit Theorem

8 Hypothesis test

9 t-test

10 Numerical vs. Numerical

11 Numerical vs. Categorical

12 Categorical vs. Categorical

12.1 Learning objectives

In this lecture, you will

  1. understand \(\chi^2\) test,
  2. use \(\chi^2\) test in scientific research.

12.2 Revisit the frequency table

Example: GAD

A scientist measures the frequency of occurrence of antibodies to the enzyme glutamic acid decarboxylase (GAD) in the plasma of normal control subjects and of subjects with the autoimmune disease stiff-man syndrome. A total of 550 subjects are tested. The results are shown below. Is the occurrence of GAD antibodies associated with the occurrence of stiff-man syndrome?

gad <- read.csv("data/gad.csv")
tb_gad <- table(gad$status, gad$antibody)
tb_gad <- table(gad$status, gad$antibody)
negative positive
normal 125 55
sm 150 220

Transform it into a long table:

df_gad <- data.frame(tb_gad)
Var1 Var2 Freq
normal negative 125
sm negative 150
normal positive 55
sm positive 220

12.3 \(\chi^2\) test

12.3.1 \(\chi^2\) Distribution

PDF:

\[f(x)=\frac{x^{(df / 2)-1} e^{-(x / 2)}}{\left(\frac{df}{2}-1\right) ! 2^{df / 2}}\]

\[df = (r − 1) ( c −1)\]

\(r\):
Row number
\(c\):
Column number

A graph:

Code
df <- c(1:30, 4:10 * 10)
x_lim <- c(0, 10)
y_lim <- c(0, 0.5)
curve(dchisq(x, df = 1), ylab = "Density", xlab = expression(chi^2), 
      xlim = x_lim, ylim = y_lim, lwd = 3,
      col = 1, las = 1)
for (i in 2: length(df)) {
  curve(dchisq(x, df = df[i]), col = i, add = TRUE, 
        lwd = c(rep(3, 5), rep(1, length(df - 5)))[i])
}
legend("topright", lty = 1, col = 1:5, legend = 1:5, bty = "n", lwd = 3, title = 'df')

Properties:

  • A family of curves.
  • Mean = \(df\).
  • variance = \(2 df\).
chi_sq_score <- qchisq(p = 0.05, df = 2, lower.tail = FALSE)
chi_sq_score
[1] 5.9915
pchisq(q = chi_sq_score, df = 2, lower.tail = FALSE)
[1] 0.05

12.3.2 One-category \(\chi^2\) test

\[\chi^{2}=\sum\left[\frac{\left(O_{i}-E_{i}\right)^{2}}{E_{i}}\right]\]

O:
Observations.
E:
Expectations.

Demo: Mendelian inheritance

A biologist counted the number of red-, white- and pink-flowered plants resulting after cross-pollination of white and red sweet peas. Mendelian inheritance of this trait predicts that the ratio of red to white to pink should be 1:1:2. The biologist’s results are as follows:

  • Red: 72 plants
  • White: 63 plants
  • Pink: 125 plants

Do the experimental results support this mode of inheritance?

  1. Hypotheses and question:
  • \(H_0\): The ratio of red to white to pink is 1:1:2 (The experiment results follow the theoretical prediction).
  • \(H_1\): The ratio of red to white to pink is not 1:1:2 (The experiment results do not follow the theoretical prediction).
  • Question: Reject \(H_0\)? Given \(\alpha\).
  1. Collect data.
dtf <- data.frame(colour = c("red", "white", "pink"),
                  observed = c(72, 63, 125))
sum(dtf$observed)
[1] 260
dtf$expected <- sum(dtf$observed) * c(1, 1, 2) / 4

Action: Fill in the following table.

colour observed expected O - E O - E square
red 72 65
white 63 65
pink 125 130
Click to see the results
colour observed expected O - E O - E square
red 72 65 7 49
white 63 65 -2 4
pink 125 130 -5 25
  1. Calculate a test statistic: \(\chi ^2\)
Click to see the results
(chi_sq_score <- sum(dtf$`O - E square` / dtf$expected))
[1] 1.0077
(df <- nrow(dtf) - 1)
[1] 2
qchisq(0.05, df, lower.tail = FALSE)
[1] 5.9915
pchisq(chi_sq_score, df, lower.tail = FALSE)
[1] 0.6042
  1. Decision.

As \(\chi ^2 _\mathrm{score} \le \chi ^2_\mathrm{critical}\), we cannot reject \(H_0\).

  1. Conclusion.

The ratio of red to white to pink is 1:1:2. The experimental results support this mode of inheritance.

12.3.3 Multiple-category \(\chi ^2\) test

When the data contain two or more samples or multiple categories, the data are arranged in contingency tables, and a \(\chi ^2\) test can be used to test for association between the variables.

\[E = \frac{\sum _ \mathrm{row}}{\sum _ \mathrm{grand}} \times \frac{\sum _ \mathrm{column}}{\sum _ \mathrm{grand}} \times \sum_\mathrm{grand}=\frac{\sum _ \mathrm{row} \times \sum _ \mathrm{column}}{\sum _ \mathrm{grand}} \]

Example: GAD

A scientist measures the frequency of occurrence of antibodies to the enzyme glutamic acid decarboxylase (GAD) in the plasma of normal control subjects and of subjects with the autoimmune disease stiff-man syndrome. Is the occurrence of GAD antibodies associated with the occurrence of stiff-man syndrome?

  1. Hypotheses and question:
  • \(H_0\): No association between the occurrence of GAD antibodies and the occurrence of stiff-man syndrome.
  • \(H_1\): Association between the occurrence of GAD antibodies and the occurrence of stiff-man syndrome.
  • Question: Reject \(H_0\)? Given \(\alpha\).
  1. Collect data.
  1. Calculate a test statistic.

Action: Fill in the following tables and calculate \(\chi ^2\).

observed negative observed positive sum
normal
sm syndrom
sum
expected negative expected positive
normal
sm syndrom
Click to see the results
(tb_gad <- table(gad$status, gad$antibody))
tb_gad2 <- matrix(nrow = 2, ncol = 2)
tb_gad2[1, 1] <- sum(tb_gad[1, ]) / sum(tb_gad) * sum(tb_gad[, 1])
tb_gad2[1, 2] <- sum(tb_gad[1, ]) / sum(tb_gad) * sum(tb_gad[, 2])
tb_gad2[2, 1] <- sum(tb_gad[2, ]) / sum(tb_gad) * sum(tb_gad[, 1])
tb_gad2[2, 2] <- sum(tb_gad[2, ]) / sum(tb_gad) * sum(tb_gad[, 2])
tb_gad2
(chi_sq <- sum((tb_gad2 - tb_gad) ^ 2 / tb_gad2))

sqrt(chi_sq / (sum(tb_gad) * (min(dim(tb_gad)) - 1))) # Cramer's coefficient

One-step:

chisq.test(tb_gad)
chit <- chisq.test(tb_gad)
str(chit)
chit$statistic

Why different?

12.3.4 Yates’ correction

For small samples (\(2\times2\) contingency table):

\[\chi^{2}=\sum\frac{(|O_{i}-E_{i}| - 0.5)^{2}}{E_{i}}\]

(chi_sq2 <- sum((abs(tb_gad2 - tb_gad) - 0.5) ^ 2 / tb_gad2))
[1] 39.318
pchisq(chi_sq2, 1, lower.tail = FALSE)
[1] 3.6019e-10
  1. Decision.

For \(df = (2 -1)\times(2-1) = 1\), in the \(\chi^2\) table, the critical value of \(\chi^2 (\alpha=0.05, df=1) = 3.841\), which is much smaller than our calculated \(\chi^2\). Thus, we reject \(H_0\).

  1. Conclusion.

There is a significant association between the occurrence of GAD antibodies and the occurrence of stiff-man syndrome.

12.4 Readings

  • Elementary Statistics, Chapter 10.1 and 10.2

12.5 Highlights

  • Carry out a step-by-step \(\chi^2\)-test for a scientific question.
    • State \(H_0\) and \(H_1\).
    • Calculate the critical value for a given \(\alpha\).
    • Calculate the testing statistics (\(\chi^2\) score).
    • Draw a conclusion for the scientific question.