ENV221 L12

Author

Peng Zhao

1 Overview of the module and R

2 R Basic Operations

3 R Programming

4 Statistical Graphs

5 Basic Concepts

6 Descriptive statistics

7 Distributions and the Central Limit Theorem

8 Hypothesis test

9 t-test

10 Numerical vs. Numerical

11 Numerical vs. Categorical

12 Categorical vs. Categorical

12.1 Learning objectives

In this lecture, you will

understand \(\chi^2\) test,
use \(\chi^2\) test in scientific research.

12.2 Revisit the frequency table

Example: GAD

A scientist measures the frequency of occurrence of antibodies to the enzyme glutamic acid decarboxylase (GAD) in the plasma of normal control subjects and of subjects with the autoimmune disease stiff-man syndrome. A total of 550 subjects are tested. The results are shown below. Is the occurrence of GAD antibodies associated with the occurrence of stiff-man syndrome?

gad <- read.csv("data/gad.csv")
tb_gad <- table(gad$status, gad$antibody)

tb_gad <- table(gad$status, gad$antibody)

	negative	positive
normal	125	55
sm	150	220

Transform it into a long table:

df_gad <- data.frame(tb_gad)

Var1	Var2	Freq
normal	negative	125
sm	negative	150
normal	positive	55
sm	positive	220

12.3 \(\chi^2\) test

12.3.1 \(\chi^2\) Distribution

PDF:

\[f(x)=\frac{x^{(df / 2)-1} e^{-(x / 2)}}{\left(\frac{df}{2}-1\right) ! 2^{df / 2}}\]

\[df = (r − 1) ( c −1)\]

\(r\):: Row number
\(c\):: Column number

A graph:

Code

df <- c(1:30, 4:10 * 10)
x_lim <- c(0, 10)
y_lim <- c(0, 0.5)
curve(dchisq(x, df = 1), ylab = "Density", xlab = expression(chi^2), 
      xlim = x_lim, ylim = y_lim, lwd = 3,
      col = 1, las = 1)
for (i in 2: length(df)) {
  curve(dchisq(x, df = df[i]), col = i, add = TRUE, 
        lwd = c(rep(3, 5), rep(1, length(df - 5)))[i])
}
legend("topright", lty = 1, col = 1:5, legend = 1:5, bty = "n", lwd = 3, title = 'df')

Properties:

A family of curves.
Mean = \(df\).
variance = \(2 df\).

chi_sq_score <- qchisq(p = 0.05, df = 2, lower.tail = FALSE)
chi_sq_score

[1] 5.9915

pchisq(q = chi_sq_score, df = 2, lower.tail = FALSE)

[1] 0.05

12.3.2 One-category \(\chi^2\) test

\[\chi^{2}=\sum\left[\frac{\left(O_{i}-E_{i}\right)^{2}}{E_{i}}\right]\]

O:: Observations.
E:: Expectations.

Demo: Mendelian inheritance

A biologist counted the number of red-, white- and pink-flowered plants resulting after cross-pollination of white and red sweet peas. Mendelian inheritance of this trait predicts that the ratio of red to white to pink should be 1:1:2. The biologist’s results are as follows:

Red: 72 plants
White: 63 plants
Pink: 125 plants

Do the experimental results support this mode of inheritance?

Hypotheses and question:

\(H_0\): The ratio of red to white to pink is 1:1:2 (The experiment results follow the theoretical prediction).
\(H_1\): The ratio of red to white to pink is not 1:1:2 (The experiment results do not follow the theoretical prediction).
Question: Reject \(H_0\)? Given \(\alpha\).

Collect data.

dtf <- data.frame(colour = c("red", "white", "pink"),
                  observed = c(72, 63, 125))
sum(dtf$observed)

[1] 260

dtf$expected <- sum(dtf$observed) * c(1, 1, 2) / 4

Action: Fill in the following table.

colour	observed	expected
red	72	65
white	63	65
pink	125	130

Click to see the results

colour	observed	expected	O - E	O - E square
red	72	65	7	49
white	63	65	-2	4
pink	125	130	-5	25

Calculate a test statistic: \(\chi ^2\)

Click to see the results

(chi_sq_score <- sum(dtf$`O - E square` / dtf$expected))

[1] 1.0077

(df <- nrow(dtf) - 1)

[1] 2

qchisq(0.05, df, lower.tail = FALSE)

[1] 5.9915

pchisq(chi_sq_score, df, lower.tail = FALSE)

[1] 0.6042

Decision.

As \(\chi ^2 _\mathrm{score} \le \chi ^2_\mathrm{critical}\), we cannot reject \(H_0\).

Conclusion.

The ratio of red to white to pink is 1:1:2. The experimental results support this mode of inheritance.

12.3.3 Multiple-category \(\chi ^2\) test

When the data contain two or more samples or multiple categories, the data are arranged in contingency tables, and a \(\chi ^2\) test can be used to test for association between the variables.

\[E = \frac{\sum _ \mathrm{row}}{\sum _ \mathrm{grand}} \times \frac{\sum _ \mathrm{column}}{\sum _ \mathrm{grand}} \times \sum_\mathrm{grand}=\frac{\sum _ \mathrm{row} \times \sum _ \mathrm{column}}{\sum _ \mathrm{grand}} \]

Example: GAD

A scientist measures the frequency of occurrence of antibodies to the enzyme glutamic acid decarboxylase (GAD) in the plasma of normal control subjects and of subjects with the autoimmune disease stiff-man syndrome. Is the occurrence of GAD antibodies associated with the occurrence of stiff-man syndrome?

Hypotheses and question:

\(H_0\): No association between the occurrence of GAD antibodies and the occurrence of stiff-man syndrome.
\(H_1\): Association between the occurrence of GAD antibodies and the occurrence of stiff-man syndrome.
Question: Reject \(H_0\)? Given \(\alpha\).

Collect data.

Calculate a test statistic.

Action: Fill in the following tables and calculate \(\chi ^2\).

	observed negative	observed positive	sum
normal
sm syndrom
sum

	expected negative	expected positive
normal
sm syndrom

Click to see the results

(tb_gad <- table(gad$status, gad$antibody))
tb_gad2 <- matrix(nrow = 2, ncol = 2)
tb_gad2[1, 1] <- sum(tb_gad[1, ]) / sum(tb_gad) * sum(tb_gad[, 1])
tb_gad2[1, 2] <- sum(tb_gad[1, ]) / sum(tb_gad) * sum(tb_gad[, 2])
tb_gad2[2, 1] <- sum(tb_gad[2, ]) / sum(tb_gad) * sum(tb_gad[, 1])
tb_gad2[2, 2] <- sum(tb_gad[2, ]) / sum(tb_gad) * sum(tb_gad[, 2])
tb_gad2
(chi_sq <- sum((tb_gad2 - tb_gad) ^ 2 / tb_gad2))

sqrt(chi_sq / (sum(tb_gad) * (min(dim(tb_gad)) - 1))) # Cramer's coefficient

One-step:

chisq.test(tb_gad)
chit <- chisq.test(tb_gad)
str(chit)
chit$statistic

Why different?

12.3.4 Yates’ correction

For small samples (\(2\times2\) contingency table):

\[\chi^{2}=\sum\frac{(|O_{i}-E_{i}| - 0.5)^{2}}{E_{i}}\]

(chi_sq2 <- sum((abs(tb_gad2 - tb_gad) - 0.5) ^ 2 / tb_gad2))

[1] 39.318

pchisq(chi_sq2, 1, lower.tail = FALSE)

[1] 3.6019e-10

Decision.

For \(df = (2 -1)\times(2-1) = 1\), in the \(\chi^2\) table, the critical value of \(\chi^2 (\alpha=0.05, df=1) = 3.841\), which is much smaller than our calculated \(\chi^2\). Thus, we reject \(H_0\).

Conclusion.

There is a significant association between the occurrence of GAD antibodies and the occurrence of stiff-man syndrome.

12.4 Readings

Elementary Statistics, Chapter 10.1 and 10.2

12.5 Highlights

Carry out a step-by-step \(\chi^2\)-test for a scientific question.
- State \(H_0\) and \(H_1\).
- Calculate the critical value for a given \(\alpha\).
- Calculate the testing statistics (\(\chi^2\) score).
- Draw a conclusion for the scientific question.