ENV221 L07

Author

Peng Zhao

1 Overview of the module and R

2 R Basic Operations

3 R Programming

4 Statistical Graphs

5 Basic Concepts

6 Descriptive statistics

7 Probability Distribution

7.1 Learning objectives

In this lecture, you will

Understand the meaning of probability density functions (PDF),
Know the features of binomial distribution and normal distribution, especially the Empirical Rule,
Understand the sampling distribution and the Central Limit Theorem (CLT), and
Use standard normal distribution and CLT to solve environmental problems.

7.2 Probability Distribution

Random variable:: a numerical value associated with each outcome of a probability experiment.

Examples:

How many heads you get when you toss 10 coins.
How many pollution days in Suzhou in a month.
The O₃ concentration on our campus.
The pH value of the Taihu Lake.

Probability Distribution:: A table (probability table), graph (e.g. bar charts for discrete data and histograms for continuous data), or function (Probability Density Function, PDF) that links each outcome of a statistical experiment with the probability of occurrence.

Example: How many heads you get when you toss 10 coins for 100 times.

A table:

A graph:

A function?

Example: Air quality in Wujiang in January and February, 2022¹.

¹ https://www.suzhou.gov.cn/szsrmzf/hjbhjdjcqk/202204/17fc853fdee047d690fe0f7e0bb6a91f.shtml

A table:

Air quality	Excellent	Good	Lightly Polluted	Moderately Polluted
Days	18	29	11	1
\(P(x)\)	0.30	0.49	0.19	0.02

A graph: bar chart

Example: LI7500 data² (Download the data file li7500.csv). Use a frequency distribution table (Section 6.3.3) to describe this dataset.

² LI7500 is a fast response gas analyzer, which can measure the CO₂ concentration at very high frequency. The dataset li7500.csv gives 10000 records of CO₂ concentrations measured in 200 seconds on June 6, 2016, at a field site in Italy.

co2 <- read.csv("data/li7500.csv")[, 1]

A table:

Code

# range(co2)
co2_group <- cut(co2, 385:406)
df_co2 <- data.frame(table(co2_group))
df_co2$Probability <- round(df_co2$Freq / sum(df_co2$Freq), 3)
df_co2$Cumulative_Probability <- round(cumsum(df_co2$Freq) / sum(df_co2$Freq), 3)

Question: How many observations appeared between 394.5 and 396.5 ppm?

Another table:

Code

co2_group <- cut(co2, seq(385, 406, 0.5))
df_co2 <- data.frame(table(co2_group))
df_co2$Probability <- round(df_co2$Freq / sum(df_co2$Freq), 3)
df_co2$Cumulative_Probability <- round(cumsum(df_co2$Freq) / sum(df_co2$Freq), 3)

Questions: Then how many observations appeared between 394.25 and 396.25 ppm? Or 394.0625 and 396.0625?

A graph might be better.

Code

par(mfrow= c(3, 2), las = 1, mar = c(2, 4, 0.3, 0.1))
for (bins in c(8, 16, 32, 64, 128)){
  myhist <- hist(co2, breaks = bins, main = "", 
       border = NA,
       col = "steelblue", 
       lwd = 0.1, xlim = c(385, 406), ylim = c(0, 0.17), freq = FALSE)
  lines(myhist$mids, myhist$density, col = 'red')
  legend("topright", legend = paste("breaks = ", bins), bty = "n")
  box()
}

u <- seq(385, 406, by = 0.01)
plot(u, dnorm(u), type = "n", xlab = "", ylab = "", col = "steelblue",
     xlim = c(385, 406), ylim = c(0, 0.17))
lines(density(co2), col = 'red')
polygon(u,
        dnorm(x= u, mean = 396, sd = 2.6),
        col = "steelblue",
        border = 'black')
legend("topright", legend = "breaks = Inf", bty = "n")
box()

Graphs that describe the distribution of CO2 concentration

The shaded area: the proportion of the observations between any two given numbers.

Or a function to describe this curve: Probability Density Function (PDF)?

7.3 Binomial distribution

Binomial experiment:

A fixed number of independent trials (\(n\)).
Only two possible outcomes of interest for each trial: success or failure.
The probability of a success (\(p\)) is the same for each trial.
The random variable \(x\) counts the number of successful trials. \(x = 0, 1, 2, 3, ..., n\).

Examples:

Toss a coin for 2 times, or toss 2 coins. \(x\) = the frequency of head. \(n = 10\), \(p = 0.5\)
Roll die for 10 times. \(x\) = the frequency of getting 1. \(n = 10\), \(p = \frac{1}{6}\).
Randomly choose a week from the air quality data of Wujiang, Jan. - Feb. 2022. \(x\) = the frequency of days with excellent air quality. \(n = 7\), \(p = 0.30\).

Examples: Let \(x = 2\). Calculate the probability for Examples 1 - 3.

Code

x <- 2

# 1
n <- 2
p <- 0.5
choose(n, x) * p^x * (1-p)^(n-x)
dbinom(x, size = n, prob = p)

# 2
n <- 10
p <- 1/6
choose(n, x) * p^x * (1-p)^(n-x)
dbinom(x, size = n, prob = p)

# 3
n <- 7
p <- 0.3
choose(n, x) * p^x * (1-p)^(n-x)
dbinom(x, size = n, prob = p)

A graph showing Example 1 (tossing two coins):

Code

library(ggplot2)
size <- 2
prob <- 0.5
x <- 0:size
dtf <- data.frame(x, d = dbinom(x, size = size, prob = prob))
ggplot(dtf, aes(x, d)) +
  geom_bar(stat = 'identity', width = 1, color = 'white') +
  geom_line(color = 'red', size = 2) +
  labs(y = 'Density')

PDF:

\[P(x)={ }_n C_x p^x (1-p)^{n-x}=\frac{n !}{(n-x) ! x !} p^x (1-p)^{n-x}\]

Features:

\(\mu = np\)
\(\sigma ^2 = np(1-p)\)
\(\sigma = \sqrt {np(1-p)}\)

Examples: Calculate \(\mu\), \(\sigma^2\), \(\sigma\) for Examples 1 - 3.

Code

# 1
n <- 2
p <- 0.5
x_mean <- n * p
x_var <- n * p * (1 - p)
x_sd <- sqrt(x_var)

# 2
n <- 10
p <- 1/6
x_mean <- n * p
x_var <- n * p * (1 - p)
x_sd <- sqrt(x_var)

# 3
n <- 7
p <- 0.3
x_mean <- n * p
x_var <- n * p * (1 - p)
x_sd <- sqrt(x_var)

If \(np \ge 5\) and \(n(1-p) \ge 5\), \(x\) is approximately normally distributed.

7.4 Normal Distribution (Gaussian Distribution)

PDF:

\[f(x)=\frac{1}{\sigma\sqrt{2 \pi}} e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^{2} }\] Suppose \(\mu = 60, \sigma = 3, x \in [40, 80]\),

Code

library(ggplot2)
mu <- 60
sigma <- 3
x <- 40:80
y <- 1 / (sigma * sqrt(2 * pi)) * exp(-1/2 * ((x - mu) / sigma) ^ 2)
dtf <- data.frame(x, y)
ggplot() + 
  geom_line(aes(x, y), dtf) + 
  geom_vline(aes(xintercept = mu + sigma * ((-3):3), 
                 color = c('3sd', '2sd', '1sd', 'mean', '1sd', '2sd', '3sd'))) +
  labs(color = '', y = 'f(x)')

Features:

Bell shaped, and symmetrical about its mean \(\mu\).
Mean = median.
The total area under the curve = 1.
Completely determined by the parameters \(\mu\) and \(\sigma\).
Location parameter: Different values of \(\mu\) shift the graph along the horizontal axis.
Shape parameter: Different values of \(\sigma\) alter the degree of flatness and peakedness of the curve.

Code

dtf$y1 <- 1 / (sigma * sqrt(2 * pi)) * exp(-1/2 * ((x - (mu - 5)) / sigma) ^ 2)
dtf$y2 <- 1 / (sigma * sqrt(2 * pi)) * exp(-1/2 * ((x - (mu + 5)) / sigma) ^ 2)
dtf$y3 <- 1 / (sigma * sqrt(2 * pi)) * exp(-1/2 * ((x - mu) / (sigma + 1)) ^ 2)
dtf$y4 <- 1 / (sigma * sqrt(2 * pi)) * exp(-1/2 * ((x - mu) / (sigma - 1)) ^ 2)
ggplot(dtf) + 
  geom_line(aes(x, y)) + 
  geom_line(aes(x, y1), color = 'blue') + 
  geom_line(aes(x, y2), color = 'blue') + 
  geom_line(aes(x, y3), color = 'red', linetype = 2) +
  geom_line(aes(x, y4), color = 'red', linetype = 2) +
  labs(y = 'f(x)')

\(P(a<X<b) = \int_a^bf(x)\text dx\) –> \(P(X=c) = \int_c^cf(x)\text dx = 0\).
Empirical Rule: Approximately 68%, 95% and 99.7% of the area is within \(\pm 1\), \(\pm 2\) and \(\pm 3\) standard deviations from the mean.

Question: In the case of the CO₂ data, how do you know whether it follows (approximately) a normal distribution?

Method 1: Visualized it with a histogram + density curve + normal distribution curve. Find if it is bell shaped with one single symmetrical peak and the mean close to the median.

Code

# plot the two similar figures one by one
mydata <- co2
hist(mydata, freq = FALSE, main = '')
lines(density(mydata), col = "blue")
curve(dnorm(x, mean = mean(mydata), 
            sd = sd(mydata)*sqrt((length(mydata)-1)/length(mydata))), 
      add = TRUE, col = "red")

mydata <- iris$Sepal.Length
hist(mydata, freq = FALSE, main = '')
lines(density(mydata), col = "blue")
curve(dnorm(x, mean = mean(mydata), 
            sd = sd(mydata)*sqrt((length(mydata)-1)/length(mydata))), 
      add = TRUE, col = "red")

Code

# plot the two figures with a self-defined function
plot_hist_density <- function(mydata){
  hist(mydata, freq = FALSE, main = '', las = 1)
  lines(density(mydata), col = "blue")
  curve(dnorm(x, mean = mean(mydata), 
              sd = sd(mydata)*sqrt((length(mydata)-1)/length(mydata))), 
        add = TRUE, col = "red")
  box()
}
par(mfrow = c(1, 2))
plot_hist_density(co2)
plot_hist_density(iris$Sepal.Length)

Method 2: QQ plot. Find if the data points are close to the line of the perfect normal distribution with no obvious pattern coming away from the line.

Code

par(mfrow = c(1, 2))
qqnorm(co2)
qqline(co2)
qqnorm(iris$Sepal.Length)
qqline(iris$Sepal.Length)

Question: As the CO₂ data follow the normal distribution, how do we express it with an equation?

mu <-  mean(co2)
mu

[1] 395.9551

n <- length(co2)
sigma <- sd(co2) * sqrt((n-1)/n)
sigma

[1] 2.617185

f(x) is the density shown in previous graphs. we need to depend on concepts introduced in basic calculus to describe these distributions.

\[f(x) = \frac{1}{2.6\sqrt{2 \pi}} e^{-\frac{1}{2}(\frac{x-396}{2.6})^{2} }\]

\[P(x_1 < x < x_2) = \int_{x_1}^{x_2} f(x)\text d x = \int_{x_1}^{x_2}\frac{1}{\sigma\sqrt{2 \pi}} e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^{2} } \text d x\]

With R, simply use the pnorm() function:

pnorm()

Question: how many observations appeared between 394.0625 and 396.0625?

p394.0625 <- pnorm(394.0625, mean = mu, sd = sigma)
p396.0625 <- pnorm(396.0625, mean = mu, sd = sigma)
n * (p396.0625 - p394.0625)

[1] 1407.577

n * diff(pnorm(c(394.0625, 396.0625), mean = mu, sd = sigma))

[1] 1407.577

sum(co2 > 394.0625 & co2 < 396.0625)

[1] 1440

What if I cannot use R or I do not have a computer for these calculations?

7.5 Standard Normal Distribution (\(z\) distribution)

Standard normal distribution (unit normal distribution):: a special form of the normal distribution when \(\mu=0\) and \(\sigma = 1\).

\[f(x)=\frac{1}{\sigma\sqrt{2 \pi}} e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^{2} }\]

\[f(z)=\frac{1}{\sqrt{2 \pi }} e^{-\frac{1}{2}z^{2}}\]

\(z\) table for the probability:

How to read the \(z\) table:

The probability for \(z < -1\) is 0.1587.
The probability for \(z < 1.53\) is 0.9370.
The probability for \(-1 < z < 1.53\) is \(0.9370 - 0.1587 = 0.7783\).

Features:

Bell shaped, and symmetrical about its mean \(\mu = 0\).
Mean = median = 0.
The total area under the curve = 1.
Empirical Rule: Approximately 68%, 95% and 99.7% of the area is within \(\pm 1\), \(\pm 2\) and \(\pm 3\), respectively.

\(z\) score:: If a quantitative variable \(x\) has a normal distribution with the mean of \(\mu\) and standard deviation of \(\sigma\), then we can create a new variable \(z_\mathrm {score}=(x-\mu)/\sigma\), and \(z_\mathrm {score}\) has a standard normal distribution.

Example:

z_score <- (co2 - mu) /sigma
mean(z_score)

[1] -9.402777e-15

sd(z_score)

[1] 1.0001

Code

par(mfrow = c(1,2))
hist(co2, freq = FALSE, breaks = seq(385, 406, 1))
curve(dnorm(x, mu, sigma), col = "red", add = TRUE)
box()

hist(z_score, freq = FALSE, breaks = (seq(385, 406, 1) - mu)/sigma)
curve(dnorm(x, 0, 1), col = "blue", add = TRUE)
box()

Convert a normal distribution into a standard normal distribution

Proportion in the range between two given values of the CO₂ concentration (\(c_1\) and \(c_2\)): convert the CO₂ concentrations into \(z\) (\(z_1\) and \(z_2\)), and look them up in the \(z\) table.

Example: Estimate the proportion of CO₂ concentration between 390 and 395 ppm with the \(z\) table.

mu <- mean(co2)
n <- length(co2)
sigma <- sd(co2) * sqrt((n-1) /n)
(390 - mu)/sigma

[1] -2.275372

(395 - mu)/sigma

[1] -0.364922

What is the (standard) normal distribution used for?

7.6 Other distributions

Distribution	`p-`	`q-`	`d-`	`r-`
Beta	`pbeta`	`qbeta`	`dbeta`	`rbeta`
Binomial	`pbinom`	`qbinom`	`dbinom`	`rbinom`
Cauchy	`pcauchy`	`qcauchy`	`dcauchy`	`rcauchy`
Chi-Square	`pchisq`	`qchisq`	`dchisq`	`rchisq`
Exponential)	`pexp`	`qexp`	`dexp`	`rexp`
F	`pf`	`qf`	`df`	`rf`
Gamma	`pgamma`	`qgamma`	`dgamma`	`rgamma`
Geometric	`pgeom`	`qgeom`	`dgeom`	`rgeom`
Hypergeometrictm	`phyper`	`qhyper`	`dhyper`	`rhyper`
Logistic	`plogis`	`qlogis`	`dlogis`	`rlogis`
Log Normal	`plnorm`	`qlnorm`	`dlnorm`	`rlnorm`
Negative Binomial	`pnbinom`	`qnbinom`	`dnbinom`	`rnbinom`
Normal	`pnorm`	`qnorm`	`dnorm`	`rnorm`
Poisson	`ppois`	`qpois`	`dpois`	`rpois`
Student t	`pt`	`qt`	`dt`	`rt`
Studentized Range	`ptukey`	`qtukey`	`dtukey`	`rtukey`
Uniform	`punif`	`qunif`	`dunif`	`runif`
Weibull	`pweibull`	`qweibull`	`dweibull`	`rweibull`
Wilcoxon Rank Sum Statistic	`pwilcox`	`qwilcox`	`dwilcox`	`rwilcox`
Wilcoxon Signed Rank Statistic	`psignrank`	`qsignrank`	`dsignrank`	`rsignrank`

Distribution functions in R: prefix + root_name

d for “density”, the density function (PDF).
p for “probability”, the cumulative distribution function (CDF).
q for “quantile”, the inverse of CDF.
r for “random”, a random variable having the specified distribution

7.7 Central limit Theorem

7.7.1 Sampling distribution

Sampling Distribution:: The distribution of all possible values of a statistic for a given sample size
The most important definition in this module.
The boundary line between people who understand statistics and people who don’t.

Key words:

A given sample size
A statistic
All possible values
A distribution

Sample:: a subset of the population
Statistic:: measures of the sample, denoted by Roman letters (e.g., \(\bar x\) for the sample mean)

Demo: 3 out of 10.

Sample 3 observations from a population of the integers 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. Give the sampling distribution for the mean.

A given sample size: 3.
A statistic: the sample mean \(\bar x\).

One possible sample: 1, 2, 3. \(\bar x = (1+2+3)/3=2\).
Another possible sample: 1, 2, 4. \(\bar x = (1+2+4)/3=2.33\).

All possible values: all the possible samples and the mean for each sample.

population <- 1:10
sample_size <- 3
choose(10, 3)

[1] 120

samples <- data.frame(t(combn(population, sample_size)))
samples$sample_mean <- apply(samples, 1, mean)
samples$Sample <- rownames(samples)

A distribution: a histogram for the sample mean \(\bar x\).

Code

par(mfrow = c(1, 2))
hist(samples$sample_mean, 
     las = 1, freq = FALSE,
     xlab = 'Sample mean', 
     main = '')
box()

hist(population, 
     las = 1, freq = FALSE,
     xlab = 'Population', 
     main = '')
box()

Example: 30 out of 999.

Sample 30 observations from a population of the integers 1, 2, …, 999. Give the sampling distribution for the mean.

A given sample size: 30
A statistic: the sample mean.
All possible values.

population <- 1:999
hist(population)

sample_size <- 30
choose(999, 30)

[1] 2.35672e+57

It is a HUGE number. Impossible! Then what can we do?

7.7.2 Approximation of the sample distribution

Instead, we randomly sample many many many times.

sample1 <- sample(1:999, 30)
sample2 <- sample(1:999, 30)

samplen <- replicate(10, sample(1:999, 30))
sample_mean <- apply(samplen, 2, mean)
hist(sample_mean, freq = FALSE)

Code

par(mfrow = c(3,2))
for (rep_n in c(10, 50, 100, 500, 1000, 5000)){
  samplen <- replicate(rep_n, sample(1:999, 30))
  sample_mean <- apply(samplen, 2, mean)
  hist(sample_mean, 
       breaks = seq(0, 1000, 25), 
        
       main = rep_n, 
       xlim = c(300, 700), 
       freq = FALSE)
  lines(density(sample_mean), col = 'red')
  box()
}

The distributions do not differ much when we repeat it for more than 100 times.
Thus, we use one of them (500) as an approximation of the sample distribution.

7.7.3 The CLT

The approximate sample distribution:

samplen <- replicate(500, sample(1:999, 30))
sample_mean <- apply(samplen, 2, mean)

\(\mu _{\bar x}\): The mean of the sample mean (\(\bar x\)):

round(mean(sample_mean))

[1] 496

The population mean \(\mu\):

mean(1:999)

[1] 500

Standard error (\(\sigma_{\bar x}\)): The standard deviation of the sample mean (\(\bar x\)):

round(sd(sample_mean) * sqrt((500 - 1) / 500))

[1] 52

The population standard deviation \(\sigma\):

st <- sd(1:999) * sqrt((99 - 1) / 99)
round(st / sqrt(30))

[1] 52

Central Limit Theorem (CLT):

If samples of size \(n\), where \(n\ge30\), are drawn from any population with a mean \(\mu\) and a standard deviation \(\sigma\), then the sampling distribution of sample means approximates a normal distribution. The greater the sample size, the better the approximation.
If the population itself is normally distributed, then the sampling distribution of sample means is normally distributed for any sample size \(n\).

If CLT holds,

The mean of the sampling distribution of the mean equals to the population mean (\(\mu_{\bar x} = \mu\)).
The standard deviation of the sampling distribution of the mean, i.e. the standard error of the mean, equals to the population standard deviation divided by the square root of the sample size (\(\sigma _{\bar x} = \frac{\sigma}{\sqrt n}\))

7.7.4 The power of CLT

The population that supplies the samples doesn’t have to be a normal distribution for the CLT to hold.

Example: Toss 10 coins (a population of a binomial distribution)

sample(c(0, 1), 10, replace = TRUE)

 [1] 0 1 0 0 0 0 0 1 0 0

sample_size <- 3
replicate(sample_size, sample(c(0, 1), 10, replace = TRUE))

      [,1] [,2] [,3]
 [1,]    0    1    0
 [2,]    0    0    0
 [3,]    0    0    0
 [4,]    1    0    1
 [5,]    0    1    1
 [6,]    0    0    1
 [7,]    1    0    1
 [8,]    0    0    0
 [9,]    1    0    0
[10,]    1    0    1

Code

par(mfrow = c(3, 2))
x <- 0:10
y <- dbinom(0:10, 10, 0.5)
rep_n <- 500
mybar <- barplot(y,
                 ylim = c(0, max(y * 1.1)),
                 space = 0,
                 names.arg = x,
                 xlab = 'x', ylab = 'Density',
                 las = 1)
axis(1, at = mybar, labels = x)
legend("topright", bty = "n", 
       legend = paste("mean: 5", 
                      "\nsd = 1.58"))

box()

for (sample_size in c(3, 6, 12, 24, 30)){
  samplen <- replicate(rep_n, replicate(sample_size, sum(sample(c(0, 1), 10, replace = TRUE))))
  sample_mean <- colMeans(samplen)
  # p <- round(shapiro.test(sample_mean)$p.value, 2)
  mean_mean <- mean(sample_mean)
  se <- sd(sample_mean)
  hist(sample_mean, xlim = c(0, 10), freq = FALSE, 
       main = paste("n = ", sample_size)
       # main = paste("n = ", sample_size, ", p = ", p)
       )
  lines(density(sample_mean), col = "red")
  curve(dnorm(x, mean = mean_mean, sd = se), add = TRUE, col = "blue")
  legend("topright", bty = "n",
         legend = paste("mean: ", round(mean_mean, 2), 
                        "\nsd = ", round(se, 2)))
  box()
}

Example: CO\(_2\) concentration (a population of a normal distribution).

Code

li7500 <- read.csv("data/li7500.csv")
population <- li7500[, 1]
rep_n <- 500
par(mfrow = c(3, 2))
hist(population, xlim = c(385, 405))
legend("topright", bty = "n", 
       legend = paste("mean: ", round(mean(population), 1)))
for (sample_size in c(2, 5, 10, 50, 100)){
  samplen <- replicate(rep_n, sample(population, sample_size))
  sample_mean <- apply(samplen, 2, mean)
  hist(sample_mean, xlim = c(385, 405), 
       main = paste("Sample size = ", sample_size), freq = FALSE)
  legend("topright", bty = "n", 
         legend = paste("mean: ", round(mean(sample_mean), 1)))
  box()
}

Example: 30 out of 1:999 (a population of a uniform distribution).

Code

population <- 1:999
rep_n <- 500
par(mfrow = c(3, 2))
hist(population, freq = FALSE)
legend("topright", 
       legend = paste("mean: ", mean(population)), bty = "n")
for (sample_size in c(3, 10, 30, 100, 300)){
  samplen <- replicate(rep_n, sample(population, sample_size))
  sample_mean <- apply(samplen, 2, mean)
  hist(sample_mean, breaks = seq(0, 1000, 25), 
       main = paste("Sample size = ", sample_size), 
       xlim = c(0, 1000), freq = FALSE)
  legend("topright", bty = "n", 
         legend = paste("mean: ", round(mean(sample_mean), 1)))
  box()
}

Question: what is the probability that the mean for a random sample was between 400 and 500?

Example: Two peaks.

Code

tp <- read.csv("data/twopeaks.csv")
population <- tp$data
rep_n <- 5000
par(mfrow = c(3, 2))
hist(population)
legend("topright", bty = "n", 
       legend = paste("mean: ", round(mean(population),2), 
                      "\nsd = ", round(sd(population), 2)))
for (sample_size in c(3, 6, 12, 24, 30)){
  samplen <- replicate(rep_n, sample(population, sample_size))
  sample_mean <- apply(samplen, 2, mean)
  # p <- round(shapiro.test(sample_mean)$p.value, 2)
  mean_mean <- mean(sample_mean)
  se <- sd(sample_mean)
  hist(sample_mean, xlim = c(0, 15), freq = FALSE, 
       main = paste("n = ", sample_size)
       # main = paste("n = ", sample_size, ", p = ", p)
       )
  lines(density(sample_mean), col = "red")
  curve(dnorm(x, mean = mean_mean, sd = se), add = TRUE, col = "blue")
  legend("topright", bty = "n",
         legend = paste("mean: ", round(mean_mean, 2), 
                        "\nsd = ", round(se, 2)))
  box()
}

Animation for the CLT:

animation::clt.ani()

7.8 Estimating Population Parameters

We do not know, in most times, the population parameters such as \(\mu\) and \(\sigma\).
Usually we only have one sample with some observations.
How do we estimate the population parameters from the sample?

7.8.1 Definitions

Point estimate:: A single value estimate for a population parameter.

Sample mean \(\bar x\) as the best point estimate of \(\mu\)
Sample standard deviation \(s\) as the best point estimate of \(\sigma\).

Interval estimate:: An interval, or range of values, used to estimate a population parameter. Expressed as the center of the interval (point estimate) and \(\pm\) margin of error.
Level of confidence (\(c\)):: The probability that the interval estimate contains the population parameter, assuming that the estimation process is repeated a large number of times.
Critical values (\(z_c\)):: Values that separate probable sample statistics that are probable from improbable/unusual sample statistics.
Margin of error (\(E\), the maximum error of estimate, error tolerance):: The greatest possible distance between the point estimate and the value of the parameter it is estimating.

\[E = z_c \frac{\sigma}{\sqrt n}\],

when CLT holds.

7.8.2 For a population mean

Confidence interval:

\[\bar x - E < \mu < \bar x + E\]

or \(CI = \bar x \pm E\).

Let \(z_c = 1\), then \(c = 0.68\). We have 68% confidence that the population mean \(\mu\) is between \(\bar x - \frac{\sigma}{\sqrt n}\) and \(\bar x + \frac{\sigma}{\sqrt n}\).

Meaning: If you repeat the sampling and estimation procedure many many times, the confidence intervals you calculate would include \(\mu\) 68% of the time.

Let \(z_c = 2\), then \(c = 0.95\). We have 95% confidence that the population mean \(\mu\) is between \(\bar x - 2\frac{\sigma}{\sqrt n}\) and \(\bar x + 2\frac{\sigma}{\sqrt n}\).

Meaning: If you repeat the sampling and estimation procedure many many times, the confidence intervals you calculate would include \(\mu\) 95% of the time.

Example: You wish to estimate the mean age of all the current students in XJTLU. The mean age of a random sample of 20 students is 22.9 years. From past studies, the standard deviation is 1.5 years, and the population is normally distributed. Construct a 95% confidence interval of the population mean age.

zc <- 1.96
sigma <- 1.5
n <- 20
xbar <- 22.9
E <- round(zc * sigma / sqrt(n), 1)
c(xbar - E, xbar + E)

[1] 22.2 23.6

Example: 30 out of 1:999.

population <- 1:999
mu <- mean(population)
sample_size <- 30
rep_n <- 100

samplen <- replicate(rep_n, sample(population, sample_size))
df <- data.frame(
  sample_mean = apply(samplen, 2, mean),
  standard_error = apply(samplen, 2, 
                         function(x) sd(x)/sqrt(sample_size))
)
df$lower <- df$sample_mean - 2 * df$standard_error
df$upper <- df$sample_mean + 2 * df$standard_error
df$include <- mu > df$lower & mu < df$upper
sum(df$include) / rep_n

[1] 0.97

Repeats <- c(20, 50, 100, 200, 500, 1000, 5000, 10000, 50000, 100000)
Includes <- NULL
for (rep_n in Repeats){
  samplen <- replicate(rep_n, sample(population, sample_size))
  df <- data.frame(
    sample_mean = apply(samplen, 2, mean),
    standard_error = apply(samplen, 2, 
                           function(x) sd(x)/sqrt(sample_size))
  )
  df$lower <- df$sample_mean - 2 * df$standard_error
  df$upper <- df$sample_mean + 2 * df$standard_error
  df$include <- mu > df$lower & mu < df$upper
  Includes <- c(Includes, sum(df$include) / rep_n)
}
plot(Repeats, Includes, log = "x", type = "b")
abline(h = 0.95, col = "blue")

Example: CO\(_2\) concentration.

population <- li7500[, 1]

mu <- mean(population)
sample_size <- 30
rep_n <- 100

samplen <- replicate(rep_n, sample(population, sample_size))
df <- data.frame(
  sample_mean = apply(samplen, 2, mean),
  standard_error = apply(samplen, 2, 
                         function(x) sd(x)/sqrt(sample_size))
)
df$lower <- df$sample_mean - 2 * df$standard_error
df$upper <- df$sample_mean + 2 * df$standard_error
df$include <- mu > df$lower & mu < df$upper
sum(df$include) / rep_n

[1] 0.92

Repeats <- c(20, 50, 100, 200, 500, 1000, 5000, 10000, 50000, 100000)
Includes <- NULL
for (rep_n in Repeats){
  samplen <- replicate(rep_n, sample(population, sample_size))
  df <- data.frame(
    sample_mean = apply(samplen, 2, mean),
    standard_error = apply(samplen, 2, 
                           function(x) sd(x)/sqrt(sample_size))
  )
  df$lower <- df$sample_mean - 2 * df$standard_error
  df$upper <- df$sample_mean + 2 * df$standard_error
  df$include <- mu > df$lower & mu < df$upper
  Includes <- c(Includes, sum(df$include) / rep_n)
}
plot(Repeats, Includes, log = "x", type = "b")
abline(h = 0.95, col = "blue")

Animation for confidence intervals:

animation::conf.int()

7.8.3 For a population proportion

Point estimate:

\[\hat{p}=\frac{x}{n}\]

A \(c\)-confidence interval for a population proportion \(p\) is

\[\hat{p}-E<p<\hat{p}+E\]

When \(np\ge5\) and \(n(1-p)\ge5\), the margin of error for \(p\) is

\[E=z_{c} \sqrt{\frac{\hat{p} (1-\hat{p})}{n}}.\]

Example: In a survey of 1000 (\(n\)) U.S. teens, 372 (\(x\)) said that they own smartphones. The point estimate for the population proportion of U.S. teens who own smartphones is 37.2%. Give the confidence interval.

p <- 0.372
n <- 1000
n * p >= 5

[1] TRUE

n * (1 - p) >= 5

[1] TRUE

zc <- 1.96
E <- zc * sqrt(p * (1 - p) / 1000)
round(c(p - E, p + E), 3)

[1] 0.342 0.402

With 95% confidence, we can say that the population proportion of U.S. teens who own smartphones is between 34.2% and 40.2%.

7.9 Summary

How to estimate the population proportion (point + interval)?
- Binomial distribution.
How to estimate the population mean (point + interval) if the population standard deviation \(\sigma\) is known?
- \(z\) distribution.
How to estimate the population mean (interval) if the population standard deviation \(\sigma\) is unknown?
- \(t\) distribution.
How to estimate the population variance?
- \(\chi^2\) distribution.

7.10 Readings

Elementary Statistics, Chapter 5