ENV221 L07

Author

Peng Zhao

1 Overview of the module and R

2 R Basic Operations

3 R Programming

4 Statistical Graphs

5 Basic Concepts

6 Descriptive statistics

7 Probability Distribution

7.1 Learning objectives

In this lecture, you will

  1. Understand the meaning of probability density functions (PDF),
  2. Know the features of binomial distribution and normal distribution, especially the Empirical Rule,
  3. Understand the sampling distribution and the Central Limit Theorem (CLT), and
  4. Use standard normal distribution and CLT to solve environmental problems.

7.2 Probability Distribution

Random variable:
a numerical value associated with each outcome of a probability experiment.

Examples:

  1. How many heads you get when you toss 10 coins.
  2. How many pollution days in Suzhou in a month.
  3. The O3 concentration on our campus.
  4. The pH value of the Taihu Lake.
Probability Distribution:

A table (probability table), graph (e.g. bar charts for discrete data and histograms for continuous data), or function (Probability Density Function, PDF) that links each outcome of a statistical experiment with the probability of occurrence.

Example: How many heads you get when you toss 10 coins for 100 times.

A table:

A graph:

A function?

Example: Air quality in Wujiang in January and February, 20221.

  • 1 https://www.suzhou.gov.cn/szsrmzf/hjbhjdjcqk/202204/17fc853fdee047d690fe0f7e0bb6a91f.shtml

  • A table:

    Air quality Excellent Good Lightly Polluted Moderately Polluted
    Days 18 29 11 1
    \(P(x)\) 0.30 0.49 0.19 0.02

    A graph: bar chart

    Example: LI7500 data2 (Download the data file li7500.csv). Use a frequency distribution table (Section 6.3.3) to describe this dataset.

  • 2 LI7500 is a fast response gas analyzer, which can measure the CO2 concentration at very high frequency. The dataset li7500.csv gives 10000 records of CO2 concentrations measured in 200 seconds on June 6, 2016, at a field site in Italy.

  • co2 <- read.csv("data/li7500.csv")[, 1]

    A table:

    Code
    # range(co2)
    co2_group <- cut(co2, 385:406)
    df_co2 <- data.frame(table(co2_group))
    df_co2$Probability <- round(df_co2$Freq / sum(df_co2$Freq), 3)
    df_co2$Cumulative_Probability <- round(cumsum(df_co2$Freq) / sum(df_co2$Freq), 3)

    Question: How many observations appeared between 394.5 and 396.5 ppm?

    Another table:

    Code
    co2_group <- cut(co2, seq(385, 406, 0.5))
    df_co2 <- data.frame(table(co2_group))
    df_co2$Probability <- round(df_co2$Freq / sum(df_co2$Freq), 3)
    df_co2$Cumulative_Probability <- round(cumsum(df_co2$Freq) / sum(df_co2$Freq), 3)

    Questions: Then how many observations appeared between 394.25 and 396.25 ppm? Or 394.0625 and 396.0625?

    A graph might be better.

    Code
    par(mfrow= c(3, 2), las = 1, mar = c(2, 4, 0.3, 0.1))
    for (bins in c(8, 16, 32, 64, 128)){
      myhist <- hist(co2, breaks = bins, main = "", 
           border = NA,
           col = "steelblue", 
           lwd = 0.1, xlim = c(385, 406), ylim = c(0, 0.17), freq = FALSE)
      lines(myhist$mids, myhist$density, col = 'red')
      legend("topright", legend = paste("breaks = ", bins), bty = "n")
      box()
    }
    
    u <- seq(385, 406, by = 0.01)
    plot(u, dnorm(u), type = "n", xlab = "", ylab = "", col = "steelblue",
         xlim = c(385, 406), ylim = c(0, 0.17))
    lines(density(co2), col = 'red')
    polygon(u,
            dnorm(x= u, mean = 396, sd = 2.6),
            col = "steelblue",
            border = 'black')
    legend("topright", legend = "breaks = Inf", bty = "n")
    box()

    Graphs that describe the distribution of CO2 concentration

    The shaded area: the proportion of the observations between any two given numbers.

    Or a function to describe this curve: Probability Density Function (PDF)?

    7.3 Binomial distribution

    Binomial experiment:
    1. A fixed number of independent trials (\(n\)).
    2. Only two possible outcomes of interest for each trial: success or failure.
    3. The probability of a success (\(p\)) is the same for each trial.
    4. The random variable \(x\) counts the number of successful trials. \(x = 0, 1, 2, 3, ..., n\).

    Examples:

    1. Toss a coin for 2 times, or toss 2 coins. \(x\) = the frequency of head. \(n = 10\), \(p = 0.5\)
    2. Roll die for 10 times. \(x\) = the frequency of getting 1. \(n = 10\), \(p = \frac{1}{6}\).
    3. Randomly choose a week from the air quality data of Wujiang, Jan. - Feb. 2022. \(x\) = the frequency of days with excellent air quality. \(n = 7\), \(p = 0.30\).

    Examples: Let \(x = 2\). Calculate the probability for Examples 1 - 3.

    Code
    x <- 2
    
    # 1
    n <- 2
    p <- 0.5
    choose(n, x) * p^x * (1-p)^(n-x)
    dbinom(x, size = n, prob = p)
    
    # 2
    n <- 10
    p <- 1/6
    choose(n, x) * p^x * (1-p)^(n-x)
    dbinom(x, size = n, prob = p)
    
    # 3
    n <- 7
    p <- 0.3
    choose(n, x) * p^x * (1-p)^(n-x)
    dbinom(x, size = n, prob = p)

    A graph showing Example 1 (tossing two coins):

    Code
    library(ggplot2)
    size <- 2
    prob <- 0.5
    x <- 0:size
    dtf <- data.frame(x, d = dbinom(x, size = size, prob = prob))
    ggplot(dtf, aes(x, d)) +
      geom_bar(stat = 'identity', width = 1, color = 'white') +
      geom_line(color = 'red', size = 2) +
      labs(y = 'Density')

    PDF:

    \[P(x)={ }_n C_x p^x (1-p)^{n-x}=\frac{n !}{(n-x) ! x !} p^x (1-p)^{n-x}\]

    Features:

    • \(\mu = np\)
    • \(\sigma ^2 = np(1-p)\)
    • \(\sigma = \sqrt {np(1-p)}\)

    Examples: Calculate \(\mu\), \(\sigma^2\), \(\sigma\) for Examples 1 - 3.

    Code
    # 1
    n <- 2
    p <- 0.5
    x_mean <- n * p
    x_var <- n * p * (1 - p)
    x_sd <- sqrt(x_var)
    
    # 2
    n <- 10
    p <- 1/6
    x_mean <- n * p
    x_var <- n * p * (1 - p)
    x_sd <- sqrt(x_var)
    
    # 3
    n <- 7
    p <- 0.3
    x_mean <- n * p
    x_var <- n * p * (1 - p)
    x_sd <- sqrt(x_var)
    • If \(np \ge 5\) and \(n(1-p) \ge 5\), \(x\) is approximately normally distributed.

    7.4 Normal Distribution (Gaussian Distribution)

    Johann Carl Friedrich Gauß

    PDF:

    \[f(x)=\frac{1}{\sigma\sqrt{2 \pi}} e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^{2} }\] Suppose \(\mu = 60, \sigma = 3, x \in [40, 80]\),

    Code
    library(ggplot2)
    mu <- 60
    sigma <- 3
    x <- 40:80
    y <- 1 / (sigma * sqrt(2 * pi)) * exp(-1/2 * ((x - mu) / sigma) ^ 2)
    dtf <- data.frame(x, y)
    ggplot() + 
      geom_line(aes(x, y), dtf) + 
      geom_vline(aes(xintercept = mu + sigma * ((-3):3), 
                     color = c('3sd', '2sd', '1sd', 'mean', '1sd', '2sd', '3sd'))) +
      labs(color = '', y = 'f(x)')

    Features:

    • Bell shaped, and symmetrical about its mean \(\mu\).
    • Mean = median.
    • The total area under the curve = 1.
    • Completely determined by the parameters \(\mu\) and \(\sigma\).
    • Location parameter: Different values of \(\mu\) shift the graph along the horizontal axis.
    • Shape parameter: Different values of \(\sigma\) alter the degree of flatness and peakedness of the curve.
    Code
    dtf$y1 <- 1 / (sigma * sqrt(2 * pi)) * exp(-1/2 * ((x - (mu - 5)) / sigma) ^ 2)
    dtf$y2 <- 1 / (sigma * sqrt(2 * pi)) * exp(-1/2 * ((x - (mu + 5)) / sigma) ^ 2)
    dtf$y3 <- 1 / (sigma * sqrt(2 * pi)) * exp(-1/2 * ((x - mu) / (sigma + 1)) ^ 2)
    dtf$y4 <- 1 / (sigma * sqrt(2 * pi)) * exp(-1/2 * ((x - mu) / (sigma - 1)) ^ 2)
    ggplot(dtf) + 
      geom_line(aes(x, y)) + 
      geom_line(aes(x, y1), color = 'blue') + 
      geom_line(aes(x, y2), color = 'blue') + 
      geom_line(aes(x, y3), color = 'red', linetype = 2) +
      geom_line(aes(x, y4), color = 'red', linetype = 2) +
      labs(y = 'f(x)')

    • \(P(a<X<b) = \int_a^bf(x)\text dx\) –> \(P(X=c) = \int_c^cf(x)\text dx = 0\).
    • Empirical Rule: Approximately 68%, 95% and 99.7% of the area is within \(\pm 1\), \(\pm 2\) and \(\pm 3\) standard deviations from the mean.

    Question: In the case of the CO2 data, how do you know whether it follows (approximately) a normal distribution?

    Method 1: Visualized it with a histogram + density curve + normal distribution curve. Find if it is bell shaped with one single symmetrical peak and the mean close to the median.

    Code
    # plot the two similar figures one by one
    mydata <- co2
    hist(mydata, freq = FALSE, main = '')
    lines(density(mydata), col = "blue")
    curve(dnorm(x, mean = mean(mydata), 
                sd = sd(mydata)*sqrt((length(mydata)-1)/length(mydata))), 
          add = TRUE, col = "red")
    
    mydata <- iris$Sepal.Length
    hist(mydata, freq = FALSE, main = '')
    lines(density(mydata), col = "blue")
    curve(dnorm(x, mean = mean(mydata), 
                sd = sd(mydata)*sqrt((length(mydata)-1)/length(mydata))), 
          add = TRUE, col = "red")
    Code
    # plot the two figures with a self-defined function
    plot_hist_density <- function(mydata){
      hist(mydata, freq = FALSE, main = '', las = 1)
      lines(density(mydata), col = "blue")
      curve(dnorm(x, mean = mean(mydata), 
                  sd = sd(mydata)*sqrt((length(mydata)-1)/length(mydata))), 
            add = TRUE, col = "red")
      box()
    }
    par(mfrow = c(1, 2))
    plot_hist_density(co2)
    plot_hist_density(iris$Sepal.Length)

    Method 2: QQ plot. Find if the data points are close to the line of the perfect normal distribution with no obvious pattern coming away from the line.

    Code
    par(mfrow = c(1, 2))
    qqnorm(co2)
    qqline(co2)
    qqnorm(iris$Sepal.Length)
    qqline(iris$Sepal.Length)

    Question: As the CO2 data follow the normal distribution, how do we express it with an equation?

    mu <-  mean(co2)
    mu
    [1] 395.9551
    n <- length(co2)
    sigma <- sd(co2) * sqrt((n-1)/n)
    sigma
    [1] 2.617185

    f(x) is the density shown in previous graphs. we need to depend on concepts introduced in basic calculus to describe these distributions.

    \[f(x) = \frac{1}{2.6\sqrt{2 \pi}} e^{-\frac{1}{2}(\frac{x-396}{2.6})^{2} }\]

    \[P(x_1 < x < x_2) = \int_{x_1}^{x_2} f(x)\text d x = \int_{x_1}^{x_2}\frac{1}{\sigma\sqrt{2 \pi}} e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^{2} } \text d x\]

    With R, simply use the pnorm() function:

    pnorm()

    The pnorm() function.

    Question: how many observations appeared between 394.0625 and 396.0625?

    p394.0625 <- pnorm(394.0625, mean = mu, sd = sigma)
    p396.0625 <- pnorm(396.0625, mean = mu, sd = sigma)
    n * (p396.0625 - p394.0625)
    [1] 1407.577
    n * diff(pnorm(c(394.0625, 396.0625), mean = mu, sd = sigma))
    [1] 1407.577
    sum(co2 > 394.0625 & co2 < 396.0625)
    [1] 1440

    What if I cannot use R or I do not have a computer for these calculations?

    7.5 Standard Normal Distribution (\(z\) distribution)

    Standard normal distribution (unit normal distribution):

    a special form of the normal distribution when \(\mu=0\) and \(\sigma = 1\).

    \[f(x)=\frac{1}{\sigma\sqrt{2 \pi}} e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^{2} }\]

    \[f(z)=\frac{1}{\sqrt{2 \pi }} e^{-\frac{1}{2}z^{2}}\]

    \(z\) table for the probability:

    How to read the \(z\) table:

    • The probability for \(z < -1\) is 0.1587.
    • The probability for \(z < 1.53\) is 0.9370.
    • The probability for \(-1 < z < 1.53\) is \(0.9370 - 0.1587 = 0.7783\).

    z distribution

    Features:

    • Bell shaped, and symmetrical about its mean \(\mu = 0\).
    • Mean = median = 0.
    • The total area under the curve = 1.
    • Empirical Rule: Approximately 68%, 95% and 99.7% of the area is within \(\pm 1\), \(\pm 2\) and \(\pm 3\), respectively.
    \(z\) score:

    If a quantitative variable \(x\) has a normal distribution with the mean of \(\mu\) and standard deviation of \(\sigma\), then we can create a new variable \(z_\mathrm {score}=(x-\mu)/\sigma\), and \(z_\mathrm {score}\) has a standard normal distribution.

    Example:

    z_score <- (co2 - mu) /sigma
    mean(z_score)
    [1] -9.402777e-15
    sd(z_score)
    [1] 1.0001
    Code
    par(mfrow = c(1,2))
    hist(co2, freq = FALSE, breaks = seq(385, 406, 1))
    curve(dnorm(x, mu, sigma), col = "red", add = TRUE)
    box()
    
    hist(z_score, freq = FALSE, breaks = (seq(385, 406, 1) - mu)/sigma)
    curve(dnorm(x, 0, 1), col = "blue", add = TRUE)
    box()

    Convert a normal distribution into a standard normal distribution

    Proportion in the range between two given values of the CO2 concentration (\(c_1\) and \(c_2\)): convert the CO2 concentrations into \(z\) (\(z_1\) and \(z_2\)), and look them up in the \(z\) table.

    Example: Estimate the proportion of CO2 concentration between 390 and 395 ppm with the \(z\) table.

    mu <- mean(co2)
    n <- length(co2)
    sigma <- sd(co2) * sqrt((n-1) /n)
    (390 - mu)/sigma
    [1] -2.275372
    (395 - mu)/sigma
    [1] -0.364922

    What is the (standard) normal distribution used for?

    7.6 Other distributions

    Distribution p- q- d- r-
    Beta pbeta qbeta dbeta rbeta
    Binomial pbinom qbinom dbinom rbinom
    Cauchy pcauchy qcauchy dcauchy rcauchy
    Chi-Square pchisq qchisq dchisq rchisq
    Exponential) pexp qexp dexp rexp
    F pf qf df rf
    Gamma pgamma qgamma dgamma rgamma
    Geometric pgeom qgeom dgeom rgeom
    Hypergeometrictm phyper qhyper dhyper rhyper
    Logistic plogis qlogis dlogis rlogis
    Log Normal plnorm qlnorm dlnorm rlnorm
    Negative Binomial pnbinom qnbinom dnbinom rnbinom
    Normal pnorm qnorm dnorm rnorm
    Poisson ppois qpois dpois rpois
    Student t pt qt dt rt
    Studentized Range ptukey qtukey dtukey rtukey
    Uniform punif qunif dunif runif
    Weibull pweibull qweibull dweibull rweibull
    Wilcoxon Rank Sum Statistic pwilcox qwilcox dwilcox rwilcox
    Wilcoxon Signed Rank Statistic psignrank qsignrank dsignrank rsignrank

    Distribution functions in R: prefix + root_name

    • d for “density”, the density function (PDF).
    • p for “probability”, the cumulative distribution function (CDF).
    • q for “quantile”, the inverse of CDF.
    • r for “random”, a random variable having the specified distribution

    7.7 Central limit Theorem

    7.7.1 Sampling distribution

    Sampling Distribution:

    The distribution of all possible values of a statistic for a given sample size
    The most important definition in this module.
    The boundary line between people who understand statistics and people who don’t.

    Key words:

    1. A given sample size
    2. A statistic
    3. All possible values
    4. A distribution
    Sample:

    a subset of the population

    Statistic:

    measures of the sample, denoted by Roman letters (e.g., \(\bar x\) for the sample mean)

    Demo: 3 out of 10.

    Sample 3 observations from a population of the integers 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. Give the sampling distribution for the mean.

    1. A given sample size: 3.
    2. A statistic: the sample mean \(\bar x\).
    • One possible sample: 1, 2, 3. \(\bar x = (1+2+3)/3=2\).
    • Another possible sample: 1, 2, 4. \(\bar x = (1+2+4)/3=2.33\).
    1. All possible values: all the possible samples and the mean for each sample.
    population <- 1:10
    sample_size <- 3
    choose(10, 3)
    [1] 120
    samples <- data.frame(t(combn(population, sample_size)))
    samples$sample_mean <- apply(samples, 1, mean)
    samples$Sample <- rownames(samples)
    1. A distribution: a histogram for the sample mean \(\bar x\).
    Code
    par(mfrow = c(1, 2))
    hist(samples$sample_mean, 
         las = 1, freq = FALSE,
         xlab = 'Sample mean', 
         main = '')
    box()
    
    hist(population, 
         las = 1, freq = FALSE,
         xlab = 'Population', 
         main = '')
    box()

    Example: 30 out of 999.

    Sample 30 observations from a population of the integers 1, 2, …, 999. Give the sampling distribution for the mean.

    1. A given sample size: 30
    2. A statistic: the sample mean.
    3. All possible values.
    population <- 1:999
    hist(population)

    sample_size <- 30
    choose(999, 30)
    [1] 2.35672e+57

    It is a HUGE number. Impossible! Then what can we do?

    7.7.2 Approximation of the sample distribution

    Instead, we randomly sample many many many times.

    sample1 <- sample(1:999, 30)
    sample2 <- sample(1:999, 30)
    
    samplen <- replicate(10, sample(1:999, 30))
    sample_mean <- apply(samplen, 2, mean)
    hist(sample_mean, freq = FALSE)
    Code
    par(mfrow = c(3,2))
    for (rep_n in c(10, 50, 100, 500, 1000, 5000)){
      samplen <- replicate(rep_n, sample(1:999, 30))
      sample_mean <- apply(samplen, 2, mean)
      hist(sample_mean, 
           breaks = seq(0, 1000, 25), 
            
           main = rep_n, 
           xlim = c(300, 700), 
           freq = FALSE)
      lines(density(sample_mean), col = 'red')
      box()
    }

    • The distributions do not differ much when we repeat it for more than 100 times.
    • Thus, we use one of them (500) as an approximation of the sample distribution.

    7.7.3 The CLT

    The approximate sample distribution:

    samplen <- replicate(500, sample(1:999, 30))
    sample_mean <- apply(samplen, 2, mean)

    \(\mu _{\bar x}\): The mean of the sample mean (\(\bar x\)):

    round(mean(sample_mean))
    [1] 496

    The population mean \(\mu\):

    mean(1:999)
    [1] 500

    Standard error (\(\sigma_{\bar x}\)): The standard deviation of the sample mean (\(\bar x\)):

    round(sd(sample_mean) * sqrt((500 - 1) / 500))
    [1] 52

    The population standard deviation \(\sigma\):

    st <- sd(1:999) * sqrt((99 - 1) / 99)
    round(st / sqrt(30))
    [1] 52

    Central Limit Theorem (CLT):

    1. If samples of size \(n\), where \(n\ge30\), are drawn from any population with a mean \(\mu\) and a standard deviation \(\sigma\), then the sampling distribution of sample means approximates a normal distribution. The greater the sample size, the better the approximation.
    2. If the population itself is normally distributed, then the sampling distribution of sample means is normally distributed for any sample size \(n\).

    If CLT holds,

    1. The mean of the sampling distribution of the mean equals to the population mean (\(\mu_{\bar x} = \mu\)).
    2. The standard deviation of the sampling distribution of the mean, i.e. the standard error of the mean, equals to the population standard deviation divided by the square root of the sample size (\(\sigma _{\bar x} = \frac{\sigma}{\sqrt n}\))

    7.7.4 The power of CLT

    The population that supplies the samples doesn’t have to be a normal distribution for the CLT to hold.

    Example: Toss 10 coins (a population of a binomial distribution)

    sample(c(0, 1), 10, replace = TRUE)
     [1] 0 1 0 0 0 0 0 1 0 0
    sample_size <- 3
    replicate(sample_size, sample(c(0, 1), 10, replace = TRUE))
          [,1] [,2] [,3]
     [1,]    0    1    0
     [2,]    0    0    0
     [3,]    0    0    0
     [4,]    1    0    1
     [5,]    0    1    1
     [6,]    0    0    1
     [7,]    1    0    1
     [8,]    0    0    0
     [9,]    1    0    0
    [10,]    1    0    1
    Code
    par(mfrow = c(3, 2))
    x <- 0:10
    y <- dbinom(0:10, 10, 0.5)
    rep_n <- 500
    mybar <- barplot(y,
                     ylim = c(0, max(y * 1.1)),
                     space = 0,
                     names.arg = x,
                     xlab = 'x', ylab = 'Density',
                     las = 1)
    axis(1, at = mybar, labels = x)
    legend("topright", bty = "n", 
           legend = paste("mean: 5", 
                          "\nsd = 1.58"))
    
    box()
    
    for (sample_size in c(3, 6, 12, 24, 30)){
      samplen <- replicate(rep_n, replicate(sample_size, sum(sample(c(0, 1), 10, replace = TRUE))))
      sample_mean <- colMeans(samplen)
      # p <- round(shapiro.test(sample_mean)$p.value, 2)
      mean_mean <- mean(sample_mean)
      se <- sd(sample_mean)
      hist(sample_mean, xlim = c(0, 10), freq = FALSE, 
           main = paste("n = ", sample_size)
           # main = paste("n = ", sample_size, ", p = ", p)
           )
      lines(density(sample_mean), col = "red")
      curve(dnorm(x, mean = mean_mean, sd = se), add = TRUE, col = "blue")
      legend("topright", bty = "n",
             legend = paste("mean: ", round(mean_mean, 2), 
                            "\nsd = ", round(se, 2)))
      box()
    }

    Example: CO\(_2\) concentration (a population of a normal distribution).

    Code
    li7500 <- read.csv("data/li7500.csv")
    population <- li7500[, 1]
    rep_n <- 500
    par(mfrow = c(3, 2))
    hist(population, xlim = c(385, 405))
    legend("topright", bty = "n", 
           legend = paste("mean: ", round(mean(population), 1)))
    for (sample_size in c(2, 5, 10, 50, 100)){
      samplen <- replicate(rep_n, sample(population, sample_size))
      sample_mean <- apply(samplen, 2, mean)
      hist(sample_mean, xlim = c(385, 405), 
           main = paste("Sample size = ", sample_size), freq = FALSE)
      legend("topright", bty = "n", 
             legend = paste("mean: ", round(mean(sample_mean), 1)))
      box()
    }

    Example: 30 out of 1:999 (a population of a uniform distribution).

    Code
    population <- 1:999
    rep_n <- 500
    par(mfrow = c(3, 2))
    hist(population, freq = FALSE)
    legend("topright", 
           legend = paste("mean: ", mean(population)), bty = "n")
    for (sample_size in c(3, 10, 30, 100, 300)){
      samplen <- replicate(rep_n, sample(population, sample_size))
      sample_mean <- apply(samplen, 2, mean)
      hist(sample_mean, breaks = seq(0, 1000, 25), 
           main = paste("Sample size = ", sample_size), 
           xlim = c(0, 1000), freq = FALSE)
      legend("topright", bty = "n", 
             legend = paste("mean: ", round(mean(sample_mean), 1)))
      box()
    }

    Question: what is the probability that the mean for a random sample was between 400 and 500?

    Example: Two peaks.

    Code
    tp <- read.csv("data/twopeaks.csv")
    population <- tp$data
    rep_n <- 5000
    par(mfrow = c(3, 2))
    hist(population)
    legend("topright", bty = "n", 
           legend = paste("mean: ", round(mean(population),2), 
                          "\nsd = ", round(sd(population), 2)))
    for (sample_size in c(3, 6, 12, 24, 30)){
      samplen <- replicate(rep_n, sample(population, sample_size))
      sample_mean <- apply(samplen, 2, mean)
      # p <- round(shapiro.test(sample_mean)$p.value, 2)
      mean_mean <- mean(sample_mean)
      se <- sd(sample_mean)
      hist(sample_mean, xlim = c(0, 15), freq = FALSE, 
           main = paste("n = ", sample_size)
           # main = paste("n = ", sample_size, ", p = ", p)
           )
      lines(density(sample_mean), col = "red")
      curve(dnorm(x, mean = mean_mean, sd = se), add = TRUE, col = "blue")
      legend("topright", bty = "n",
             legend = paste("mean: ", round(mean_mean, 2), 
                            "\nsd = ", round(se, 2)))
      box()
    }

    Animation for the CLT:

    animation::clt.ani()

    7.8 Estimating Population Parameters

    • We do not know, in most times, the population parameters such as \(\mu\) and \(\sigma\).
    • Usually we only have one sample with some observations.
    • How do we estimate the population parameters from the sample?

    7.8.1 Definitions

    Point estimate:
    A single value estimate for a population parameter.
    • Sample mean \(\bar x\) as the best point estimate of \(\mu\)
    • Sample standard deviation \(s\) as the best point estimate of \(\sigma\).
    Interval estimate:
    An interval, or range of values, used to estimate a population parameter. Expressed as the center of the interval (point estimate) and \(\pm\) margin of error.
    Level of confidence (\(c\)):
    The probability that the interval estimate contains the population parameter, assuming that the estimation process is repeated a large number of times.
    Critical values (\(z_c\)):
    Values that separate probable sample statistics that are probable from improbable/unusual sample statistics.
    Margin of error (\(E\), the maximum error of estimate, error tolerance):
    The greatest possible distance between the point estimate and the value of the parameter it is estimating.

    \[E = z_c \frac{\sigma}{\sqrt n}\],

    when CLT holds.

    7.8.2 For a population mean

    Confidence interval:

    \[\bar x - E < \mu < \bar x + E\]

    or \(CI = \bar x \pm E\).

    • Let \(z_c = 1\), then \(c = 0.68\). We have 68% confidence that the population mean \(\mu\) is between \(\bar x - \frac{\sigma}{\sqrt n}\) and \(\bar x + \frac{\sigma}{\sqrt n}\).

    Meaning: If you repeat the sampling and estimation procedure many many times, the confidence intervals you calculate would include \(\mu\) 68% of the time.

    • Let \(z_c = 2\), then \(c = 0.95\). We have 95% confidence that the population mean \(\mu\) is between \(\bar x - 2\frac{\sigma}{\sqrt n}\) and \(\bar x + 2\frac{\sigma}{\sqrt n}\).

    Meaning: If you repeat the sampling and estimation procedure many many times, the confidence intervals you calculate would include \(\mu\) 95% of the time.

    Example: You wish to estimate the mean age of all the current students in XJTLU. The mean age of a random sample of 20 students is 22.9 years. From past studies, the standard deviation is 1.5 years, and the population is normally distributed. Construct a 95% confidence interval of the population mean age.

    zc <- 1.96
    sigma <- 1.5
    n <- 20
    xbar <- 22.9
    E <- round(zc * sigma / sqrt(n), 1)
    c(xbar - E, xbar + E) 
    [1] 22.2 23.6

    Example: 30 out of 1:999.

    population <- 1:999
    mu <- mean(population)
    sample_size <- 30
    rep_n <- 100
    
    samplen <- replicate(rep_n, sample(population, sample_size))
    df <- data.frame(
      sample_mean = apply(samplen, 2, mean),
      standard_error = apply(samplen, 2, 
                             function(x) sd(x)/sqrt(sample_size))
    )
    df$lower <- df$sample_mean - 2 * df$standard_error
    df$upper <- df$sample_mean + 2 * df$standard_error
    df$include <- mu > df$lower & mu < df$upper
    sum(df$include) / rep_n
    [1] 0.97
    Repeats <- c(20, 50, 100, 200, 500, 1000, 5000, 10000, 50000, 100000)
    Includes <- NULL
    for (rep_n in Repeats){
      samplen <- replicate(rep_n, sample(population, sample_size))
      df <- data.frame(
        sample_mean = apply(samplen, 2, mean),
        standard_error = apply(samplen, 2, 
                               function(x) sd(x)/sqrt(sample_size))
      )
      df$lower <- df$sample_mean - 2 * df$standard_error
      df$upper <- df$sample_mean + 2 * df$standard_error
      df$include <- mu > df$lower & mu < df$upper
      Includes <- c(Includes, sum(df$include) / rep_n)
    }
    plot(Repeats, Includes, log = "x", type = "b")
    abline(h = 0.95, col = "blue")

    Example: CO\(_2\) concentration.

    population <- li7500[, 1]
    
    mu <- mean(population)
    sample_size <- 30
    rep_n <- 100
    
    samplen <- replicate(rep_n, sample(population, sample_size))
    df <- data.frame(
      sample_mean = apply(samplen, 2, mean),
      standard_error = apply(samplen, 2, 
                             function(x) sd(x)/sqrt(sample_size))
    )
    df$lower <- df$sample_mean - 2 * df$standard_error
    df$upper <- df$sample_mean + 2 * df$standard_error
    df$include <- mu > df$lower & mu < df$upper
    sum(df$include) / rep_n
    [1] 0.92
    Repeats <- c(20, 50, 100, 200, 500, 1000, 5000, 10000, 50000, 100000)
    Includes <- NULL
    for (rep_n in Repeats){
      samplen <- replicate(rep_n, sample(population, sample_size))
      df <- data.frame(
        sample_mean = apply(samplen, 2, mean),
        standard_error = apply(samplen, 2, 
                               function(x) sd(x)/sqrt(sample_size))
      )
      df$lower <- df$sample_mean - 2 * df$standard_error
      df$upper <- df$sample_mean + 2 * df$standard_error
      df$include <- mu > df$lower & mu < df$upper
      Includes <- c(Includes, sum(df$include) / rep_n)
    }
    plot(Repeats, Includes, log = "x", type = "b")
    abline(h = 0.95, col = "blue")

    Animation for confidence intervals:

    animation::conf.int()

    7.8.3 For a population proportion

    Point estimate:

    \[\hat{p}=\frac{x}{n}\]

    A \(c\)-confidence interval for a population proportion \(p\) is

    \[\hat{p}-E<p<\hat{p}+E\]

    When \(np\ge5\) and \(n(1-p)\ge5\), the margin of error for \(p\) is

    \[E=z_{c} \sqrt{\frac{\hat{p} (1-\hat{p})}{n}}.\]

    Example: In a survey of 1000 (\(n\)) U.S. teens, 372 (\(x\)) said that they own smartphones. The point estimate for the population proportion of U.S. teens who own smartphones is 37.2%. Give the confidence interval.

    p <- 0.372
    n <- 1000
    n * p >= 5
    [1] TRUE
    n * (1 - p) >= 5
    [1] TRUE
    zc <- 1.96
    E <- zc * sqrt(p * (1 - p) / 1000)
    round(c(p - E, p + E), 3)
    [1] 0.342 0.402

    With 95% confidence, we can say that the population proportion of U.S. teens who own smartphones is between 34.2% and 40.2%.

    7.9 Summary

    • How to estimate the population proportion (point + interval)?
      • Binomial distribution.
    • How to estimate the population mean (point + interval) if the population standard deviation \(\sigma\) is known?
      • \(z\) distribution.
    • How to estimate the population mean (interval) if the population standard deviation \(\sigma\) is unknown?
      • \(t\) distribution.
    • How to estimate the population variance?
      • \(\chi^2\) distribution.

    7.10 Readings

    • Elementary Statistics, Chapter 5