ENV221 L07
1 Overview of the module and R
2 R Basic Operations
3 R Programming
4 Statistical Graphs
5 Basic Concepts
6 Descriptive statistics
7 Probability Distribution
7.1 Learning objectives
In this lecture, you will
- Understand the meaning of probability density functions (PDF),
- Know the features of binomial distribution and normal distribution, especially the Empirical Rule,
- Understand the sampling distribution and the Central Limit Theorem (CLT), and
- Use standard normal distribution and CLT to solve environmental problems.
7.2 Probability Distribution
- Random variable:
- a numerical value associated with each outcome of a probability experiment.
Examples:
- How many heads you get when you toss 10 coins.
- How many pollution days in Suzhou in a month.
- The O3 concentration on our campus.
- The pH value of the Taihu Lake.
- Probability Distribution:
-
A table (probability table), graph (e.g. bar charts for discrete data and histograms for continuous data), or function (Probability Density Function, PDF) that links each outcome of a statistical experiment with the probability of occurrence.
Example: How many heads you get when you toss 10 coins for 100 times.
A table:
A graph:
A function?
Example: Air quality in Wujiang in January and February, 20221.
1 https://www.suzhou.gov.cn/szsrmzf/hjbhjdjcqk/202204/17fc853fdee047d690fe0f7e0bb6a91f.shtml
A table:
Air quality | Excellent | Good | Lightly Polluted | Moderately Polluted |
---|---|---|---|---|
Days | 18 | 29 | 11 | 1 |
\(P(x)\) | 0.30 | 0.49 | 0.19 | 0.02 |
A graph: bar chart
Example: LI7500 data2 (Download the data file li7500.csv). Use a frequency distribution table (Section 6.3.3) to describe this dataset.
2 LI7500 is a fast response gas analyzer, which can measure the CO2 concentration at very high frequency. The dataset li7500.csv gives 10000 records of CO2 concentrations measured in 200 seconds on June 6, 2016, at a field site in Italy.
<- read.csv("data/li7500.csv")[, 1] co2
A table:
Code
# range(co2)
<- cut(co2, 385:406)
co2_group <- data.frame(table(co2_group))
df_co2 $Probability <- round(df_co2$Freq / sum(df_co2$Freq), 3)
df_co2$Cumulative_Probability <- round(cumsum(df_co2$Freq) / sum(df_co2$Freq), 3) df_co2
Question: How many observations appeared between 394.5 and 396.5 ppm?
Another table:
Code
<- cut(co2, seq(385, 406, 0.5))
co2_group <- data.frame(table(co2_group))
df_co2 $Probability <- round(df_co2$Freq / sum(df_co2$Freq), 3)
df_co2$Cumulative_Probability <- round(cumsum(df_co2$Freq) / sum(df_co2$Freq), 3) df_co2
Questions: Then how many observations appeared between 394.25 and 396.25 ppm? Or 394.0625 and 396.0625?
A graph might be better.
Code
par(mfrow= c(3, 2), las = 1, mar = c(2, 4, 0.3, 0.1))
for (bins in c(8, 16, 32, 64, 128)){
<- hist(co2, breaks = bins, main = "",
myhist border = NA,
col = "steelblue",
lwd = 0.1, xlim = c(385, 406), ylim = c(0, 0.17), freq = FALSE)
lines(myhist$mids, myhist$density, col = 'red')
legend("topright", legend = paste("breaks = ", bins), bty = "n")
box()
}
<- seq(385, 406, by = 0.01)
u plot(u, dnorm(u), type = "n", xlab = "", ylab = "", col = "steelblue",
xlim = c(385, 406), ylim = c(0, 0.17))
lines(density(co2), col = 'red')
polygon(u,
dnorm(x= u, mean = 396, sd = 2.6),
col = "steelblue",
border = 'black')
legend("topright", legend = "breaks = Inf", bty = "n")
box()
The shaded area: the proportion of the observations between any two given numbers.
Or a function to describe this curve: Probability Density Function (PDF)?
7.3 Binomial distribution
- Binomial experiment:
-
- A fixed number of independent trials (\(n\)).
- Only two possible outcomes of interest for each trial: success or failure.
- The probability of a success (\(p\)) is the same for each trial.
- The random variable \(x\) counts the number of successful trials. \(x = 0, 1, 2, 3, ..., n\).
Examples:
- Toss a coin for 2 times, or toss 2 coins. \(x\) = the frequency of head. \(n = 10\), \(p = 0.5\)
- Roll die for 10 times. \(x\) = the frequency of getting 1. \(n = 10\), \(p = \frac{1}{6}\).
- Randomly choose a week from the air quality data of Wujiang, Jan. - Feb. 2022. \(x\) = the frequency of days with excellent air quality. \(n = 7\), \(p = 0.30\).
Examples: Let \(x = 2\). Calculate the probability for Examples 1 - 3.
Code
<- 2
x
# 1
<- 2
n <- 0.5
p choose(n, x) * p^x * (1-p)^(n-x)
dbinom(x, size = n, prob = p)
# 2
<- 10
n <- 1/6
p choose(n, x) * p^x * (1-p)^(n-x)
dbinom(x, size = n, prob = p)
# 3
<- 7
n <- 0.3
p choose(n, x) * p^x * (1-p)^(n-x)
dbinom(x, size = n, prob = p)
A graph showing Example 1 (tossing two coins):
Code
library(ggplot2)
<- 2
size <- 0.5
prob <- 0:size
x <- data.frame(x, d = dbinom(x, size = size, prob = prob))
dtf ggplot(dtf, aes(x, d)) +
geom_bar(stat = 'identity', width = 1, color = 'white') +
geom_line(color = 'red', size = 2) +
labs(y = 'Density')
PDF:
\[P(x)={ }_n C_x p^x (1-p)^{n-x}=\frac{n !}{(n-x) ! x !} p^x (1-p)^{n-x}\]
Features:
- \(\mu = np\)
- \(\sigma ^2 = np(1-p)\)
- \(\sigma = \sqrt {np(1-p)}\)
Examples: Calculate \(\mu\), \(\sigma^2\), \(\sigma\) for Examples 1 - 3.
Code
# 1
<- 2
n <- 0.5
p <- n * p
x_mean <- n * p * (1 - p)
x_var <- sqrt(x_var)
x_sd
# 2
<- 10
n <- 1/6
p <- n * p
x_mean <- n * p * (1 - p)
x_var <- sqrt(x_var)
x_sd
# 3
<- 7
n <- 0.3
p <- n * p
x_mean <- n * p * (1 - p)
x_var <- sqrt(x_var) x_sd
- If \(np \ge 5\) and \(n(1-p) \ge 5\), \(x\) is approximately normally distributed.
7.4 Normal Distribution (Gaussian Distribution)
PDF:
\[f(x)=\frac{1}{\sigma\sqrt{2 \pi}} e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^{2} }\] Suppose \(\mu = 60, \sigma = 3, x \in [40, 80]\),
Code
library(ggplot2)
<- 60
mu <- 3
sigma <- 40:80
x <- 1 / (sigma * sqrt(2 * pi)) * exp(-1/2 * ((x - mu) / sigma) ^ 2)
y <- data.frame(x, y)
dtf ggplot() +
geom_line(aes(x, y), dtf) +
geom_vline(aes(xintercept = mu + sigma * ((-3):3),
color = c('3sd', '2sd', '1sd', 'mean', '1sd', '2sd', '3sd'))) +
labs(color = '', y = 'f(x)')
Features:
- Bell shaped, and symmetrical about its mean \(\mu\).
- Mean = median.
- The total area under the curve = 1.
- Completely determined by the parameters \(\mu\) and \(\sigma\).
- Location parameter: Different values of \(\mu\) shift the graph along the horizontal axis.
- Shape parameter: Different values of \(\sigma\) alter the degree of flatness and peakedness of the curve.
Code
$y1 <- 1 / (sigma * sqrt(2 * pi)) * exp(-1/2 * ((x - (mu - 5)) / sigma) ^ 2)
dtf$y2 <- 1 / (sigma * sqrt(2 * pi)) * exp(-1/2 * ((x - (mu + 5)) / sigma) ^ 2)
dtf$y3 <- 1 / (sigma * sqrt(2 * pi)) * exp(-1/2 * ((x - mu) / (sigma + 1)) ^ 2)
dtf$y4 <- 1 / (sigma * sqrt(2 * pi)) * exp(-1/2 * ((x - mu) / (sigma - 1)) ^ 2)
dtfggplot(dtf) +
geom_line(aes(x, y)) +
geom_line(aes(x, y1), color = 'blue') +
geom_line(aes(x, y2), color = 'blue') +
geom_line(aes(x, y3), color = 'red', linetype = 2) +
geom_line(aes(x, y4), color = 'red', linetype = 2) +
labs(y = 'f(x)')
- \(P(a<X<b) = \int_a^bf(x)\text dx\) –> \(P(X=c) = \int_c^cf(x)\text dx = 0\).
- Empirical Rule: Approximately 68%, 95% and 99.7% of the area is within \(\pm 1\), \(\pm 2\) and \(\pm 3\) standard deviations from the mean.
Question: In the case of the CO2 data, how do you know whether it follows (approximately) a normal distribution?
Method 1: Visualized it with a histogram + density curve + normal distribution curve. Find if it is bell shaped with one single symmetrical peak and the mean close to the median.
Code
# plot the two similar figures one by one
<- co2
mydata hist(mydata, freq = FALSE, main = '')
lines(density(mydata), col = "blue")
curve(dnorm(x, mean = mean(mydata),
sd = sd(mydata)*sqrt((length(mydata)-1)/length(mydata))),
add = TRUE, col = "red")
<- iris$Sepal.Length
mydata hist(mydata, freq = FALSE, main = '')
lines(density(mydata), col = "blue")
curve(dnorm(x, mean = mean(mydata),
sd = sd(mydata)*sqrt((length(mydata)-1)/length(mydata))),
add = TRUE, col = "red")
Code
# plot the two figures with a self-defined function
<- function(mydata){
plot_hist_density hist(mydata, freq = FALSE, main = '', las = 1)
lines(density(mydata), col = "blue")
curve(dnorm(x, mean = mean(mydata),
sd = sd(mydata)*sqrt((length(mydata)-1)/length(mydata))),
add = TRUE, col = "red")
box()
}par(mfrow = c(1, 2))
plot_hist_density(co2)
plot_hist_density(iris$Sepal.Length)
Method 2: QQ plot. Find if the data points are close to the line of the perfect normal distribution with no obvious pattern coming away from the line.
Code
par(mfrow = c(1, 2))
qqnorm(co2)
qqline(co2)
qqnorm(iris$Sepal.Length)
qqline(iris$Sepal.Length)
Question: As the CO2 data follow the normal distribution, how do we express it with an equation?
<- mean(co2)
mu mu
[1] 395.9551
<- length(co2)
n <- sd(co2) * sqrt((n-1)/n)
sigma sigma
[1] 2.617185
f(x) is the density shown in previous graphs. we need to depend on concepts introduced in basic calculus to describe these distributions.
\[f(x) = \frac{1}{2.6\sqrt{2 \pi}} e^{-\frac{1}{2}(\frac{x-396}{2.6})^{2} }\]
\[P(x_1 < x < x_2) = \int_{x_1}^{x_2} f(x)\text d x = \int_{x_1}^{x_2}\frac{1}{\sigma\sqrt{2 \pi}} e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^{2} } \text d x\]
With R, simply use the pnorm()
function:
pnorm()
Question: how many observations appeared between 394.0625 and 396.0625?
.0625 <- pnorm(394.0625, mean = mu, sd = sigma)
p394.0625 <- pnorm(396.0625, mean = mu, sd = sigma)
p396* (p396.0625 - p394.0625) n
[1] 1407.577
* diff(pnorm(c(394.0625, 396.0625), mean = mu, sd = sigma)) n
[1] 1407.577
sum(co2 > 394.0625 & co2 < 396.0625)
[1] 1440
What if I cannot use R or I do not have a computer for these calculations?
7.5 Standard Normal Distribution (\(z\) distribution)
- Standard normal distribution (unit normal distribution):
-
a special form of the normal distribution when \(\mu=0\) and \(\sigma = 1\).
\[f(x)=\frac{1}{\sigma\sqrt{2 \pi}} e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^{2} }\]
\[f(z)=\frac{1}{\sqrt{2 \pi }} e^{-\frac{1}{2}z^{2}}\]
\(z\) table for the probability:
How to read the \(z\) table:
- The probability for \(z < -1\) is 0.1587.
- The probability for \(z < 1.53\) is 0.9370.
- The probability for \(-1 < z < 1.53\) is \(0.9370 - 0.1587 = 0.7783\).
Features:
- Bell shaped, and symmetrical about its mean \(\mu = 0\).
- Mean = median = 0.
- The total area under the curve = 1.
- Empirical Rule: Approximately 68%, 95% and 99.7% of the area is within \(\pm 1\), \(\pm 2\) and \(\pm 3\), respectively.
- \(z\) score:
-
If a quantitative variable \(x\) has a normal distribution with the mean of \(\mu\) and standard deviation of \(\sigma\), then we can create a new variable \(z_\mathrm {score}=(x-\mu)/\sigma\), and \(z_\mathrm {score}\) has a standard normal distribution.
Example:
<- (co2 - mu) /sigma
z_score mean(z_score)
[1] -9.402777e-15
sd(z_score)
[1] 1.0001
Code
par(mfrow = c(1,2))
hist(co2, freq = FALSE, breaks = seq(385, 406, 1))
curve(dnorm(x, mu, sigma), col = "red", add = TRUE)
box()
hist(z_score, freq = FALSE, breaks = (seq(385, 406, 1) - mu)/sigma)
curve(dnorm(x, 0, 1), col = "blue", add = TRUE)
box()
Proportion in the range between two given values of the CO2 concentration (\(c_1\) and \(c_2\)): convert the CO2 concentrations into \(z\) (\(z_1\) and \(z_2\)), and look them up in the \(z\) table.
Example: Estimate the proportion of CO2 concentration between 390 and 395 ppm with the \(z\) table.
<- mean(co2)
mu <- length(co2)
n <- sd(co2) * sqrt((n-1) /n)
sigma 390 - mu)/sigma (
[1] -2.275372
395 - mu)/sigma (
[1] -0.364922
What is the (standard) normal distribution used for?
7.6 Other distributions
Distribution | p- |
q- |
d- |
r- |
---|---|---|---|---|
Beta | pbeta |
qbeta |
dbeta |
rbeta |
Binomial | pbinom |
qbinom |
dbinom |
rbinom |
Cauchy | pcauchy |
qcauchy |
dcauchy |
rcauchy |
Chi-Square | pchisq |
qchisq |
dchisq |
rchisq |
Exponential) | pexp |
qexp |
dexp |
rexp |
F | pf |
qf |
df |
rf |
Gamma | pgamma |
qgamma |
dgamma |
rgamma |
Geometric | pgeom |
qgeom |
dgeom |
rgeom |
Hypergeometrictm | phyper |
qhyper |
dhyper |
rhyper |
Logistic | plogis |
qlogis |
dlogis |
rlogis |
Log Normal | plnorm |
qlnorm |
dlnorm |
rlnorm |
Negative Binomial | pnbinom |
qnbinom |
dnbinom |
rnbinom |
Normal | pnorm |
qnorm |
dnorm |
rnorm |
Poisson | ppois |
qpois |
dpois |
rpois |
Student t | pt |
qt |
dt |
rt |
Studentized Range | ptukey |
qtukey |
dtukey |
rtukey |
Uniform | punif |
qunif |
dunif |
runif |
Weibull | pweibull |
qweibull |
dweibull |
rweibull |
Wilcoxon Rank Sum Statistic | pwilcox |
qwilcox |
dwilcox |
rwilcox |
Wilcoxon Signed Rank Statistic | psignrank |
qsignrank |
dsignrank |
rsignrank |
Distribution functions in R: prefix + root_name
d
for “density”, the density function (PDF).p
for “probability”, the cumulative distribution function (CDF).q
for “quantile”, the inverse of CDF.r
for “random”, a random variable having the specified distribution
7.7 Central limit Theorem
7.7.1 Sampling distribution
- Sampling Distribution:
-
The distribution of all possible values of a statistic for a given sample size
The most important definition in this module.
The boundary line between people who understand statistics and people who don’t.
Key words:
- A given sample size
- A statistic
- All possible values
- A distribution
- Sample:
-
a subset of the population
- Statistic:
-
measures of the sample, denoted by Roman letters (e.g., \(\bar x\) for the sample mean)
Demo: 3 out of 10.
Sample 3 observations from a population of the integers 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. Give the sampling distribution for the mean.
- A given sample size: 3.
- A statistic: the sample mean \(\bar x\).
- One possible sample: 1, 2, 3. \(\bar x = (1+2+3)/3=2\).
- Another possible sample: 1, 2, 4. \(\bar x = (1+2+4)/3=2.33\).
- All possible values: all the possible samples and the mean for each sample.
<- 1:10
population <- 3
sample_size choose(10, 3)
[1] 120
<- data.frame(t(combn(population, sample_size)))
samples $sample_mean <- apply(samples, 1, mean)
samples$Sample <- rownames(samples) samples
- A distribution: a histogram for the sample mean \(\bar x\).
Code
par(mfrow = c(1, 2))
hist(samples$sample_mean,
las = 1, freq = FALSE,
xlab = 'Sample mean',
main = '')
box()
hist(population,
las = 1, freq = FALSE,
xlab = 'Population',
main = '')
box()
Example: 30 out of 999.
Sample 30 observations from a population of the integers 1, 2, …, 999. Give the sampling distribution for the mean.
- A given sample size: 30
- A statistic: the sample mean.
- All possible values.
<- 1:999
population hist(population)
<- 30
sample_size choose(999, 30)
[1] 2.35672e+57
It is a HUGE number. Impossible! Then what can we do?
7.7.2 Approximation of the sample distribution
Instead, we randomly sample many many many times.
<- sample(1:999, 30)
sample1 <- sample(1:999, 30)
sample2
<- replicate(10, sample(1:999, 30))
samplen <- apply(samplen, 2, mean)
sample_mean hist(sample_mean, freq = FALSE)
Code
par(mfrow = c(3,2))
for (rep_n in c(10, 50, 100, 500, 1000, 5000)){
<- replicate(rep_n, sample(1:999, 30))
samplen <- apply(samplen, 2, mean)
sample_mean hist(sample_mean,
breaks = seq(0, 1000, 25),
main = rep_n,
xlim = c(300, 700),
freq = FALSE)
lines(density(sample_mean), col = 'red')
box()
}
- The distributions do not differ much when we repeat it for more than 100 times.
- Thus, we use one of them (500) as an approximation of the sample distribution.
7.7.3 The CLT
The approximate sample distribution:
<- replicate(500, sample(1:999, 30))
samplen <- apply(samplen, 2, mean) sample_mean
\(\mu _{\bar x}\): The mean of the sample mean (\(\bar x\)):
round(mean(sample_mean))
[1] 496
The population mean \(\mu\):
mean(1:999)
[1] 500
Standard error (\(\sigma_{\bar x}\)): The standard deviation of the sample mean (\(\bar x\)):
round(sd(sample_mean) * sqrt((500 - 1) / 500))
[1] 52
The population standard deviation \(\sigma\):
<- sd(1:999) * sqrt((99 - 1) / 99)
st round(st / sqrt(30))
[1] 52
Central Limit Theorem (CLT):
- If samples of size \(n\), where \(n\ge30\), are drawn from any population with a mean \(\mu\) and a standard deviation \(\sigma\), then the sampling distribution of sample means approximates a normal distribution. The greater the sample size, the better the approximation.
- If the population itself is normally distributed, then the sampling distribution of sample means is normally distributed for any sample size \(n\).
If CLT holds,
- The mean of the sampling distribution of the mean equals to the population mean (\(\mu_{\bar x} = \mu\)).
- The standard deviation of the sampling distribution of the mean, i.e. the standard error of the mean, equals to the population standard deviation divided by the square root of the sample size (\(\sigma _{\bar x} = \frac{\sigma}{\sqrt n}\))
7.7.4 The power of CLT
The population that supplies the samples doesn’t have to be a normal distribution for the CLT to hold.
Example: Toss 10 coins (a population of a binomial distribution)
sample(c(0, 1), 10, replace = TRUE)
[1] 0 1 0 0 0 0 0 1 0 0
<- 3
sample_size replicate(sample_size, sample(c(0, 1), 10, replace = TRUE))
[,1] [,2] [,3]
[1,] 0 1 0
[2,] 0 0 0
[3,] 0 0 0
[4,] 1 0 1
[5,] 0 1 1
[6,] 0 0 1
[7,] 1 0 1
[8,] 0 0 0
[9,] 1 0 0
[10,] 1 0 1
Code
par(mfrow = c(3, 2))
<- 0:10
x <- dbinom(0:10, 10, 0.5)
y <- 500
rep_n <- barplot(y,
mybar ylim = c(0, max(y * 1.1)),
space = 0,
names.arg = x,
xlab = 'x', ylab = 'Density',
las = 1)
axis(1, at = mybar, labels = x)
legend("topright", bty = "n",
legend = paste("mean: 5",
"\nsd = 1.58"))
box()
for (sample_size in c(3, 6, 12, 24, 30)){
<- replicate(rep_n, replicate(sample_size, sum(sample(c(0, 1), 10, replace = TRUE))))
samplen <- colMeans(samplen)
sample_mean # p <- round(shapiro.test(sample_mean)$p.value, 2)
<- mean(sample_mean)
mean_mean <- sd(sample_mean)
se hist(sample_mean, xlim = c(0, 10), freq = FALSE,
main = paste("n = ", sample_size)
# main = paste("n = ", sample_size, ", p = ", p)
)lines(density(sample_mean), col = "red")
curve(dnorm(x, mean = mean_mean, sd = se), add = TRUE, col = "blue")
legend("topright", bty = "n",
legend = paste("mean: ", round(mean_mean, 2),
"\nsd = ", round(se, 2)))
box()
}
Example: CO\(_2\) concentration (a population of a normal distribution).
Code
<- read.csv("data/li7500.csv")
li7500 <- li7500[, 1]
population <- 500
rep_n par(mfrow = c(3, 2))
hist(population, xlim = c(385, 405))
legend("topright", bty = "n",
legend = paste("mean: ", round(mean(population), 1)))
for (sample_size in c(2, 5, 10, 50, 100)){
<- replicate(rep_n, sample(population, sample_size))
samplen <- apply(samplen, 2, mean)
sample_mean hist(sample_mean, xlim = c(385, 405),
main = paste("Sample size = ", sample_size), freq = FALSE)
legend("topright", bty = "n",
legend = paste("mean: ", round(mean(sample_mean), 1)))
box()
}
Example: 30 out of 1:999 (a population of a uniform distribution).
Code
<- 1:999
population <- 500
rep_n par(mfrow = c(3, 2))
hist(population, freq = FALSE)
legend("topright",
legend = paste("mean: ", mean(population)), bty = "n")
for (sample_size in c(3, 10, 30, 100, 300)){
<- replicate(rep_n, sample(population, sample_size))
samplen <- apply(samplen, 2, mean)
sample_mean hist(sample_mean, breaks = seq(0, 1000, 25),
main = paste("Sample size = ", sample_size),
xlim = c(0, 1000), freq = FALSE)
legend("topright", bty = "n",
legend = paste("mean: ", round(mean(sample_mean), 1)))
box()
}
Question: what is the probability that the mean for a random sample was between 400 and 500?
Example: Two peaks.
Code
<- read.csv("data/twopeaks.csv")
tp <- tp$data
population <- 5000
rep_n par(mfrow = c(3, 2))
hist(population)
legend("topright", bty = "n",
legend = paste("mean: ", round(mean(population),2),
"\nsd = ", round(sd(population), 2)))
for (sample_size in c(3, 6, 12, 24, 30)){
<- replicate(rep_n, sample(population, sample_size))
samplen <- apply(samplen, 2, mean)
sample_mean # p <- round(shapiro.test(sample_mean)$p.value, 2)
<- mean(sample_mean)
mean_mean <- sd(sample_mean)
se hist(sample_mean, xlim = c(0, 15), freq = FALSE,
main = paste("n = ", sample_size)
# main = paste("n = ", sample_size, ", p = ", p)
)lines(density(sample_mean), col = "red")
curve(dnorm(x, mean = mean_mean, sd = se), add = TRUE, col = "blue")
legend("topright", bty = "n",
legend = paste("mean: ", round(mean_mean, 2),
"\nsd = ", round(se, 2)))
box()
}
Animation for the CLT:
::clt.ani() animation
7.8 Estimating Population Parameters
- We do not know, in most times, the population parameters such as \(\mu\) and \(\sigma\).
- Usually we only have one sample with some observations.
- How do we estimate the population parameters from the sample?
7.8.1 Definitions
- Point estimate:
- A single value estimate for a population parameter.
- Sample mean \(\bar x\) as the best point estimate of \(\mu\)
- Sample standard deviation \(s\) as the best point estimate of \(\sigma\).
- Interval estimate:
- An interval, or range of values, used to estimate a population parameter. Expressed as the center of the interval (point estimate) and \(\pm\) margin of error.
- Level of confidence (\(c\)):
- The probability that the interval estimate contains the population parameter, assuming that the estimation process is repeated a large number of times.
- Critical values (\(z_c\)):
- Values that separate probable sample statistics that are probable from improbable/unusual sample statistics.
- Margin of error (\(E\), the maximum error of estimate, error tolerance):
- The greatest possible distance between the point estimate and the value of the parameter it is estimating.
\[E = z_c \frac{\sigma}{\sqrt n}\],
when CLT holds.
7.8.2 For a population mean
Confidence interval:
\[\bar x - E < \mu < \bar x + E\]
or \(CI = \bar x \pm E\).
- Let \(z_c = 1\), then \(c = 0.68\). We have 68% confidence that the population mean \(\mu\) is between \(\bar x - \frac{\sigma}{\sqrt n}\) and \(\bar x + \frac{\sigma}{\sqrt n}\).
Meaning: If you repeat the sampling and estimation procedure many many times, the confidence intervals you calculate would include \(\mu\) 68% of the time.
- Let \(z_c = 2\), then \(c = 0.95\). We have 95% confidence that the population mean \(\mu\) is between \(\bar x - 2\frac{\sigma}{\sqrt n}\) and \(\bar x + 2\frac{\sigma}{\sqrt n}\).
Meaning: If you repeat the sampling and estimation procedure many many times, the confidence intervals you calculate would include \(\mu\) 95% of the time.
Example: You wish to estimate the mean age of all the current students in XJTLU. The mean age of a random sample of 20 students is 22.9 years. From past studies, the standard deviation is 1.5 years, and the population is normally distributed. Construct a 95% confidence interval of the population mean age.
<- 1.96
zc <- 1.5
sigma <- 20
n <- 22.9
xbar <- round(zc * sigma / sqrt(n), 1)
E c(xbar - E, xbar + E)
[1] 22.2 23.6
Example: 30 out of 1:999.
<- 1:999
population <- mean(population)
mu <- 30
sample_size <- 100
rep_n
<- replicate(rep_n, sample(population, sample_size))
samplen <- data.frame(
df sample_mean = apply(samplen, 2, mean),
standard_error = apply(samplen, 2,
function(x) sd(x)/sqrt(sample_size))
)$lower <- df$sample_mean - 2 * df$standard_error
df$upper <- df$sample_mean + 2 * df$standard_error
df$include <- mu > df$lower & mu < df$upper
dfsum(df$include) / rep_n
[1] 0.97
<- c(20, 50, 100, 200, 500, 1000, 5000, 10000, 50000, 100000)
Repeats <- NULL
Includes for (rep_n in Repeats){
<- replicate(rep_n, sample(population, sample_size))
samplen <- data.frame(
df sample_mean = apply(samplen, 2, mean),
standard_error = apply(samplen, 2,
function(x) sd(x)/sqrt(sample_size))
)$lower <- df$sample_mean - 2 * df$standard_error
df$upper <- df$sample_mean + 2 * df$standard_error
df$include <- mu > df$lower & mu < df$upper
df<- c(Includes, sum(df$include) / rep_n)
Includes
}plot(Repeats, Includes, log = "x", type = "b")
abline(h = 0.95, col = "blue")
Example: CO\(_2\) concentration.
<- li7500[, 1]
population
<- mean(population)
mu <- 30
sample_size <- 100
rep_n
<- replicate(rep_n, sample(population, sample_size))
samplen <- data.frame(
df sample_mean = apply(samplen, 2, mean),
standard_error = apply(samplen, 2,
function(x) sd(x)/sqrt(sample_size))
)$lower <- df$sample_mean - 2 * df$standard_error
df$upper <- df$sample_mean + 2 * df$standard_error
df$include <- mu > df$lower & mu < df$upper
dfsum(df$include) / rep_n
[1] 0.92
<- c(20, 50, 100, 200, 500, 1000, 5000, 10000, 50000, 100000)
Repeats <- NULL
Includes for (rep_n in Repeats){
<- replicate(rep_n, sample(population, sample_size))
samplen <- data.frame(
df sample_mean = apply(samplen, 2, mean),
standard_error = apply(samplen, 2,
function(x) sd(x)/sqrt(sample_size))
)$lower <- df$sample_mean - 2 * df$standard_error
df$upper <- df$sample_mean + 2 * df$standard_error
df$include <- mu > df$lower & mu < df$upper
df<- c(Includes, sum(df$include) / rep_n)
Includes
}plot(Repeats, Includes, log = "x", type = "b")
abline(h = 0.95, col = "blue")
Animation for confidence intervals:
::conf.int() animation
7.8.3 For a population proportion
Point estimate:
\[\hat{p}=\frac{x}{n}\]
A \(c\)-confidence interval for a population proportion \(p\) is
\[\hat{p}-E<p<\hat{p}+E\]
When \(np\ge5\) and \(n(1-p)\ge5\), the margin of error for \(p\) is
\[E=z_{c} \sqrt{\frac{\hat{p} (1-\hat{p})}{n}}.\]
Example: In a survey of 1000 (\(n\)) U.S. teens, 372 (\(x\)) said that they own smartphones. The point estimate for the population proportion of U.S. teens who own smartphones is 37.2%. Give the confidence interval.
<- 0.372
p <- 1000
n * p >= 5 n
[1] TRUE
* (1 - p) >= 5 n
[1] TRUE
<- 1.96
zc <- zc * sqrt(p * (1 - p) / 1000)
E round(c(p - E, p + E), 3)
[1] 0.342 0.402
With 95% confidence, we can say that the population proportion of U.S. teens who own smartphones is between 34.2% and 40.2%.
7.9 Summary
- How to estimate the population proportion (point + interval)?
- Binomial distribution.
- How to estimate the population mean (point + interval) if the population standard deviation \(\sigma\) is known?
- \(z\) distribution.
- How to estimate the population mean (interval) if the population standard deviation \(\sigma\) is unknown?
- \(t\) distribution.
- How to estimate the population variance?
- \(\chi^2\) distribution.
7.10 Readings
- Elementary Statistics, Chapter 5