<- read.csv('data/students_env221.csv')
x sample(x$ID, 10)
ENV221 L05
1 Overview of the module and R
2 R Basic Operations
3 R Programming
4 Statistical Graphs
5 Basic concepts
5.1 Learning objectives
In this lecture, you will
- Learn the basic concepts of statistics.
- Understand the features of data types.
- Know the functions for data types in R.
5.2 What is statistics
- Statistics:
-
the science of collecting, organizing, analyzing, and interpreting data in order to make decisions.
the plural form of the noun statistic.
Examples:
- 6.7 % of the ENV221 in A.Y. 2021/2022 students knew nothing about statistics before the course, and 7.3% in A. Y. 2022/2023.
- XJTLU has been ranked in the 601-800 band in The Times Higher Education World University Rankings 2022. It also ranks in the top 50 universities on the Chinese mainland (source).
- Weather tomorrow in Suzhou is going to be sunny with a maximum temperature of 22\(^o\)C and minimum temperature of 11\(^o\)C (source).
- The general fertility rate (GFR) for the United States in 2019 was 58.3 births per 1,000 females aged 15–44, down 1% from 2018 (59.1) and a record low rate for the nation (source).
- The death rate in US 2019 was 869.7 deaths per 100,000 population, and the life expectancy was 78.8 years (source).
5.3 Population and sample
Why do we have to know them?
- Almost all the knowledge of ENV221 is based on them.
5.3.1 Definitions
- Population:
-
the largest collection of entities for which we have an interest at a particular time
- Sample:
-
a subset of the population
Identify the population and the sample:
1500 adults in the United States were asked if they thought there was solid evidence for global warming. 855 said yes.
In order to know the average lengths of all the fishes in the Taihu Laike, a scientist randomly sampled 50 fishes from the lake.
The module leader wanted to know the background of the ENV221 students in R language. He selected 10 students and made a survey.
ENV221 students were randomly selected in a survey in XJTLU for the knowledge in R language.
5.3.2 Parameter and statistic
- Parameter:
-
measures of the population, denoted by Greek letters (e.g., \(\mu\) and \(\sigma ^2\) for the population mean and variance)
- Statistic:
-
measures of the sample, denoted by Roman letters (e.g., \(\bar x\) and \(s^2\) for the sample mean and variance)
Questions: What are the parameters and statistics in the example of the previous section?
Upper Case | Lower Case | Roman | Pron. | Pron. | In Statistics |
---|---|---|---|---|---|
A | \(\alpha\) | a | Alpha | 阿尔法 | intercept |
B | \(\beta\) | b | Beta | 贝塔 | slope |
M | \(\mu\) | m | Mu | 谬/木 | mean |
N | \(\nu\) | n | Nu | 钮/努 | degree of the freedom |
P | \(\rho\) | r | Rho | 柔 | correlation coefficient |
\(\Sigma\) | \(\sigma\) | s | Sigma | 希格玛 | standard deviation |
X | \(\chi\) | ch | Chi | 凯 | \(\chi^2\) test |
5.4 Descriptive and inferential
- Descriptive statistics:
-
the organization, summarization, and display of data. Common forms: summary, graphs, tables.
- Inferential statistics:
-
using a sample to draw conclusions about a population. Common forms: hypothesis tests, confidence intervals, regression.
Identify descriptive statistics and inferential statistics:
- 7.3 % of the ENV221 students knew nothing about statistics before A.Y. 2022/2023.
- From our sample, we have the 95% of confidence that 23% - 30% of the XJTLU students often use R.
- The mean length of the fishes in our sample is 11.0 cm.
- We are 99% sure that the mean length of all the fishes in the Taihu Lake is 9.8 - 12.2 cm.
A basic tool in the study of inferential statistics is probability.
5.5 Probability
Why do we have to know it?
- Statistics: Make decisions.
5.5.1 Definitions
- Probability:
-
A measure of the likelihood of an event to occur.
Range: 0 (impossible) to 1 (certain).
Probability | Meaning |
---|---|
0 | Impossible |
0.25 | Unlikely |
0.5 | Even chance |
0.75 | Likely |
1 | Certain |
An event that occurs with \(p \le 0.05\) is typically considered unusual. Unusual events are unlikely to occur.
- Probability experiment:
-
An action, or trial, through which specific results (counts, measurements, or responses) are obtained.
- Outcome:
-
The result of a single trial in a probability experiment.
- Sample Space:
-
The set of all possible outcomes of a probability experiment.
- Event:
-
Consists of one or more outcomes and is a subset of the sample space.
Examples:
- Probability experiment: Toss a coin
- Outcome: {Head}
- Sample space: {Head, Tail}
- Event: {Head}={Head}
- Probability experiment: Roll a die
- Outcome: {3}
- Sample space: {1, 2, 3, 4, 5, 6}
- Event: {Die is even}={2, 4, 6}
Question: In a probability experiment of tossing a coin and then rolling a die, what is the sample space?
expand.grid(coin = c("head", "tail"), die = 1:6)
5.5.2 Types of probability
- Classical (or theoretical) probability:
-
\[P(E) = \frac {\text {Number of outcomes in event } E}{\text{Total number of outcomes in sample space}} \]
Example:
- You toss a coin. What is the probability of the event “head”?
- Sample space: {0, 1}
- For event A = {0}, \(P(\text{0}) = 1/2 = 0.5\).
- You roll a six-sided die. What is the probability of each event?
- Event A: rolling a 3
- Event B: rolling a 7
- Event C: rolling a number less than 5
- Sample space: {1, 2, 3, 4, 5, 6}
- For event A = {3}, \(P(\text{rolling a 3}) = 1/6 \approx 0.167\).
- For event B = {7}, \(P(\text{rolling a 7}) = 0/6 = 0\).
- For event C = {1, 2, 3, 4}, \(P(\text{rolling a number less than 5}) = 4/6 \approx 0.667\).
Questions:
There are 37 BIO students and 20 ENV students enrolled in ENV221. You randomly pick a name out of the ENV221 student list. What is the probability of the event that the one you pick out is an ENV student?
You toss a coin and then roll a die. What is the probability of the event that you get a tail and roll a number less than 4?
- Empirical (or statistical) probability:
-
\[P(E) = \frac {\text {Frequency of event }E}{\text {Total frequency}}\]
Example:
You toss a coin for 100 times, and get a head for 30 times. What is the empirical probability of getting a head for the tossing?
- The empirical probability: \(P (0) = 30 / 100 = 0.3\).
5.5.3 Law of Large Numbers
- Law of Large Numbers:
-
As an experiment is repeated over and over, the empirical probability of an event approaches theoretical (actual) probability of the event.
Examples:
- Toss a coin, and calculate the empirical probability of getting a head. Compare it with the theoretical probability.
<- c(0, 1)
x
sample(x, 1)
sample(x, 1)
sample(x, 1)
<- sample(x, 100, replace = TRUE)
y1 <- cumsum(y1)
y2 <- data.frame(n = 1:100, y1, y2)
e_coin $p <- e_coin$y2 / e_coin$n
e_coin
library(ggplot2)
ggplot(e_coin) +
geom_point(aes(n, p)) +
lims(y = c(0,1)) +
geom_hline(yintercept = 0.5, color = 'red')
- Ten balls, colored in green, red, or blue, are placed in a blind box. You don’t know how many of them for each color. You randomly pick one out of the box, write down the color, and put it back. Repeat it for many times and guess how many balls are green, red, and blue.
sample(x, 1)
[1] "blue"
sample(x, 1)
[1] "green"
sample(x, 1)
[1] "blue"
sample(x, 1)
[1] "blue"
sample(x, 1)
[1] "green"
sample(x, 1)
[1] "blue"
<- c("red", rep("green", 3), rep("blue", 6))
x
<- sample(x, 100, replace = TRUE)
color <- data.frame(n = 1:100, color)
e_box
$green <- ifelse(e_box$color == "green", 1, 0)
e_box$n_green <- cumsum(e_box$green)
e_box$p_green <- e_box$n_green / e_box$n
e_box
ggplot(e_box) +
geom_point(aes(n, p_green)) +
lims(y = c(0,1)) +
geom_hline(yintercept = 0.3, color = 'green')
5.5.4 Sampling techniques
- Simple Random Sample:
-
a sampling technique in which each sample has an equal probability of being chosen.
Suppose you cook a pot of soup for you family. You add some salt into it. How do you know if the soup is salty enough for your family?
Question: How do you randomly choose 10 students from our class?
<- read.csv('data/students_env221.csv')
x $ID x
sample(x$ID, 10)
- Stratified Sample:
-
Members of the population are divided into two or more subsets, called strata, that share a similar characteristic such as age, gender, ethnicity, or even political preference. A sample is then randomly selected from each of the strata.
Questions: How do you randomly choose 5 girls and 5 boys from our class? How do you randomly choose 30% girls and 30% boys?
table(x$GENDER)
sample(x$ID[x$GENDER == 'Female'], 5)
sample(x$ID[x$GENDER == 'Male'], 5)
<- x$ID[x$GENDER == 'Female']
idfemale sample(idfemale, 5)
<- x$ID[x$GENDER == 'Male']
idmale sample(idmale, 5)
sample(idfemale, length(idfemale) * 0.3)
sample(idfemale, length(idmale) * 0.3)
- Cluster Sample:
-
Divide the population into groups, called clusters, and select all of the members in one or more (but not all) of the clusters.
Question: Classify every three students in one group. How do you randomly choose 1 group from our class?
<- 1
n_clu <- 3
perclu <- sample(1: (nrow(x) / perclu), n_clu)
sample_clu <- (perclu * (sample_clu - 1) + 1) : (perclu * sample_clu)
i_clu x[i_clu]
- Systematic Sample:
-
Each member of the population is assigned a number. The members of the population are ordered in some way, a starting number is randomly selected, and then sample members are selected at regular intervals from the starting number.
Question: How do you choose three students with the systematic sampling method?
<- nrow(x) %/% 3
i <- sample(1:i, 1)
r $ID[seq(r, r + i * (3 - 1), i)] x
- Convenience sample:
-
Only of members of the population that are easy to get. Not recommended.
5.6 Data
Data consist of information coming from observations, counts, measurements, or responses.
5.6.1 Constants & variables
- Constant:
-
only one value (e.g. the speed of light in vacuum, \(\pi\), e)
- Variable:
-
something that has more than one value (e.g. your age, air temperature, currency exchange rate)
- Dependent variable:
-
something a researcher measures. Its value depends on the values of other variables by some law or rule.
- Independent variable:
-
something a researcher manipulates in an experiment, or something occurring naturally and affecting dependent variables.
5.6.2 Levels of Measurement
Which statistical calculations are meaningful.
- Nominal level:
-
categorized using names, labels, or qualities. No mathematical computations can be made at this level. (定类) Nationality, gender, weather. Statistical calculations: (Relative) Frequency/count
- Ordinal level:
-
arranged in order, or ranked, but differences between data entries are not meaningful. (定序) Top 10 Universities, UK Honours Degree, Weather warnings. Statistical calculations: nominal + rank.
- Interval level:
-
ordered, and meaningful differences between data entries can be calculated. At the interval level, a zero entry simply represents a position on a scale; the entry is not an inherent zero. (定距) Calender year, temperature in °C or °F. Statistical calculations: ordinal + centre, spread, shape
- Ratio level:
-
a zero entry is an inherent zero. A ratio of two data entries can be formed so that one data entry can be meaningfully expressed as a multiple of another. (定比) Age, temperature in K, wind speed. Statistical calculations: interval + ratio
5.6.3 Data types
- Qualitative data, or categorical data:
-
place data in descriptive groups. e.g. home prefecture, Judo rank, weather… Nominal or ordinal levels.
- Quantitative data, or numerical data
-
take numerical values. Interval or ratio levels.
Discrete data: often integers, e.g. age, book quantity
Continuous data: any value within or without a range, e.g. wind speed, temperature, pH
Why do we have to know data types?
Choose the appropriate tables and graphs to present your data,
Choose the appropriate distribution to describe your data, and
Choose the appropriate hypothesis tests to make decisions.
- Data types in R:
-
6 basic types: numeric, logical, character, integer, complex, raw
is.numeric()
is.logical()
is.character()
class()
mode()
typeof()
str()
summary()
- Object types in R:
-
vector, data frame, factor, list, matrix, array, …
5.6.4 Converting data types
- Reducing the levels will result in a loss of information.
Example:
- height —> grouped into “> 1.3 m” and “\(\le\) 1.3 m” (ratio —> ordinal)
- wind rose (ratio —> ordinal)
The classification of wind speeds
Beaufort number | Wind speed (km/hr) | International description | US Weather Bureau description | Effect of wind on the Sea |
---|---|---|---|---|
0 | <1 | Calm | Light Wind | Small wavelets |
1 | 1-5 | Light Air | Light Wind | Small wavelets |
2 | 6-11 | Light Breeze | Light Wind | Small wavelets |
3 | 12-19 | Gentle Breeze | Gentle-moderate | Large wavelets to small waves |
4 | 20-28 | Moderate Breeze | Gentle-moderate | Large wavelets to small waves |
5 | 29-38 | Fresh Breeze | Fresh wind | Moderate waves, many whitecaps |
6 | 39-49 | Strong gale | Strong wind | Large waves, many whitecaps |
7 | 50-61 | Fresh Breeze | Strong wind | Large waves, many whitecaps |
8 | 62-74 | Fresh gale | Gale | High waves, foam streaks |
9 | 75-88 | Stong gale | Gale | High waves, foam streaks |
10 | 89-102 | Whole gale | Whole gale | Very high waves, rolling sea |
11 | 103-117 | Storm | Whole gale | Very high waves, rolling sea |
12-17 | >117 | Hurricane | Hurricane | Sea white with spray and foam |
Body mass index (BMI)
BMI | Weight Status |
---|---|
Below 18.5 | Underweight |
18.5 – 24.9 | Healthy Weight |
25.0 – 29.9 | Overweight |
30.0 and Above | Obesity |
- Increasing the levels will be misleading.
Example:
- Table of the test grades
Table: Test grades
R functions:
as.numeric()
as.logical()
is.character()
cut()
5.7 Readings
- Elementary Statistics, Chapter 1 and 3