ENV221 L05

Author

Peng Zhao

1 Overview of the module and R

2 R Basic Operations

3 R Programming

4 Statistical Graphs

5 Basic concepts

5.1 Learning objectives

In this lecture, you will

  1. Learn the basic concepts of statistics.
  2. Understand the features of data types.
  3. Know the functions for data types in R.

5.2 What is statistics

Statistics:
  • the science of collecting, organizing, analyzing, and interpreting data in order to make decisions.

  • the plural form of the noun statistic.

Examples:

  1. 6.7 % of the ENV221 in A.Y. 2021/2022 students knew nothing about statistics before the course, and 7.3% in A. Y. 2022/2023.
  2. XJTLU has been ranked in the 601-800 band in The Times Higher Education World University Rankings 2022. It also ranks in the top 50 universities on the Chinese mainland (source).
  3. Weather tomorrow in Suzhou is going to be sunny with a maximum temperature of 22\(^o\)C and minimum temperature of 11\(^o\)C (source).
  4. The general fertility rate (GFR) for the United States in 2019 was 58.3 births per 1,000 females aged 15–44, down 1% from 2018 (59.1) and a record low rate for the nation (source).
  5. The death rate in US 2019 was 869.7 deaths per 100,000 population, and the life expectancy was 78.8 years (source).

5.3 Population and sample

Why do we have to know them?

  • Almost all the knowledge of ENV221 is based on them.

5.3.1 Definitions

Population:

the largest collection of entities for which we have an interest at a particular time

Sample:

a subset of the population

Population and sample

Identify the population and the sample:

  1. 1500 adults in the United States were asked if they thought there was solid evidence for global warming. 855 said yes.

  2. In order to know the average lengths of all the fishes in the Taihu Laike, a scientist randomly sampled 50 fishes from the lake.

  3. The module leader wanted to know the background of the ENV221 students in R language. He selected 10 students and made a survey.

    x <- read.csv('data/students_env221.csv')
    sample(x$ID, 10)
  4. ENV221 students were randomly selected in a survey in XJTLU for the knowledge in R language.

5.3.2 Parameter and statistic

Parameter:

measures of the population, denoted by Greek letters (e.g., \(\mu\) and \(\sigma ^2\) for the population mean and variance)

Statistic:

measures of the sample, denoted by Roman letters (e.g., \(\bar x\) and \(s^2\) for the sample mean and variance)

Questions: What are the parameters and statistics in the example of the previous section?

Greek letters which are often used in statistics. Adapted from https://www.rapidtables.com/math/symbols/greek_alphabet.html
Upper Case Lower Case Roman Pron. Pron. In Statistics
A \(\alpha\) a Alpha 阿尔法 intercept
B \(\beta\) b Beta 贝塔 slope
M \(\mu\) m Mu 谬/木 mean
N \(\nu\) n Nu 钮/努 degree of the freedom
P \(\rho\) r Rho correlation coefficient
\(\Sigma\) \(\sigma\) s Sigma 希格玛 standard deviation
X \(\chi\) ch Chi \(\chi^2\) test

5.4 Descriptive and inferential

Descriptive statistics:

the organization, summarization, and display of data. Common forms: summary, graphs, tables.

Inferential statistics:

using a sample to draw conclusions about a population. Common forms: hypothesis tests, confidence intervals, regression.

Identify descriptive statistics and inferential statistics:

  1. 7.3 % of the ENV221 students knew nothing about statistics before A.Y. 2022/2023.
  2. From our sample, we have the 95% of confidence that 23% - 30% of the XJTLU students often use R.
  3. The mean length of the fishes in our sample is 11.0 cm.
  4. We are 99% sure that the mean length of all the fishes in the Taihu Lake is 9.8 - 12.2 cm.

A basic tool in the study of inferential statistics is probability.

5.5 Probability

Why do we have to know it?

  • Statistics: Make decisions.

5.5.1 Definitions

Probability:

A measure of the likelihood of an event to occur.

Range: 0 (impossible) to 1 (certain).

Probability Meaning
0 Impossible
0.25 Unlikely
0.5 Even chance
0.75 Likely
1 Certain

An event that occurs with \(p \le 0.05\) is typically considered unusual. Unusual events are unlikely to occur.

Probability experiment:

An action, or trial, through which specific results (counts, measurements, or responses) are obtained.

Outcome:

The result of a single trial in a probability experiment.

Sample Space:

The set of all possible outcomes of a probability experiment.

Event:

Consists of one or more outcomes and is a subset of the sample space.

Examples:

  1. Probability experiment: Toss a coin
  • Outcome: {Head}
  • Sample space: {Head, Tail}
  • Event: {Head}={Head}
  1. Probability experiment: Roll a die
  • Outcome: {3}
  • Sample space: {1, 2, 3, 4, 5, 6}
  • Event: {Die is even}={2, 4, 6}

Question: In a probability experiment of tossing a coin and then rolling a die, what is the sample space?

expand.grid(coin = c("head", "tail"), die = 1:6)

5.5.2 Types of probability

Classical (or theoretical) probability:

\[P(E) = \frac {\text {Number of outcomes in event } E}{\text{Total number of outcomes in sample space}} \]

Example:

  1. You toss a coin. What is the probability of the event “head”?
  • Sample space: {0, 1}
  • For event A = {0}, \(P(\text{0}) = 1/2 = 0.5\).
  1. You roll a six-sided die. What is the probability of each event?
  • Event A: rolling a 3
  • Event B: rolling a 7
  • Event C: rolling a number less than 5
  • Sample space: {1, 2, 3, 4, 5, 6}
  • For event A = {3}, \(P(\text{rolling a 3}) = 1/6 \approx 0.167\).
  • For event B = {7}, \(P(\text{rolling a 7}) = 0/6 = 0\).
  • For event C = {1, 2, 3, 4}, \(P(\text{rolling a number less than 5}) = 4/6 \approx 0.667\).

Questions:

  1. There are 37 BIO students and 20 ENV students enrolled in ENV221. You randomly pick a name out of the ENV221 student list. What is the probability of the event that the one you pick out is an ENV student?

  2. You toss a coin and then roll a die. What is the probability of the event that you get a tail and roll a number less than 4?

Empirical (or statistical) probability:

\[P(E) = \frac {\text {Frequency of event }E}{\text {Total frequency}}\]

Example:

You toss a coin for 100 times, and get a head for 30 times. What is the empirical probability of getting a head for the tossing?

  • The empirical probability: \(P (0) = 30 / 100 = 0.3\).

5.5.3 Law of Large Numbers

Law of Large Numbers:

As an experiment is repeated over and over, the empirical probability of an event approaches theoretical (actual) probability of the event.

Examples:

  1. Toss a coin, and calculate the empirical probability of getting a head. Compare it with the theoretical probability.
x <- c(0, 1)

sample(x, 1)
sample(x, 1)
sample(x, 1)

y1 <- sample(x, 100, replace = TRUE)
y2 <- cumsum(y1)
e_coin <- data.frame(n = 1:100, y1, y2)
e_coin$p <- e_coin$y2 / e_coin$n

library(ggplot2)
ggplot(e_coin) + 
  geom_point(aes(n, p)) + 
  lims(y = c(0,1)) +
  geom_hline(yintercept = 0.5, color = 'red')
  1. Ten balls, colored in green, red, or blue, are placed in a blind box. You don’t know how many of them for each color. You randomly pick one out of the box, write down the color, and put it back. Repeat it for many times and guess how many balls are green, red, and blue.
sample(x, 1)
[1] "blue"
sample(x, 1)
[1] "green"
sample(x, 1)
[1] "blue"
sample(x, 1)
[1] "blue"
sample(x, 1)
[1] "green"
sample(x, 1)
[1] "blue"
x <- c("red", rep("green", 3), rep("blue", 6))

color <- sample(x, 100, replace = TRUE)
e_box <- data.frame(n = 1:100, color)

e_box$green <- ifelse(e_box$color == "green", 1, 0)
e_box$n_green <- cumsum(e_box$green)
e_box$p_green <- e_box$n_green / e_box$n

ggplot(e_box) + 
  geom_point(aes(n, p_green)) + 
  lims(y = c(0,1)) +
  geom_hline(yintercept = 0.3, color = 'green')

5.5.4 Sampling techniques

Simple Random Sample:

a sampling technique in which each sample has an equal probability of being chosen.

Suppose you cook a pot of soup for you family. You add some salt into it. How do you know if the soup is salty enough for your family?

Sample a spoon of soup from a pot

Question: How do you randomly choose 10 students from our class?

x <- read.csv('data/students_env221.csv')
x$ID
sample(x$ID, 10)
Stratified Sample:

Members of the population are divided into two or more subsets, called strata, that share a similar characteristic such as age, gender, ethnicity, or even political preference. A sample is then randomly selected from each of the strata.

Questions: How do you randomly choose 5 girls and 5 boys from our class? How do you randomly choose 30% girls and 30% boys?

table(x$GENDER)

sample(x$ID[x$GENDER == 'Female'], 5)
sample(x$ID[x$GENDER == 'Male'], 5)

idfemale <- x$ID[x$GENDER == 'Female']
sample(idfemale, 5)
idmale <- x$ID[x$GENDER == 'Male']
sample(idmale, 5)

sample(idfemale, length(idfemale) * 0.3)
sample(idfemale, length(idmale) * 0.3)
Cluster Sample:

Divide the population into groups, called clusters, and select all of the members in one or more (but not all) of the clusters.

Question: Classify every three students in one group. How do you randomly choose 1 group from our class?

n_clu <- 1
perclu <- 3
sample_clu <- sample(1: (nrow(x) / perclu), n_clu)
i_clu <- (perclu * (sample_clu - 1) + 1) : (perclu * sample_clu)
x[i_clu]
Systematic Sample:

Each member of the population is assigned a number. The members of the population are ordered in some way, a starting number is randomly selected, and then sample members are selected at regular intervals from the starting number.

Question: How do you choose three students with the systematic sampling method?

i <- nrow(x) %/% 3
r <- sample(1:i, 1)
x$ID[seq(r, r + i * (3 - 1), i)]
Convenience sample:

Only of members of the population that are easy to get. Not recommended.

5.6 Data

Data consist of information coming from observations, counts, measurements, or responses.

5.6.1 Constants & variables

Constant:

only one value (e.g. the speed of light in vacuum, \(\pi\), e)

Variable:

something that has more than one value (e.g. your age, air temperature, currency exchange rate)

Dependent variable:

something a researcher measures. Its value depends on the values of other variables by some law or rule.

Independent variable:

something a researcher manipulates in an experiment, or something occurring naturally and affecting dependent variables.

5.6.2 Levels of Measurement

Which statistical calculations are meaningful.

Nominal level:

categorized using names, labels, or qualities. No mathematical computations can be made at this level. (定类) Nationality, gender, weather. Statistical calculations: (Relative) Frequency/count

Ordinal level:

arranged in order, or ranked, but differences between data entries are not meaningful. (定序) Top 10 Universities, UK Honours Degree, Weather warnings. Statistical calculations: nominal + rank.

Interval level:

ordered, and meaningful differences between data entries can be calculated. At the interval level, a zero entry simply represents a position on a scale; the entry is not an inherent zero. (定距) Calender year, temperature in °C or °F. Statistical calculations: ordinal + centre, spread, shape

Ratio level:

a zero entry is an inherent zero. A ratio of two data entries can be formed so that one data entry can be meaningfully expressed as a multiple of another. (定比) Age, temperature in K, wind speed. Statistical calculations: interval + ratio

5.6.3 Data types

Qualitative data, or categorical data:

place data in descriptive groups. e.g. home prefecture, Judo rank, weather… Nominal or ordinal levels.

Quantitative data, or numerical data

take numerical values. Interval or ratio levels.

  • Discrete data: often integers, e.g. age, book quantity

  • Continuous data: any value within or without a range, e.g. wind speed, temperature, pH

Why do we have to know data types?

  • Choose the appropriate tables and graphs to present your data,

  • Choose the appropriate distribution to describe your data, and

  • Choose the appropriate hypothesis tests to make decisions.

Data types in R:

6 basic types: numeric, logical, character, integer, complex, raw

is.numeric()
is.logical()
is.character()
class()
mode()
typeof()
str()
summary()
Object types in R:

vector, data frame, factor, list, matrix, array, …

5.6.4 Converting data types

  1. Reducing the levels will result in a loss of information.
    Example:
  • height —> grouped into “> 1.3 m” and “\(\le\) 1.3 m” (ratio —> ordinal)
  • wind rose (ratio —> ordinal)

Windrose

The classification of wind speeds
The classification of wind speeds
Beaufort number Wind speed (km/hr) International description US Weather Bureau description Effect of wind on the Sea
0 <1 Calm Light Wind Small wavelets
1 1-5 Light Air Light Wind Small wavelets
2 6-11 Light Breeze Light Wind Small wavelets
3 12-19 Gentle Breeze Gentle-moderate Large wavelets to small waves
4 20-28 Moderate Breeze Gentle-moderate Large wavelets to small waves
5 29-38 Fresh Breeze Fresh wind Moderate waves, many whitecaps
6 39-49 Strong gale Strong wind Large waves, many whitecaps
7 50-61 Fresh Breeze Strong wind Large waves, many whitecaps
8 62-74 Fresh gale Gale High waves, foam streaks
9 75-88 Stong gale Gale High waves, foam streaks
10 89-102 Whole gale Whole gale Very high waves, rolling sea
11 103-117 Storm Whole gale Very high waves, rolling sea
12-17 >117 Hurricane Hurricane Sea white with spray and foam
Body mass index (BMI)
Body mass index (BMI) is a person’s weight in kilograms divided by the square of height in meters. BMI is an inexpensive and easy screening method for weight category—underweight, healthy weight, overweight, and obesity. For adults 20 years old and older, BMI is interpreted using standard weight status categories (see the following table). These categories are the same for men and women of all body types and ages.
BMI Weight Status
Below 18.5 Underweight
18.5 – 24.9 Healthy Weight
25.0 – 29.9 Overweight
30.0 and Above Obesity
  1. Increasing the levels will be misleading.
    Example:
  • Table of the test grades

Table: Test grades

R functions:

as.numeric()
as.logical()
is.character()
cut()

5.7 Readings

  • Elementary Statistics, Chapter 1 and 3