# Bar plot is a common waylibrary(ggplot2)ggplot(x) +geom_bar(aes(y = GENDER, fill = GENDER))# A pie chart is good if you have only a few groups.pie(tb_gender)# but is a disaster when you have many groupstb_disp <-table(mtcars$disp)pie(tb_disp, col =rainbow(length(tb_disp)))# especially a 3-D pie chart.library(plotrix)pie3D(tb_disp)# instead, use a bar plot
6.2.2 Two-way tables
Contingency tables/cross-tabulation
table(x$GENDER, x$HOME)
North China North China Other places South China
Female 1 11 1 22
Male 0 8 2 12
unique(x$HOME)
[1] "South China" "North China" "Other places" " North China"
x$HOME[x$HOME ==' North China'] <-'North China'table(x$GENDER, x$HOME)
North China Other places South China
Female 12 1 22
Male 8 2 12
tb2 <-table(x$GENDER, x$HOME)prop.table(tb2)
North China Other places South China
Female 0.21052632 0.01754386 0.38596491
Male 0.14035088 0.03508772 0.21052632
round(prop.table(tb2), 3)
North China Other places South China
Female 0.211 0.018 0.386
Male 0.140 0.035 0.211
Epi::stat.table(list(GENDER, HOME), data = x, margins =TRUE)
-----------------------------------------
--------------HOME---------------
GENDER North Other South Total
China places China
-----------------------------------------
Female 12 1 22 35
Male 8 2 12 22
Total 20 3 34 57
-----------------------------------------
Question: How do you display it in a long version?
library(ggplot2)ggplot(x) +geom_bar(aes(y = GENDER, fill = HOME), width =0.2)
ggplot(x) +geom_bar(aes(y = GENDER, fill = HOME), position ='fill', width =0.2)
ggplot(x) +geom_bar(aes(x = GENDER, fill = HOME), position ='dodge')
6.2.3 Three-way tables
table(x$GENDER, x$HOME, x$YEAR)
, , = 1998
North China Other places South China
Female 0 0 0
Male 1 1 0
, , = 1999
North China Other places South China
Female 1 0 1
Male 1 0 0
, , = 2000
North China Other places South China
Female 0 0 1
Male 4 0 0
, , = 2001
North China Other places South China
Female 4 0 4
Male 0 0 5
, , = 2002
North China Other places South China
Female 6 1 16
Male 1 1 6
, , = 2022
North China Other places South China
Female 0 0 0
Male 0 0 1
HairEyeColor
, , Sex = Male
Eye
Hair Brown Blue Hazel Green
Black 32 11 10 3
Brown 53 50 25 15
Red 10 10 7 7
Blond 3 30 5 8
, , Sex = Female
Eye
Hair Brown Blue Hazel Green
Black 36 9 5 2
Brown 66 34 29 14
Red 16 7 7 7
Blond 4 64 5 8
Question: How do you display it in a report/publication?
Gender
Male
Female
Eye
Brown
Blue
Hazel
Green
Brown
Blue
Hazel
Green
Hair
Black
32
11
10
3
36
9
5
2
Brown
53
50
25
15
66
34
29
14
Red
10
10
7
7
16
7
7
7
Blond
3
30
5
8
4
64
5
8
Question: How do you display it in a long version?
More-way table:
Titanic
, , Age = Child, Survived = No
Sex
Class Male Female
1st 0 0
2nd 0 0
3rd 35 17
Crew 0 0
, , Age = Adult, Survived = No
Sex
Class Male Female
1st 118 4
2nd 154 13
3rd 387 89
Crew 670 3
, , Age = Child, Survived = Yes
Sex
Class Male Female
1st 5 1
2nd 11 13
3rd 13 14
Crew 0 0
, , Age = Adult, Survived = Yes
Sex
Class Male Female
1st 57 140
2nd 14 80
3rd 75 76
Crew 192 20
Question: How do you display it in a long version?
Visualization: bar plots and mosaic plot
ggplot(x) +geom_bar(aes(x = GENDER, fill = HOME)) +facet_wrap( ~ YEAR)
ggplot(data.frame(HairEyeColor)) +geom_bar(aes(y = Freq, x = Sex, fill = Sex), stat ='identity') +facet_grid(Hair ~ Eye)
The daily temperatures for last week were \(t_1 = 27, t_2 = 26, t_3 = 30, t_4 = 27, t_5 = 29, t_6 = 30, t_7 = 25 ^\circ\)C. What was the mean temperature?
You wanted to know the surface water temperature of the Taihu Lake. You randomly selected four locations at 11:59 PM on 1/1/2021, and got the following temperatures: 3.0, 2.1, 3.0 and 24 \(^\circ\)C. Find the measures of the center with the mean and median.
tc <-c(3.0, 2.1, 3.0, 24)mean(tc)median(tc)
Mean: the best way to estimate a subject of a population
Median: the safest.
Both are insufficient and unsatisfactory because of the lack of the spread.
6.3.2 Describe the spread
Generic Measure: Range
\(x_\text{max} - x_\text{min}\)
Example: What is the range of the sepal length in the iris dataset?
max(iris$Sepal.Length) -min(iris$Sepal.Length)
[1] 3.6
range(iris$Sepal.Length)
[1] 4.3 7.9
Paired with the mean: Standard deviation
Variance: for a population, \(\sigma^2 = \frac{\Sigma (x_i - \mu)^2}{N}\); for a sample, \(s^2 = \frac{\Sigma (x_i - \bar x)^2}{n-1}\)
Standard deviation: \(\sigma\), \(s\)
Coefficient of variation: \(c_v = \sigma / \mu\), \(c_v = s / \bar x\)
Example: what are the standard deviation, the variance, and coefficient of variation for the sepal length in the iris dataset?
x <- iris$Sepal.Lengthn <-length(x)x_bar <-mean(x)x_variance <-sum((x - x_bar) ^2) / (n -1)x_sd <-sqrt(x_variance)x_cv <- x_sd/x_bar# use the built-in functions:var(x)sd(x)
Paired with the median: Quartiles/Percentiles
The p-th percentile: the value \(V_p\) such that p% of the sample points are less than or equal to \(V_p\).
Show groups of numerical data through their quartiles, with lines extending from the boxes (whiskers) indicating variability outside the upper and lower quartiles.
Example: box plot for the sepal length in the iris dataset.
Code
boxplot(iris$Sepal.Length, horizontal =TRUE, xlab ='Sepal length (cm)')points(mean(iris$Sepal.Length), 1, col ="red", pch =16)ggplot(iris, aes(y = Sepal.Length)) +geom_boxplot() +geom_jitter(aes(x =0), color ='lightblue') +stat_summary(aes(x =0), fun = mean, color="red") +labs(x ='', y ='Sepal Length (cm)') +theme(axis.ticks.y =element_blank(), axis.text.y =element_blank()) +coord_flip()
Question: How do you draw a box plot for the sepal length for each species?
boxplot(iris$Sepal.Length ~ iris$Species, horizontal =TRUE, las =1,xlab ='Sepal Length (cm)', ylab ='')sepal_length_mean <-tapply(iris$Sepal.Length, iris$Species, mean)points(sepal_length_mean, 1:3, col ="red", pch =16)
tapply(iris$Sepal.Length, iris$Species, summary)
$setosa
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.300 4.800 5.000 5.006 5.200 5.800
$versicolor
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.900 5.600 5.900 5.936 6.300 7.000
$virginica
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.900 6.225 6.500 6.588 6.900 7.900
Violin plot:
a combination of a box plot and a density plot to show the distribution shape of the data