04:00
Data Analytics and Visualization with R
Session 3
University of Mannheim
Spring 2023
04:00
The mean and mode of the variable always have to be present in the data.
When estimating variance of the distribution, observations further from the mean have more weight than observations close to the mean.
Boxplot contains the information about the distributions’ measures of center, spread, and shape.
Median and IQR are more robust to outliers than mean and standard deviation.
Proportion is the mean of a binary variable.
Use a sample to make inferences about a population
Letters like \(\beta_1\) are the truth/estimands, aka population parameters
Letters with extra markings like \(\hat{\beta_1}\) are our estimate of the truth based on our sample
Letters like \(X\) are actual data from our sample
Letters with extra markings like \(\bar{X}\) are calculations from our sample
(Sample) Data → Calculation → Estimate → Truth
Data | \(X\) |
Calculation | \(\bar{X} = \frac{\sum{X}}{N}\) |
Estimate | \(\hat{\mu}\) |
Truth | \(\mu\) |
\[ \bar{X} = \hat{\mu} \]
\[ X \rightarrow \bar{X} \rightarrow \hat{\mu} \xrightarrow{\text{🤞 hopefully 🤞}} \mu \]
A conditional distribution is the distribution of one variable given the value of another variable
treatment | att_start_mn | att_end_mn |
---|---|---|
Control | 8.859375 | 8.453125 |
Treated | 9.607843 | 10.000000 |
03:00
\[Cov(X,Y) = \frac{\overbrace{\sum^N_{i = 1}\overbrace{(X_i - \bar{x})}^{\text{Deviation of }X_i\\\text{from mean of X}} \times\overbrace{(Y_i-\bar{y})}^{\text{Deviation of }Y_i\\\text{from mean of Y}}}^{\text{Sum of the product of the deviations}\\\text{across all observations}}}{\underbrace{N}_{\text{Number of observations}}}\]
\[corr(X,Y)=\frac{Cov(X,Y)}{\underbrace{\sigma_X}_{\text{Standard }\\\text{Deviation }\\\text{of X}}\underbrace{\sigma_Y}_{\text{Standard}\\\text{Deviation}\\\text{of Y}}}\]
r | Rough meaning |
---|---|
±0.1–0.3 | Modest |
±0.3–0.5 | Moderate |
±0.5–0.8 | Strong |
±0.8–0.9 | Very strong |
\[\beta_X = \frac{Cov(X,Y)}{\underbrace{\sigma_X}_{\text{Standard }\\\text{Deviation }\\\text{of X}}}\]
How much \(Y\) changes, on average, as \(X\) increases by one unit
ggplot2
ggplot(
weather,
aes(x = datetime, y = tempmax)
) +
geom_line() +
geom_smooth() +
scale_y_continuous(
sec.axis =
sec_axis(
trans = ~ (. * 9 / 5) + 32,
name = "Fahrenheit"
)
) +
labs(
x = NULL, y = "Celsius",
title = "Daily high temperatures in Mannheim",
subtitle = "January 1 2021–December 31, 2021",
caption = "Source: visualcrossing.com"
)
ggplot2
library(patchwork)
temp_plot <- ggplot(
weather,
aes(x = datetime, y = tempmax)
) +
geom_line() +
geom_smooth() +
labs(
x = NULL,
y = "Temperature (ºC)"
)
humid_plot <- ggplot(
weather,
aes(x = datetime, y = humidity)
) +
geom_line() +
geom_smooth() +
labs(
x = NULL,
y = "Humidity (%)"
)
temp_plot + humid_plot +
plot_layout(
ncol = 1,
heights = c(0.7, 0.3)
)