`04:00`

Data Analytics and Visualization with R

Session 3

Viktoriia Semenova

University of Mannheim

Spring 2023

- Push your latest version of the project to GitHub
- Slack me/write on GitHub Discussion
- Include the lines of code which produce the error and the error text when asking for help
- Deadline for Problem Set 3: Monday 23:59 (not noon)
- Office hours Monday 16:00-17:00 (and Friday)

`04:00`

The mean and mode of the variable always have to be present in the data.

When estimating variance of the distribution, observations further from the mean have more weight than observations close to the mean.

Boxplot contains the information about the distributions’ measures of center, spread, and shape.

Median and IQR are more robust to outliers than mean and standard deviation.

Proportion is the mean of a binary variable.

- Last week: describing variables by themselves
- Today: how variables can be related to each other
- Measuring association
- Visualizing bivariate relationships
- Lab: data viz + data wrangling

Use a sample to make inferences about a population

*Estimand*: the*true*value of the parameter in the population (unknown)*Estimate*: a value which is our best guess about a parameter based on our sample*Estimator*: the function (procedure) we apply to get the estimate

Letters like \(\beta_1\) are the

/*truth*, aka population parameters*estimands*Letters with extra markings like \(\hat{\beta_1}\) are our

of the truth based on our sample*estimate*

Letters like \(X\) are

from our sample*actual data*Letters with extra markings like \(\bar{X}\) are

from our sample*calculations*

(Sample) Data → Calculation → Estimate → Truth

Data | \(X\) |

Calculation | \(\bar{X} = \frac{\sum{X}}{N}\) |

Estimate | \(\hat{\mu}\) |

Truth | \(\mu\) |

\[ \bar{X} = \hat{\mu} \]

\[ X \rightarrow \bar{X} \rightarrow \hat{\mu} \xrightarrow{\text{🤞 hopefully 🤞}} \mu \]

A *conditional distribution* is the distribution of one variable given the value of another variable

- Summarizes conditional distributions

- Conditional expectation \(E[Y|X]\):
- Expected (typical/average) \(Y\), given the value of \(X\)

- Independence (no relationship) \(E[Y|X] = E[Y]\):
- “Knowing \(X\) doesn’t affect my expectation of \(Y\)”
- Learning about \(X\) does not help us to predict \(Y\)
- Our best guess remains the typical value of \(Y\)

treatment | att_start_mn | att_end_mn |
---|---|---|

Control | 8.859375 | 8.453125 |

Treated | 9.607843 | 10.000000 |

- We would consider two variables to be
*related*if knowing something about one of them tells you something about the other - Variables are
*dependent*on each other if telling you the value of one gives you information about the distribution of the other - Variables are
*correlated*if knowing whether one of them is unusually high gives you information about whether the other is unusually high (positive correlation) or unusually low (negative correlation) - Explaining one variable Y with another X means
*predicting*your Y by looking at the distribution of Y for your value of X

`03:00`

- Voters who donate money to political candidates are usually wealthy.
- Cities with more crime tend to hire more police officers.
- Female legislators on average speak more emotionally about women-related issues as compared to male parliamentarians.
- Most candidate who win political office received a lot of campaign donations.

\[Cov(X,Y) = \frac{\overbrace{\sum^N_{i = 1}\overbrace{(X_i - \bar{x})}^{\text{Deviation of }X_i\\\text{from mean of X}} \times\overbrace{(Y_i-\bar{y})}^{\text{Deviation of }Y_i\\\text{from mean of Y}}}^{\text{Sum of the product of the deviations}\\\text{across all observations}}}{\underbrace{N}_{\text{Number of observations}}}\]

- conveys information about co-occurrence of the values in variables
- positive values indicate direct relationship (positive correlation)
- negative values indicate inverse relationship (negative correlation)

\[corr(X,Y)=\frac{Cov(X,Y)}{\underbrace{\sigma_X}_{\text{Standard }\\\text{Deviation }\\\text{of X}}\underbrace{\sigma_Y}_{\text{Standard}\\\text{Deviation}\\\text{of Y}}}\]

- rescaled covariance to \(corr(X,Y) \in [-1,1]\): extreme values indicate stronger relationship
- sometimes denoted by letter \(r\)
- says nothing about
*how much*\(Y\) changes when \(X\) changes - has no units and will not be affected by a linear change in the units (e.g., going from centimeters to inches)

r | Rough meaning |
---|---|

±0.1–0.3 | Modest |

±0.3–0.5 | Moderate |

±0.5–0.8 | Strong |

±0.8–0.9 | Very strong |

\[\beta_X = \frac{Cov(X,Y)}{\underbrace{\sigma_X}_{\text{Standard }\\\text{Deviation }\\\text{of X}}}\]

How much \(Y\) changes, on average, as \(X\) increases by one unit

- We have to choose where the y-axes start and stop
- We can force the two trends to line up however we want

`ggplot2`

```
ggplot(
weather,
aes(x = datetime, y = tempmax)
) +
geom_line() +
geom_smooth() +
scale_y_continuous(
sec.axis =
sec_axis(
trans = ~ (. * 9 / 5) + 32,
name = "Fahrenheit"
)
) +
labs(
x = NULL, y = "Celsius",
title = "Daily high temperatures in Mannheim",
subtitle = "January 1 2021–December 31, 2021",
caption = "Source: visualcrossing.com"
)
```

`ggplot2`

```
library(patchwork)
temp_plot <- ggplot(
weather,
aes(x = datetime, y = tempmax)
) +
geom_line() +
geom_smooth() +
labs(
x = NULL,
y = "Temperature (ºC)"
)
humid_plot <- ggplot(
weather,
aes(x = datetime, y = humidity)
) +
geom_line() +
geom_smooth() +
labs(
x = NULL,
y = "Humidity (%)"
)
temp_plot + humid_plot +
plot_layout(
ncol = 1,
heights = c(0.7, 0.3)
)
```

- Problem Set 3 (Visualization and Data Wrangling)
- Readings/videos for week 4 (Intro to Causality)