# Describing Relationships

Data Analytics and Visualization with R
Session 3

Viktoriia Semenova

University of Mannheim
Spring 2023

# Warm up

## Housekeeping: Homework

• Slack me/write on GitHub Discussion
• Include the lines of code which produce the error and the error text when asking for help
• Deadline for Problem Set 3: Monday 23:59 (not noon)
• Office hours Monday 16:00-17:00 (and Friday)

## Quiz: Which of these statements are correct?

04:00
1. The mean and mode of the variable always have to be present in the data.

2. When estimating variance of the distribution, observations further from the mean have more weight than observations close to the mean.

3. Boxplot contains the information about the distributions’ measures of center, spread, and shape.

4. Median and IQR are more robust to outliers than mean and standard deviation.

5. Proportion is the mean of a binary variable.

## Will these two graphs look different?

### Plot A

ggplot(
data = governors,
mapping = aes(
x = year,
y = lived_after
)
)
+
geom_point(alpha = 0.5)

### Plot B

ggplot() +
geom_point(
governors,
aes(
x = year,
y = lived_after
),
alpha = 0.5
)

## Today

• Last week: describing variables by themselves
• Today: how variables can be related to each other
• Measuring association
• Visualizing bivariate relationships
• Lab: data viz + data wrangling

## What is Statistics?

Use a sample to make inferences about a population

## Terminology

• Estimand: the true value of the parameter in the population (unknown)
• Estimate: a value which is our best guess about a parameter based on our sample
• Estimator: the function (procedure) we apply to get the estimate

## Notation

### Greek

• Letters like $\beta_1$ are the truth/estimands, aka population parameters

• Letters with extra markings like $\hat{\beta_1}$ are our estimate of the truth based on our sample

### Latin

• Letters like $X$ are actual data from our sample

• Letters with extra markings like $\bar{X}$ are calculations from our sample

## Estimating the Truth

(Sample) Data → Calculation → Estimate → Truth

 Data $X$ Calculation $\bar{X} = \frac{\sum{X}}{N}$ Estimate $\hat{\mu}$ Truth $\mu$

$\bar{X} = \hat{\mu}$

$X \rightarrow \bar{X} \rightarrow \hat{\mu} \xrightarrow{\text{🤞 hopefully 🤞}} \mu$

## Conditional (Marginal) Distributions

A conditional distribution is the distribution of one variable given the value of another variable

## Conditional Means

• Summarizes conditional distributions
• Conditional expectation $E[Y|X]$:
• Expected (typical/average) $Y$, given the value of $X$
• Independence (no relationship) $E[Y|X] = E[Y]$:
• “Knowing $X$ doesn’t affect my expectation of $Y$
• Learning about $X$ does not help us to predict $Y$
• Our best guess remains the typical value of $Y$
treatment att_start_mn att_end_mn
Control 8.859375 8.453125
Treated 9.607843 10.000000

## Does Time of Day Help Predict Happiness?

cookies_data
# A tibble: 10 × 3
<dbl>   <int> <chr>
1       0.5       1 Morning
2       2         2 Morning
3       1         3 Morning
4       2.5       4 Morning
5       3         5 Morning
6       1.5       6 Afternoon
7       2         7 Afternoon
8       2.5       8 Afternoon
9       2         9 Afternoon
10       3        10 Afternoon
cookies_data %>%
group_by(time) %>%
summarise_all(mean)
# A tibble: 2 × 3
<chr>         <dbl>   <dbl>
1 Afternoon       2.2       8
2 Morning         1.8       3

# Relationships

## Which of These Statements Describe Correlations?

03:00
1. Voters who donate money to political candidates are usually wealthy.
2. Cities with more crime tend to hire more police officers.
3. Female legislators on average speak more emotionally about women-related issues as compared to male parliamentarians.
4. Most candidate who win political office received a lot of campaign donations.

# Measuring Association

## Covariance

$Cov(X,Y) = \frac{\overbrace{\sum^N_{i = 1}\overbrace{(X_i - \bar{x})}^{\text{Deviation of }X_i\\\text{from mean of X}} \times\overbrace{(Y_i-\bar{y})}^{\text{Deviation of }Y_i\\\text{from mean of Y}}}^{\text{Sum of the product of the deviations}\\\text{across all observations}}}{\underbrace{N}_{\text{Number of observations}}}$

• conveys information about co-occurrence of the values in variables
• positive values indicate direct relationship (positive correlation)
• negative values indicate inverse relationship (negative correlation)

## Covariance: Example

x <- c(4, 13, 19, 25, 29, 10, 30)
y <- c(10, 12, 28, 32, 38, 35, 11)
data <- data.frame(x, y)
knitr::kable(data)
x y
4 10
13 12
19 28
25 32
29 38
10 35
30 11
data %>%
summarise_all(mean)
         x        y
1 18.57143 23.71429
cov(x, y)
 37.85714

## Correlation Coefficient

$corr(X,Y)=\frac{Cov(X,Y)}{\underbrace{\sigma_X}_{\text{Standard }\\\text{Deviation }\\\text{of X}}\underbrace{\sigma_Y}_{\text{Standard}\\\text{Deviation}\\\text{of Y}}}$

• rescaled covariance to $corr(X,Y) \in [-1,1]$: extreme values indicate stronger relationship
• sometimes denoted by letter $r$
• says nothing about how much $Y$ changes when $X$ changes
• has no units and will not be affected by a linear change in the units (e.g., going from centimeters to inches)

## Correlation Values

r Rough meaning
±0.1–0.3 Modest
±0.3–0.5 Moderate
±0.5–0.8 Strong
±0.8–0.9 Very strong

## Slope of the Regression Line

$\beta_X = \frac{Cov(X,Y)}{\underbrace{\sigma_X}_{\text{Standard }\\\text{Deviation }\\\text{of X}}}$

How much $Y$ changes, on average, as $X$ increases by one unit

# Visualizing Relationships

## The Problem: Too Much Freedom

• We have to choose where the y-axes start and stop
• We can force the two trends to line up however we want

## Adding a second scale in ggplot2

ggplot(
weather,
aes(x = datetime, y = tempmax)
) +
geom_line() +
geom_smooth() +
scale_y_continuous(
sec.axis =
sec_axis(
trans = ~ (. * 9 / 5) + 32,
name = "Fahrenheit"
)
) +
labs(
x = NULL, y = "Celsius",
title = "Daily high temperatures in Mannheim",
subtitle = "January 1 2021–December 31, 2021",
caption = "Source: visualcrossing.com"
)

## Adding a second scale in ggplot2

gov
# A tibble: 3 × 2
party       total
<chr>       <int>
1 Democrat      557
2 Republican    527
3 Third party     8
gov <- governors %>%
group_by(party) %>%
summarize(total = n())

ggplot(
gov,
aes(
x = party, y = total,
fill = party
)
) +
geom_col() +
scale_y_continuous(
sec.axis = sec_axis(
trans = ~ . / sum(gov\$total),
labels = scales::percent
)
) +
guides(fill = "none") +
scale_fill_viridis_d()

## Alternative: Use Multiple Plots

library(patchwork)
temp_plot <- ggplot(
weather,
aes(x = datetime, y = tempmax)
) +
geom_line() +
geom_smooth() +
labs(
x = NULL,
y = "Temperature (ºC)"
)

humid_plot <- ggplot(
weather,
aes(x = datetime, y = humidity)
) +
geom_line() +
geom_smooth() +
labs(
x = NULL,
y = "Humidity (%)"
)

temp_plot + humid_plot +
plot_layout(
ncol = 1,
heights = c(0.7, 0.3)
)

## To-Do List

• Problem Set 3 (Visualization and Data Wrangling)
• Readings/videos for week 4 (Intro to Causality) 