Describing Relationships

R for Data Analysis
Session 3

Viktoriia Semenova

University of Mannheim
Fall 2023


Plan for Today

  • Organisation & Homework Issues
  • Refresher on quantifying relationships between variables
  • More visualization with ggplot2 and some data wrangling with dplyr


  • Start problem sets earlier (come on Monday if you have questions/problems!)
  • By next Wednesday class (September 27, 2023) find teammates
    • 2-3 people per group
    • Problem sets will ask for explicit contributions from both
    • You can still work on your own if you strongly prefer that

Workflow: Projects

  • Make sure you are in the correct project:
    • you cannot push to r4da-labs, only to your problem sets
    • if you are not in a project, you may not have the Git pane
  • Project knows your working directory, you don’t need the setwd()
  • Do not multiply problem_set.qmd (or other) files
    • the whole point of git is to prevent that from happening
    • commit and push regularly and if you need a previous version of your project, you can find it in history

Workflow: Naming

  • Object names must:
    • start with a letter
    • can only contain letters, numbers, _, and .
    • be be descriptive
  • janitor package helps with cleaning the variable names in datasets
    • df <- clean_names(df)

Workflow: Git

  • git does not track some of the files in your folders/repos
    • untracked files are listed in .gitignore
    • you can edit this file if necessary
  • GitHub blocks files larger than 100 MB
    • Do not commit large files
    • Add large files’ names to .gitignore, commit the updated .gitignore file

Workflow: Code Style

Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread.

un_votes %>% 
  filter(un_votes > 5, issue == "Human rights")

un_votes %>% filter(un_votes > 5, issue == "Human rights")

un_votes %>% 
  filter(un_votes > 5, 
         issue == "Human rights")

un_votes %>% filter(un_votes>5,issue=="Human rights")

un_votes %>% 
  filter(un_votes > 5,       issue == "Human rights")

un_votes %>%
    un_votes > 5, 
    issue == "Human rights"

Workflow: Code Style Tools

  1. Select the code inside the chunk
  2. Use CTRL/CMD + SHIFT + A

There are also packages like styler that allow you to style the entire file or even all files in a directory:


Especially when rendering to PDF, you will want to make sure that your code is not getting cut off on the page. lint package shows you which lines are problematic so that you can adjust them (as well as many other things):


More on styling on course website

Quiz: Using Packages

  • What is the difference between these lines of code?
  • Which lines print out the dataset but not store the object in the environment?
  • Which lines require installed readr package and which do not?
  • Which lines require loaded readr package and which do not?

df <- read_csv("data/beauty.csv")

beauty <- read_csv("/Users/vktrsmnv/Desktop/r4da/website/data/beauty.csv")

df <- read.csv("data/beauty.csv")

Quiz: Plots

  • Will these two graphs look different?

Plot A

  data = governors,
  mapping = aes(
    x = year,
    y = lived_after
  geom_point(alpha = 0.5)

Plot B

ggplot() +
      x = year, 
      y = lived_after
    alpha = 0.5

Association and Relationships

Conditional (Marginal) Distributions

A conditional distribution is the distribution of one variable given the value of another variable

Conditional Means

  • Summarizes conditional distributions
  • Conditional expectation \(E[Y|X]\):
    • Expected (typical/average) \(Y\), given the value of \(X\)
  • Independence (no relationship) \(E[Y|X] = E[Y]\):
    • “Knowing \(X\) doesn’t affect my expectation of \(Y\)
    • Learning about \(X\) does not help us to predict \(Y\)
    • Our best guess remains the typical value of \(Y\)
oecd avg_PEIIndexi
0 50.34753
1 73.49882


Which of These Statements Describe Relationships?

  1. Voters who donate money to political candidates are usually wealthy.
  2. Cities with more crime tend to hire more police officers.
  3. Female legislators on average speak more emotionally about women-related issues as compared to male parliamentarians.
  4. Most candidate who win political office received a lot of campaign donations.

Measuring Association


\[Cov(X,Y) = \frac{\overbrace{\sum^N_{i = 1}\overbrace{(X_i - \bar{x})}^{\text{Deviation of }X_i\\\text{from mean of X}} \times\overbrace{(Y_i-\bar{y})}^{\text{Deviation of }Y_i\\\text{from mean of Y}}}^{\text{Sum of the product of the deviations}\\\text{across all observations}}}{\underbrace{N}_{\text{Number of observations}}}\]

  • conveys information about co-occurrence of the values in variables
  • positive values indicate direct relationship (positive correlation)
  • negative values indicate inverse relationship (negative correlation)

Covariance: Example

x <- c(4, 13, 19, 25, 29, 10, 30)
y <- c(10, 12, 28, 32, 38, 35, 11)
data <- data.frame(x, y)
x y
4 10
13 12
19 28
25 32
29 38
10 35
30 11
data %>%
         x        y
1 18.57143 23.71429
cov(x, y)
[1] 37.85714

Covariance: Illustration

Correlation Coefficient

\[corr(X,Y)=\frac{Cov(X,Y)}{\underbrace{\sigma_X}_{\text{Standard }\\\text{Deviation }\\\text{of X}}\underbrace{\sigma_Y}_{\text{Standard}\\\text{Deviation}\\\text{of Y}}}\]

  • rescaled covariance to \(corr(X,Y) \in [-1,1]\): extreme values indicate stronger relationship
  • sometimes denoted by letter \(r\)
  • says nothing about how much \(Y\) changes when \(X\) changes
  • has no units and will not be affected by a linear change in the units (e.g., going from centimeters to inches)

Correlation Values

r Rough meaning
±0.1–0.3 Modest
±0.3–0.5 Moderate
±0.5–0.8 Strong
±0.8–0.9 Very strong




Example: Does Time of Day Help Predict Happiness?

# A tibble: 10 × 3
   happiness cookies time     
       <dbl>   <int> <chr>    
 1       0.5       1 Morning  
 2       2         2 Morning  
 3       1         3 Morning  
 4       2.5       4 Morning  
 5       3         5 Morning  
 6       1.5       6 Afternoon
 7       2         7 Afternoon
 8       2.5       8 Afternoon
 9       2         9 Afternoon
10       3        10 Afternoon
cookies_data %>%
  group_by(time) %>% 
# A tibble: 2 × 3
  time      happiness cookies
  <chr>         <dbl>   <dbl>
1 Afternoon       2.2       8
2 Morning         1.8       3

Slope of the Regression Line

\[\beta_X = \frac{Cov(X,Y)}{\underbrace{\sigma_X}_{\text{Standard }\\\text{Deviation }\\\text{of X}}}\]

How much \(Y\) changes, on average, as \(X\) increases by one unit

Regression Line as Conditional Mean

Cookies and Happiness (again)

Goal: Draw a line that approximates the relationship

Prediction Ignoring Number of Cookies Eaten

Fitting the Line Through Every Observation

LOESS (Locally Estimated Scatterplot Smoothing) Curve

OLS Regression Line

Regression Line Minimzes the Errors of Prediction

Visualizing Relationships

The Dangers of Dual y-axes

Spurious correlation between divorce rate and margarine consumption

Source: Tyler Vigen’s spurious correlations

The Problem: Too Much Freedom

  • We have to choose where the y-axes start and stop
  • We can force the two trends to line up however we want

Example from The Economist

Fine When They Measure the Same Thing

Alternative: Use Multiple Plots

temp_plot <- ggplot(
  aes(x = datetime, y = tempmax)
) +
  geom_line() +
  geom_smooth() +
    x = NULL,
    y = "Temperature (ºC)"

humid_plot <- ggplot(
  aes(x = datetime, y = humidity)
) +
  geom_line() +
  geom_smooth() +
    x = NULL,
    y = "Humidity (%)"

temp_plot + humid_plot +
    ncol = 1,
    heights = c(0.7, 0.3)

Getting Lab Material

  • Open r4da-labs.Rproj file to open the labs project.
  • Go into Git pane.
  • If there are any files you changed, you will see them in the window.
    • Select all these files.
    • Click on Commit, write a commit message, and save it.
    • If you click on Push, it will give you a git error message
  • Click on Pull to get the new files uploaded.