# Multiple Linear Regression

R for Data Analysis
Session 8

Viktoriia Semenova

University of Mannheim
Fall 2023

## Agenda for Today

Multiple Linear Regression

Selection of Explanatory Variables

Model Conditions & Assumpions

Intuition for Statistical Control

# Selection of Independent Variables

## Data Generating Process

• An unknown process in the real world that “generates” the data we are interested in
• In social sciences, DGP is often not very precise
• Our understanding of DGP comes from the theory and subject knowledge
• The variables that we choose to include into regression should depict our idea of the DGP

## Directed Acyclic Graphs (DAGs)

Nodes: variables in the DGP
Arrows: causal relationships in the DGP (associations)
Direction: from the cause variable to the caused variable

Directed: Each node has an arrow that points to another node

Acyclic: You can’t cycle back to a node (and arrows only have one direction)

Graph: Well…it is a graph.

## How DAGs Help Us?

DAGs represent the underlying data-generating process

• help clarify study question and relevant concepts
• provide common language to talk about theories and causal relationship (systematic way to talk about what is missing, like a node or a path)
• make our assumptions about DGP explicit
• help determine whether the effect of interest can be identified from available data
• allow us to determine which variables we need to account for to be able to estimate the causal effect (isolate specific pathways)

# Types of Association

## Major Types of Association

Common cause

#### Mediation (Chain)

Intermediate variable

#### Collision (Inverted Fork)

Selection / endogeneity

## Confounding: Effect of Money on Elections

Handling Confounders Means:

1. Find the part of campaign money that is explained by quality, remove it.

2. Find the part of win margin that is explained by quality, remove it.

3. Find the relationship between the residual part of money and residual part of win margin. This is the causal effect.

## Collider is Masking Effects

Height is unrelated to basketball skill… among professional basketball players

## Which Statements Are Correct?

For the relationship between Beauty and Talent, being a Movie star is:

1. a confounder and thus should be controled for
2. a mediator and thus should not be controled for
3. a collider and here we controled for it when fitting the regression line
4. a collider and accounting for it masks the true relationship between Beauty and Talent

## Collider is Creating Effects

Example from Elwert, Felix, and Christopher Winship. 2014. “Endogenous Selection Bias: The Problem of Conditioning on a Collider Variable.” Annu. Rev. Sociol. 40 (1): 31–53.

## Steps to Causal Diagram

1. Identify your treatment $X$ and outcome $Y$ variables
2. List possible variables (Nodes) related to the relationship you try to identify, including the unobserved and unmeasurable ones
3. For simplicity, combine them together or prune the ones least likely to be important
4. Consider which variables are likely to affect which other variables and draw arrows from one to the other
5. List all paths that connect $X$ to $Y$, regardless of the direction of arrows
6. Identify any pathways that have arrows pointing backwards towards $X$
7. Control for all nodes that point back to $X$ (aka Close Backdoors)

## Paths Glossary

Frontdoor Path

A path where all the arrows point away from Treatment $X$

Backdoor Path

A path where at least one of the arrows points towards Treatment $X$

Open Path

A path in which there is variation in all variables along the path (and no variation in any colliders on that path)

Closed Path

A path in which there is at least one variable with no variation (or a collider with variation)

Our goal: block all backdoor paths to identify the main pathway we care about

## Finding Paths

• $X$ causes $Y$
• $Z$ causes both $X$ and $Y$
• $Z$ confounds the $X⟶Y$ association
• Paths between $X$ and $Y$:
• $X⟶Y$
• $X⟵Z⟶Y$
• $Z$ is a backdoor path
• Even if there was no $X⟶Y$, $Z$ connects them

## Finding Paths: Campaign Money Example

Paths between Money and Votes:

1. Money ⟶ Total Votes
2. Money ⟵ Candidate Quality ⟶ Total Votes
• Accounting for Quality closes the backdoor
• In other words, we:
• compare candidates as if they had the same Quality
• remove differences that are predicted by Quality
• hold Quality constant

## Finding Paths: A More Complex DAG

List all paths that connect Money raised with Total Votes (regardless of the direction of arrows). Which of them are backdoor paths?

1. Money ⟶ Total Votes
2. Money ⟶ Hire campaign manager ⟶ Total Votes
3. Money ⟶ Won Election ⟵Total Votes
4. Money ⟵ Candidate Quality ⟶ Total Votes
5. Money ⟵ District ⟶ Total Votes
6. Money ⟵ Party ⟶ Total Votes
7. Money ⟵ District ⟵ History ⟶ Party ⟶ Total Votes
8. Money ⟵ Party ⟵ History ⟶ District ⟶ Total Votes

## Closing Backdoor Paths Is the Goal

Frontdoor Paths:

1. Money ⟶ Total Votes
2. Money ⟶ Hire campaign manager ⟶ Total Votes

Closed Backdoor Path:

1. Money ⟶ Won Election ⟵Total Votes

Open Backdoor Paths:

1. Money ⟵ Candidate Quality ⟶ Total Votes
2. Money ⟵ District ⟶ Total Votes
3. Money ⟵ Party ⟶ Total Votes
4. Money ⟵ District ⟵ History ⟶ Party ⟶ Total Votes
5. Money ⟵ Party ⟵ History ⟶ District ⟶ Total Votes
• Adjusting for Quality, District, and Party closes open backdoors. ⟶ Yes!
• Unobserved History then also does not confound Money and Votes.
• Adjusting for Won Election opens a backdoor ⟶ No!

# Mechanics of Multiple Linear Regression

## Multiple Multivariate Linear Regression (MLR)

$y = \hat\beta_0 + \hat\beta_1 x_1 + \hat\beta_2 x_2 + \cdots + \hat\beta_k x_k +\hat\varepsilon$

A one unit increase in $x_k$ is associated with, on average, a $\beta_k$ increase (decrease) in $y$, holding all else constant.

• We obtain our coefficient for $x_k$ independent of all other variables
• We are comparing observations as though they had same value of other variables
• You can also think of it as comparing within values of other variables
• Coefficient $\hat\beta_k$ tells us what is the value of a predictor $x$, once we know other predictors in the model
• $\hat \beta_k$ captures the effect of $x_k$, which can be uniquely attributed to this variable $x_k$

## Confounders and Statistical Control

Closing paths means ensuring we are comparing within the same values of confounders. What does this mean statistically?

1. Remove the effect of the confounder W (Age) on X (Beauty)
2. Remove the effect of the confounder W (Age) on Y (Evlauations)
3. Regress the leftovers of Y (Evlauations), the residuals from step 2, on leftovers of X (Beauty), the residuals from step 1.

## Venn Diagrams: Slope Coefficients

Slope coefficient for $X$, is the ratio between covariance of $X$ and $Y$ and variance of $X$

$\beta_{x} = \frac{cov(x,y)}{var(x)}= \frac{B}{A+B}$ Slope coefficient for $Z$, is the ratio between covariance of $Z$ and $Y$ and variance of $Z$

$\beta_{z} = \frac{cov(z,y)}{var(z)}= \frac{D}{D+E}$

## Venn Diagrams: Statistical Control

Here $X$ and $Z$ are correlated. If we did not include $Z$, our slope will still be:

$\beta_{biased} = \frac{B + F}{A+B + F + G}$

If we include $Z$, the slope becomes: $\beta_{unbiased} = \frac{B}{A+B}$

Since we cannot attribute the effect of both $X$ and $Z$ together to one single variable, that part (section F) is tossed out of the slope calculations.

# Model Conditions

## Conditions for Inference

Inference on the regression coefficients and predictions are reliable only when the regression assumptions are reasonably satisfied:

• Linearity: There is a linear relationship between the outcome and predictor variables
• Independence: The errors are independent from each other, i.e. knowing the error term for one observation doesn’t tell you anything about the error term for another observation
• Normality: The distribution of errors is approximately normal $\varepsilon|X \sim \mathcal{N}(0, \sigma^2)$
• Constant variance: The variability of the errors is equal for all values of the predictor variable, i.e. the errors are homeoscedastic

We will use plots of the residuals to check these assumptions

## Linearity Assumption

• We expect linearity in parameters (i.e. coefficients), not predictors

• $Y = \beta_0X^{\beta_1} + \eta$ vs. $\log(Y) = \beta_0 + \beta_1 \log(X) + \log(\eta)$

• Diagnostic:

• Check the plot of residuals vs. predicted values for patterns
• If you observe any patterns, you can look at individual plots of residuals vs. each predictor to try to identify the issue
• For binary predictors the assumption is always met $\Rightarrow$ you only need to check the linearity assumption for continuous predictors

• Transformations of variables ($X$ and/or $Y$) could sometimes address the problems
• Consequences of Violation:

• will bias the coefficients and pose problems for uncertainty measures and hypothesis testing

## Independence Assumption

• Examples of violation: if the observations are clustered, e.g.
• there are repeated measures from the same individual, as in longitudinal data
• if classrooms were first sampled followed by sampling individuals within classes
• Diagnostic:
• We can often check the independence condition based on the context of the data and how the observations were collected
• If the data were collected in a particular order, examine a scatterplot of the residuals versus order in which the data were collected
• Consequences of Violation:
• may not bias the coefficient, but will pose problems for uncertainty measures (standard errors) and hypothesis testing (p-values etc)

## Normality Assumption

• At any given predictor value the distribution of outcome given predictor is assumed to be normal

• Diagnostic:

• Compare the distributions of residuals to a normal distribution
• Consequences of Violation:

• may pose problems for uncertainty measures and hypothesis testing in small samples

## Constant Variance Assumption

• The vertical spread of the residuals is not constant across the plot

• Non-constant error variance could mean we predict some observations better (i.e. with less error) than others

• Diagnostic:

• Check the plot of residuals vs. predicted values for non-constant the spread of residuals along the values of the predicted values
• Consequences of Violation:

• inaccurate confidence intervals and p-values

## Main Take Aways

• Regression model represents our idea of the data-generation process (DAG) in the systematic component, everything else is captured in the error term
• You should control for confounders, but not mediators and colliders $\Rightarrow$ controls should be related to both dependent and main independent variables
• When justifying the choice of control variables, mention both how they are related to main explanatory variable and the dependent variable
• When interpreting the coefficients in multiple linear regression, remember that they are average effects holding all other variables constant
• You do not need to interpret the coefficients of control variables (at least as estimated effects of that variable on dependent variable)