# Binary Dependent Variable

Data Analytics and Visualization with R
Session 9

Viktoriia Semenova

University of Mannheim
Spring 2023

# Intro

## Housekeeping

• Final paper: June 19-July 3, 2023 (?)
• Blog post extra assignment:
• Pick a topic in Data visualization/wrangling area and make a 1-5-2 page tutorial
• Select an (polisci) dataset and make a visualization that answers a RQ with these data
• Alone or in pairs, by Monday May 29 latest

## Quiz: Which of these statements are correct?

04:00

Indridason and Bowler (2014) explore the determinants of cabinet size in parliamentary systems. Below you can find a plot based on one of their model.

1. Systematic component of the model likely includes variable Legislature Size interacted with another variable.
2. Marginal effect of the variable Legislature Size is constant across all values of Legislature Size variable.
3. The relationship between legislature size and cabinet size is strongest for smaller values of legislature size.
4. For legislatures with sizes above 500, there is, on average, no significant effect of legislature size on cabinet size.
5. Legislature size seems to be inversely related to cabinet size.

# Binary Dependent Variable

## Data

Unique (#) Missing (%) Mean SD Min Median Max
russian_tv 2 0 0.6 0.5 0.0 1.0 1.0
pro_russian_vote 2 0 0.2 0.4 0.0 0.0 1.0
within_25km 2 0 0.6 0.5 0.0 1.0 1.0
• russian_tv: indicator for whether voter’s precinct received Russian TV (1) or not (0)
• pro_russian_vote: indicator for whether respondent voted for pro-Russian party in 2014 Ukrainian elections (1) or not (0)
• within_25km: indicator for whether respondent’s precinct is within 25 kilometers from Russian border (1) or not (0)

## Linear Probability Model

$\text{Pro-Russian Vote} \sim \beta_0 + \beta_1 \text{Russian TV} + \beta_2 \text{Living within 25km} + \varepsilon$

## Logistic Regression Solution

• Apply a transformation to the linear predictor $\beta_0 + \beta_1 \text{Russian TV} + \beta_2 \text{Living within 25km}$, to ensure that outcome is bounded between 0 and 1

• The inverse logit (aka sigmoid) function takes a value between $-\infty$ and $+\infty$ and maps it to a value between 0 and 1:

$logit^{-1}(x) = \frac{\exp(x)}{1+\exp(x)}$

## Logistic Regression Model

• $Y = 1:$ yes, $Y = 0:$ no; two mutually exclusive outcomes
• $\pi = Pr(Y = 1)$: probability that $Y=1$
• $\frac{\pi}{1-\pi}$: odds that $Y = 1$
• $\log\Big(\frac{\pi}{1-\pi}\Big)$: log odds
• Go from $\pi$ to $\log\Big(\frac{\pi}{1-\pi}\Big)$ using the logit transformation

$\underbrace{\log\Big(\frac{\pi}{1-\pi}\Big)}_{\text{straight line}} = \underbrace{\beta_0 + \beta_1~X_1 + \cdots + \beta_k~X_k}_{\text{linear predictor}}$

## Odds and Probabilities Example

Suppose there is a 70% chance it will rain tomorrow

• Probability it will rain is $\pi = \mathbf{p = 0.7}$
• Probability it won’t rain is $1 - \pi = \mathbf{1 - p = 0.3}$
• Odds $\omega$ it will rain are 7 to 3, 7:3, $\omega = \frac{\pi}{1-\pi}= \mathbf{\frac{0.7}{0.3} \approx 2.33}$
pi <- seq(0, 1, length.out = 6)
tibble(pi, 1 - pi, "odds" = pi / (1 - pi), "log odds" = log(pi / (1 - pi))) %>%
kable()
pi 1 - pi odds log odds
0.0 1.0 0.0000000 -Inf
0.2 0.8 0.2500000 -1.3862944
0.4 0.6 0.6666667 -0.4054651
0.6 0.4 1.5000000 0.4054651
0.8 0.2 4.0000000 1.3862944
1.0 0.0 Inf Inf

## Probabilities, Odds, and Log Odds

Odds

$\omega = \frac{\pi}{1-\pi} = \exp\Big\{\log\Big(\frac{\pi}{1-\pi}\Big)\Big\} = \exp\{\beta_0 + \beta_1~X_1 + \cdots + \beta_k~X_k\}$

Log odds

$\log(\omega) = \log\Big(\frac{\pi}{1-\pi}\Big) = \beta_0 + \beta_1~X_1 + \cdots + \beta_k~X_k$

Probability

$\pi = \frac{\omega}{1 + \omega} = \frac{\exp\{\beta_0 + \beta_1~X_1 + \cdots + \beta_k~X_k\}}{1 + \exp\{\beta_0 + \beta_1~X_1 + \cdots + \beta_k~X_k\}}$

## Fitting the Model in R

OLS Logit
(Intercept) 0.196 −1.474
(0.035) (0.214)
russian_tv 0.288 1.790
(0.077) (0.504)
within_25km −0.208 −1.319
(0.077) (0.493)
Num.Obs. 358 358
R2 0.039
Log.Lik. −194.925 −188.738
F 7.238 6.534

## Logit Coefficients Interpretation

$\log\left[ \frac { \widehat{Pr( \text{Pro-russian Vote} = \operatorname{1} )} }{ 1 - \widehat{Pr( \text{Pro-russian Vote} = \operatorname{1} )} } \right] = -1.474 + 1.79 \cdot\text{Russian TV} - 1.32\cdot\text{Living within 25 km}$

• Sign and significance are straightforward to interpret

The log-odds of voting for a pro-Russian party are expected to be 1.79 more for those exposed to Russian TV compared to those without exposure to Russian TV (the baseline group), holding all else constant.

The odds of voting for a pro-Russian party or those exposed to Russian TV are expected to be 5.98 ($e^{1.79}$) times the odds for those without exposure to Russian TV, holding all else constant.

$\text{Odds Ratio} = e^{\hat{\beta}_j} = \exp\{\hat{\beta}_j\}$

## Predicted Probabilities By Hand

$\log\left[ \frac { \widehat{Pr( \text{Pro-russian Vote} = \operatorname{1} )} }{ 1 - \widehat{Pr( \text{Pro-russian Vote} = \operatorname{1} )} } \right] = -1.474 + 1.79 \cdot\text{Russian TV} - 1.32\cdot\text{Living within 25 km}$

$\hat\pi = \widehat{Pr( \text{Pro-russian Vote} = \operatorname{1} )} =\frac{exp(-1.474 + 1.79 \cdot \text{Russian TV} - 1.32\cdot\text{Living within 25 km})}{1 + exp(-1.474 + 1.79 \cdot \text{Russian TV}- 1.32\cdot\text{Living within 25 km})}$

$\widehat{Pr( \text{Pro-russian Vote} = \operatorname{1} |~\text{Russian TV} = 1,~\text{Living within 25 km} = 1)} \\=\frac{exp(-1.474 + 1.79 - 1.32)}{1 + exp(-1.474 + 1.79 - 1.32)}\\ = \frac{exp(-1.004)}{1 + exp(1 -1.004)} \approx \frac{0.366}{1.996} \approx 0.18$

$\widehat{Pr( \text{Pro-russian Vote} = \operatorname{1} |~\text{Russian TV} = 0,~\text{Living within 25 km} = 1)} \\=\frac{exp(-1.474 - 1.32)}{1 + exp(-1.474 - 1.32)}\\ = \frac{exp(-2.794)}{1 + exp(1 -2.794)} \approx \frac{0.061}{1.166} \approx 0.05$

## Predicted Values

• Transforming probabilities $\pi$ back to 0/1 scale of the response variable $Y$
• Requires threshold for which values translate to $Y = 1$ and $Y = 0$; $\pi > 0.5$ most common
pro_russian_vote russian_tv within_25km .fitted prediction
0 1 1 0.2682990 0
1 1 1 0.2682990 0
0 0 0 0.1863569 0
0 0 1 0.0577046 0
0 0 1 0.0577046 0
0 1 0 0.5783134 1
0 0 1 0.0577046 0
0 0 1 0.0577046 0
0 1 1 0.2682990 0
0 1 1 0.2682990 0

## Model Predictions vs. Actual Outomes

prediction pro_russian_vote count
0 0 267 True Negatives
0 1 77 False Negatives
1 0 6 False Positives
1 1 8 True Positives

# Uncertainty and Inference

## Model Plots and Confidence Intervals

tidy(m2, conf.int = T, conf.level = 0.99) %>%
dplyr::select(term, estimate, starts_with("conf")) %>%
kable()
term estimate conf.low conf.high
(Intercept) -1.473858 -2.0605444 -0.9506684
russian_tv 1.789712 0.5610434 3.1999490
within_25km -1.319124 -2.7003574 -0.1058402
modelplot(m2, conf_level = 0.99) +
geom_vline(xintercept = 0, lty = 2)
• [0.561, 3.20] is an interval for the difference in the log-odds between voters in precincts with and without Russian TV coverage, holding proximity to the border constant.

• Living in a precinct with Russian TV coverage is, on average, positively related to voting pro-Russian, controlling for the proximity to the border. This effect is significantly different from zero at 1% significance level.

## Better Interpretation

• With 95% confidence, predicted probability to vote for a pro-Russian party when living within 25km to the border and being exposed to Russian propaganda ranges from 0.21 to 0.34.
• We are 95% confident that for voter living within 25km from the Russian border, exposure to Russian TV propaganda is associated with an increase in the probability to vote for a pro-Russian party of 12 to 27 percentage points.
• For those living in a precinct further than 25km from the border, the average effect of exposure to Russian TV propaganda on the probability of pro-Russian voting ranges from 16 to 59 percentage points.

## Sampling Distributions

• The sampling distribution of a statistic is a probability distribution based on a large number of samples of size from a given population

• Sampling distributions represent the variability of our estimates: if we had taken different samples from the population, we would have obtained slightly different estimates

• Sampling distributions of most of the parameters are normal:

• Determined by two parameters, mean (center) and standard deviation (spread)

## Draws from Simulated Sampling Distributions

We can use our coefficient estimates and uncertainty about them to simulate sampling distributions:

## Simulated Sampling Distributions

# get draws from multivariate normal distribution
sims <- clarify::sim(m2, n = 1000)
(Intercept) russian_tv within_25km
-1.916805 2.113583 -1.2053787
-1.373905 1.143104 -0.8299581
-1.827432 2.242678 -1.2927962
-1.612314 1.854074 -1.1262945
-1.217560 1.569573 -1.3917399
-1.690160 1.410358 -0.6638327

## Simulated Sampling Distributions

as_tibble(sims$sim.coefs) %>% summarise_all(.funs = list(mean = ~ mean(.))) %>% kable() (Intercept)_mean russian_tv_mean within_25km_mean -1.473347 1.784857 -1.322555 as_tibble(sims$sim.coefs) %>%
summarise_all(.funs = list(sd = ~ sd(.))) %>%
kable()
(Intercept)_sd russian_tv_sd within_25km_sd
0.2035962 0.5138858 0.5047867
tidy(m2) %>%
kable()
term estimate std.error statistic p.value
(Intercept) -1.473858 0.2139185 -6.889812 0.0000000
russian_tv 1.789712 0.5036063 3.553791 0.0003797
within_25km -1.319124 0.4934381 -2.673332 0.0075102

## Log Odds with Simulated Coefficients

• Now instead of one equation with estimated coefficients, we have many with similar, simulated coefficients
• Each equation will result in slightly different log odds value (and predicted probability, too)

${ \begin{array}{c} \tilde{\beta}_0^1 \times 1+\tilde{\beta}_1^1 \times \text{Russian TV}_i +\tilde{\beta}_2^1 \times \text{Living within 25km}_i &= \log\Big(\dfrac{\tilde{\pi}^1}{1-\tilde{\pi}^1}\Big)\\ \tilde{\beta}_0^2 \times 1+\tilde{\beta}_1^2 \times \text{Russian TV}_i +\tilde{\beta}_2^2 \times \text{Living within 25km}_i &= \log\Big(\dfrac{\tilde{\pi}^2}{1-\tilde{\pi}^2}\Big)\\ \tilde{\beta}_0^3 \times 1+\tilde{\beta}_1^3 \times \text{Russian TV}_i +\tilde{\beta}_2^3 \times \text{Living within 25km}_i &= \log\Big(\dfrac{\tilde{\pi}^3}{1-\tilde{\pi}^3}\Big)\\ \dots \\ \tilde{\beta}_0^{1000} \times 1+\tilde{\beta}_1^{1000} \times \text{Russian TV}_i +\tilde{\beta}_2^{1000} \times \text{Living within 25km}_i &= \log\Big(\dfrac{\tilde{\pi}^{1000}}{1-\tilde{\pi}^{1000}}\Big)\\ \end{array} }$

## Calculating Predicted Probabilities for Chosen Scenarios

${\text{Russian TV} = 0,~\text{Living within 25 km} = 1}$

# manually calculate the log odds (not predicted probabilities yet)
lo1 <- sims$sim.coefs[1,1] + sims$sim.coefs[1,2] * 0 +
sims$sim.coefs[1,3] * 1 lo2 <- sims$sim.coefs[2,1] +  sims$sim.coefs[2,2] * 0 + sims$sim.coefs[2,3] * 1
# and so on for every row in the matrix

# custom inverse logit function
inv_logit <- function(x) {
exp(x) / (1 + exp(x))
}

# transform log odds to probabilities
inv_logit(lo1) 
(Intercept)
0.04220143 
inv_logit(lo2)
(Intercept)
0.09940411 
# with clarify
evs <- sim_setx(sim = sims, # object with simulated coefs
x = list(russian_tv = 0, # scenario
within_25km = 1))

# compare to manual calculations
as.matrix(evs) %>% head(6) %>% kable()
1
0.0422014
0.0994041
0.0422805
0.0607332
0.0685423
0.0867489

## Summarize Predicted Probabilities

Estimate 2.5 % 97.5 %
0.0577046 0.0207833 0.1448121

We are 95% confident that predicted probability to vote pro-Russian ranges from 0.02 to 0.14 percentage points in case ${\text{Russian TV} = 0,~\text{Living within 25 km} = 1}$

## Expected Values for Two Scenarios

${\text{Russian TV} = 0,~\text{Living within 25 km} = 1}$ ${\text{Russian TV} = 1,~\text{Living within 25 km} = 1}$

russian_tv = 0 russian_tv = 1
0.0422014 0.2672538
0.0994041 0.2571643
0.0422805 0.2936858
0.0607332 0.2922390
0.0685423 0.2612027
0.0867489 0.2801668
Estimate 2.5 % 97.5 %
russian_tv = 0 0.0577046 0.0207833 0.1448121
russian_tv = 1 0.2682990 0.2102465 0.3318915

## What Is the Effect of Russian TV Propaganda?

fds <- transform(evs,
First Difference = russian_tv = 1 - russian_tv = 0)
fds %>%
summary() %>%
kable()
Estimate 2.5 % 97.5 %
russian_tv = 0 0.0577046 0.0207833 0.1448121
russian_tv = 1 0.2682990 0.2102465 0.3318915
First Difference 0.2105944 0.1133236 0.2741849

On average, for those living in precincts within 25km to the border, being exposed to Russian TV propaganda is associates with an increase in the probability to vote for a pro-Russian party of 0 percentage points.  95% confidence interval for this quantity ranges from 0 to 0 percentage points, making the effect of Russian TV propaganda significantly different from zero.

## Multiple Scenarios and First Differences

evs <- sim_setx(sim = sims, # object with simulated coefs
x = list(russian_tv = 0:1, # scenario with desired (plausible) values
within_25km = 0:1))

as.matrix(evs) %>% head(3) %>% kable()
russian_tv = 0, within_25km = 0 russian_tv = 1, within_25km = 0 russian_tv = 0, within_25km = 1 russian_tv = 1, within_25km = 1
0.1282183 0.5490364 0.0422014 0.2672538
0.2019897 0.4425544 0.0994041 0.2571643
0.1385445 0.6023452 0.0422805 0.2936858

## Multiple Scenarios

fds <- transform(
evs,
FD_within_25km = 0 = russian_tv = 1, within_25km = 0 - russian_tv = 0, within_25km = 0,
FD_within_25km = 1 = russian_tv = 1, within_25km = 1 - russian_tv = 0, within_25km = 1
)

summary(fds) %>% kable()
Estimate 2.5 % 97.5 %
russian_tv = 0, within_25km = 0 0.1863569 0.1315394 0.2503005
russian_tv = 1, within_25km = 0 0.5783134 0.3334135 0.7823231
russian_tv = 0, within_25km = 1 0.0577046 0.0207833 0.1448121
russian_tv = 1, within_25km = 1 0.2682990 0.2102465 0.3318915
FD_within_25km = 0 0.3919565 0.1467803 0.5987365
FD_within_25km = 1 0.2105944 0.1133236 0.2741849

# Appendix: Interpretation of Coefficients in Logit Models

## Compare the odds for two groups

Russian TV Pro-Russian Vote No pro-Russian Vote
No 27 131
Yes 58 142
• We want to compare the risk of Pro-Russian Vote for those with exposure to Russian TV and those without it.
• We’ll use the odds to compare the two groups

$\text{odds} = \frac{P(\text{success})}{P(\text{failure})} = \frac{\text{# of successes}}{\text{# of failures}}$

## Compare the odds for two groups

Russian TV Pro-Russian Vote No pro-Russian Vote
No 27 131
Yes 58 142
• Odds of voting pro-Russian with Russian TV exposure: $\frac{58}{142} = 0.408$

• Odds of voting pro-Russian without Russian TV exposure: $\frac{27}{131} = 0.206$

• Based on this, we see those with a Russian TV exposure had higher odds of voting pro-Russian than those without Russian propaganda exposure.

• We can summarize the relationship with odds ratio (OR): $OR = \frac{\text{odds}_1}{\text{odds}_2} = \frac{\omega_1}{\omega_2}$

## Odds Ratio: with vs. without exposure to Russian TV

Russian TV Pro-Russian Vote No pro-Russian Vote
No 27 131
Yes 58 142
• Odds of voting pro-Russian with Russian TV exposure: $\frac{58}{142} = 0.408$
• Odds of voting pro-Russian without Russian TV exposure: $\frac{27}{131} = 0.206$

$OR = \frac{\text{odds}_{with}}{\text{odds}_{without}} = \frac{0.408}{0.206} = 1.982$

The odds of voting pro-Russian are 1.982 times higher for those with exposure to Russian TV than those without exposure to Russian TV.

## Coefficients in Logit Model

m3 <- glm(pro_russian_vote ~ russian_tv,
family = binomial,
data = UA)
m3$coefficients[2] # log odds  russian_tv 0.6839764  The log odds of voting pro-Russian are 0.684 higher for those with exposure to Russian TV compared to those without exposure to Russian TV. exp(m3$coefficients[2]) # odds 
russian_tv
1.981742 

The odds of voting pro-Russian are 1.982 times higher for those with exposure to Russian TV than those without exposure to Russian TV.

## Continous Predictors

For each additional unit change in $X_k$, the log-odds of Y are expected to increase by $\beta_k$ (holding all else contant).

For each additional unit change in $X_k$, the odds of Y are expected to mulitply by a factor of $e^{\beta_k}$ (holding all else contant).

OR

For each additional unit change in $X_k$, the odds of Y are expected to increase by $e^{\beta_k}%$ (holding all else contant).

## Confidence Intervals

#### Log Odds

We can calculate the C% confidence interval for $\beta_k$ as the following:

$\Large{\hat{\beta}_k \pm z^* SE_{\hat{\beta}_k}}$

where $z^*$ is calculated from the $N(0,1)$ distribution

This is an interval for the change in the log-odds for every one unit increase in $x_k$.

#### Odds

The change in odds for every one unit increase in $x_k$.

$\Large{e^{\hat{\beta}_k \pm z^* SE_{\hat{\beta}_k}}}$

Interpretation: We are $C\%$ confident that for every one unit increase in $x_k$, the odds multiply by a factor of $e^{\hat{\beta}_k - z^* SE_{\hat{\beta}_k}}$ to $e^{\hat{\beta}_k + z^* SE_{\hat{\beta}_k}}$, holding all else constant.

# Appendix: Fitting Logit Models

## Maximum Likelihood Estimation (MLE)

• 20 data points that we assume come from a normal distribution. We know that normal distribution has two parameters, mean and variance
• Which of the plotted distributions has most likely generated the data points?

## Maximum Likelihood Estimation (MLE)

• The points seem to be centered around zero and they range is between $[-2,2]$
• Most likely, it is the violet distribution that has generated the points
• With MLE: (1) we observe the data, (2) assume a distribution it come from, and (3) look for the values of parameters defining this distribution that result in the curve that best fits the data