`04:00`

Data Analytics and Visualization with R

Session 9

Viktoriia Semenova

University of Mannheim

Spring 2023

- Final paper: June 19-July 3, 2023 (?)
- Blog post extra assignment:
- Pick a topic in Data visualization/wrangling area and make a 1-5-2 page tutorial
- Select an (polisci) dataset and make a visualization that answers a RQ with these data
- Alone or in pairs, by Monday May 29 latest

`04:00`

Indridason and Bowler (2014) explore the determinants of cabinet size in parliamentary systems. Below you can find a plot based on one of their model.

- Systematic component of the model likely includes variable
*Legislature Size*interacted with another variable.

- Marginal effect of the variable
*Legislature Size*is constant across all values of*Legislature Size*variable. - The relationship between
*legislature size*and*cabinet size*is strongest for smaller values of*legislature size*. - For legislatures with sizes above 500, there is, on average, no significant effect of
*legislature size*on*cabinet size*. *Legislature size*seems to be inversely related to*cabinet size*.

Unique (#) | Missing (%) | Mean | SD | Min | Median | Max | ||
---|---|---|---|---|---|---|---|---|

russian_tv | 2 | 0 | 0.6 | 0.5 | 0.0 | 1.0 | 1.0 | |

pro_russian_vote | 2 | 0 | 0.2 | 0.4 | 0.0 | 0.0 | 1.0 | |

within_25km | 2 | 0 | 0.6 | 0.5 | 0.0 | 1.0 | 1.0 |

`russian_tv`

: indicator for whether voter’s precinct received Russian TV (1) or not (0)`pro_russian_vote`

: indicator for whether respondent voted for pro-Russian party in 2014 Ukrainian elections (1) or not (0)`within_25km`

: indicator for whether respondent’s precinct is within 25 kilometers from Russian border (1) or not (0)

\[\text{Pro-Russian Vote} \sim \beta_0 + \beta_1 \text{Russian TV} + \beta_2 \text{Living within 25km} + \varepsilon\]

Apply a transformation to the

*linear predictor*\(\beta_0 + \beta_1 \text{Russian TV} + \beta_2 \text{Living within 25km}\), to ensure that outcome is bounded between 0 and 1The inverse logit (aka sigmoid) function takes a value between \(-\infty\) and \(+\infty\) and maps it to a value between 0 and 1:

\[ logit^{-1}(x) = \frac{\exp(x)}{1+\exp(x)} \]

- \(Y = 1:\) yes, \(Y = 0:\) no; two mutually exclusive outcomes
- \(\pi = Pr(Y = 1)\):
**probability**that \(Y=1\) - \(\frac{\pi}{1-\pi}\):
**odds**that \(Y = 1\) - \(\log\Big(\frac{\pi}{1-\pi}\Big)\):
**log odds** - Go from \(\pi\) to \(\log\Big(\frac{\pi}{1-\pi}\Big)\) using the
**logit transformation**

\[\underbrace{\log\Big(\frac{\pi}{1-\pi}\Big)}_{\text{straight line}} = \underbrace{\beta_0 + \beta_1~X_1 + \cdots + \beta_k~X_k}_{\text{linear predictor}}\]

Suppose there is a **70% chance** it will rain tomorrow

- Probability it will rain is \(\pi = \mathbf{p = 0.7}\)
- Probability it won’t rain is \(1 - \pi = \mathbf{1 - p = 0.3}\)
- Odds \(\omega\) it will rain are
**7 to 3**,**7:3**, \(\omega = \frac{\pi}{1-\pi}= \mathbf{\frac{0.7}{0.3} \approx 2.33}\)

**Odds**

\[\omega = \frac{\pi}{1-\pi} = \exp\Big\{\log\Big(\frac{\pi}{1-\pi}\Big)\Big\} = \exp\{\beta_0 + \beta_1~X_1 + \cdots + \beta_k~X_k\}\]

**Log odds**

\[\log(\omega) = \log\Big(\frac{\pi}{1-\pi}\Big) = \beta_0 + \beta_1~X_1 + \cdots + \beta_k~X_k\]

**Probability**

\[\pi = \frac{\omega}{1 + \omega} = \frac{\exp\{\beta_0 + \beta_1~X_1 + \cdots + \beta_k~X_k\}}{1 + \exp\{\beta_0 + \beta_1~X_1 + \cdots + \beta_k~X_k\}}\]

`R`

OLS | Logit | |
---|---|---|

(Intercept) | 0.196 | −1.474 |

(0.035) | (0.214) | |

russian_tv | 0.288 | 1.790 |

(0.077) | (0.504) | |

within_25km | −0.208 | −1.319 |

(0.077) | (0.493) | |

Num.Obs. | 358 | 358 |

R2 | 0.039 | |

R2 Adj. | 0.034 | |

Log.Lik. | −194.925 | −188.738 |

F | 7.238 | 6.534 |

\[ \log\left[ \frac { \widehat{Pr( \text{Pro-russian Vote} = \operatorname{1} )} }{ 1 - \widehat{Pr( \text{Pro-russian Vote} = \operatorname{1} )} } \right] = -1.474 + 1.79 \cdot\text{Russian TV} - 1.32\cdot\text{Living within 25 km} \]

- Sign and significance are straightforward to interpret

The

log-oddsof voting for a pro-Russian party are expected to be 1.79 more for those exposed to Russian TV compared to those without exposure to Russian TV (the baseline group), holding all else constant.

The

oddsof voting for a pro-Russian party or those exposed to Russian TV are expected to be 5.98 (\(e^{1.79}\))timesthe odds for those without exposure to Russian TV, holding all else constant.

\[\text{Odds Ratio} = e^{\hat{\beta}_j} = \exp\{\hat{\beta}_j\}\]

\[ \log\left[ \frac { \widehat{Pr( \text{Pro-russian Vote} = \operatorname{1} )} }{ 1 - \widehat{Pr( \text{Pro-russian Vote} = \operatorname{1} )} } \right] = -1.474 + 1.79 \cdot\text{Russian TV} - 1.32\cdot\text{Living within 25 km} \]

\[ \hat\pi = \widehat{Pr( \text{Pro-russian Vote} = \operatorname{1} )} =\frac{exp(-1.474 + 1.79 \cdot \text{Russian TV} - 1.32\cdot\text{Living within 25 km})}{1 + exp(-1.474 + 1.79 \cdot \text{Russian TV}- 1.32\cdot\text{Living within 25 km})} \]

\[ \widehat{Pr( \text{Pro-russian Vote} = \operatorname{1} |~\text{Russian TV} = 1,~\text{Living within 25 km} = 1)} \\=\frac{exp(-1.474 + 1.79 - 1.32)}{1 + exp(-1.474 + 1.79 - 1.32)}\\ = \frac{exp(-1.004)}{1 + exp(1 -1.004)} \approx \frac{0.366}{1.996} \approx 0.18 \]

\[ \widehat{Pr( \text{Pro-russian Vote} = \operatorname{1} |~\text{Russian TV} = 0,~\text{Living within 25 km} = 1)} \\=\frac{exp(-1.474 - 1.32)}{1 + exp(-1.474 - 1.32)}\\ = \frac{exp(-2.794)}{1 + exp(1 -2.794)} \approx \frac{0.061}{1.166} \approx 0.05 \]

- Transforming probabilities \(\pi\) back to 0/1 scale of the
*response*variable \(Y\) - Requires threshold for which values translate to \(Y = 1\) and \(Y = 0\); \(\pi > 0.5\) most common

pro_russian_vote | russian_tv | within_25km | .fitted | prediction |
---|---|---|---|---|

0 | 1 | 1 | 0.2682990 | 0 |

1 | 1 | 1 | 0.2682990 | 0 |

0 | 0 | 0 | 0.1863569 | 0 |

0 | 0 | 1 | 0.0577046 | 0 |

0 | 0 | 1 | 0.0577046 | 0 |

0 | 1 | 0 | 0.5783134 | 1 |

0 | 0 | 1 | 0.0577046 | 0 |

0 | 0 | 1 | 0.0577046 | 0 |

0 | 1 | 1 | 0.2682990 | 0 |

0 | 1 | 1 | 0.2682990 | 0 |

prediction | pro_russian_vote | count | |
---|---|---|---|

0 | 0 | 267 | True Negatives |

0 | 1 | 77 | False Negatives |

1 | 0 | 6 | False Positives |

1 | 1 | 8 | True Positives |

[0.561, 3.20] is an interval for the difference in the log-odds between voters in precincts with and without Russian TV coverage, holding proximity to the border constant.

Living in a precinct with Russian TV coverage is, on average, positively related to voting pro-Russian, controlling for the proximity to the border. This effect is significantly different from zero at 1% significance level.

- With 95% confidence, predicted probability to vote for a pro-Russian party when living within 25km to the border and being exposed to Russian propaganda ranges from 0.21 to 0.34.
- We are 95% confident that for voter living within 25km from the Russian border, exposure to Russian TV propaganda is associated with an increase in the probability to vote for a pro-Russian party of 12 to 27 percentage points.
- For those living in a precinct further than 25km from the border, the average effect of exposure to Russian TV propaganda on the probability of pro-Russian voting ranges from 16 to 59 percentage points.

The sampling distribution of a statistic is a probability distribution based on a large number of samples of size from a given population

Sampling distributions represent the variability of our estimates: if we had taken different samples from the population, we would have obtained slightly different estimates

Sampling distributions of most of the parameters are normal:

- Determined by two parameters, mean (center) and standard deviation (spread)

We can use our coefficient estimates and uncertainty about them to simulate sampling distributions:

(Intercept) | russian_tv | within_25km |
---|---|---|

-1.916805 | 2.113583 | -1.2053787 |

-1.373905 | 1.143104 | -0.8299581 |

-1.827432 | 2.242678 | -1.2927962 |

-1.612314 | 1.854074 | -1.1262945 |

-1.217560 | 1.569573 | -1.3917399 |

-1.690160 | 1.410358 | -0.6638327 |

(Intercept)_mean | russian_tv_mean | within_25km_mean |
---|---|---|

-1.473347 | 1.784857 | -1.322555 |

(Intercept)_sd | russian_tv_sd | within_25km_sd |
---|---|---|

0.2035962 | 0.5138858 | 0.5047867 |

- Now instead of one equation with
*estimated*coefficients, we have many with similar,*simulated*coefficients - Each equation will result in slightly different log odds value (and predicted probability, too)

\[ { \begin{array}{c} \tilde{\beta}_0^1 \times 1+\tilde{\beta}_1^1 \times \text{Russian TV}_i +\tilde{\beta}_2^1 \times \text{Living within 25km}_i &= \log\Big(\dfrac{\tilde{\pi}^1}{1-\tilde{\pi}^1}\Big)\\ \tilde{\beta}_0^2 \times 1+\tilde{\beta}_1^2 \times \text{Russian TV}_i +\tilde{\beta}_2^2 \times \text{Living within 25km}_i &= \log\Big(\dfrac{\tilde{\pi}^2}{1-\tilde{\pi}^2}\Big)\\ \tilde{\beta}_0^3 \times 1+\tilde{\beta}_1^3 \times \text{Russian TV}_i +\tilde{\beta}_2^3 \times \text{Living within 25km}_i &= \log\Big(\dfrac{\tilde{\pi}^3}{1-\tilde{\pi}^3}\Big)\\ \dots \\ \tilde{\beta}_0^{1000} \times 1+\tilde{\beta}_1^{1000} \times \text{Russian TV}_i +\tilde{\beta}_2^{1000} \times \text{Living within 25km}_i &= \log\Big(\dfrac{\tilde{\pi}^{1000}}{1-\tilde{\pi}^{1000}}\Big)\\ \end{array} } \]

\[ {\text{Russian TV} = 0,~\text{Living within 25 km} = 1} \]

```
# manually calculate the log odds (not predicted probabilities yet)
lo1 <- sims$sim.coefs[1,1] + sims$sim.coefs[1,2] * 0 +
sims$sim.coefs[1,3] * 1
lo2 <- sims$sim.coefs[2,1] + sims$sim.coefs[2,2] * 0 +
sims$sim.coefs[2,3] * 1
# and so on for every row in the matrix
# custom inverse logit function
inv_logit <- function(x) {
exp(x) / (1 + exp(x))
}
# transform log odds to probabilities
inv_logit(lo1)
```

```
(Intercept)
0.04220143
```

```
(Intercept)
0.09940411
```

Estimate | 2.5 % | 97.5 % |
---|---|---|

0.0577046 | 0.0207833 | 0.1448121 |

We are 95% confident that predicted probability to vote pro-Russian ranges from 0.02 to 0.14 percentage points in case \[{\text{Russian TV} = 0,~\text{Living within 25 km} = 1}\]

\[{\text{Russian TV} = 0,~\text{Living within 25 km} = 1}\] \[{\text{Russian TV} = 1,~\text{Living within 25 km} = 1}\]

russian_tv = 0 | russian_tv = 1 |
---|---|

0.0422014 | 0.2672538 |

0.0994041 | 0.2571643 |

0.0422805 | 0.2936858 |

0.0607332 | 0.2922390 |

0.0685423 | 0.2612027 |

0.0867489 | 0.2801668 |

Estimate | 2.5 % | 97.5 % | |
---|---|---|---|

russian_tv = 0 | 0.0577046 | 0.0207833 | 0.1448121 |

russian_tv = 1 | 0.2682990 | 0.2102465 | 0.3318915 |

```
fds <- transform(evs,
`First Difference` = `russian_tv = 1` - `russian_tv = 0`)
fds %>%
summary() %>%
kable()
```

Estimate | 2.5 % | 97.5 % | |
---|---|---|---|

russian_tv = 0 | 0.0577046 | 0.0207833 | 0.1448121 |

russian_tv = 1 | 0.2682990 | 0.2102465 | 0.3318915 |

First Difference | 0.2105944 | 0.1133236 | 0.2741849 |

On average, for those living in precincts within 25km to the border, being exposed to Russian TV propaganda is associates with an increase in the probability to vote for a pro-Russian party of 0 percentage points. 95% confidence interval for this quantity ranges from 0 to 0 percentage points, making the effect of Russian TV propaganda significantly different from zero.

```
evs <- sim_setx(sim = sims, # object with simulated coefs
x = list(russian_tv = 0:1, # scenario with desired (plausible) values
within_25km = 0:1))
as.matrix(evs) %>% head(3) %>% kable()
```

russian_tv = 0, within_25km = 0 | russian_tv = 1, within_25km = 0 | russian_tv = 0, within_25km = 1 | russian_tv = 1, within_25km = 1 |
---|---|---|---|

0.1282183 | 0.5490364 | 0.0422014 | 0.2672538 |

0.2019897 | 0.4425544 | 0.0994041 | 0.2571643 |

0.1385445 | 0.6023452 | 0.0422805 | 0.2936858 |

```
fds <- transform(
evs,
`FD_within_25km = 0` = `russian_tv = 1, within_25km = 0` - `russian_tv = 0, within_25km = 0`,
`FD_within_25km = 1` = `russian_tv = 1, within_25km = 1` - `russian_tv = 0, within_25km = 1`
)
summary(fds) %>% kable()
```

Estimate | 2.5 % | 97.5 % | |
---|---|---|---|

russian_tv = 0, within_25km = 0 | 0.1863569 | 0.1315394 | 0.2503005 |

russian_tv = 1, within_25km = 0 | 0.5783134 | 0.3334135 | 0.7823231 |

russian_tv = 0, within_25km = 1 | 0.0577046 | 0.0207833 | 0.1448121 |

russian_tv = 1, within_25km = 1 | 0.2682990 | 0.2102465 | 0.3318915 |

FD_within_25km = 0 | 0.3919565 | 0.1467803 | 0.5987365 |

FD_within_25km = 1 | 0.2105944 | 0.1133236 | 0.2741849 |

Russian TV | Pro-Russian Vote | No pro-Russian Vote |
---|---|---|

No | 27 | 131 |

Yes | 58 | 142 |

- We want to compare the risk of Pro-Russian Vote for those with exposure to Russian TV and those without it.
- We’ll use the odds to compare the two groups

\[\text{odds} = \frac{P(\text{success})}{P(\text{failure})} = \frac{\text{# of successes}}{\text{# of failures}}\]

Russian TV | Pro-Russian Vote | No pro-Russian Vote |
---|---|---|

No | 27 | 131 |

Yes | 58 | 142 |

Odds of voting pro-Russian with Russian TV exposure: \(\frac{58}{142} = 0.408\)

Odds of voting pro-Russian without Russian TV exposure: \(\frac{27}{131} = 0.206\)

Based on this, we see those with a Russian TV exposure had higher odds of voting pro-Russian than those without Russian propaganda exposure.

We can summarize the relationship with odds ratio (OR): \(OR = \frac{\text{odds}_1}{\text{odds}_2} = \frac{\omega_1}{\omega_2}\)

Russian TV | Pro-Russian Vote | No pro-Russian Vote |
---|---|---|

No | 27 | 131 |

Yes | 58 | 142 |

- Odds of voting pro-Russian with Russian TV exposure: \(\frac{58}{142} = 0.408\)
- Odds of voting pro-Russian without Russian TV exposure: \(\frac{27}{131} = 0.206\)

\(OR = \frac{\text{odds}_{with}}{\text{odds}_{without}} = \frac{0.408}{0.206} = 1.982\)

The odds of voting pro-Russian are 1.982 times higher for those with exposure to Russian TV than those without exposure to Russian TV.

```
m3 <- glm(pro_russian_vote ~ russian_tv,
family = binomial,
data = UA)
m3$coefficients[2] # log odds
```

```
russian_tv
0.6839764
```

The log odds of voting pro-Russian are 0.684 higher for those with exposure to Russian TV compared to those without exposure to Russian TV.

The odds of voting pro-Russian are 1.982

timeshigher for those with exposure to Russian TV than those without exposure to Russian TV.

For each additional unit change in \(X_k\), the log-odds of Y are expected to increase by \(\beta_k\) (holding all else contant).

For each additional unit change in \(X_k\), the odds of Y are expected to mulitply by a factor of \(e^{\beta_k}\) (holding all else contant).

OR

For each additional unit change in \(X_k\), the odds of Y are expected to increase by \(e^{\beta_k}%\) (holding all else contant).

We can calculate the **C% confidence interval** for \(\beta_k\) as the following:

\[ \Large{\hat{\beta}_k \pm z^* SE_{\hat{\beta}_k}} \]

where \(z^*\) is calculated from the \(N(0,1)\) distribution

This is an interval for the change in the log-odds for every one unit increase in \(x_k\).

The change in **odds** for every one unit increase in \(x_k\).

\[ \Large{e^{\hat{\beta}_k \pm z^* SE_{\hat{\beta}_k}}} \]

**Interpretation:** We are \(C\%\) confident that for every one unit increase in \(x_k\), the odds multiply by a factor of \(e^{\hat{\beta}_k - z^* SE_{\hat{\beta}_k}}\) to \(e^{\hat{\beta}_k + z^* SE_{\hat{\beta}_k}}\), holding all else constant.

- 20 data points that we assume come from a normal distribution. We know that normal distribution has two parameters, mean and variance
- Which of the plotted distributions has most likely generated the data points?

- The points seem to be centered around zero and they range is between \([-2,2]\)
- Most likely, it is the violet distribution that has generated the points
- With MLE: (1) we observe the data, (2) assume a distribution it come from, and (3) look for the values of parameters defining this distribution that result in the curve that best fits the data