R for Data Analysis

Session 7

Viktoriia Semenova

University of Mannheim

Fall 2023

**Fitting the line with OLS**

**Interpretation of Regression Coefficients**

**True Model**\(y = \underbrace{\beta_0}_{\text{intercept}} + \underbrace{\beta_1}_{\text{slope}} x + \underbrace{\varepsilon}_{\text{error}}\)- Population parameters \(\beta\): truth (estimand), unknown to us

**Estimated Model**\(y = \underbrace{\hat\beta_0}_{\text{intercept}} + \underbrace{\hat\beta_1}_{\text{slope}} x + \underbrace{\hat\varepsilon}_{\text{residual}}\)- Estimates \(\hat{\beta}\): our best guess about the estimand given the data

- A model has two parts:
**Systematic Component**of a linear model: \(\underbrace{\hat y}_{\text{fitted}\\\text{value}} = \underbrace{\hat\beta_0}_{\text{intercept}} + \underbrace{\hat\beta_1}_{\text{slope}} x\)**Stochastic Component**of a linear model: \(\hat\varepsilon\)

\[ y = \hat \beta_0 + \hat \beta_1 x_1 + \hat \varepsilon \]

- \(y\): dependent variable, outcome
- \(x\): independent variable, treatment, explanatory variable, treatment, predictor, feature
- \(\hat y\): predicted values of y, y-hat, fitted values, regression line

- \(\hat \beta_0\): intercept, prediction when all \(x=0\), constant
- \(\hat \beta_k\): slope, the effect of \(k\)-th variable

\[ \operatorname{\widehat{Happiness}} = 1.16 + 0.155(\operatorname{Number\ of\ Cookies\ Eaten}) \]

The slope of the model for predicting happiness score from number of consumed cookies is 0.155. Which of the following is the best interpretation of this value?

- For every additional cookie eaten, the happiness score goes up by 0.155 points, on average.
- For every additional cookie eaten, we expect the happiness score to be higher by 0.155 points, on average.
- For every additional cookie eaten, the happiness score goes up by 0.155 points.
- For every one point increase in happiness score, the number of cookies eaten goes up by 0.155 points, on average.

`01:30`

\[ \operatorname{\widehat{Happiness}} = 1.16 + 0.155(\operatorname{Number\ of\ Cookies\ Eaten}) \]

Slope: For every additional cookie eaten, we expect the happiness score to be higher by 0.155 points, on average.

- Each additional cookie has the same effect on happiness, i.e.
*marginal effect*is constant- Associated increase in happiness is 0.155 for the first and, say, tenth cookie

Intercept: If the number of eaten cookies is 0, we expect the happiness score to be 1.16 points.

- Intercept is meaningful in the context of data because the predictor can feasibly take values equal to or near zero

\[ \begin{aligned} &\widehat{\text{Happiness}} = 2 - 0.1 \cdot \text{Cookies} \end{aligned} \]

\[ \begin{aligned} &\widehat{\text{Happiness}} = 2 + 0 \cdot \text{Cookies} = \overline{\text{Happiness}} \end{aligned} \]

\[ \begin{aligned} &\widehat{\text{Happiness}} = 1.1 + 0.1 \cdot \text{Cookies} \end{aligned} \]

\[ \begin{aligned} &\widehat{\text{Happiness}} = 1.16 + 0.155 \cdot \text{Cookies} \end{aligned} \]

**Explained Variance (Sum of Squares):** \[ESS = \sum^{n}_{i=1}(\hat y_i - \bar y)^2\]

**Sum of Squared Residuals:** \[RSS = \sum^n_{i=1}\hat{\varepsilon_i}^2 = \sum^n_{i=1}(y_i - \hat{y_i})^2\]

**Total Sum of Squares:** \[TSS = \sum^{n}_{i=1}(y_i - \bar y)^2 = ESS + RSS\]

\[\text{Sum of Squared Residuals (SSR)} \\= \sum^n_{i=1}\hat{\varepsilon_i}^2 \\ = \sum^n_{i=1}(y_i - \hat{y_i})^2 \\ = \sum^n_{i=1}(y_i - \hat{\beta_0} - \hat{\beta_1} x_i)^2\]

- OLS estimator finds values of \(\hat\beta\) which minimize \(SSR\), the unexplained variance

- We use differential calculus to find these values of \(\hat{\beta}\) (full derivation)

- The regression line goes through the center of mass point, the coordinates corresponding to average \(X\) and average \(Y\), \((\bar{X}, \bar{Y})\):

\[\bar{Y} = \hat \beta_0 + \hat \beta_1 \bar{X} ~ \rightarrow ~ \hat \beta_0 = \bar{y} - \hat \beta_1 \bar{x}\]

The slope has the same sign as the correlation coefficient: \(\beta_X = Corr(X,Y) \dfrac{{\sigma_Y}}{{\sigma_X}}\)

The sum of the residuals is zero (by design): \(\sum_{i = 1}^n e_i = 0\)