Random noise, the stochastic component of the model: the sum of “everything else” not in the systematic component of the model

Errors in Regression

Error Variance and Regression Standard Error

Once we fit the model, we can use the residuals to estimate error variance (i.e. residual variance):

\[
\hat\sigma^2 = \frac{\overbrace{\sum_{i=1}^{n}\hat{e}_i^2}^{\text{sum of squared residuals}}}{
\underbrace{
\underbrace{n}_{\text{number of}\\\text{observations}}-\underbrace{k}_{\text{number of}\\\text{covariates}} - 1
}_{\text{degrees of freedom}}
}
\]

Regression standard error \(\sqrt{\hat\sigma^2}\):

A measure of the average error (average difference between observed and predicted values of the outcome) in same units as the outcome variable

Model Conditions

Conditions for Inference

Linearity: There is a linear relationship between the outcome and predictor variables

Independence: The errors are independent from each other, i.e. knowing the error term for one observation doesn’t tell you anything about the error term for another observation

Normality: The distribution of errors is approximately normal \(\varepsilon|X \sim \mathcal{N}(0, \sigma^2)\)

Constant variance: The variability of the errors is equal for all values of the predictor variable, i.e. the errors are homeoscedastic

Linearity Assumption

Check the plot of residuals vs. predicted values for patterns

If you observe any patterns, you can look at individual plots of residuals vs. each predictor to try to identify the issue

Look for patterns in predictors you treat as continuous, for binary predictors the assumption is always met

Transformations of variables could sometimes address the problems

Violation will bias the coefficients and pose problems for uncertainty measures and hypothesis testing

Linearity Assumption Violation

Independence Assumption

Examples of violation: if the observations are clustered, e.g.

there are repeated measures from the same individual, as in longitudinal data

if classrooms were first sampled followed by sampling individuals within classes

We can often check the independence condition based on the context of the data and how the observations were collected

If the data were collected in a particular order, examine a scatterplot of the residuals versus order in which the data were collected

Violations may not bias the coefficient, but will pose problems for uncertainty measures and hypothesis testing

Normality Assumption

At any given predictor value the distribution of outcome given predictor is assumed to be normal

Compare the distributions of residuals to a normal distribution

Violations may pose problems for uncertainty measures and hypothesis testing in small samples

Constant Variance Assumption

The vertical spread of the residuals is not constant across the plot

Non-constant error variance could mean we predict some observations better (i.e. with less error) than others

Violation results in inaccurate confidence intervals and p-values

Uncertainty

Statistical Inference

… is the process of using sample data to make conclusions about the underlying population the sample came from

Sampling in Real Life

When you taste a spoonful of soup and decide the spoonful you tasted isn’t salty enough, that’s exploratory analysis

If you generalize and conclude that your entire soup needs salt, that’s an inference

For your inference to be valid, the spoonful you tasted (the sample) needs to be representative of the entire pot (the population)

Why Communicate Uncertainty

If you want to estimate a population parameter, do you prefer to report a range of values the parameter might be in, or a single value?

If we report a point estimate, we probably won’t hit the exact population parameter

If we report a range of plausible values, we have a good shot at capturing the parameter

How to Communicate Uncertainty

Standard error: since point estimates vary from sample to sample, we quantify this variability with what is called the standard error (SE). Standard error is a standard deviation of the sampling distribution of a statistic.

Confidence interval: a range of our best guesses about the point estimate. One way to calculate it is using the standard error.

We call a confidence interval a \((1 - \alpha)\)% confidence interval if it is constructed such that it contains the true parameter at least \((1 - \alpha)\)% of the time if we repeat the experiment a large number of times.

Thought Experiment

Suppose we had data for the whole population (100’000 students) and we could estimate the true parameter values:

Compare these values to estimates from a random sample of 500 students:

term

parameter

estimate

(Intercept)

3.691

3.712

beauty

0.070

0.074

Sampling Variability

Now let’s take 5000 samples (N = 1000) from this population of 100’000 students, run the same bivariate model on each of these samples, and plot the results:

Sampling Distributions

Sampling distributions are hypothetical constructs, underlying the logic of frequentist approach to statistics. We never observe them in real life

Sampling distributions (of many statistics) are approximately normally distributed (if the sample size is sufficiently large)

Sampling distributions are centered at the true value of the population coefficient (the value we would get from linear modeling if we did indeed use the full population)

The spread of the sampling distribution gives us a measure of the precision of our estimate:

If the sampling distribution is very wide, then our estimate is imprecise; our estimate would vary widely from sample to sample

If the sampling distribution is very narrow, then our estimate is precise; our estimate would not vary much from sample to sample

Standard Error

Standard Error are Standard Deviations of Sampling Distributions:

# SD of sampling distributionsestimates %>%group_by(term) %>%summarize(sd =sd(estimate))

# A tibble: 2 × 2
term sd
<chr> <dbl>
1 (Intercept) 0.0545
2 beauty 0.0114

If we took samples of N = 100 and not N = 1000, what should happen to the spread of sampling distributions?

Sample Size and Precision

Larger samples allow for more precise estimates (i.e. smaller standard errors)

We cannot observe these distributions, but the SE provides us with the information about their variability
We can use this info to obtain a plausible range of estimates given our data, the confidence interval

Confidence Intervals

Range of our best guesses about the point estimate with X% confidence (accounting for sampling variability)

If we have a large sample, we can estimate CIs from standard errors and quantiles for the standard normal distribution \(\mathcal{N}(\mu = 0, \sigma = 1)\): \(100 \cdot (1 - \alpha)\)% confidence interval is \(CI_{(1 - \alpha)} = \hat \beta \pm \underbrace{z_{\alpha / 2} \cdot SE}_{\text{Margin of Error}}\)

The procedure of how 95% confidence intervals are constructed ensures that:

when we draw repeated samples

95% of CIs calculated with this formula on the respective samples

cover the true parameter

one single calculated CI tells us nothing about:

parameters in other samples

individual observations in our sample and/or population

Confidence Intervals Illustration

Confidence Intervals as Ring Toss

Each sample gives a different CI or toss of the ring

Some samples the ring will contain the target (the CI will contain the truth) other times it won’t

We don’t know if the CI for our sample contains the truth!

Confidence level: percent of the time our CI will contain the population parameter

Number of ring tosses that will hit the target.

We get to choose, but typical values are 90%, 95%, and 99%

The confidence level of a CI determine how often the CI will be wrong

The confidence level does not mean that there is 95% probability of each CI including true value. Confidence level is a statement about the procedure in general, not each individual interval.

95% confidence interval for the effect of beauty ranges from 0.04 to 0.1.
We are 95% confident that the effect of beauty on course eevaluations ranges from, on average, 0.04 to 0.1.
We are 95% confident that for each additional point in beauty score, we would expect course evaluations to increase by 0.04 to 0.1 points, on average. With 95% probability the expected increase in course evaluataions ranges between 0.04 to 0.1 points for each additional unit of beauty score.

Communicating Uncertainty

Accuracy vs. Precision Trade-off

By design, confidence intervals of different levels vary in their width:

This relates to the precision vs. accuracy of our interval estimates: the higher the level we choose, the more certain we will be to cover the true value

But we also loose the precision of the estimate: wider ranges can become rather uninformative

Hypothesis Testing

Decision Based on Confidence Intervals

Do the data provide sufficient evidence that β1 (the true slope for the population) is different from 0?

Suppose we want to know if there is a linear effect of beauty on evaluations (\(H_A: \hat\beta_1 \neq 0\)) or if there is no linear effect (\(H_0: \hat\beta_1 = 0\))

Workflow:

Calculate the standard error and confidence interval (e.g., 95% CI)

Check if the confidence interval covers zero or not.

If CI does not cover zero, we conclude that with 95% confidence, the slope coefficient is different from zero

If CI covers zero, we conclude that with 95% confidence, the slope coefficient is not different from zero

Statistical hypothesis testing is a thought experiment: What would the world look like if we knew the truth?

Do the data provide sufficient evidence that β1 (the true slope for the population) is different from 0?

Workflow:

Start with a null hypothesis, \(H_0\) that represents the “null-world”

Set an alternative hypothesis, \(H_A\) that represents the research question, i.e. what we’re testing for

Conduct a hypothesis test under the assumption that the null hypothesis is true and calculate a p-value (probability of observed or more extreme outcome given that the null hypothesis is true)

if the test results suggest that the data do not provide convincing evidence for the alternative hypothesis, stick with the null hypothesis

if they do, then reject the null hypothesis in favor of the alternative

Course Evals of English Natives vs. Non-natives

Are courses taught by English natives evaluated higher than courses of non-natives?

How would the effect \(\hat\beta_1\) look like if there was no difference?

Shuffle the eval and nonenglish columns and calculate the difference in evaluations between natives and non-natives

Repeat re-shuffling and estimation of the differences ↑ 1000 times

Check \(\hat\beta_1\) in the null world: what is the probability of observing data as or more extreme as our data under the null (i.e. the p-value)?

Null Words with Permutation: Two-sided p-value

Null Words with Permutation: Slope Example

Repeated permutations allow for quantifying the variability in the slope under the condition that there is no linear relationship (i.e., that the null hypothesis is true)

Main Take Aways

Sampling variability implies uncertainty related to the estimation process: more data leads to more precise estimates (i.e. smaller SEs, narrower CIs)

Confidence intervals quantify the uncertainty associated with the average outcome, not each individual prediction. Confidence level indicates how often, in the long run, the CI would be wrong (i.e. not contain the true value). We cannot know if the CI we obtained is a “good” or a “bad” one, though.

With hypothesis testing, we are comparing the null-world to our estimates. \(p\)-values indicate the probability of us observing the data at least as extreme as we have in the null-world

Appendix

Probabilistic Interpretation of CIs

\[CI_{95}: [\bar x - 1.96 SE, \bar x + 1.96 SE]\]

Randomness comes from the stage of drawing a sample (we only have one, but hypothetically, we draw them repeatedly)

After we draw the random sample, calculating the CI is a matter of procedure:

there is no more randomness, it’s just applying the formula

hence, we can think of this as a realized experiment

if CI bounds are just numbers, the true fixed value is either inside (1) or not (0)

we say: a single calculated CI contains the true parameter or not (and in real life, we don’t know if it does, so we hope it is one of the ones that cover the true value)

Before the random sample is drawn, we can apply probabilistic interpretation to CI:

for a 95% confidence interval (before the sample for calculating that CI is drawn!), there is a 95% chance that a CI will contain \(\mu\)

the random interval \([\bar x - 1.96 SE, \bar x + 1.96 SE]\) contains \(\mu\) with probability 0.95. It is a random interval, since the endpoints change with different samples (i.e., we don’t have not drawn the sample yet)