Random noise, the stochastic component of the model: the sum of “everything else” not in the systematic component of the model
Errors in Regression
Error Variance and Regression Standard Error
Once we fit the model, we can use the residuals to estimate error variance (i.e. residual variance):
\[
\hat\sigma^2 = \frac{\overbrace{\sum_{i=1}^{n}\hat{e}_i^2}^{\text{sum of squared residuals}}}{
\underbrace{
\underbrace{n}_{\text{number of}\\\text{observations}}-\underbrace{k}_{\text{number of}\\\text{covariates}} - 1
}_{\text{degrees of freedom}}
}
\]
Regression standard error \(\sqrt{\hat\sigma^2}\):
A measure of the average error (average difference between observed and predicted values of the outcome) in same units as the outcome variable
Model Conditions
Conditions for Inference
Linearity: There is a linear relationship between the outcome and predictor variables
Independence: The errors are independent from each other, i.e. knowing the error term for one observation doesn’t tell you anything about the error term for another observation
Normality: The distribution of errors is approximately normal \(\varepsilon|X \sim \mathcal{N}(0, \sigma^2)\)
Constant variance: The variability of the errors is equal for all values of the predictor variable, i.e. the errors are homeoscedastic
Linearity Assumption
Check the plot of residuals vs. predicted values for patterns
If you observe any patterns, you can look at individual plots of residuals vs. each predictor to try to identify the issue
Look for patterns in predictors you treat as continuous, for binary predictors the assumption is always met
Transformations of variables could sometimes address the problems
Violation will bias the coefficients and pose problems for uncertainty measures and hypothesis testing
Linearity Assumption Violation
Independence Assumption
Examples of violation: if the observations are clustered, e.g.
there are repeated measures from the same individual, as in longitudinal data
if classrooms were first sampled followed by sampling individuals within classes
We can often check the independence condition based on the context of the data and how the observations were collected
If the data were collected in a particular order, examine a scatterplot of the residuals versus order in which the data were collected
Violations may not bias the coefficient, but will pose problems for uncertainty measures and hypothesis testing
Normality Assumption
At any given predictor value the distribution of outcome given predictor is assumed to be normal
Compare the distributions of residuals to a normal distribution
Violations may pose problems for uncertainty measures and hypothesis testing in small samples
Constant Variance Assumption
The vertical spread of the residuals is not constant across the plot
Non-constant error variance could mean we predict some observations better (i.e. with less error) than others
Violation results in inaccurate confidence intervals and p-values
Uncertainty
Statistical Inference
… is the process of using sample data to make conclusions about the underlying population the sample came from
Sampling in Real Life
When you taste a spoonful of soup and decide the spoonful you tasted isn’t salty enough, that’s exploratory analysis
If you generalize and conclude that your entire soup needs salt, that’s an inference
For your inference to be valid, the spoonful you tasted (the sample) needs to be representative of the entire pot (the population)
Why Communicate Uncertainty
If you want to estimate a population parameter, do you prefer to report a range of values the parameter might be in, or a single value?
If we report a point estimate, we probably won’t hit the exact population parameter
If we report a range of plausible values, we have a good shot at capturing the parameter
How to Communicate Uncertainty
Standard error: since point estimates vary from sample to sample, we quantify this variability with what is called the standard error (SE). Standard error is a standard deviation of the sampling distribution of a statistic.
Confidence interval: a range of our best guesses about the point estimate. One way to calculate it is using the standard error.
We call a confidence interval a \((1 - \alpha)\)% confidence interval if it is constructed such that it contains the true parameter at least \((1 - \alpha)\)% of the time if we repeat the experiment a large number of times.
Thought Experiment
Suppose we had data for the whole population (100’000 students) and we could estimate the true parameter values:
Compare these values to estimates from a random sample of 500 students:
term
parameter
estimate
(Intercept)
3.691
3.712
beauty
0.070
0.074
Sampling Variability
Now let’s take 5000 samples (N = 1000) from this population of 100’000 students, run the same bivariate model on each of these samples, and plot the results:
Sampling Distributions
Sampling distributions are hypothetical constructs, underlying the logic of frequentist approach to statistics. We never observe them in real life
Sampling distributions (of many statistics) are approximately normally distributed (if the sample size is sufficiently large)
Sampling distributions are centered at the true value of the population coefficient (the value we would get from linear modeling if we did indeed use the full population)
The spread of the sampling distribution gives us a measure of the precision of our estimate:
If the sampling distribution is very wide, then our estimate is imprecise; our estimate would vary widely from sample to sample
If the sampling distribution is very narrow, then our estimate is precise; our estimate would not vary much from sample to sample
Standard Error
Standard Error are Standard Deviations of Sampling Distributions:
# SD of sampling distributionsestimates %>%group_by(term) %>%summarize(sd =sd(estimate))
# A tibble: 2 × 2
term sd
<chr> <dbl>
1 (Intercept) 0.0545
2 beauty 0.0114
If we took samples of N = 100 and not N = 1000, what should happen to the spread of sampling distributions?
Sample Size and Precision
Larger samples allow for more precise estimates (i.e. smaller standard errors)
We cannot observe these distributions, but the SE provides us with the information about their variability
We can use this info to obtain a plausible range of estimates given our data, the confidence interval
Confidence Intervals
Range of our best guesses about the point estimate with X% confidence (accounting for sampling variability)
If we have a large sample, we can estimate CIs from standard errors and quantiles for the standard normal distribution \(\mathcal{N}(\mu = 0, \sigma = 1)\): \(100 \cdot (1 - \alpha)\)% confidence interval is \(CI_{(1 - \alpha)} = \hat \beta \pm \underbrace{z_{\alpha / 2} \cdot SE}_{\text{Margin of Error}}\)
The procedure of how 95% confidence intervals are constructed ensures that:
when we draw repeated samples
95% of CIs calculated with this formula on the respective samples
cover the true parameter
one single calculated CI tells us nothing about:
parameters in other samples
individual observations in our sample and/or population
Confidence Intervals Illustration
Confidence Intervals as Ring Toss
Each sample gives a different CI or toss of the ring
Some samples the ring will contain the target (the CI will contain the truth) other times it won’t
We don’t know if the CI for our sample contains the truth!
Confidence level: percent of the time our CI will contain the population parameter
Number of ring tosses that will hit the target.
We get to choose, but typical values are 90%, 95%, and 99%
The confidence level of a CI determine how often the CI will be wrong
The confidence level does not mean that there is 95% probability of each CI including true value. Confidence level is a statement about the procedure in general, not each individual interval.
95% confidence interval for the effect of beauty ranges from 0.04 to 0.1.
We are 95% confident that the effect of beauty on course eevaluations ranges from, on average, 0.04 to 0.1.
We are 95% confident that for each additional point in beauty score, we would expect course evaluations to increase by 0.04 to 0.1 points, on average. With 95% probability the expected increase in course evaluataions ranges between 0.04 to 0.1 points for each additional unit of beauty score.
Communicating Uncertainty
Accuracy vs. Precision Trade-off
By design, confidence intervals of different levels vary in their width:
This relates to the precision vs. accuracy of our interval estimates: the higher the level we choose, the more certain we will be to cover the true value
But we also loose the precision of the estimate: wider ranges can become rather uninformative
Hypothesis Testing
Decision Based on Confidence Intervals
Do the data provide sufficient evidence that β1 (the true slope for the population) is different from 0?
Suppose we want to know if there is a linear effect of beauty on evaluations (\(H_A: \hat\beta_1 \neq 0\)) or if there is no linear effect (\(H_0: \hat\beta_1 = 0\))
Workflow:
Calculate the standard error and confidence interval (e.g., 95% CI)
Check if the confidence interval covers zero or not.
If CI does not cover zero, we conclude that with 95% confidence, the slope coefficient is different from zero
If CI covers zero, we conclude that with 95% confidence, the slope coefficient is not different from zero
Statistical hypothesis testing is a thought experiment: What would the world look like if we knew the truth?
Do the data provide sufficient evidence that β1 (the true slope for the population) is different from 0?
Workflow:
Start with a null hypothesis, \(H_0\) that represents the “null-world”
Set an alternative hypothesis, \(H_A\) that represents the research question, i.e. what we’re testing for
Conduct a hypothesis test under the assumption that the null hypothesis is true and calculate a p-value (probability of observed or more extreme outcome given that the null hypothesis is true)
if the test results suggest that the data do not provide convincing evidence for the alternative hypothesis, stick with the null hypothesis
if they do, then reject the null hypothesis in favor of the alternative
Course Evals of English Natives vs. Non-natives
Are courses taught by English natives evaluated higher than courses of non-natives?
How would the effect \(\hat\beta_1\) look like if there was no difference?
Shuffle the eval and nonenglish columns and calculate the difference in evaluations between natives and non-natives
Repeat re-shuffling and estimation of the differences ↑ 1000 times
Check \(\hat\beta_1\) in the null world: what is the probability of observing data as or more extreme as our data under the null (i.e. the p-value)?
Null Words with Permutation: Two-sided p-value
Null Words with Permutation: Slope Example
Repeated permutations allow for quantifying the variability in the slope under the condition that there is no linear relationship (i.e., that the null hypothesis is true)
Main Take Aways
Sampling variability implies uncertainty related to the estimation process: more data leads to more precise estimates (i.e. smaller SEs, narrower CIs)
Confidence intervals quantify the uncertainty associated with the average outcome, not each individual prediction. Confidence level indicates how often, in the long run, the CI would be wrong (i.e. not contain the true value). We cannot know if the CI we obtained is a “good” or a “bad” one, though.
With hypothesis testing, we are comparing the null-world to our estimates. \(p\)-values indicate the probability of us observing the data at least as extreme as we have in the null-world
Appendix
Probabilistic Interpretation of CIs
\[CI_{95}: [\bar x - 1.96 SE, \bar x + 1.96 SE]\]
Randomness comes from the stage of drawing a sample (we only have one, but hypothetically, we draw them repeatedly)
After we draw the random sample, calculating the CI is a matter of procedure:
there is no more randomness, it’s just applying the formula
hence, we can think of this as a realized experiment
if CI bounds are just numbers, the true fixed value is either inside (1) or not (0)
we say: a single calculated CI contains the true parameter or not (and in real life, we don’t know if it does, so we hope it is one of the ones that cover the true value)
Before the random sample is drawn, we can apply probabilistic interpretation to CI:
for a 95% confidence interval (before the sample for calculating that CI is drawn!), there is a 95% chance that a CI will contain \(\mu\)
the random interval \([\bar x - 1.96 SE, \bar x + 1.96 SE]\) contains \(\mu\) with probability 0.95. It is a random interval, since the endpoints change with different samples (i.e., we don’t have not drawn the sample yet)