Uncertainty and Statistical Inference

Data Analytics and Visualization with R
Session 7

Viktoriia Semenova

University of Mannheim
Spring 2023

Agenda for Today


Model Assumptions


Sampling Variability and Sampling Distributions


Standard Errors, Confidence Intervals, p-values


Lab: Communicating Uncertainty

Error Term in Regression Model

Errors in Regression

Random noise, the stochastic component of the model: the sum of “everything else” not in the systematic component of the model

Errors in Regression

Error Variance and Regression Standard Error

Once we fit the model, we can use the residuals to estimate error variance (i.e. residual variance):

\[ \hat\sigma^2 = \frac{\overbrace{\sum_{i=1}^{n}\hat{e}_i^2}^{\text{sum of squared residuals}}}{ \underbrace{ \underbrace{n}_{\text{number of}\\\text{observations}}-\underbrace{k}_{\text{number of}\\\text{covariates}} - 1 }_{\text{degrees of freedom}} } \]

Regression standard error \(\sqrt{\hat\sigma^2}\):

  • A measure of the average error (average difference between observed and predicted values of the outcome) in same units as the outcome variable

Model Conditions

Conditions for Inference

  • Linearity: There is a linear relationship between the outcome and predictor variables
  • Independence: The errors are independent from each other, i.e. knowing the error term for one observation doesn’t tell you anything about the error term for another observation
  • Normality: The distribution of errors is approximately normal \(\varepsilon|X \sim \mathcal{N}(0, \sigma^2)\)
  • Constant variance: The variability of the errors is equal for all values of the predictor variable, i.e. the errors are homeoscedastic

Linearity Assumption

  • Check the plot of residuals vs. predicted values for patterns

  • If you observe any patterns, you can look at individual plots of residuals vs. each predictor to try to identify the issue

  • Look for patterns in predictors you treat as continuous, for binary predictors the assumption is always met

  • Transformations of variables could sometimes address the problems

  • Violation will bias the coefficients and pose problems for uncertainty measures and hypothesis testing

Linearity Assumption Violation

Independence Assumption

  • Examples of violation: if the observations are clustered, e.g.

    • there are repeated measures from the same individual, as in longitudinal data
    • if classrooms were first sampled followed by sampling individuals within classes
  • We can often check the independence condition based on the context of the data and how the observations were collected

  • If the data were collected in a particular order, examine a scatterplot of the residuals versus order in which the data were collected

  • Violations may not bias the coefficient, but will pose problems for uncertainty measures and hypothesis testing

Normality Assumption

  • At any given predictor value the distribution of outcome given predictor is assumed to be normal

  • Compare the distributions of residuals to a normal distribution

  • Violations may pose problems for uncertainty measures and hypothesis testing in small samples

Constant Variance Assumption

  • The vertical spread of the residuals is not constant across the plot

  • Non-constant error variance could mean we predict some observations better (i.e. with less error) than others

  • Violation results in inaccurate confidence intervals and p-values

Uncertainty

Statistical Inference

… is the process of using sample data to make conclusions about the underlying population the sample came from

G population Population sample Sample population->sample      Probability sample->population   Inference

Sampling in Real Life

  • When you taste a spoonful of soup and decide the spoonful you tasted isn’t salty enough, that’s exploratory analysis
  • If you generalize and conclude that your entire soup needs salt, that’s an inference
  • For your inference to be valid, the spoonful you tasted (the sample) needs to be representative of the entire pot (the population)

Why Communicate Uncertainty

If you want to estimate a population parameter, do you prefer to report a range of values the parameter might be in, or a single value?

  • If we report a point estimate, we probably won’t hit the exact population parameter

  • If we report a range of plausible values, we have a good shot at capturing the parameter

How to Communicate Uncertainty

  • Standard error: since point estimates vary from sample to sample, we quantify this variability with what is called the standard error (SE). Standard error is a standard deviation of the sampling distribution of a statistic.

\[ SE(\hat\beta_1)= \sqrt{Var(\hat{\beta_1})}, \qquad Var(\hat{\beta_1}) = \frac{\overbrace{\hat\sigma^2}^{\text{estimated regression error}^2\\\text{a.k.a. residual variance}}}{\underbrace{\sum_{i=1}^{n}(x_i - \bar x)^2}_{\text{sum of squared deviations of } x}} \]

  • Confidence interval: a range of our best guesses about the point estimate. One way to calculate it is using the standard error.

We call a confidence interval a \((1 - \alpha)\)% confidence interval if it is constructed such that it contains the true parameter at least \((1 - \alpha)\)% of the time if we repeat the experiment a large number of times.

Thought Experiment

Suppose we had data for the whole population (100’000 students) and we could estimate the true parameter values:

\[ \text{Evaluations} = \beta_0 + \beta_1 \cdot \text{Beauty Score} + \varepsilon \]

Compare these values to estimates from a random sample of 500 students:

term parameter estimate
(Intercept) 3.691 3.712
beauty 0.070 0.074

Sampling Variability

Now let’s take 5000 samples (N = 1000) from this population of 100’000 students, run the same bivariate model on each of these samples, and plot the results:

Sampling Distributions

  • Sampling distributions are hypothetical constructs, underlying the logic of frequentist approach to statistics. We never observe them in real life

  • Sampling distributions (of many statistics) are approximately normally distributed (if the sample size is sufficiently large)

  • Sampling distributions are centered at the true value of the population coefficient (the value we would get from linear modeling if we did indeed use the full population)

  • The spread of the sampling distribution gives us a measure of the precision of our estimate:

    • If the sampling distribution is very wide, then our estimate is imprecise; our estimate would vary widely from sample to sample
    • If the sampling distribution is very narrow, then our estimate is precise; our estimate would not vary much from sample to sample

Standard Error

Standard Error are Standard Deviations of Sampling Distributions:

# SD of sampling distributions
estimates %>% 
  group_by(term) %>% 
  summarize(sd = sd(estimate))
# A tibble: 2 × 2
  term            sd
  <chr>        <dbl>
1 (Intercept) 0.0545
2 beauty      0.0114
# estimates SEs  
estimates %>% 
  filter(sample == 10) 
# A tibble: 2 × 9
  term        estimate std.error statistic       p.value conf.low conf.high n     sample
  <chr>          <dbl>     <dbl>     <dbl>         <dbl>    <dbl>     <dbl> <chr>  <int>
1 (Intercept)   3.71      0.0533     69.6  0               3.61      3.82   1000      10
2 beauty        0.0676    0.0112      6.01 0.00000000266   0.0455    0.0896 1000      10
# estimates SEs  
estimates %>% 
  filter(sample == 1000) 
# A tibble: 2 × 9
  term        estimate std.error statistic     p.value conf.low conf.high n     sample
  <chr>          <dbl>     <dbl>     <dbl>       <dbl>    <dbl>     <dbl> <chr>  <int>
1 (Intercept)   3.78      0.0526     71.8  0             3.67      3.88   1000    1000
2 beauty        0.0570    0.0114      4.99 0.000000724   0.0346    0.0795 1000    1000

If we took samples of N = 100 and not N = 1000, what should happen to the spread of sampling distributions?

Sample Size and Precision

Larger samples allow for more precise estimates (i.e. smaller standard errors)

We cannot observe these distributions, but the SE provides us with the information about their variability
We can use this info to obtain a plausible range of estimates given our data, the confidence interval

Confidence Intervals

  • Range of our best guesses about the point estimate with X% confidence (accounting for sampling variability)
  • If we have a large sample, we can estimate CIs from standard errors and quantiles for the standard normal distribution \(\mathcal{N}(\mu = 0, \sigma = 1)\): \(100 \cdot (1 - \alpha)\)% confidence interval is \(CI_{(1 - \alpha)} = \hat \beta \pm \underbrace{z_{\alpha / 2} \cdot SE}_{\text{Margin of Error}}\)

\[CI_{95} = \hat \beta \pm 1.96 SE = [ \hat \beta - 1.96 SE; \hat\beta + 1.96 SE]\]

  • The procedure of how 95% confidence intervals are constructed ensures that:
    • when we draw repeated samples
    • 95% of CIs calculated with this formula on the respective samples
    • cover the true parameter
  • one single calculated CI tells us nothing about:
    • parameters in other samples
    • individual observations in our sample and/or population

Confidence Intervals Illustration

Confidence Intervals as Ring Toss

  • Each sample gives a different CI or toss of the ring
  • Some samples the ring will contain the target (the CI will contain the truth) other times it won’t
    • We don’t know if the CI for our sample contains the truth!
  • Confidence level: percent of the time our CI will contain the population parameter
    • Number of ring tosses that will hit the target.
    • We get to choose, but typical values are 90%, 95%, and 99%
  • The confidence level of a CI determine how often the CI will be wrong

The confidence level does not mean that there is 95% probability of each CI including true value. Confidence level is a statement about the procedure in general, not each individual interval.

Communicating Uncertainty: Examples

sample_est <- students_population %>%
  slice_sample(n = 500) %>%
  lm(eval ~ beauty, data = .) %>%
  tidy(conf.int = T, conf.level = 0.95) %T>% 
  print()
# A tibble: 2 × 7
  term        estimate std.error statistic   p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
1 (Intercept)   3.65      0.0738     49.4  4.20e-194   3.50      3.79  
2 beauty        0.0668    0.0159      4.19 3.33e-  5   0.0354    0.0981

95% confidence interval for the effect of beauty ranges from 0.04 to 0.1.
We are 95% confident that the effect of beauty on course eevaluations ranges from, on average, 0.04 to 0.1.
We are 95% confident that for each additional point in beauty score, we would expect course evaluations to increase by 0.04 to 0.1 points, on average.
With 95% probability the expected increase in course evaluataions ranges between 0.04 to 0.1 points for each additional unit of beauty score.

Communicating Uncertainty

Accuracy vs. Precision Trade-off

  • By design, confidence intervals of different levels vary in their width:
sample_m %>% 
  tidy(conf.int = T, conf.level = 0.95) 
# A tibble: 2 × 7
  term        estimate std.error statistic   p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
1 (Intercept)   3.63      0.0765     47.4  8.11e-187   3.48       3.78 
2 beauty        0.0866    0.0163      5.30 1.71e-  7   0.0545     0.119
sample_m %>% 
  tidy(conf.int = T, conf.level = 0.99)
# A tibble: 2 × 7
  term        estimate std.error statistic   p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
1 (Intercept)   3.63      0.0765     47.4  8.11e-187   3.43       3.82 
2 beauty        0.0866    0.0163      5.30 1.71e-  7   0.0444     0.129
  • This relates to the precision vs. accuracy of our interval estimates: the higher the level we choose, the more certain we will be to cover the true value
  • But we also loose the precision of the estimate: wider ranges can become rather uninformative

Hypothesis Testing

Decision Based on Confidence Intervals

Do the data provide sufficient evidence that β1 (the true slope for the population) is different from 0?

  • Suppose we want to know if there is a linear effect of beauty on evaluations (\(H_A: \hat\beta_1 \neq 0\)) or if there is no linear effect (\(H_0: \hat\beta_1 = 0\))

Workflow:

  1. Calculate the standard error and confidence interval (e.g., 95% CI)
  2. Check if the confidence interval covers zero or not.
  • If CI does not cover zero, we conclude that with 95% confidence, the slope coefficient is different from zero
  • If CI covers zero, we conclude that with 95% confidence, the slope coefficient is not different from zero
sample_est
# A tibble: 2 × 7
  term        estimate std.error statistic   p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
1 (Intercept)   3.65      0.0738     49.4  4.20e-194   3.50      3.79  
2 beauty        0.0668    0.0159      4.19 3.33e-  5   0.0354    0.0981

Hypothesis Testing with p-values

  • Statistical hypothesis testing is a thought experiment: What would the world look like if we knew the truth?

Do the data provide sufficient evidence that β1 (the true slope for the population) is different from 0?

Workflow:

  1. Start with a null hypothesis, \(H_0\) that represents the “null-world”
  2. Set an alternative hypothesis, \(H_A\) that represents the research question, i.e. what we’re testing for
  3. Conduct a hypothesis test under the assumption that the null hypothesis is true and calculate a p-value (probability of observed or more extreme outcome given that the null hypothesis is true)
    • if the test results suggest that the data do not provide convincing evidence for the alternative hypothesis, stick with the null hypothesis
    • if they do, then reject the null hypothesis in favor of the alternative

Course Evals of English Natives vs. Non-natives

Are courses taught by English natives evaluated higher than courses of non-natives?

evals %>% 
  group_by(nonenglish) %>% 
  summarize(avg_eval = mean(eval)) 
# A tibble: 2 × 2
  nonenglish avg_eval
       <dbl>    <dbl>
1          0     4.02
2          1     3.69
lm(eval ~ nonenglish, evals) %>% tidy()
# A tibble: 2 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)    4.02     0.0264    152.   0      
2 nonenglish    -0.329    0.107      -3.07 0.00229

\[\hat\beta_1 = 3.69 - 4.02 = - 0.33\]

Null Words with Permutation

How would the effect \(\hat\beta_1\) look like if there was no difference?

  • Shuffle the eval and nonenglish columns and calculate the difference in evaluations between natives and non-natives
  • Repeat re-shuffling and estimation of the differences ↑ 1000 times
  • Check \(\hat\beta_1\) in the null world: what is the probability of observing data as or more extreme as our data under the null (i.e. the p-value)?

Null Words with Permutation: Two-sided p-value

Null Words with Permutation: Slope Example

Repeated permutations allow for quantifying the variability in the slope under the condition that there is no linear relationship (i.e., that the null hypothesis is true)

Main Take Aways

  • Sampling variability implies uncertainty related to the estimation process: more data leads to more precise estimates (i.e. smaller SEs, narrower CIs)
  • Confidence intervals quantify the uncertainty associated with the average outcome, not each individual prediction. Confidence level indicates how often, in the long run, the CI would be wrong (i.e. not contain the true value). We cannot know if the CI we obtained is a “good” or a “bad” one, though.
  • With hypothesis testing, we are comparing the null-world to our estimates. \(p\)-values indicate the probability of us observing the data at least as extreme as we have in the null-world

Appendix

Probabilistic Interpretation of CIs

\[CI_{95}: [\bar x - 1.96 SE, \bar x + 1.96 SE]\]

  • Randomness comes from the stage of drawing a sample (we only have one, but hypothetically, we draw them repeatedly)

  • After we draw the random sample, calculating the CI is a matter of procedure:

    • there is no more randomness, it’s just applying the formula
    • hence, we can think of this as a realized experiment
    • if CI bounds are just numbers, the true fixed value is either inside (1) or not (0)
    • we say: a single calculated CI contains the true parameter or not (and in real life, we don’t know if it does, so we hope it is one of the ones that cover the true value)
  • Before the random sample is drawn, we can apply probabilistic interpretation to CI:

    • for a 95% confidence interval (before the sample for calculating that CI is drawn!), there is a 95% chance that a CI will contain \(\mu\)
    • the random interval \([\bar x - 1.96 SE, \bar x + 1.96 SE]\) contains \(\mu\) with probability 0.95. It is a random interval, since the endpoints change with different samples (i.e., we don’t have not drawn the sample yet)