Uncertainty and Statistical Inference

R for Data Analysis
Session 10

Viktoriia Semenova

University of Mannheim
Fall 2023


Statistical Inference

… is the process of using sample data to make conclusions about the underlying population the sample came from

G population Population sample Sample population->sample      Probability sample->population   Inference

Why Communicate Uncertainty

If you want to estimate a population parameter, do you prefer to report a range of values the parameter might be in, or a single value?

If we report a point estimate, we probably won’t hit the exact population parameter

If we report a range of plausible values, we have a good shot at capturing the parameter

Why Communicate Uncertainty

How to Communicate Uncertainty

  • Standard error: since point estimates vary from sample to sample, we quantify this variability with what is called the standard error (SE) Standard error is a standard deviation of the sampling distribution of a statistic

\[ SE(\hat\beta_1)= \sqrt{Var(\hat{\beta_1})}, \qquad Var(\hat{\beta_1}) = \frac{\overbrace{\hat\sigma^2}^{\text{estimated regression error}^2\\\text{a.k.a. residual variance}}}{\underbrace{\sum_{i=1}^{n}(x_i - \bar x)^2}_{\text{sum of squared deviations of } x}} \]

  • Confidence interval: a range of our best guesses about the point estimate. One way to calculate it is using the standard error

We call a confidence interval a \((1 - \alpha)\)% confidence interval if it is constructed such that it contains the true parameter at least \((1 - \alpha)\)% of the time if we repeat the experiment a large number of times.

Thought Experiment

Suppose we had data for the whole population (say 100’000 students) and we could estimate the true parameter values:

\[ \text{Evaluations} = \beta_0 + \beta_1 \cdot \text{Beauty Score} + \varepsilon \]

Compare these values to estimates from a random sample of 500 students:

term parameter estimate
(Intercept) 3.691 3.756
beauty 0.070 0.061

Sampling Variability

Now let’s take 5000 samples (N = 1000) from this population of 100’000 students, run the same bivariate model on each of these samples, and plot the results:

Sampling Distributions

  • Sampling distributions are hypothetical constructs, we never observe them in real life

  • Sampling distributions (of most statistics) are approximately normally distributed (if the sample size is sufficiently large)

  • Sampling distributions are centered at the true value of the population coefficient (the value we would get from linear modeling if we did indeed use the full population)

  • The spread of the sampling distribution gives us a measure of the precision of our estimate:

    • If the sampling distribution is very wide, then our estimate is imprecise; our estimate would vary widely from sample to sample
    • If the sampling distribution is very narrow, then our estimate is precise; our estimate would not vary much from sample to sample

Standard Error

Standard Error is a Standard Deviation of Sampling Distribution of that Statistic:

# SD of sampling distributions
estimates %>% 
  group_by(term) %>% 
  summarize(sd = sd(estimate))
# A tibble: 2 × 2
  term            sd
  <chr>        <dbl>
1 (Intercept) 0.0544
2 beauty      0.0114
# estimates SEs  
estimates %>% 
  filter(sample == 10) 
# A tibble: 2 × 9
  term        estimate std.error statistic  p.value conf.low conf.high n     sample
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl> <chr>  <int>
1 (Intercept)   3.69      0.0522     70.7  0          3.59      3.79   1000      10
2 beauty        0.0735    0.0114      6.43 1.94e-10   0.0511    0.0959 1000      10
# estimates SEs  
estimates %>% 
  filter(sample == 1000) 
# A tibble: 2 × 9
  term        estimate std.error statistic  p.value conf.low conf.high n     sample
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl> <chr>  <int>
1 (Intercept)   3.66      0.0507     72.1  0          3.56      3.76   1000    1000
2 beauty        0.0771    0.0106      7.24 8.73e-13   0.0562    0.0980 1000    1000

If we took samples of N = 100 and not N = 1000, what should happen to the spread of sampling distributions?

Sample Size and Precision

Larger samples allow for more precise estimates (i.e. smaller standard errors)

We cannot observe these distributions, but the SE provides us with the information about their variability. We can use this info to obtain a plausible range of estimates given our data, the confidence interval

Confidence Intervals

  • Range of our best guesses about the point estimate with X% confidence (accounting for sampling variability)
  • If we have a large sample, we can estimate CIs from standard errors and quantiles for the standard normal distribution \(\mathcal{N}(\mu = 0, \sigma = 1)\): \(100 \cdot (1 - \alpha)\)% confidence interval is \(CI_{(1 - \alpha)} = \hat \beta \pm \underbrace{z_{\alpha / 2} \cdot SE}_{\text{Margin of Error}}\)

\[CI_{95} = \hat \beta \pm 1.96 SE = [ \hat \beta - 1.96 SE; \hat\beta + 1.96 SE]\]

  • The procedure of how 95% confidence intervals are constructed ensures that:
    • when we draw repeated samples
    • 95% of CIs calculated with this formula on the respective samples
    • cover the true parameter
  • one single calculated CI tells us nothing about:
    • parameters in other samples
    • individual observations in our sample and/or population

Confidence Intervals Illustration

Confidence Intervals as Ring Toss

  • Each sample gives a different CI or toss of the ring
  • Some samples the ring will contain the target (the CI will contain the truth) other times it won’t
    • We don’t know if the CI for our sample contains the truth!
  • Confidence level: percent of the time our CI will contain the population parameter
    • Number of ring tosses that will hit the target.
    • We get to choose, but typical values are 90%, 95%, and 99%
  • The confidence level of a CI determine how often the CI will be wrong

The confidence level does not mean that there is 95% probability of each CI including true value. Confidence level is a statement about the procedure in general, not each individual interval.

Communicating Uncertainty: Examples

sample_est <- students_population %>%
  slice_sample(n = 500) %>%
  lm(eval ~ beauty, data = .) %>%
  tidy(conf.int = T, conf.level = 0.95) %T>% 
# A tibble: 2 × 7
  term        estimate std.error statistic   p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
1 (Intercept)   3.71      0.0710     52.3  1.49e-204   3.58      3.85  
2 beauty        0.0681    0.0154      4.43 1.14e-  5   0.0379    0.0983

95% confidence interval for the effect of beauty ranges from 0.04 to 0.1.
We are 95% confident that the effect of beauty on course eevaluations ranges from, on average, 0.04 to 0.1.
We are 95% confident that for each additional point in beauty score, we would expect course evaluations to increase by 0.04 to 0.1 points, on average.
With 95% probability the expected increase in course evaluataions ranges between 0.04 to 0.1 points for each additional unit of beauty score.

Communicating Uncertainty

Accuracy vs. Precision Trade-off

  • By design, confidence intervals of different levels vary in their width:
tidy(sample_m, conf.int = T, conf.level = 0.95)
# A tibble: 2 × 7
  term        estimate std.error statistic   p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
1 (Intercept)   3.62      0.0707     51.2  1.87e-200   3.48       3.76 
2 beauty        0.0883    0.0150      5.88 7.53e-  9   0.0588     0.118
tidy(sample_m, conf.int = T, conf.level = 0.99)
# A tibble: 2 × 7
  term        estimate std.error statistic   p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
1 (Intercept)   3.62      0.0707     51.2  1.87e-200   3.43       3.80 
2 beauty        0.0883    0.0150      5.88 7.53e-  9   0.0495     0.127
  • This relates to the precision vs. accuracy of our interval estimates: the higher the level we choose, the more certain we will be to cover the true value
  • But we also loose the precision of the estimate: wider ranges can become rather uninformative

Hypothesis Testing

Decision Based on Confidence Intervals

Do the data provide sufficient evidence that β1 (the true slope for the population) is different from 0?

  • Suppose we want to know if there is a linear effect of beauty on evaluations (\(H_A: \hat\beta_1 \neq 0\)) or if there is no linear effect (\(H_0: \hat\beta_1 = 0\))


  1. Calculate the standard error and confidence interval (e.g., 95% CI)
  2. Check if the confidence interval covers zero or not.
  • If CI does not cover zero, we conclude that with 95% confidence, the slope coefficient is different from zero
  • If CI covers zero, we conclude that with 95% confidence, the slope coefficient is not different from zero
# A tibble: 2 × 7
  term        estimate std.error statistic   p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
1 (Intercept)   3.71      0.0710     52.3  1.49e-204   3.58      3.85  
2 beauty        0.0681    0.0154      4.43 1.14e-  5   0.0379    0.0983

Hypothesis Testing with p-values

  • Statistical hypothesis testing is a thought experiment: What would the world look like if we knew the truth?

Do the data provide sufficient evidence that β1 (the true slope for the population) is different from 0?


  1. Start with a null hypothesis, \(H_0\) that represents the “null-world”
  2. Set an alternative hypothesis, \(H_A\) that represents the research question, i.e. what we’re testing for
  3. Conduct a hypothesis test under the assumption that the null hypothesis is true and calculate a p-value (probability of observed or more extreme outcome given that the null hypothesis is true)
    • if the test results suggest that the data do not provide convincing evidence for the alternative hypothesis, stick with the null hypothesis
    • if they do, then reject the null hypothesis in favor of the alternative

Course Evals of English Natives vs. Non-natives

Are courses taught by English natives evaluated higher than courses of non-natives?

evals %>% 
  group_by(nonenglish) %>% 
  summarize(avg_eval = mean(eval)) 
# A tibble: 2 × 2
  nonenglish avg_eval
       <dbl>    <dbl>
1          0     4.02
2          1     3.69
lm(eval ~ nonenglish, evals) %>% tidy()
# A tibble: 2 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)    4.02     0.0264    152.   0      
2 nonenglish    -0.329    0.107      -3.07 0.00229

\[\hat\beta_1 = 3.69 - 4.02 = - 0.33\]

Null Words with Permutation

How would the effect \(\hat\beta_1\) look like if there was no difference?

  • Shuffle the eval and nonenglish columns and calculate the difference in evaluations between natives and non-natives
  • Repeat re-shuffling and estimation of the differences ↑ 1000 times
  • Check \(\hat\beta_1\) in the null world: what is the probability of observing data as or more extreme as our data under the null (i.e. the p-value)?

Null Words with Permutation: Two-sided p-value

Null Words with Permutation: Slope Example

Repeated permutations allow for quantifying the variability in the slope under the condition that there is no linear relationship (i.e., that the null hypothesis is true)

Communicating p-values

  • By default, p-values are about comparing to zero, i.e. no difference between groups with and without \(X\)

“the effect of \(X\) is not statistically significant/discernible (p = 0.3)”

“the effect of \(X\) is not distinguishable from zero at 0.05 level”

“the effect of \(X\) is not significantly distinguishable from zero (p = 0.3)”

“X is a significant/discernible predictor of Y (p ≈ 0.002)”

  • Every coefficient is going to be significant, some at 0.05 level, some at 0.85 level \(\Rightarrow\) saying something is significant without specifying the level is not enough
  • In general, report corresponding p-value in parenthesis to just one significant digit: 0.002 rather than 0.0017

Main Take Aways

  • Sampling variability implies uncertainty related to the estimation process: more data leads to more precise estimates (i.e. smaller SEs, narrower CIs)
  • Confidence intervals quantify the uncertainty associated with the average outcome, not each individual prediction. Confidence level indicates how often, in the long run, the CI would be wrong (i.e. not contain the true value). We cannot know if the CI we obtained is a “good” or a “bad” one, though.
  • With hypothesis testing, we are comparing the null-world to our estimates. \(p\)-values indicate the probability of us observing the data at least as extreme as we have in the null-world


Probabilistic Interpretation of CIs

\[CI_{95}: [\bar x - 1.96 SE, \bar x + 1.96 SE]\]

  • Randomness comes from the stage of drawing a sample (we only have one, but hypothetically, we draw them repeatedly)

  • After we draw the random sample, calculating the CI is a matter of procedure:

    • there is no more randomness, it’s just applying the formula
    • hence, we can think of this as a realized experiment
    • if CI bounds are just numbers, the true fixed value is either inside (1) or not (0)
    • we say: a single calculated CI contains the true parameter or not (and in real life, we don’t know if it does, so we hope it is one of the ones that cover the true value)
  • Before the random sample is drawn, we can apply probabilistic interpretation to CI:

    • for a 95% confidence interval (before the sample for calculating that CI is drawn!), there is a 95% chance that a CI will contain \(\mu\)
    • the random interval \([\bar x - 1.96 SE, \bar x + 1.96 SE]\) contains \(\mu\) with probability 0.95. It is a random interval, since the endpoints change with different samples (i.e., we don’t have not drawn the sample yet)