… is the process of using sample data to make conclusions about the underlying population the sample came from
Why Communicate Uncertainty
If you want to estimate a population parameter, do you prefer to report a range of values the parameter might be in, or a single value?
If we report a point estimate, we probably won’t hit the exact population parameter
If we report a range of plausible values, we have a good shot at capturing the parameter
Why Communicate Uncertainty
How to Communicate Uncertainty
Standard error: since point estimates vary from sample to sample, we quantify this variability with what is called the standard error (SE) Standard error is a standard deviation of the sampling distribution of a statistic
Confidence interval: a range of our best guesses about the point estimate. One way to calculate it is using the standard error
We call a confidence interval a \((1 - \alpha)\)% confidence interval if it is constructed such that it contains the true parameter at least \((1 - \alpha)\)% of the time if we repeat the experiment a large number of times.
Thought Experiment
Suppose we had data for the whole population (say 100’000 students) and we could estimate the true parameter values:
Compare these values to estimates from a random sample of 500 students:
term
parameter
estimate
(Intercept)
3.691
3.756
beauty
0.070
0.061
Sampling Variability
Now let’s take 5000 samples (N = 1000) from this population of 100’000 students, run the same bivariate model on each of these samples, and plot the results:
Sampling Distributions
Sampling distributions are hypothetical constructs, we never observe them in real life
Sampling distributions (of most statistics) are approximately normally distributed (if the sample size is sufficiently large)
Sampling distributions are centered at the true value of the population coefficient (the value we would get from linear modeling if we did indeed use the full population)
The spread of the sampling distribution gives us a measure of the precision of our estimate:
If the sampling distribution is very wide, then our estimate is imprecise; our estimate would vary widely from sample to sample
If the sampling distribution is very narrow, then our estimate is precise; our estimate would not vary much from sample to sample
Standard Error
Standard Error is a Standard Deviation of Sampling Distribution of that Statistic:
# SD of sampling distributionsestimates %>%group_by(term) %>%summarize(sd =sd(estimate))
# A tibble: 2 × 2
term sd
<chr> <dbl>
1 (Intercept) 0.0544
2 beauty 0.0114
If we took samples of N = 100 and not N = 1000, what should happen to the spread of sampling distributions?
Sample Size and Precision
Larger samples allow for more precise estimates (i.e. smaller standard errors)
We cannot observe these distributions, but the SE provides us with the information about their variability. We can use this info to obtain a plausible range of estimates given our data, the confidence interval
Confidence Intervals
Range of our best guesses about the point estimate with X% confidence (accounting for sampling variability)
If we have a large sample, we can estimate CIs from standard errors and quantiles for the standard normal distribution \(\mathcal{N}(\mu = 0, \sigma = 1)\): \(100 \cdot (1 - \alpha)\)% confidence interval is \(CI_{(1 - \alpha)} = \hat \beta \pm \underbrace{z_{\alpha / 2} \cdot SE}_{\text{Margin of Error}}\)
The procedure of how 95% confidence intervals are constructed ensures that:
when we draw repeated samples
95% of CIs calculated with this formula on the respective samples
cover the true parameter
one single calculated CI tells us nothing about:
parameters in other samples
individual observations in our sample and/or population
Confidence Intervals Illustration
Confidence Intervals as Ring Toss
Each sample gives a different CI or toss of the ring
Some samples the ring will contain the target (the CI will contain the truth) other times it won’t
We don’t know if the CI for our sample contains the truth!
Confidence level: percent of the time our CI will contain the population parameter
Number of ring tosses that will hit the target.
We get to choose, but typical values are 90%, 95%, and 99%
The confidence level of a CI determine how often the CI will be wrong
The confidence level does not mean that there is 95% probability of each CI including true value. Confidence level is a statement about the procedure in general, not each individual interval.
95% confidence interval for the effect of beauty ranges from 0.04 to 0.1.
We are 95% confident that the effect of beauty on course eevaluations ranges from, on average, 0.04 to 0.1.
We are 95% confident that for each additional point in beauty score, we would expect course evaluations to increase by 0.04 to 0.1 points, on average. With 95% probability the expected increase in course evaluataions ranges between 0.04 to 0.1 points for each additional unit of beauty score.
Communicating Uncertainty
Accuracy vs. Precision Trade-off
By design, confidence intervals of different levels vary in their width:
This relates to the precision vs. accuracy of our interval estimates: the higher the level we choose, the more certain we will be to cover the true value
But we also loose the precision of the estimate: wider ranges can become rather uninformative
Hypothesis Testing
Decision Based on Confidence Intervals
Do the data provide sufficient evidence that β1 (the true slope for the population) is different from 0?
Suppose we want to know if there is a linear effect of beauty on evaluations (\(H_A: \hat\beta_1 \neq 0\)) or if there is no linear effect (\(H_0: \hat\beta_1 = 0\))
Workflow:
Calculate the standard error and confidence interval (e.g., 95% CI)
Check if the confidence interval covers zero or not.
If CI does not cover zero, we conclude that with 95% confidence, the slope coefficient is different from zero
If CI covers zero, we conclude that with 95% confidence, the slope coefficient is not different from zero
Statistical hypothesis testing is a thought experiment: What would the world look like if we knew the truth?
Do the data provide sufficient evidence that β1 (the true slope for the population) is different from 0?
Workflow:
Start with a null hypothesis, \(H_0\) that represents the “null-world”
Set an alternative hypothesis, \(H_A\) that represents the research question, i.e. what we’re testing for
Conduct a hypothesis test under the assumption that the null hypothesis is true and calculate a p-value (probability of observed or more extreme outcome given that the null hypothesis is true)
if the test results suggest that the data do not provide convincing evidence for the alternative hypothesis, stick with the null hypothesis
if they do, then reject the null hypothesis in favor of the alternative
Course Evals of English Natives vs. Non-natives
Are courses taught by English natives evaluated higher than courses of non-natives?
How would the effect \(\hat\beta_1\) look like if there was no difference?
Shuffle the eval and nonenglish columns and calculate the difference in evaluations between natives and non-natives
Repeat re-shuffling and estimation of the differences ↑ 1000 times
Check \(\hat\beta_1\) in the null world: what is the probability of observing data as or more extreme as our data under the null (i.e. the p-value)?
Null Words with Permutation: Two-sided p-value
Null Words with Permutation: Slope Example
Repeated permutations allow for quantifying the variability in the slope under the condition that there is no linear relationship (i.e., that the null hypothesis is true)
Communicating p-values
By default, p-values are about comparing to zero, i.e. no difference between groups with and without \(X\)
“the effect of \(X\) is not statistically significant/discernible (p = 0.3)”
“the effect of \(X\) is not distinguishable from zero at 0.05 level”
“the effect of \(X\) is not significantly distinguishable from zero (p = 0.3)”
“X is a significant/discernible predictor of Y (p ≈ 0.002)”
Every coefficient is going to be significant, some at 0.05 level, some at 0.85 level \(\Rightarrow\) saying something is significant without specifying the level is not enough
In general, report corresponding p-value in parenthesis to just one significant digit: 0.002 rather than 0.0017
Main Take Aways
Sampling variability implies uncertainty related to the estimation process: more data leads to more precise estimates (i.e. smaller SEs, narrower CIs)
Confidence intervals quantify the uncertainty associated with the average outcome, not each individual prediction. Confidence level indicates how often, in the long run, the CI would be wrong (i.e. not contain the true value). We cannot know if the CI we obtained is a “good” or a “bad” one, though.
With hypothesis testing, we are comparing the null-world to our estimates. \(p\)-values indicate the probability of us observing the data at least as extreme as we have in the null-world
Appendix
Probabilistic Interpretation of CIs
\[CI_{95}: [\bar x - 1.96 SE, \bar x + 1.96 SE]\]
Randomness comes from the stage of drawing a sample (we only have one, but hypothetically, we draw them repeatedly)
After we draw the random sample, calculating the CI is a matter of procedure:
there is no more randomness, it’s just applying the formula
hence, we can think of this as a realized experiment
if CI bounds are just numbers, the true fixed value is either inside (1) or not (0)
we say: a single calculated CI contains the true parameter or not (and in real life, we don’t know if it does, so we hope it is one of the ones that cover the true value)
Before the random sample is drawn, we can apply probabilistic interpretation to CI:
for a 95% confidence interval (before the sample for calculating that CI is drawn!), there is a 95% chance that a CI will contain \(\mu\)
the random interval \([\bar x - 1.96 SE, \bar x + 1.96 SE]\) contains \(\mu\) with probability 0.95. It is a random interval, since the endpoints change with different samples (i.e., we don’t have not drawn the sample yet)