Describing Variables

R for Data Analysis
Session 2

Viktoriia Semenova

University of Mannheim
Fall 2023

Getting Lab Material

  • Open r4da-labs.Rproj file to open the labs project.
  • Go into Git pane.
  • If there are any files you changed, you will see them in the window.
    • Select all these files.
    • Click on Commit, write a commit message, and save it.
    • If you click on Push, it will give you a git error message
  • Click on Pull to get the new files uploaded.

tidyverse package is a shortcut


Describing Variables

Types of Data in Political Science

  • Cross-section: a snapshot of a sample of units (e.g., people, countries, governments) at one point of time

  • Time series: observations on variables over time

  • Pooled time series cross-section: comparable time series data observed on variety of units (e.g., people, countries, governments)

    • Usually few cases, but long time series
  • Panel data: large number of the same cross-sectional units (e.g., survey respondents) observed repeatedly

    • Usually many cases, but shorter time series

Types of Variables

  • Numerical (quantitative): take on values sensible to add, subtract, take averages, etc. with these values
    • Continuous: take on any of an infinite number of values within a given range (e.g., vote share) numeric
    • Discrete: take on one of a specific set of numeric values (e.g., number of human fatalities in conflict) integer
  • Categorical (qualitative): take on a limited number of distinct categories categories can be identified with numbers, but not sensible to do arithmetic operations character or factor
    • Ordinal: levels have an inherent ordering (e.g., Likert scales)
    • Nominal: levels have no inherent ordering (e.g., party choice: CDU, SPD, Greens, etc.)

Data types are conceptual

Number of variables involved

  • Univariate data analysis - distribution of single variable

  • Bivariate data analysis - relationship between two variables

  • Multivariate data analysis - relationship between many variables at once, usually focusing on the relationship between two while conditioning for others

Describing shapes of numerical distributions

  • shape:
    • skewness: right-skewed, left-skewed, symmetric (skew is to the side of the longer tail)
    • modality: unimodal, bimodal, multimodal, uniform
  • center: mean (mean), median (median), mode (not always useful)
  • spread: range (range), standard deviation (sd), inter-quartile range (IQR)
  • unusual observations

Central Tendency: Mean

Mean: arithmetic average, the “typical” value, the best guess about the value drawn from the distribution

\[\bar{x} = \frac{x_1 + x_2 + x_3 + ... + x_N}{N} = \frac{\sum_{i=1}^{N}x_i}{N}\]

mean(x = un_votes$percent_yes)
[1] 0.2091304
sum(un_votes$percent_yes) / length(un_votes$percent_yes)
[1] 0.2091304

Central Tendency: Median

Median: value of \(x\) that falls in the middle position when observations are ordered ascending

\[\widetilde{x} =\begin{cases} x_\frac{N+1}{2} & \text{if }N\text{ is odd}\\ \frac {1}{2}\left(x_{\frac{N}{2}} + x_{\frac{N}{2} + 1}\right) & \text{if }N \text{ is even} \end{cases}\]

median(x = un_votes$percent_yes)
[1] 0.1052632
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  0.1053  0.2091  0.3333  1.0000 

Problems with Single Numbers

Average Weight
Cats 40.14931
Dogs 40.07032

Sample Dispersion: Variance

Variance: measure of the typical departure from the mean of a dataset

\[ s^2 = \frac{\sum_{i=1}^{N}(x_i - \bar{x})^2} {N - 1} \]

var(x = un_votes$percent_yes)
[1] 0.06551721

Variance Visually Explained

Population vs. Sample (\(N\) vs \(n−1\))

Population vs. Sample: Inherent Bias

  • The sum of squares from \(\mu\) will always be greater than the \(\bar x\) sum of squares
  • \(\bar x\)’s location already minimizes the total distance of all the observations to the center by the definition of sample mean
  • A line at any other location would be a line that is not minimizing the distance for observations in our sample

Sample Dispersion: Standard Deviation

Standard Deviation \(s\): measure of the typical departure from the mean of a dataset (intuitive scale)

\[ s = \sqrt{\frac{\sum_{i=1}^{N}(x_i - \bar{x})^2} {N - 1}} = \sqrt{s^2} \]

sd(x = un_votes$percent_yes)
[1] 0.2559633

Quantiles and Range

  • Range: difference between smallest and largest value
  • Interquantile Range: range of the middle 50% of the data, distance between the first quartile (25th percentile) and third quartile (75th percentile)
[1] 0 1
range(un_votes$percent_yes) %>% 
[1] 1
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  0.1053  0.2091  0.3333  1.0000 
IQR(x = un_votes$percent_yes)
[1] 0.3333333

Grammar of graphics: ggplot2

Data visualization

“The simple graph has brought more information to the data analyst’s mind than any other device.” --- John Tukey

  • Data visualization is the creation and study of the visual representation of data

  • Many tools for visualizing data -- R is one of them

  • Many approaches/systems within R for making data visualizations -- ggplot2 is one of them, and that’s what we’re going to use



Grammar of Graphics Logic

  • Map data to aesthetics
  • Aesthetic: visual property of the graph
    • position
    • shape
    • color
    • transparency

Mapping Data to Aesthetics

Data Aesthetic Graphic/Geometry
Longitude Position (x-axis) Point
Latitude Position (y-axis) Point
Army size Size Path
Army direction Color Path
Date Position (x-axis) Line + text
Temperature Position (y-axis) Line + text

ggplot2 ∈ tidyverse

  • ggplot2 is tidyverse’s data visualization package

  • gg in ggplot2 stands for Grammar of Graphics

  • Installation:

  • For help with ggplot2, see

Plotting with layers

ggplot(data = [dataset],
       mapping = aes(
         x = [x - variable],
         y = [y - variable]
       ) +
  geom_xxx() +
  other options

Possible aesthetics

color discrete

color continuous





Example geoms

geom_col() Bar charts
geom_text() Text
geom_point() Points
geom_boxplot() Boxplots
geom_sf() Maps

Additional layers

  • scales change properties of variable mapping
  • facets show subplots for different subsets of data
  • coordinates change the coordinate system
  • labels add labels to the plot
  • theme changes the appearance of anything in the plot
  • theme options make adjustments to existing themes

Tidy data

  • For ggplot() to work, your data needs to be in a tidy format
  • This doesn’t mean that it’s clean, it refers to the structure of the data
  • All the packages in the tidyverse work best with tidy data; that why it’s called that!

Tidy means:

  • Each variable has its own column
  • Each observation has its own row
  • Each value has its own cell

Same Data, Different Formats

Untidy Data

Tidy data

Tidy is Long Data

Styling the Code: Why

un_votes %>% 
  filter(un_votes > 5, issue == "Human rights")

un_votes %>% filter(un_votes > 5, issue == "Human rights")

un_votes %>% 
  filter(un_votes > 5, 
         issue == "Human rights")

un_votes %>% filter(un_votes>5,issue=="Human rights")

un_votes %>% 
  filter(un_votes > 5,       issue == "Human rights")

Styling the Code: How



- More on styling on course website

Saving Your Plots: Bitmaps vs Vector

  • JPEG: Photographs

  • PNG/GIF: Images with limited colors

  • PDF: Anything vector based

  • SVG: Vectors online

Save your plots as PNG or SVG (Web) or PDF (Print)