Describing Variables

Data Analytics and Visualization with R
Session 2

Viktoriia Semenova

University of Mannheim
Spring 2023

Warm up

Announcements

  • Our Slack workspace: r4da.slack.com
    • #github_lab_updates: notifies you on push to lab repos
    • #github_discussions_updates: notifies you on new posts in Discussions
    • you can create private channels for teamwork

Naming Conventions

  • Avoid spaces and special characters (e.g., umlauts) in folder/file names. Use:
    • snake_case
    • camelCase
    • PascalCase
  • Same applies to creating variables in R:
# good 
un_votes$percent_yes

# bad
un_votes$`percent yes`

tidyverse package is a shortcut

install.packages("tidyverse")
install.packages("ggplot2")
install.packages("dplyr")
install.packages("tidyr")
install.packages("readr")
install.packages("purrr")
install.packages("tibble")
install.packages("stringr")
install.packages("forcats")
install.packages("lubridate")
install.packages("hms")
install.packages("DBI")
install.packages("haven")
install.packages("httr")
install.packages("jsonlite")
install.packages("readxl")
install.packages("rvest")
install.packages("xml2")
install.packages("modelr")
install.packages("broom")
library("tidyverse")
library("ggplot2")
library("dplyr")
library("tidyr")
library("readr")
library("purrr")
library("tibble")
library("stringr")
library("forcats")

Installing and loading new packages

# put all packages we use in a vector
p_needed <- c("tidyverse", "scico") 

# check if they are already installed, install if not installed 
lapply(p_needed[!(p_needed %in% rownames(installed.packages()))], install.packages)

# load the packages
lapply(p_needed, library, character.only = TRUE)

Loading Dataset

  • Use relative paths in qmd files
  • Use Tab for auto-complete when writing paths
  • Always put all the code lines in the qmd file
  • If you use Import Dataset tool in Rstudio:
    • Load the dataset
    • Copy the path absolute path from Console
    • Shorten the absolute path to a relative one and paste into qmd file
  • Do not leave full datasets printed out it chunks

Information about Dataset

How many and what columns does it contain? How many observations are there?

glimpse(un_votes)
Rows: 59,284
Columns: 5
$ country     <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan"…
$ year        <dbl> 1946, 1946, 1946, 1947, 1947, 1947, 1948, 1948, 1948, 1948…
$ issue       <chr> "Colonialism", "Economic development", "Human rights", "Co…
$ votes       <dbl> 5, 6, 1, 8, 2, 7, 12, 9, 8, 6, 11, 3, 14, 3, 5, 6, 14, 3, …
$ percent_yes <dbl> 0.80000000, 0.66666667, 1.00000000, 0.50000000, 0.50000000…

How many unique countries are there in the dataset?

un_votes$country %>% unique() %>% length()
[1] 200

GitHub

  • GitHub commit messages are primarily for you, not for me
  • File status:
  • What’s coming later on:
    • use git & GitHub to collaborate (aka deal with merge conflicts)
    • travel in time (go between versions of the project)
    • create and populate R project and repos on your own

Your GitHub Stats 🤓

Using pipes %>% or |>

leave_house(
  get_dressed(
    get_out_of_bed(wake_up(me, time = "8:00"),
                   side = "correct"),
    pants = TRUE,
    shirt = TRUE
  ),
  foot = TRUE,
  bike = FALSE
)

With pipes:

me %>%
  wake_up(time = "7:00") %>%
  get_out_of_bed(side = "correct") %>%
  get_dressed(pants = TRUE, shirt = TRUE) %>%
  leave_house(foot = TRUE, bike = FALSE)

Describing Variables

Types of Data in Political Science

  • Cross-section: a snapshot of a sample of units (e.g., people, countries, governments) at one point of time

  • Time series: observations on variables over time

  • Pooled time series cross-section: comparable time series data observed on variety of units (e.g., people, countries, governments)

    • Usually few cases, but long time series
  • Panel data: large number of the same cross-sectional units (e.g., survey respondents) observed repeatedly

    • Usually many cases, but shorter time series

Types of Variables

  • Numerical (quantitative): take on values sensible to add, subtract, take averages, etc. with these values
    • Continuous: take on any of an infinite number of values within a given range (e.g., vote share) numeric
    • Discrete: take on one of a specific set of numeric values (e.g., number of human fatalities in conflict) integer
  • Categorical (qualitative): take on a limited number of distinct categories categories can be identified with numbers, but not sensible to do arithmetic operations character or factor
    • Ordinal: levels have an inherent ordering (e.g., Likert scales)
    • Nominal: levels have no inherent ordering (e.g., party choice: CDU, SPD, Greens, etc.)

Data types are conceptual

Number of variables involved

  • Univariate data analysis - distribution of single variable

  • Bivariate data analysis - relationship between two variables

  • Multivariate data analysis - relationship between many variables at once, usually focusing on the relationship between two while conditioning for others

Describing shapes of numerical distributions

  • shape:
    • skewness: right-skewed, left-skewed, symmetric (skew is to the side of the longer tail)
    • modality: unimodal, bimodal, multimodal, uniform
  • center: mean (mean), median (median), mode (not always useful)
  • spread: range (range), standard deviation (sd), inter-quartile range (IQR)
  • unusual observations

Central Tendency: Mean

Mean: arithmetic average, the “typical” value, the best guess about the value drawn from the distribution

\[\bar{x} = \frac{x_1 + x_2 + x_3 + ... + x_N}{N} = \frac{\sum_{i=1}^{N}x_i}{N}\]

mean(x = un_votes$percent_yes)
[1] 0.2091304
sum(un_votes$percent_yes) / length(un_votes$percent_yes)
[1] 0.2091304

Central Tendency: Median

Median: value of \(x\) that falls in the middle position when observations are ordered ascending

\[\widetilde{x} =\begin{cases} x_\frac{N+1}{2} & \text{if }N\text{ is odd}\\ \frac {1}{2}\left(x_{\frac{N}{2}} + x_{\frac{N}{2} + 1}\right) & \text{if }N \text{ is even} \end{cases}\]

median(x = un_votes$percent_yes)
[1] 0.1052632
summary(un_votes$percent_yes)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  0.1053  0.2091  0.3333  1.0000 

Problems with Single Numbers

Average Weight
Cats 40.14931
Dogs 40.07032

Sample Dispersion: Variance

Variance: measure of the typical departure from the mean of a dataset

\[ s^2 = \frac{\sum_{i=1}^{N}(x_i - \bar{x})^2} {N - 1} \]

var(x = un_votes$percent_yes)
[1] 0.06551721

Sample Dispersion: Standard Deviation

Standard Deviation \(s\): measure of the typical departure from the mean of a dataset (intuitive scale)

\[ s = \sqrt{\frac{\sum_{i=1}^{N}(x_i - \bar{x})^2} {N - 1}} = \sqrt{s^2} \]

sd(x = un_votes$percent_yes)
[1] 0.2559633

Quantiles and Range

  • Range: difference between smallest and largest value
  • Interquantile Range: range of the middle 50% of the data, distance between the first quartile (25th percentile) and third quartile (75th percentile)
range(un_votes$percent_yes)
[1] 0 1
range(un_votes$percent_yes) %>% 
  diff() 
[1] 1
summary(un_votes$percent_yes)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  0.1053  0.2091  0.3333  1.0000 
IQR(x = un_votes$percent_yes)
[1] 0.3333333

Grammar of graphics: ggplot2

Data visualization

“The simple graph has brought more information to the data analyst’s mind than any other device.” --- John Tukey

  • Data visualization is the creation and study of the visual representation of data

  • Many tools for visualizing data -- R is one of them

  • Many approaches/systems within R for making data visualizations -- ggplot2 is one of them, and that’s what we’re going to use

Temperature

Causalties

Grammar of Graphics Logic

  • Map data to aesthetics
  • Aesthetic: visual property of the graph
    • position
    • shape
    • color
    • transparency

Mapping Data to Aesthetics

Data Aesthetic Graphic/Geometry
Longitude Position (x-axis) Point
Latitude Position (y-axis) Point
Army size Size Path
Army direction Color Path
Date Position (x-axis) Line + text
Temperature Position (y-axis) Line + text

ggplot2 ∈ tidyverse

  • ggplot2 is tidyverse’s data visualization package

  • gg in ggplot2 stands for Grammar of Graphics

  • Installation:

    install.packages("tidyverse")
    library(ggplot2)
  • For help with ggplot2, see ggplot2.tidyverse.org

Plotting with layers

ggplot(data = [dataset],
       mapping = aes(
         x = [x - variable],
         y = [y - variable]
         )
       ) +
  geom_xxx() +
  other options

Possible aesthetics

color discrete

color continuous

size

fill

shape

alpha

Example geoms


geom_col() Bar charts
geom_text() Text
geom_point() Points
geom_boxplot() Boxplots
geom_sf() Maps

Additional layers

  • scales change properties of variable mapping
  • facets show subplots for different subsets of data
  • coordinates change the coordinate system
  • labels add labels to the plot
  • theme changes the appearance of anything in the plot
  • theme options make adjustments to existing themes

Tidy data

  • For ggplot() to work, your data needs to be in a tidy format
  • This doesn’t mean that it’s clean, it refers to the structure of the data
  • All the packages in the tidyverse work best with tidy data; that why it’s called that!

Tidy means:

  • Each variable has its own column
  • Each observation has its own row
  • Each value has its own cell

Same Data, Different Formats

Untidy Data

Tidy data

Tidy is Long Data

Data: US Governors

Does political office cause worse or better longevity prospects? Two perspectives in the literature offer contradicting answers. First, increased income, social status, and political connections obtained through holding office can increase longevity. Second, increased stress and working hours associated with holding office can have detrimental effects on longevity. <…> The results show that politicians winning a close election live 5–10 years longer than candidates who lose.

Data: US Governors

glimpse(governors)
Rows: 1,092
Columns: 14
$ state        <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "A…
$ year         <dbl> 1946, 1946, 1950, 1954, 1954, 1958, 1962, 1966, 1966, 197…
$ first_name   <chr> "James", "Lyman", "Gordon", "Tom", "James", "William", "G…
$ last_name    <chr> "Folsom", "Ward", "Persons", "Abernethy", "Folsom", "Long…
$ party        <chr> "Democrat", "Republican", "Democrat", "Republican", "Demo…
$ sex          <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Male", "…
$ died         <date> 1987-11-21, 1948-12-17, 1965-05-29, 1968-03-07, 1987-11-…
$ status       <chr> "Challenger", "Challenger", "Challenger", "Challenger", "…
$ win_margin   <dbl> 77.334394, -77.334394, 82.206564, -46.748166, 46.748166, …
$ region       <chr> "South", "South", "South", "South", "South", "South", "So…
$ population   <dbl> 2906000, 2906000, 3058000, 3014000, 3014000, 3163000, 332…
$ election_age <dbl> 38.07255, 78.54894, 48.74743, 46.54620, 46.07255, 33.2703…
$ death_age    <dbl> 79.11567, 80.66530, 63.31006, 59.88227, 79.11567, 87.8193…
$ lived_after  <dbl> 41.043121, 2.116359, 14.562628, 13.336071, 33.043121, 54.…

Styling the Code: Why

governors %>% 
  filter(election_age > 50, sex == "Male")

governors %>% filter(election_age > 50, sex == "Male")

governors %>% 
  filter(election_age > 50,
         sex == "Male")

governors %>% filter(election_age>50, sex=="Male")

filter(governors,election_age>50, sex=="Male")

governors %>% 
filter(election_age > 50, 
                            sex=="Male")

filter ( governors,election_age>   50,     sex=="Male" )

Styling the Code: How

install.packages("styler")
library(styler)

- More on styling on course website

Saving Your Plots: Bitmaps vs Vector

  • JPEG: Photographs

  • PNG/GIF: Images with limited colors

  • PDF: Anything vector based

  • SVG: Vectors online

Save your plots as PNG or SVG (Web) or PDF (Print)

To-Do List

  • Problem Set 2
  • Readings/videos for week 3