Data Analytics and Visualization with R
Session 2
University of Mannheim
Spring 2023
#github_lab_updates
: notifies you on push to lab repos#github_discussions_updates
: notifies you on new posts in Discussionssnake_case
camelCase
PascalCase
R
:tidyverse
package is a shortcutinstall.packages("ggplot2")
install.packages("dplyr")
install.packages("tidyr")
install.packages("readr")
install.packages("purrr")
install.packages("tibble")
install.packages("stringr")
install.packages("forcats")
install.packages("lubridate")
install.packages("hms")
install.packages("DBI")
install.packages("haven")
install.packages("httr")
install.packages("jsonlite")
install.packages("readxl")
install.packages("rvest")
install.packages("xml2")
install.packages("modelr")
install.packages("broom")
qmd
filesTab
for auto-complete when writing pathsqmd
fileqmd
fileHow many and what columns does it contain? How many observations are there?
Rows: 59,284
Columns: 5
$ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan"…
$ year <dbl> 1946, 1946, 1946, 1947, 1947, 1947, 1948, 1948, 1948, 1948…
$ issue <chr> "Colonialism", "Economic development", "Human rights", "Co…
$ votes <dbl> 5, 6, 1, 8, 2, 7, 12, 9, 8, 6, 11, 3, 14, 3, 5, 6, 14, 3, …
$ percent_yes <dbl> 0.80000000, 0.66666667, 1.00000000, 0.50000000, 0.50000000…
git
& GitHub to collaborate (aka deal with merge conflicts)%>%
or |>
Cross-section: a snapshot of a sample of units (e.g., people, countries, governments) at one point of time
Time series: observations on variables over time
Pooled time series cross-section: comparable time series data observed on variety of units (e.g., people, countries, governments)
Panel data: large number of the same cross-sectional units (e.g., survey respondents) observed repeatedly
numeric
integer
character
or factor
Data types are conceptual
Univariate data analysis - distribution of single variable
Bivariate data analysis - relationship between two variables
Multivariate data analysis - relationship between many variables at once, usually focusing on the relationship between two while conditioning for others
mean
), median (median
), mode (not always useful)range
), standard deviation (sd
), inter-quartile range (IQR)Mean: arithmetic average, the “typical” value, the best guess about the value drawn from the distribution
\[\bar{x} = \frac{x_1 + x_2 + x_3 + ... + x_N}{N} = \frac{\sum_{i=1}^{N}x_i}{N}\]
Median: value of \(x\) that falls in the middle position when observations are ordered ascending
\[\widetilde{x} =\begin{cases} x_\frac{N+1}{2} & \text{if }N\text{ is odd}\\ \frac {1}{2}\left(x_{\frac{N}{2}} + x_{\frac{N}{2} + 1}\right) & \text{if }N \text{ is even} \end{cases}\]
Average Weight | |
---|---|
Cats | 40.14931 |
Dogs | 40.07032 |
Variance: measure of the typical departure from the mean of a dataset
\[ s^2 = \frac{\sum_{i=1}^{N}(x_i - \bar{x})^2} {N - 1} \]
Standard Deviation \(s\): measure of the typical departure from the mean of a dataset (intuitive scale)
\[ s = \sqrt{\frac{\sum_{i=1}^{N}(x_i - \bar{x})^2} {N - 1}} = \sqrt{s^2} \]
ggplot2
“The simple graph has brought more information to the data analyst’s mind than any other device.” --- John Tukey
Data visualization is the creation and study of the visual representation of data
Many tools for visualizing data -- R
is one of them
Many approaches/systems within R
for making data visualizations -- ggplot2 is one of them, and that’s what we’re going to use
Data | Aesthetic | Graphic/Geometry |
---|---|---|
Longitude | Position (x-axis) | Point |
Latitude | Position (y-axis) | Point |
Army size | Size | Path |
Army direction | Color | Path |
Date | Position (x-axis) | Line + text |
Temperature | Position (y-axis) | Line + text |
ggplot2 is tidyverse’s data visualization package
gg
in ggplot2 stands for Grammar of Graphics
Installation:
For help with ggplot2, see ggplot2.tidyverse.org
color
discrete
color
continuous
size
fill
shape
alpha
![]() |
geom_col() |
Bar charts |
![]() |
geom_text() |
Text |
![]() |
geom_point() |
Points |
![]() |
geom_boxplot() |
Boxplots |
![]() |
geom_sf() |
Maps |
ggplot()
to work, your data needs to be in a tidy formatUntidy Data
Tidy data
Does political office cause worse or better longevity prospects? Two perspectives in the literature offer contradicting answers. First, increased income, social status, and political connections obtained through holding office can increase longevity. Second, increased stress and working hours associated with holding office can have detrimental effects on longevity. <…> The results show that politicians winning a close election live 5–10 years longer than candidates who lose.
Rows: 1,092
Columns: 14
$ state <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "A…
$ year <dbl> 1946, 1946, 1950, 1954, 1954, 1958, 1962, 1966, 1966, 197…
$ first_name <chr> "James", "Lyman", "Gordon", "Tom", "James", "William", "G…
$ last_name <chr> "Folsom", "Ward", "Persons", "Abernethy", "Folsom", "Long…
$ party <chr> "Democrat", "Republican", "Democrat", "Republican", "Demo…
$ sex <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Male", "…
$ died <date> 1987-11-21, 1948-12-17, 1965-05-29, 1968-03-07, 1987-11-…
$ status <chr> "Challenger", "Challenger", "Challenger", "Challenger", "…
$ win_margin <dbl> 77.334394, -77.334394, 82.206564, -46.748166, 46.748166, …
$ region <chr> "South", "South", "South", "South", "South", "South", "So…
$ population <dbl> 2906000, 2906000, 3058000, 3014000, 3014000, 3163000, 332…
$ election_age <dbl> 38.07255, 78.54894, 48.74743, 46.54620, 46.07255, 33.2703…
$ death_age <dbl> 79.11567, 80.66530, 63.31006, 59.88227, 79.11567, 87.8193…
$ lived_after <dbl> 41.043121, 2.116359, 14.562628, 13.336071, 33.043121, 54.…
governors %>%
filter(election_age > 50, sex == "Male")
governors %>% filter(election_age > 50, sex == "Male")
governors %>%
filter(election_age > 50,
sex == "Male")
governors %>% filter(election_age>50, sex=="Male")
filter(governors,election_age>50, sex=="Male")
governors %>%
filter(election_age > 50,
sex=="Male")
filter ( governors,election_age> 50, sex=="Male" )
- More on styling on course website
JPEG: Photographs
PNG/GIF: Images with limited colors
PDF: Anything vector based
SVG: Vectors online
Save your plots as PNG or SVG (Web) or PDF (Print)