R for Data Analysis
Session 2
University of Mannheim
Fall 2023
r4da-labs.Rproj
file to open the labs project.git
error messagetidyverse
package is a shortcutinstall.packages("ggplot2")
install.packages("dplyr")
install.packages("tidyr")
install.packages("readr")
install.packages("purrr")
install.packages("tibble")
install.packages("stringr")
install.packages("forcats")
install.packages("lubridate")
install.packages("hms")
install.packages("DBI")
install.packages("haven")
install.packages("httr")
install.packages("jsonlite")
install.packages("readxl")
install.packages("rvest")
install.packages("xml2")
install.packages("modelr")
install.packages("broom")
Cross-section: a snapshot of a sample of units (e.g., people, countries, governments) at one point of time
Time series: observations on variables over time
Pooled time series cross-section: comparable time series data observed on variety of units (e.g., people, countries, governments)
Panel data: large number of the same cross-sectional units (e.g., survey respondents) observed repeatedly
numeric
integer
character
or factor
Data types are conceptual
Univariate data analysis - distribution of single variable
Bivariate data analysis - relationship between two variables
Multivariate data analysis - relationship between many variables at once, usually focusing on the relationship between two while conditioning for others
mean
), median (median
), mode (not always useful)range
), standard deviation (sd
), inter-quartile range (IQR)Mean: arithmetic average, the “typical” value, the best guess about the value drawn from the distribution
\[\bar{x} = \frac{x_1 + x_2 + x_3 + ... + x_N}{N} = \frac{\sum_{i=1}^{N}x_i}{N}\]
Median: value of \(x\) that falls in the middle position when observations are ordered ascending
\[\widetilde{x} =\begin{cases} x_\frac{N+1}{2} & \text{if }N\text{ is odd}\\ \frac {1}{2}\left(x_{\frac{N}{2}} + x_{\frac{N}{2} + 1}\right) & \text{if }N \text{ is even} \end{cases}\]
Average Weight | |
---|---|
Cats | 40.14931 |
Dogs | 40.07032 |
Variance: measure of the typical departure from the mean of a dataset
\[ s^2 = \frac{\sum_{i=1}^{N}(x_i - \bar{x})^2} {N - 1} \]
Standard Deviation \(s\): measure of the typical departure from the mean of a dataset (intuitive scale)
\[ s = \sqrt{\frac{\sum_{i=1}^{N}(x_i - \bar{x})^2} {N - 1}} = \sqrt{s^2} \]
ggplot2
“The simple graph has brought more information to the data analyst’s mind than any other device.” --- John Tukey
Data visualization is the creation and study of the visual representation of data
Many tools for visualizing data -- R
is one of them
Many approaches/systems within R
for making data visualizations -- ggplot2 is one of them, and that’s what we’re going to use
Data | Aesthetic | Graphic/Geometry |
---|---|---|
Longitude | Position (x-axis) | Point |
Latitude | Position (y-axis) | Point |
Army size | Size | Path |
Army direction | Color | Path |
Date | Position (x-axis) | Line + text |
Temperature | Position (y-axis) | Line + text |
ggplot2 is tidyverse’s data visualization package
gg
in ggplot2 stands for Grammar of Graphics
Installation:
For help with ggplot2, see ggplot2.tidyverse.org
color
discrete
color
continuous
size
fill
shape
alpha
![]() |
geom_col() |
Bar charts |
![]() |
geom_text() |
Text |
![]() |
geom_point() |
Points |
![]() |
geom_boxplot() |
Boxplots |
![]() |
geom_sf() |
Maps |
ggplot()
to work, your data needs to be in a tidy formatUntidy Data
Tidy data
- More on styling on course website
JPEG: Photographs
PNG/GIF: Images with limited colors
PDF: Anything vector based
SVG: Vectors online
Save your plots as PNG or SVG (Web) or PDF (Print)