R for Data Analysis

Session 2

Viktoriia Semenova

University of Mannheim

Fall 2023

- Open
`r4da-labs.Rproj`

file to open the*labs*project. - Go into
*Git*pane. - If there are any files you changed, you will see them in the window.
- Select all these files.
- Click on
*Commit*, write a commit message, and save it. - If you click on
*Push*, it will give you a`git`

error message

- Click on
*Pull*to get the new files uploaded.

`tidyverse`

package is a shortcut```
install.packages("ggplot2")
install.packages("dplyr")
install.packages("tidyr")
install.packages("readr")
install.packages("purrr")
install.packages("tibble")
install.packages("stringr")
install.packages("forcats")
install.packages("lubridate")
install.packages("hms")
install.packages("DBI")
install.packages("haven")
install.packages("httr")
install.packages("jsonlite")
install.packages("readxl")
install.packages("rvest")
install.packages("xml2")
install.packages("modelr")
install.packages("broom")
```

**Cross-section:**a snapshot of a sample of units (e.g., people, countries, governments) at one point of time**Time series:**observations on variables over time**Pooled time series cross-section:**comparable time series data observed on variety of units (e.g., people, countries, governments)- Usually few cases, but long time series

**Panel data:**large number of the same cross-sectional units (e.g., survey respondents) observed repeatedly- Usually many cases, but shorter time series

**Numerical (quantitative):**take on values sensible to add, subtract, take averages, etc. with these values*Continuous:*take on any of an infinite number of values within a given range (e.g., vote share)`numeric`

*Discrete:*take on one of a specific set of numeric values (e.g., number of human fatalities in conflict)`integer`

**Categorical (qualitative):**take on a limited number of distinct categories categories can be identified with numbers, but not sensible to do arithmetic operations`character`

or`factor`

*Ordinal:*levels have an inherent ordering (e.g., Likert scales)*Nominal:*levels have no inherent ordering (e.g., party choice: CDU, SPD, Greens, etc.)

Data types are conceptual

Univariate data analysis - distribution of single variable

Bivariate data analysis - relationship between two variables

Multivariate data analysis - relationship between many variables at once, usually focusing on the relationship between two while conditioning for others

- shape:
- skewness: right-skewed, left-skewed, symmetric (skew is to the side of the longer tail)
- modality: unimodal, bimodal, multimodal, uniform

- center: mean (
`mean`

), median (`median`

), mode (not always useful) - spread: range (
`range`

), standard deviation (`sd`

), inter-quartile range (IQR) - unusual observations

**Mean:** arithmetic average, the “typical” value, the best guess about the value drawn from the distribution

\[\bar{x} = \frac{x_1 + x_2 + x_3 + ... + x_N}{N} = \frac{\sum_{i=1}^{N}x_i}{N}\]

**Median:** value of \(x\) that falls in the middle position when observations are ordered ascending

\[\widetilde{x} =\begin{cases} x_\frac{N+1}{2} & \text{if }N\text{ is odd}\\ \frac {1}{2}\left(x_{\frac{N}{2}} + x_{\frac{N}{2} + 1}\right) & \text{if }N \text{ is even} \end{cases}\]

Average Weight | |
---|---|

Cats | 40.14931 |

Dogs | 40.07032 |

**Variance:** measure of the typical departure from the mean of a dataset

\[ s^2 = \frac{\sum_{i=1}^{N}(x_i - \bar{x})^2} {N - 1} \]

- The sum of squares from \(\mu\) will
*always*be greater than the \(\bar x\) sum of squares - \(\bar x\)’s location already minimizes the total distance of all the observations to the center by the definition of sample mean
- A line at any other location would be a line that is not minimizing the distance for observations in our sample

**Standard Deviation** \(s\): measure of the typical departure from the mean of a dataset (intuitive scale)

\[ s = \sqrt{\frac{\sum_{i=1}^{N}(x_i - \bar{x})^2} {N - 1}} = \sqrt{s^2} \]

**Range:**difference between smallest and largest value

**Interquantile Range:**range of the middle 50% of the data, distance between the first quartile (25th percentile) and third quartile (75th percentile)

`ggplot2`

“The simple graph has brought more information to the data analyst’s mind than any other device.” --- John Tukey

Data visualization is the creation and study of the visual representation of data

Many tools for visualizing data --

`R`

is one of themMany approaches/systems within

`R`

for making data visualizations --**ggplot2**is one of them, and that’s what we’re going to use

- Map
*data*to*aesthetics* *Aesthetic*: visual property of the graph- position
- shape
- color
- transparency

Data | Aesthetic | Graphic/Geometry |
---|---|---|

Longitude | Position (x-axis) | Point |

Latitude | Position (y-axis) | Point |

Army size | Size | Path |

Army direction | Color | Path |

Date | Position (x-axis) | Line + text |

Temperature | Position (y-axis) | Line + text |

**ggplot2**is tidyverse’s data visualization package`gg`

in**ggplot2**stands for Grammar of Graphics

Installation:

For help with ggplot2, see ggplot2.tidyverse.org

`color`

discrete

`color`

continuous

`size`

`fill`

`shape`

`alpha`

`geom_col()` |
Bar charts | |

`geom_text()` |
Text | |

`geom_point()` |
Points | |

`geom_boxplot()` |
Boxplots | |

`geom_sf()` |
Maps |

*scales*change properties of variable mapping*facets*show subplots for different subsets of data*coordinates*change the coordinate system*labels*add labels to the plot*theme*changes the appearance of anything in the plot*theme options*make adjustments to existing*themes*

- For
`ggplot()`

to work, your data needs to be in a*tidy*format - This doesn’t mean that it’s clean, it refers to the structure of the data
- All the packages in the
*tidyverse*work best with tidy data; that why it’s called that!

- Each variable has its own column
- Each observation has its own row
- Each value has its own cell

Untidy Data

Tidy data

- More on styling on course website

JPEG: Photographs

PNG/GIF: Images with limited colors

PDF: Anything vector based

SVG: Vectors online

**Save your plots as PNG or SVG (Web) or PDF (Print)**