Meet the Toolkit

R for Data Analysis
Session 1

Viktoriia Semenova

University of Mannheim
Fall 2023

Welcome!

Agenda


Introductions

Course Details

Meet the Toolkit

Teaching team

Instructor

Teaching assistant

What will you learn in this class

  • use the statistical programming language R for data wrangling and analysis
  • translate your social science theory into a statistical model
  • use directed acyclic graphs (DAGs) to build causal models and use them when building statistical models
  • create meaningful visualizations to communicate insights from statistical analysis
  • describe your analysis in research papers
  • use Quarto (and RMarkdown) to write reproducible reports
  • use git and GitHub for version control and collaboration

What will you learn in this class

Poltical Campaigning Graph

What will you learn in this class

The coefficient on education is positive and statistically significant at the 0.05 level.

Other things being equal, an additional year of education would increase your annual income by 1,500 USD on average, plus or minus about 500 USD.

Our Starting Point & Your Expectations

  • Everyone has experience with Stata but many no experience with R
  • Issues most confusing and unclear about empirical analysis:

Choosing the right method of analysis (for the Research question)
To find the right (control) variables
Being aware of the results and limitations
Translating concepts into variables for OLS
The interpretation of an own analysis

  • Topics interested:

Impact and usage of LLMs like ChatGPT

Course Details

Schedule Overview

Monday

Consultation 15:30-17:00 (A5, 6 B301)

Tuesday

Problem sets due at 22:00 (GitHub)

Wednesday

Class sessions 13:45-15:15 (A5,6 B317)

Problem sets distributed (GitHub) 18:00

Thursday

Friday

Problem sets evaluation due 22:00 (GitHub)

Assessment and Grading

  • You need to show up and participate

  • You need to pass all homework problem sets

    • Weekly problem sets with coding and technical writing tasks, due Tuesday 22:00
    • First 2-3 are individual, then in groups of 2-3 students
    • Solutions are posted, you are evaluating yourselves by Friday 22:00
    • Evaluation is point-based, passing means >50% points
    • Deadlines are hard
    • Bonus: student(s) with highest homework scores get(s) 0.3 improvement of the final course grade

Final Paper: Data Analysis Project

  • You will get a dataset and a research question you could answer using it
  • You will need to apply the skills you learned in this class to answer the research question
  • The write-up should read like the results section of a research paper
  • You will have at least 10 days to complete the task, no late work accepted
  • Preliminary time frame: December 6-December 20, 2023
  • If you want to work on a paper of your own instead, talk to me before November 15, 2023

The final paper grade is 100% of your course grade

Course Policies

Deadlines are hard
Attendance is expected (as long as you’re healthy!)
No recording during sessions or office hours
Check syllabus & website for updates regularly
Check email/Slack regularly
Feedback and suggestions welcome anytime

Generative AI policy

You should treat generative AI, such as ChatGPT, the same as other online resources. There are two guiding principles that govern how you can use AI in this course:

(1) Cognitive dimension: Working with AI should not reduce your ability to think clearly. We will practice using AI to facilitate—rather than hinder—learning.

(2) Ethical dimension: Students using AI should be transparent about their use and make sure it aligns with academic integrity.

  • ✅ AI tools for code: You may make use of the technology for coding examples on assignments; if you do so, you must explicitly cite where you obtained the code. Any recycled code that is discovered and is not explicitly cited will be treated as plagiarism.

  • ❌ AI tools for narrative: Unless instructed otherwise, you may not use generative AI to write narrative on assignments. In general, you may use generative AI as a resource as you complete assignments but not to answer the exercises for you. You are ultimately responsible for the work you turn in; it should reflect your understanding of the course content.

Communication & Help

  • Slack workspace is the primary forum for communication
    • #course-announcements: deadline updates, task clarifications, schedule/topic changes, etc. (please react so I see it reached you :D)
    • #github-updates: automated notifications when new material is uploaded on GitHub
    • #help: place for your questions and issues (ask & answer if you can)
    • #random: place for various data/stats things
    • you can create private channels for your teams or DM me
  • Email me (semenova@uni-mannheim.de) for private matters (e.g., illness, accommodations, etc.)
    • Expect response within 48 hours Monday - Friday
  • Office hours on Monday/Wednesday 15:30-17:00 in A5, 6, B301 (please schedule in advance)

Asking for Help

  1. For software/installation questions: better come over during office hours
  2. For git/R-related questions:
  • Read the error message. Often the message includes suggestions on how to resolve it.
  • Before posting, make sure the most recent version of your work is on GitHub (I will be able to download that exact version and see the issue)
  • Attach screen shots/error message when first writing the message
  • Use formatting for code parts (easier copypaste + aesthetics)

Course website: http://r4da.live/

Meet the Toolkit

Class technology

Programming

Visualization Examples

R and RStudio

R logo

  • R is an open-source statistical programming language
  • R is also an environment for statistical computing and graphics
  • It’s easily extensible with packages

RStudio logo

  • RStudio is a convenient interface for R called an IDE (integrated development environment), e.g. “I write R code in the RStudio IDE”
  • RStudio is not a requirement for programming with R, but it’s very commonly used by R programmers and data scientists

R and RStudio

On the left: a car engine. On the right: a car dashboard. The engine is labelled R. The dashboard is labelled RStudio.

Tour: R and RStudio

A short list (for now) of R essentials

  • Functions are (most often) verbs, followed by what they will be applied to in parentheses:
do_this(to_this)
do_that(to_this, to_that, with_those)
  • Packages are installed with the install.packages() function and loaded with the library function, once per session:
install.packages("package_name") # once per device
library(package_name) # once per session

R essentials (continued)

  • Columns (variables) in data frames are accessed with $:
dataframe$var_name
  • Object documentation can be accessed with ?
?mean

Packages in R

Packages in R

tidyverse

Hex logos for dplyr, ggplot2, forcats, tibble, readr, stringr, tidyr, and purrr

tidyverse.org

  • The tidyverse is an opinionated collection of R packages designed for data science
  • All packages share an underlying philosophy and a common grammar

Version Control

Git and GitHub

Git logo

Git is a version control system, like “Track Changes” features from Microsoft Word but more powerful.

GitHub logo

GitHub is the home for your Git-based projects on the internet, like Dropbox but much better.

Why git?

GitHub


  • GitHub organization for the course

  • All of your work and your membership (enrollment) in the organization is private

  • Each assignment is a private repo on GitHub, I distribute the assignments on GitHub and you submit them there

  • Feedback & solutions are also uploaded to your repos

Make your account today so I can distribute the first problem set to you. Use ILIAS login as a username so I can find you easily.

GitHub Workflow

Quarto

Quarto

  • Fully reproducible reports – each time you render the analysis is ran from the beginning
  • Code goes in chunks narrative goes outside of chunks
  • A visual editor for a familiar / Google docs-like editing experience

Tour: Quarto

RStudio IDE with a Quarto document, source code on the left and output on the right. Annotated to show the YAML, a link, a header, and a code chunk.