Skip to content

Latest commit

 

History

History
129 lines (89 loc) · 8.03 KB

A_syllabus.md

File metadata and controls

129 lines (89 loc) · 8.03 KB

Research Workshop on Computational Tools for Digital Data Collection

If you have comments, questions, or suggestions, then create an issue.

Description

The focus of this workshop is on digital data collection using R (most cases), Python, and UNIX command-line tools. Three lecture-style sessions will introduce graduate students to advanced techniques in web-scraping, pdf-scraping, and social media scraping. Three seminar-style courses will provide graduate students with the opportunity to receive feedback on strategies for collecting data.

The objective of this workshop is practical: graduate students will develop and execute data collections strategies in each of the three thematic modules, with the final deliverable being three complete and clean datasets. As such, we will expect graduate students involved in the workshop to identify resources---e.g., administrative databases, archival documents, social media accounts---that they wish to scrape.

The emphasis of this course is on data collection, rather than data analysis. However, as the goal of data collection is typically analytical, we will assume a familiarity with conventional approaches to statistical inference in the social sciences.

Logistics

Co-instructors

Jae Yeon Kim

jaeyeonkim@berkeley.edu

Nicholas Kuipers

nkuipers@berkeley.edu

Time and Location

Date: TBD

Location: Zoom

All course materials will be posted on Github at https://github.com/jaeyk/digital_data_collection_workshop, including class notes, code demonstrations, sample data, and assignments.

Accessibility

This class is committed to creating an environment in which everyone can participate, regardless of background, discipline, or disability. If you have a particular concern, please come to me as soon as possible so that we can make special arrangements.

Books and Other Resources

There are no official textbooks for this class. Please see the references (will be updated throughout the semester) for additional references and the style guides for efficient programming and project management.

Computer Requirements

The software needed for the course is as follows:

  • Access to the UNIX command line (e.g., a Mac laptop, a Bash wrapper on Windows)
  • Git
  • R and RStudio (latest versions)
  • Anaconda and Python 3 (latest versions)

This requires a computer that can handle all this software. Almost any Mac will do the job. Most Windows machines are fine too if they have enough space and memory.

You must have all the software downloaded and installed PRIOR to the first day of class.

See this guideline for more information on installation.

Curriculum Outline / Schedule

The schedule is subject to change based on the class's rate of progress.

  • To view the course contents interactively, please Binder.

  • To view the HTML rendered course contents, please click [Notebook].

Techniques in automating data collection workflow [Notebook]

  • September 16, 2020: Automating data collection workflow
    • Instructor: Kim
    • Style: Lecture
    • Description: introduction to the tidyverse; discussion of efficient and reproducible ways to collect and wrangle data
    • R Packages: dplyr, purrr
    • References:
      • Kim, How to Automate Repeated Things in R (GitHub)
      • Kim, Advanced Wrangling Workshop in R (GitHub)

Techniques in social media scraping [Notebook 1] [Notebook 2] [Notebook 3] [Online book chapter]

Techniques in pdf-scraping

  • October 14, 2020: No workshop -- Indigenous peoples’ day

  • October 21, 2020: PDF-parsing

    • Lead instructor: Kuipers
    • Style: Lecture
    • Description: introduction to techniques of pdf-scraping; where to look for documents; how to know what to pre-process by hand; identifying recurring patterns in text to exploit for data wrangling; parallel processing
    • R Packages: tesseract, magick, zoo, parallel, pdftools
    • References:
  • October 28, 2020: PDF-parsing workshop

    • Instructor: Kuipers + Kim
    • Style: Seminar
    • Description: Graduate students provide/receive feedback on PDF-parsing data collection strategies

Techniques in web-scraping

  • November 4, 2020: Web-scraping

  • November 11, 2020: Web-scraping workshop

    • Instructor: Kuipers + Kim
    • Style: Seminar
    • Description: Graduate students provide/receive feedback on web-scraping data collection strategies