Skip to content

datapolitan/MiningTheWeb

Repository files navigation

XSAVI 750 – Mining the Web: How to Scrape, Analyze & Map Open Data

Pratt Institute, Center for Continuing and Professional Studies Spatial Analysis and Visualization Initiative (SAVI)

Instructor: Richard Dunks

Location: ISC Building, Lower Level, Room 003

Continuing Education Units (C.E.U.s): 3.0

Click for more information and to register


Navigation


Administrivia

Course Overview

This course introduces the tools, techniques, and general approaches used to acquire, clean, analyze, and visualize open data, with particular emphasis on using web-based technologies and open-source tools at each step of the process.

We will be working with the community preservation group Save Harlem Now! to help collect, organize, and visualize data related to historic preservation in Harlem. There is no requirement to participate in this project and each student is free to pursue their own projects in class. The work with Save Harlem Now! is an opportunity to work on a real-world problem related to the collection, analysis, and visualization of data.

Learning Objectives

  • You will learn to formulate and articulate a meaningful research question with public open data, as well as meaningfully critique the work of others
  • You will learn how to acquire data through open data portals, application programmer interfaces (APIs), and scraping data from web sites
  • You will learn how to clean data using open source tools in preparation for analysis and visualization
  • You will learn how to conduct exploratory data analysis using descriptive statistics
  • You will learn to visualize your analytical findings in meaningful and visually-engaging graphics, as well as meaningfully critique the work of others
  • You will learn the basics of cartographic design as it relates to visualizing open data

Course Requirements

All students will need to bring their own laptop for exercises during class. Time will be set aside to help install, configure, and run the programs necessary for all assignments, projects, and exercises. Where possible, all programs will be free and open-source. All assigned work using services hosted online can be run using free accounts. Please update your system to the latest version of your prefered operating system prior to the first day of class to ensure you're able to successfully install and use the tools in class.

You will be required to have free accounts with the following services:

Time will be set aside to help you register and setup these accounts, but please try to come to the first session having already registered for these servies.

In addition, please install the following applications prior to class:

Course Readings

The required readings for this course consist of book chapters, newspaper articles, and short blog posts. The intention is to help give you a foundation in the critical skills ahead of class lectures. All required readings are available online or will be made available through the class portal. Recommended readings are suggestions if you wish to study further the topics covered in class. The books listed in the Suggested Readings section below offer even more depth and an extended discussion of the material we cover in class. Readings are due for the class under which they're listed.

Class Format

Class runs from 6:30pm to 9:30pm, with the class time broken up into two 85-minute blocks with a single 10-minute break around the half-way point of the class. Class will be a mix of lecture and practical exercise work, emphasizing the application of skills covered in the lecture portion of the class.

I will also be available for questions or further assistance before and after class. You will have ample time in class to work on practical exercises based on the information presented in lectures. When possible, the final half hour of class will be set aside for any additional questions or additional tutorials in tools, skills, or techniques. Please plan on attending the full class time.

Submitting Assignments

All assignments will be submitted by adding your content to the class page and issuing a "pull request" in the class repository. All of this will be explained, setup, and otherwise clarified on the first day of class. Assignments aren't considered submitted until the pull request has been issued. We will have ample time in class to address any technical issues and a reference guide for the process.

Assessment

Area Total Points
Attendance 20
Class Participation 20
Visualization Critiques 20
Visualizations 20
Final Project 20
Total 100

Class Policies

Attendance and Tardiness

I expect you to attend every class, arriving on time and staying for the entire duration of class. Daily attendance counts 2 points toward your final grade. Excused absences won't result in points being lost.

Participation

I expect you to be fully engaged while you’re in class. This means asking questions when necessary, engaging in class discussions, participating in class exercises, and completing all assigned work. Learning will occur in this class only when you actively use the tools, techniques, and skills described in the lectures. I will provide you ample time and resources to accomplish the goals of this course and expect you to take full advantage of what’s offered. Daily participation counts 2 points toward your final grade.

Late Assignments

All assignments are to be due before the start of class to be presented in class. Points will be taken off late assignments.

Office hours

I won’t be holding regular office hours, but I’m happy to set up a time to meet in person, over the phone, or via Skype/Google Hangout if you have any problems. Please use Slack to reach out to me. I will also be available before or after class to provide any assistance you may need.

Resources


Course Outline

Topics will be covered that day in class. Reading Assignments are to be read before class in preparation of the lecture and exercises. Assignments are due before the start of the next class and build on the information presented in class.


Week 1 - Acquiring Data

Topics

  • What is open data?
  • Data on the web
  • Introduction to mapping
  • Introduction to open source tools and services for mapping and visualization

Assignment

  1. Complete the visualization started in class with data from an open data portal. Style the map in CartoDB and have it ready to present in class.
  2. Find an interesting or visually compelling visualization online and write 2-3 paragraphs on the visualization, discussing the data source(s), the visual style, and how well the data was represented. Feel free to use the visualization resources listed above. Submit your text to the class page following the example shown.

Resources


Topics

Readings

Assignments

  1. Complete the online CartoDB “Online Mapping for Beginners” course.
  2. Create a second visualization or improve on your first, using new data or explore a data set from Save Harlem Now Project. Write 2-3 paragraphs discussing any challenges you encountered working with the data and/or creating your visualization in CartoDB.

Resources


Week 2: More Acquiring Data/Data Cleaning

Topics

  • Web scraping
  • Introduction to APIs
  • Introduction to OpenRefine

Readings

Assignments

  1. Find an interesting or visually compelling visualization online and write 2-3 paragraphs on the visualization, discussing the data source(s), the visual style, and how well the data was represented. Feel free to use the visualization resources listed above. Submit your text to the class page following the example shown.
  2. Identify a question or topic you'd like to explore in this class, with the intention of creating a map related to the topics as part of your final project in this class. Write 2-3 paragraphs on why the topic is interesting to you, what data you'd like to explore using, and what you hope to contribute with your work.

Resources


Topics

  • Overview of social media data
  • Collecting social media data from APIs
  • Introduction to Python for querying APIs

Readings

  • TBD

Assignments

  1. Using an API, either of an open data portal such as the NYC Open Data Portal or some other open data source, create a visualization of the data in CartoDB. Write a short (2-3 paragraph) description of the data, the API you used to access it, how you styled it, and the resulting visualization. Discuss other data you'd like to use or other techniques of cleaning the data to get your desired result. Submit your API code via the Slack channel in the format "lastname-assignment2.py" if you do your API query in Python or "lastname-assignment2.txt" if you did you query in OpenRefine.
  2. Update your project plan for your final project with additional questions, data sources, ideas for visualizing, or other issues/challenges you've discovered.

Resources


Week 3: Cleaning/Analyzing Data

Topics

  • Introduction to SQL for cleaning data
  • Cleaning Data with APIs

Readings

  • Obe, Regina, and Leo Hsu. PostGIS in action. Manning Publications Co., 2011, Pg 3-8.

Assignments

  1. Find an interesting or visually compelling visualization online and write 2-3 paragraphs on the visualization, discussing the data source(s), the visual style, and how well the data was represented. Feel free to use the visualization resources listed above. Submit your text to the class page following the example shown.
  2. Complete the SQL and PostGIS in CartoDB course.

Resources


Topics

  • Python for querying Geoclient API
  • SQL for cleaning and analysis

Readings

  • TBD

Assignments

  1. Create a new visualization or improve on your previous visualization with additional data and provide analysis of the data you've found. Write 2-3 paragraphs on the visualization, discussing the data source(s), the visual style, and how well the data was represented.

Resources


Week 4: Visualizing Data

Topics

  • A (re-)introduction to statistics
  • Introduction to visualization design

Readings

Assignments

  1. Find an interesting or visually compelling visualization online and write 2-3 paragraphs on the visualization, discussing the data source(s), the visual style, and how well the data was represented. Feel free to use the visualization resources listed above. Submit your text to the class page following the example shown.

Resources


Topics

  • Advanced CartoDB (guest lecture)

Readings

2

Resources


Week 5: Advanced Topics/Final Presentations

Topics

  • Course review
  • Advanced topics, to possibly include:
    • Introduction to Interactive Visualization of Data with D3 and Leaflet
    • Introduction to Spatial Databases
    • Visualizing social media data

Readings

Assignments

  1. Find an interesting or visually compelling visualization online and write 2-3 paragraphs on the visualization, discussing the data source(s), the visual style, and how well the data was represented. Feel free to use the visualization resources listed above. Submit your text to the class page following the example shown.

Resources


Topics

  • Final presentations

Suggested Reading

  • Fry, Ben. Visualizing Data: Exploring and Explaining Data with the Processing Environment. O'Reilly Media, Inc., 2007.
  • Garrad, Chris. Geoprocessing with Python. Manning Publications Co., forthcoming. Janert, Philipp K. Data analysis with open source tools. O'Reilly Media, Inc., 2010.
  • McCallum, Q. Ethan. Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work. O'Reilly Media, Inc., 2012.
  • McKinney, Wes. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. O'Reilly Media, Inc., 2012.
  • Munzner, Tamara. Visualization Analysis and Design. AK Peters, 2014.
  • Murray, Scott. Interactive data visualization for the Web. O'Reilly Media, Inc., 2013.
  • Tufte, Edward R., and P. R. Graves-Morris. The visual display of quantitative information. Vol. 2. Cheshire, CT: Graphics press, 1983.