The project aims to import four hospital datasets into SQL databases and one hotel review dataset into Firebase. A web interface will be developed for users to select the desired dataset, search based on user preferences, and explore with primary and foreign key relations.
The following datasets are provided by Centers for Medicare & Medicaid Services (CMS) and are accessible via Socrata Open Data API (SODA).
- Complications and Deaths - Hospital (23MB) - includes scoring and national comparison of complications and deaths of U.S. hospitals.
- Timely and Effective Care - Hospital (21MB) - includes scoring of care in various departments of U.S. hospitals.
- Healthcare Associated Infections - Hospital (43 MB) - includes scoring of healthcare associated infection developed by Centers of Disease Control and prevention of U.S. hospitals.
- Hospital General Information (1.9MB) - includes all the registered hospitals' general information.
The following datasets are provided by Kaggle and Yelp respectively.
- Hotel Reviews (120MB) - includes a list of 1,000 U.S. hotels and their reviews (near 10K).
Data cleaning, transformation, and aggregation and exploration are expected to benefit from using Spark.
- Data cleaning - perform entity resolution of hospitals across datasets.
- Data transformation - load normalized data into MySQL and Firebase.
- Data integration - integrate data from different data sources.
- Data aggregation and exploration - pre-calculation of aggregations will allow for faster querying for users.
- MySQL - Relational database
- Processed and cleaned hospital data will be stored in MySQL.
- Firebase - Non-relational database
- Processed and cleaned hotel and Yelp data will be stored in NoSQL.
- multiple datasets with different indexing will be available for faster retrieval.
Asumi Suguro is a first-year graduate student in the Applied Data Science program of Viterbi. After receiving an undergraduate degree in chemistry and mathematics, Asumi worked at a biotechnology company for three years. So far, Asumi has taken DSCI 552 Machine Learning for Data Science, and is currently taking DSCI 554 Data Visualization in addition to DSCI 551. Skills include: Python, HTML, CSS, JavaScript.
Juntao Shen is a second year graduate student in the Computer Science Data Science program. He obtained his bachelor’s degree in computer science at University of Southern California. He took DSCI553 last semester and worked on AWS AI in his summer internship at AWS. He has a deep understanding of Java, Python, C++ and C. He also had experience with web development, search engine optimization and operating systems.
Deadlines (Week of) | Tasks |
---|---|
09/14 | Submit and present the project proposal. |
09/21 - 10/05 | Create a general web page with a search form. Complete data cleaning, integration, and aggregation. |
10/12 | Submit midterm report. |
10/19 - 11/16 | Integrates data into front-end with basic functionalities. |
11/23 | Submit final project, and present demo to class |