I'm using this as a learning opportunity to work on a few data engineering projects taught by Darshil Parmar. Here is a link to the medium article.
https://medium.com/@darshilp/7-end-to-end-data-engineering-projects-for-free-bf9e86d7bfe0
This repo will house 3 projects (with the possibility to expand).
In this project the main aim is to identify an open source dataset and design a data model to ingest this data into postgres. The following are the requirements for the project:
- Find datasets from any of the sites such as opendata, kaggle etc (links found in description of the youtube video).
- Build a data model (minimum 3 tables).
- Write python code to create the required tables.
- Insert data from the files into the tables.
I found a vehicle sales data from kaggle which I feel would be good dataset for this project. The data set can be found here (also saved under src/data/):
https://www.kaggle.com/datasets/syedanwarafridi/vehicle-sales-data?resource=download
- You should have docker installed on your local machine.
- An IDE for SQL, I personally use Dbeaver. You can add another service to the docker compose file such as PGAdmin which has a web UI.
In this repo I'm trying to include a dev container which would potentially help develop a developer setup a local environment that is supported on all major platforms. This includes:
- Docker containers that run python and postgres at the moment.
- Setup necessary packages on devcontainer
- Create a schema and tables
- Start the course to see if you were able to take into account the requirements.
- Add setup instructions.
- Set up .env file to handle postgres credentials and pass in the postgres uri to the python image.
- Find a way to dynamically assign postgres uri (IP address of container is static at the moment)