- Start
superset
on port8088
withpostgresql
database on port5432
andjupyter
on port8888
:
docker-compose up
Important
You can also follow this official user guide to start superset locally, as it reflects the latest changes in the superset repo.
Note
If you encounter unable to start container process: exec: "/app/docker/docker-init.sh": permission denied: unknown
, you can do the following:
chmod -R u=rwx,g=rwx,o=rwx docker
Note
If you encounter error getting credentials - err: docker-credential-desktop resolves to executable in current directory (./docker-credential-desktop)
, you can remove the following line from the from the file ~/.docker/config.json
:
"credsStore": "desktop",
Now, let's go to http://localhost:8088/ in our browser, and login with the safest duo ever (admin
as user, and admin
as password).
- Connect to the database
superset
with the following configuration usingPostgreSQL
:
- HOST:
superset_db
- PORT: 5432
- DATABASE NAME:
superset
- USERNAME:
superset
- PASSWORD:
superset
- Create the
python
virtual environment:
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Note
For M1/2 chips, issues will occur during the installation of psycopg2
, and the reference is here.
In order to install psycopg2
for M1/2 chips, execute the following commands first before pip install -r requirements.txt
:
brew install postgresql
brew install openssl
Important
If the directory data/raw
is empty before preparing data, please carry out the following steps:
- Create the folder
data/raw
under this repo. - Copy some Excel files to the above folder.
- Prepare the data from
xlsx
format tocsv
format, and the CSV files will be generated under the folderdata/prepared
:
python prepare.py
- Push the prepared CSV files to the database:
python push.py
Option 1: You can query the data in Superset via http://localhost:8088/sqllab/:
Option 2: You can query the data in Jupyter via a link that you can find by docker logs jupyter_notebook
, which looks like http://127.0.0.1:8888/lab?token=xxx. The example code is in this link.
- Can it be automatic?
- Can you give examples of "raw" Excel files and PDFs, "cleaned" and "curated" data?
- What tool will you choose to clean the data if you have the freedom?
- https://github.com/tintinrevient/dagster-poor-mans-data-lake
- https://github.com/tintinrevient/dagster-with-dbt
- https://ourworldindata.org/co2-and-greenhouse-gas-emissions
- https://github.com/owid/co2-data
- https://python.langchain.com/docs/modules/data_connection/vectorstores/
- https://www.databricks.com/glossary/medallion-architecture
- https://developer.nvidia.com/blog/pandas-dataframe-tutorial-beginners-guide-to-gpu-accelerated-dataframes-in-python/
- https://github.com/gventuri/pandas-ai
- https://github.com/fugue-project/fugue
- 😍 Build a poor man’s data lake from scratch with DuckDB
- https://www.pola.rs/benchmarks.html
- https://h2oai.github.io/db-benchmark/
- ✨ https://motherduck.com/blog/duckdb-text2sql-llm/