Skip to content

tintinrevient/not-doing-excel

Repository files navigation

NOT Doing Excel

Hmm...

overview

Aha!

overview

Tech Stack

overview

Install the playground

  1. Start superset on port 8088 with postgresql database on port 5432 and jupyter on port 8888:
docker-compose up

Important

You can also follow this official user guide to start superset locally, as it reflects the latest changes in the superset repo.

Note

If you encounter unable to start container process: exec: "/app/docker/docker-init.sh": permission denied: unknown, you can do the following:

chmod -R u=rwx,g=rwx,o=rwx docker

Note

If you encounter error getting credentials - err: docker-credential-desktop resolves to executable in current directory (./docker-credential-desktop), you can remove the following line from the from the file ~/.docker/config.json:

"credsStore": "desktop",

Now, let's go to http://localhost:8088/ in our browser, and login with the safest duo ever (admin as user, and admin as password).

  1. Connect to the database superset with the following configuration using PostgreSQL:
  • HOST: superset_db
  • PORT: 5432
  • DATABASE NAME: superset
  • USERNAME: superset
  • PASSWORD: superset

Insert the data

  1. Create the python virtual environment:
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Note

For M1/2 chips, issues will occur during the installation of psycopg2, and the reference is here.

In order to install psycopg2 for M1/2 chips, execute the following commands first before pip install -r requirements.txt:

brew install postgresql
brew install openssl

Important

If the directory data/raw is empty before preparing data, please carry out the following steps:

  1. Create the folder data/raw under this repo.
  2. Copy some Excel files to the above folder.
  1. Prepare the data from xlsx format to csv format, and the CSV files will be generated under the folder data/prepared:
python prepare.py
  1. Push the prepared CSV files to the database:
python push.py

Query the data

Option 1: You can query the data in Superset via http://localhost:8088/sqllab/:

superset

Option 2: You can query the data in Jupyter via a link that you can find by docker logs jupyter_notebook, which looks like http://127.0.0.1:8888/lab?token=xxx. The example code is in this link.

Follow-up questions

  • Can it be automatic?
  • Can you give examples of "raw" Excel files and PDFs, "cleaned" and "curated" data?
  • What tool will you choose to clean the data if you have the freedom?

Other options

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published