NOT Doing Excel

Hmm...

Aha!

Tech Stack

Install the playground

Start superset on port 8088 with postgresql database on port 5432 and jupyter on port 8888:

docker-compose up

Important

You can also follow this official user guide to start superset locally, as it reflects the latest changes in the superset repo.

Note

If you encounter unable to start container process: exec: "/app/docker/docker-init.sh": permission denied: unknown, you can do the following:

chmod -R u=rwx,g=rwx,o=rwx docker

Note

If you encounter error getting credentials - err: docker-credential-desktop resolves to executable in current directory (./docker-credential-desktop), you can remove the following line from the from the file ~/.docker/config.json:

"credsStore": "desktop",

Now, let's go to http://localhost:8088/ in our browser, and login with the safest duo ever (admin as user, and admin as password).

Connect to the database superset with the following configuration using PostgreSQL:

HOST: superset_db
PORT: 5432
DATABASE NAME: superset
USERNAME: superset
PASSWORD: superset

Insert the data

Create the python virtual environment:

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Note

For M1/2 chips, issues will occur during the installation of psycopg2, and the reference is here.

In order to install psycopg2 for M1/2 chips, execute the following commands first before pip install -r requirements.txt:

brew install postgresql
brew install openssl

Important

If the directory data/raw is empty before preparing data, please carry out the following steps:

Create the folder data/raw under this repo.
Copy some Excel files to the above folder.

Prepare the data from xlsx format to csv format, and the CSV files will be generated under the folder data/prepared:

python prepare.py

Push the prepared CSV files to the database:

python push.py

Query the data

Option 1: You can query the data in Superset via http://localhost:8088/sqllab/:

Option 2: You can query the data in Jupyter via a link that you can find by docker logs jupyter_notebook, which looks like http://127.0.0.1:8888/lab?token=xxx. The example code is in this link.

Follow-up questions

Can it be automatic?
Can you give examples of "raw" Excel files and PDFs, "cleaned" and "curated" data?
What tool will you choose to clean the data if you have the freedom?

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
config		config
docker		docker
logs		logs
notebooks		notebooks
pix		pix
.gitignore		.gitignore
README.md		README.md
config.py		config.py
docker-compose.yml		docker-compose.yml
prepare.py		prepare.py
push.py		push.py
requirements.txt		requirements.txt

tintinrevient/not-doing-excel

Folders and files

Latest commit

History

Repository files navigation

NOT Doing Excel

Hmm...

Aha!

Tech Stack

Install the playground

Insert the data

Query the data

Follow-up questions

Other options

References

About

Resources

Stars

Watchers

Forks

Languages