🚀 Dagster Databricks Integration Demo 🚀

This repository contains a demo project showcasing the integration of Dagster with Databricks using the Databricks Step Launcher. The demo demonstrates how to develop PySpark code locally, version it using Git, upload it to Databricks, and execute it there, providing an efficient workflow compared to just scheduling Databricks Jobs via Dagster.

📋 Prerequisites

Before you begin, ensure you have the following:

Databricks Cluster: You will need an active Databricks cluster. The Cluster ID will be required later to run the PySpark script.
Python: This demo is developed with Python 3.8.6. Ensure you have the same or compatible Python version installed.

⚙️ Configuration

To use this demo, you need to specify several configuration parameters (no env variables were used in this demo, as Databricks also has to access them):

Databricks Cluster ID: Find this in your Databricks cluster configurations and update it in the dagster_demo/__init__.py file (i.e., replace "<CLUSTER_ID>").
Databricks Host: This is the URL of your Databricks workspace. Update it in the dagster_demo/__init__.py file (i.e., replace "<HOST_HTTPS_PATH>", sample format: "https://abd-123abcde-0a1b.cloud.databricks.com").
Databricks Token: This is your personal access token for Databricks. Update it in the dagster_demo/__init__.py file (i.e., replace "<HOST_TOKEN>").
Databricks Schema: This is the schema/database where you want to store the sample table for this demo, assuming you store it in the Hive Metastore. Update the schema in the dagster_demo/pyspark_asset.py (i.e., replace "<DB_SCHEMA>").

For more detailed instructions on how to find these parameters, refer to the Databricks documentation.

🔨 Installation and Setup

Follow these steps to set up and run the demo:

Clone the Repository:

git clone <repository-url>
cd dagster_demo

Install Dependencies: Install the required dependencies for the project:

pip install -e ".[dev]"

Start the Dagster UI: Launch the Dagster development environment:

dagster dev

Once the UI is running, access it via http://localhost:3000 in your browser.

🏃 Running the Demo

After setting up, you can run the demo pipeline using the Dagster UI. The pipeline will execute the PySpark script on your Databricks cluster. If everything worked out, you should see the table raw_full_moon_dates in Databricks in the specified schema.

🤝 Contributing

Contributions to this demo are welcome. Please ensure that your code adheres to the best practices for PySpark development and is compatible with the Databricks environment.

📜 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

📞 Contact

Tobias Zeulner - t.zeulner@reply.de
David Weber - da.weber@reply.de
Döme Lörinczy - d.loerinczy@reply.de

Machine Learning Reply, Luise-Ullrich Str 14, 80636 Munich

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
dagster_demo		dagster_demo
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dagster_demo

dagster_demo

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

🚀 Dagster Databricks Integration Demo 🚀

📋 Prerequisites

⚙️ Configuration

🔨 Installation and Setup

🏃 Running the Demo

🤝 Contributing

📜 License

📞 Contact

About

Releases

Packages

Languages

License

MachineLearningReply/dagster-databricks-step-launcher-demo

Folders and files

Latest commit

History

Repository files navigation

🚀 Dagster Databricks Integration Demo 🚀

📋 Prerequisites

⚙️ Configuration

🔨 Installation and Setup

🏃 Running the Demo

🤝 Contributing

📜 License

📞 Contact

About

Resources

License

Stars

Watchers

Forks

Languages