Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Establish general pattern for uploading artifacts to storage #894

Open
dwreeves opened this issue Mar 16, 2024 · 3 comments
Open

Establish general pattern for uploading artifacts to storage #894

dwreeves opened this issue Mar 16, 2024 · 3 comments
Labels
area:config Related to configuration, like YAML files, environment variables, or executer configuration dbt:docs Primarily related to dbt docs command or functionality execution:local Related to Local execution environment parsing:custom Related to custom parsing, like custom DAG parsing, custom DBT parsing, etc triage-needed Items need to be reviewed / assigned to milestone
Milestone

Comments

@dwreeves
Copy link
Collaborator

Overview

This issue should be seen as a higher-level, more general offshoot of #870. Basically, as the need to upload more files to cloud storage increases, the patterns around this should be more firmly established and generalized.

Right now there is a DbtDocsOperator that loads files created by dbt into cloud storage.

Additionally, there is a need to be able to run dbt parse or dbt compile inside the Cosmos runtime and upload artifacts from these runs; see: #870 for why.

Considerations

The high level API for DbtDocsOperator is pretty reasonable, and solving this issue can utilize the same high level API as the DbtDocsOperator. That said, there are some minor nitpicks I'd have about the current API:

  • At the moment, non-dbt docs commands do not have direct access to the API for uploading artifacts to cloud storage, even though commands such as dbt run create artifacts.
  • The name required_files may be a little awkward. "Required files" seems in contrast to something like an optional file. Should there be optional files? How should Cosmos treat the process of identifying files to be uploaded more generally?
  • It is not clear that by default the target/ directory should be assumed. Notably, what about uploading from logs/? This is irrelevant for most things but it does matter for enabling commands like dbt run to upload debug logs to cloud storage.

^ I think trying to support all of these API nitpicks may not be useful for most people, and perhaps e.g. logs/ upload should not be supported and dbt run should not be able to upload artifacts, but it'd be worth explicitly establishing all of that.

Some additional requirements and considerations for uploading files:

  • Many dbt commands generate artifacts that are not core to the reason why you'd run the command, but could still be useful to have accessible.
    • dbt parse generates perf_info.json
    • Many commands generate a graph.gpickle, graph_summary.json, or partial_parse.msgpack.
  • Some global flags need to be checked.
    • When --no-write-json is passed, graph.gpickle is not created.
    • When --no-partial-parse is passed, partial_parse.msgpack is not created.

Some additional requirements for the interface:

  • The path in cloud storage needs to be templated so that dagrun-specific artifacts can be stored in different locations, e.g. using dbt_artifacts/{{ run_id }}/ as a location for storing files.
  • There are a lot of possible options to consider when uploading files. (E.g. should a command upload its target/graph.gpickle or not?) You can imagine there being an "upload configuration" class that compresses all of this information down to a portable object. Or maybe that is too complex!

I don't have answers, just questions. I'm hoping this issue can be a place to discuss what to do.

Copy link

dosubot bot commented Mar 16, 2024

Hey @dwreeves, great to see you diving into another intriguing challenge over here!

Thank you for your detailed request. I agree that establishing a general pattern for uploading artifacts to cloud storage is a crucial enhancement for the Cosmos repository. Here's a proposed approach considering the current state of the Cosmos codebase and your requirements:

  1. Extend the existing DbtDocsOperator: The current DbtDocsOperator is implemented as three specific operators for different cloud storage services: DbtDocsS3Operator, DbtDocsAzureStorageOperator, and DbtDocsGCSOperator. We can extend these operators to handle the uploading of artifacts generated by dbt parse and dbt compile commands. This would involve modifying the execute method of these operators to run the desired dbt command, identify the generated artifacts, and upload them to the respective cloud storage service.

  2. Create a generalized ArtifactUploader class: This class would abstract the process of identifying and uploading artifacts. It could be initialized with a list of artifact file paths (or patterns) to upload and a cloud storage path template. The upload method of this class would then handle the process of uploading the identified artifacts to the templated cloud storage path. This class could also handle the checking of global flags to determine which artifacts to upload.

  3. Integrate ArtifactUploader with DbtDocsOperator: The DbtDocsOperator (and its specific implementations) could be modified to utilize the ArtifactUploader class. After running the dbt command, the operator would create an instance of ArtifactUploader, passing the list of artifacts to upload and the templated cloud storage path. It would then call the upload method of the ArtifactUploader to upload the artifacts.

Here's a conceptual example of how this might look:

class ArtifactUploader:
    def __init__(self, artifact_paths: List[str], storage_path_template: str):
        self.artifact_paths = artifact_paths
        self.storage_path_template = storage_path_template

    def upload(self):
        # Identify artifacts based on self.artifact_paths
        # Upload each artifact to the templated storage path

class DbtDocsS3Operator(BaseOperator):
    def execute(self, context):
        # Run dbt command
        # ...

        # Upload artifacts
        uploader = ArtifactUploader(artifact_paths=["target/*", "logs/*"], storage_path_template="dbt_artifacts/{{ run_id }}/")
        uploader.upload()

This approach would provide a flexible and extensible way to handle the uploading of artifacts to cloud storage, while also addressing the API nitpicks you mentioned. It would allow for the uploading of artifacts from any dbt command, not just dbt docs generate, and it would provide a clear and consistent way to specify which artifacts to upload and where to upload them.

Please let me know if this approach aligns with your vision, or if there are any other considerations or requirements I should take into account.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@dosubot dosubot bot added area:config Related to configuration, like YAML files, environment variables, or executer configuration dbt:docs Primarily related to dbt docs command or functionality execution:local Related to Local execution environment parsing:custom Related to custom parsing, like custom DAG parsing, custom DBT parsing, etc labels Mar 16, 2024
@dwreeves
Copy link
Collaborator Author

First time Dosu did a good job for one of my issues. Thank you Dosu!

Copy link

dosubot bot commented Mar 16, 2024

Thank you, @dwreeves! I'm glad to hear that the proposed approach aligns with your vision. If you have any further questions or need additional assistance as you work through this enhancement, feel free to reach out. I'm here to help!

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@tatiana tatiana added this to the 1.5.0 milestone Apr 30, 2024
tatiana added a commit that referenced this issue May 15, 2024
[Daniel Reeves](https://www.linkedin.com/in/daniel-reeves-27700545/)
(@dwreeves ) is an experienced Open-Source Developer currently working
as a Data Architect at Battery Ventures. He has significant experience
with Apache Airflow, SQL, and Python and has contributed to many [OSS
projects](https://github.com/dwreeve).

Not only has he been using Cosmos since its early stages, but since
January 2023, he has actively contributed to the project:
![Screenshot 2024-05-14 at 10 47
30](https://github.com/astronomer/astronomer-cosmos/assets/272048/57829cb6-7eee-4b02-998b-46cc7746f15a)

He has been a critical driver for the Cosmos 1.4 release, and some of
his contributions include new features, bug fixes, and documentation
improvements, including:
* Creation of an Airflow plugin to render dbt docs:
#737
* Support using dbt partial parsing file:
#800
* Add more template fields to `DbtBaseOperator`:
#786
* Add cancel on kill functionality:
#101
* Make region optional in Snowflake profile mapping:
#100
* Fix the dbt docs operator to not look for `graph.pickle`:
#883

He thinks about the project long-term and proposes thorough solutions to
problems faced by the community, as can be seen in Github tickets:
* Introducing composability in the middle layer of Cosmos's API:
#895
* Establish a general pattern for uploading artifacts to storage:
#894
* Support `operator_arguments` injection at a node level:
#881

One of Daniel's notable traits is his collaborative and supportive
approach. He has actively engaged with users in the #airflow-dbt Slack
channel, demonstrating his commitment to fostering a supportive
community.

We want to promote him as a Cosmos committer and maintainer for all
these, recognising his constant efforts and achievements towards our
community. Thank you very much, @dwreeves !
@tatiana tatiana added the triage-needed Items need to be reviewed / assigned to milestone label May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:config Related to configuration, like YAML files, environment variables, or executer configuration dbt:docs Primarily related to dbt docs command or functionality execution:local Related to Local execution environment parsing:custom Related to custom parsing, like custom DAG parsing, custom DBT parsing, etc triage-needed Items need to be reviewed / assigned to milestone
Projects
None yet
Development

No branches or pull requests

2 participants