-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Establish general pattern for uploading artifacts to storage #894
Comments
Hey @dwreeves, great to see you diving into another intriguing challenge over here! Thank you for your detailed request. I agree that establishing a general pattern for uploading artifacts to cloud storage is a crucial enhancement for the Cosmos repository. Here's a proposed approach considering the current state of the Cosmos codebase and your requirements:
Here's a conceptual example of how this might look: class ArtifactUploader:
def __init__(self, artifact_paths: List[str], storage_path_template: str):
self.artifact_paths = artifact_paths
self.storage_path_template = storage_path_template
def upload(self):
# Identify artifacts based on self.artifact_paths
# Upload each artifact to the templated storage path
class DbtDocsS3Operator(BaseOperator):
def execute(self, context):
# Run dbt command
# ...
# Upload artifacts
uploader = ArtifactUploader(artifact_paths=["target/*", "logs/*"], storage_path_template="dbt_artifacts/{{ run_id }}/")
uploader.upload() This approach would provide a flexible and extensible way to handle the uploading of artifacts to cloud storage, while also addressing the API nitpicks you mentioned. It would allow for the uploading of artifacts from any dbt command, not just Please let me know if this approach aligns with your vision, or if there are any other considerations or requirements I should take into account. Sources
|
First time Dosu did a good job for one of my issues. Thank you Dosu! |
Thank you, @dwreeves! I'm glad to hear that the proposed approach aligns with your vision. If you have any further questions or need additional assistance as you work through this enhancement, feel free to reach out. I'm here to help!
|
[Daniel Reeves](https://www.linkedin.com/in/daniel-reeves-27700545/) (@dwreeves ) is an experienced Open-Source Developer currently working as a Data Architect at Battery Ventures. He has significant experience with Apache Airflow, SQL, and Python and has contributed to many [OSS projects](https://github.com/dwreeve). Not only has he been using Cosmos since its early stages, but since January 2023, he has actively contributed to the project: ![Screenshot 2024-05-14 at 10 47 30](https://github.com/astronomer/astronomer-cosmos/assets/272048/57829cb6-7eee-4b02-998b-46cc7746f15a) He has been a critical driver for the Cosmos 1.4 release, and some of his contributions include new features, bug fixes, and documentation improvements, including: * Creation of an Airflow plugin to render dbt docs: #737 * Support using dbt partial parsing file: #800 * Add more template fields to `DbtBaseOperator`: #786 * Add cancel on kill functionality: #101 * Make region optional in Snowflake profile mapping: #100 * Fix the dbt docs operator to not look for `graph.pickle`: #883 He thinks about the project long-term and proposes thorough solutions to problems faced by the community, as can be seen in Github tickets: * Introducing composability in the middle layer of Cosmos's API: #895 * Establish a general pattern for uploading artifacts to storage: #894 * Support `operator_arguments` injection at a node level: #881 One of Daniel's notable traits is his collaborative and supportive approach. He has actively engaged with users in the #airflow-dbt Slack channel, demonstrating his commitment to fostering a supportive community. We want to promote him as a Cosmos committer and maintainer for all these, recognising his constant efforts and achievements towards our community. Thank you very much, @dwreeves !
Overview
This issue should be seen as a higher-level, more general offshoot of #870. Basically, as the need to upload more files to cloud storage increases, the patterns around this should be more firmly established and generalized.
Right now there is a DbtDocsOperator that loads files created by dbt into cloud storage.
Additionally, there is a need to be able to run
dbt parse
ordbt compile
inside the Cosmos runtime and upload artifacts from these runs; see: #870 for why.Considerations
The high level API for DbtDocsOperator is pretty reasonable, and solving this issue can utilize the same high level API as the DbtDocsOperator. That said, there are some minor nitpicks I'd have about the current API:
dbt run
create artifacts.required_files
may be a little awkward. "Required files" seems in contrast to something like an optional file. Should there be optional files? How should Cosmos treat the process of identifying files to be uploaded more generally?target/
directory should be assumed. Notably, what about uploading fromlogs/
? This is irrelevant for most things but it does matter for enabling commands likedbt run
to upload debug logs to cloud storage.^ I think trying to support all of these API nitpicks may not be useful for most people, and perhaps e.g.
logs/
upload should not be supported anddbt run
should not be able to upload artifacts, but it'd be worth explicitly establishing all of that.Some additional requirements and considerations for uploading files:
dbt parse
generatesperf_info.json
graph.gpickle
,graph_summary.json
, orpartial_parse.msgpack
.--no-write-json
is passed,graph.gpickle
is not created.--no-partial-parse
is passed,partial_parse.msgpack
is not created.Some additional requirements for the interface:
dbt_artifacts/{{ run_id }}/
as a location for storing files.target/graph.gpickle
or not?) You can imagine there being an "upload configuration" class that compresses all of this information down to a portable object. Or maybe that is too complex!I don't have answers, just questions. I'm hoping this issue can be a place to discuss what to do.
The text was updated successfully, but these errors were encountered: