Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to overwrite batch transform output in S3 #68

Open
BaoshengHeTR opened this issue Sep 4, 2020 · 5 comments
Open

How to overwrite batch transform output in S3 #68

BaoshengHeTR opened this issue Sep 4, 2020 · 5 comments

Comments

@BaoshengHeTR
Copy link

I did not find the doc on overwrite batch transform output
If I try to run the same batch transform job multiple times along the time, how should I set the transformer to overwrite the output results (i.e., I don not change the output_path)

@chuyang-deng
Copy link
Contributor

Hi @BaoshengHeTR, are you using Python SDK? If so, if you use the same path (https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/transformer.py#L59) for multiple different times, you should have the results stored in the same location in S3.

@BaoshengHeTR
Copy link
Author

Hi @BaoshengHeTR, are you using Python SDK? If so, if you use the same path (https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/transformer.py#L59) for multiple different times, you should have the results stored in the same location in S3.

Yes. Doing that way makes new results append to the old ones, right? So can we set up an overwritting way? Like in Spark, we have write.mode("overwrite").

@haoransh
Copy link

Any update on this? I also need an overwrite mode especially when the input S3 path is the output from a spark job.

@matiassciencenow
Copy link

Same issue here. It would be ideal to be able to overwrite previous results from batch inferences instead of appending them, and the same feature for processing jobs.

@melaniemoy
Copy link

melaniemoy commented Jun 25, 2021

Throwing in another vote for this functionality. We had to modify our Airflow task to clean the directory before starting the prediction task, but it'd be nicer to be able to use .mode("overwrite") instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants