Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 SinglepartWriter writes on exception when garbage collected #819

Open
donsokolone opened this issue Apr 20, 2024 · 2 comments · May be fixed by #820
Open

S3 SinglepartWriter writes on exception when garbage collected #819

donsokolone opened this issue Apr 20, 2024 · 2 comments · May be fixed by #820

Comments

@donsokolone
Copy link

donsokolone commented Apr 20, 2024

Problem description

When there is an unhandled exception raised in context of SinglepartWriter side-effect occurs when writer is garbage-collected which results in unwanted write of partial file into S3.

2024-04-20T06:02:19.817140Z [debug    ] Parsed JSON record             aws_request_id=00000000-0000-0000-0000-000000000000 instance_id=cf450c4a-21b7-452b-ad79-291cb87b11ab records_count=24 target_uri=s3://vf-localstack-nora-pii-data-retention/anonymized/vf-localstack-nora-pii-data-retention/data/vf-da-prod-nora-cdp-blueconic-dck-consumer-profiles-1-2023-03-25-03-46-20-0482e5d3-5733-4ed6-b836-a6bdd2401a2d.gz trace_id=00000000-0000-0000-0000-000000000000 uri=s3://vf-localstack-nora-pii-data-retention/data/vf-da-prod-nora-cdp-blueconic-dck-consumer-profiles-1-2023-03-25-03-46-20-0482e5d3-5733-4ed6-b836-a6bdd2401a2d.gz
Traceback (most recent call last):
  File "/Users/tsokolowski/Dev/code-werkz/vfc/vfn/gdf_pii_data_retention/src/lambda/anonymizer-s3/local.py", line 43, in <module>
    output = handler(payload, SimpleNamespace(aws_request_id=trace_id))
  File "/Users/tsokolowski/Dev/code-werkz/vfc/vfn/gdf_pii_data_retention/src/lambda/anonymizer-s3/src/anonymizer_s3/app.py", line 31, in handler
    anonymize(settings, di_container)(inbound_payload)
  File "/Users/tsokolowski/Dev/code-werkz/vfc/vfn/gdf_pii_data_retention/src/lib/piilib/src/piilib/s3/anonymizer/anonymize.py", line 127, in _
    _, lookups_hits = dispatch(task)
  File "/Users/tsokolowski/.pyenv/versions/3.9.18/lib/python3.9/functools.py", line 888, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "/Users/tsokolowski/Dev/code-werkz/vfc/vfn/gdf_pii_data_retention/src/lib/piilib/src/piilib/s3/anonymizer/files/json_file.py", line 247, in _
    for raw_record_in, record_delimiter in json_parse(fin):
  File "/Users/tsokolowski/Dev/code-werkz/vfc/vfn/gdf_pii_data_retention/src/lib/piilib/src/piilib/s3/files/json_file.py", line 71, in json_parse
    buff = io.StringIO(old_buff.read())
KeyboardInterrupt
2024-04-20T06:02:20.791680Z [debug    ] smart_open.s3.SinglepartWriter('vf-localstack-nora-pii-data-retention', 'anonymized/vf-localstack-nora-pii-data-retention/data/vf-da-prod-nora-cdp-blueconic-dck-consumer-profiles-1-2023-03-25-03-46-20-0482e5d3-5733-4ed6-b836-a6bdd2401a2d.gz'): direct upload finished [smart_open.s3] target_uri=s3://vf-localstack-nora-pii-data-retention/anonymized/vf-localstack-nora-pii-data-retention/data/vf-da-prod-nora-cdp-blueconic-dck-consumer-profiles-1-2023-03-25-03-46-20-0482e5d3-5733-4ed6-b836-a6bdd2401a2d.gz uri=s3://vf-localstack-nora-pii-data-retention/data/vf-da-prod-nora-cdp-blueconic-dck-consumer-profiles-1-2023-03-25-03-46-20-0482e5d3-5733-4ed6-b836-a6bdd2401a2d.gz

Reason for this behaviour is SinglepartWriter inherits io.BufferedIOBase which invokes close() in __del__() descriptior.

Steps/code to reproduce the problem

  • set smart_open logger level to DEBUG
  • raise an unhandled exception in context of SinglepartWriter
  • wait for the writer to be garbage collected
  • observe smart_open.s3.SinglepartWriter log records indicating S3 write was performed

Versions

macOS-14.2.1-x86_64-i386-64bit
Python 3.9.18 (main, Nov 30 2023, 12:53:32)
[Clang 15.0.0 (clang-1500.0.40.1)]
smart_open 7.0.4
@ddelange
Copy link
Contributor

hi @donsokolone 👋

how about setting self._buf = None in terminate? then close is a no-op by the time the SinglepartWriter is garbage collected, analogous to MultipartWriter.

cc @mpenkov

@donsokolone
Copy link
Author

@ddelange This is exactly what the fix should be, as I mentioned in #763. I will PR it in few moments.

donsokolone added a commit to donsokolone/smart_open that referenced this issue Apr 20, 2024
donsokolone added a commit to donsokolone/smart_open that referenced this issue Apr 20, 2024
@donsokolone donsokolone linked a pull request Apr 20, 2024 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants