Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of large file transfers #966

Closed
petersilva opened this issue Mar 11, 2024 · 18 comments · Fixed by #1067
Closed

Improve handling of large file transfers #966

petersilva opened this issue Mar 11, 2024 · 18 comments · Fixed by #1067
Assignees
Labels
bug Something isn't working Priority 3 - Important worrisome impact... should study carefully ReliabilityRecovery improve behaviour in failure situations.

Comments

@petersilva
Copy link
Contributor

when downloading large files, that are kind of slow, sanity can kill the transfer in progress.

Could address by having transfer/init.py write_chunk to emit encouraging occasional messages to convince sanity that the log is not dead.

@petersilva petersilva added bug Something isn't working enhancement New feature or request labels Mar 11, 2024
@petersilva
Copy link
Contributor Author

petersilva commented Mar 11, 2024

write_chunk_init -- start a timer...
write_chunk -- check timer, emit message if too long...
write_chunk_end -- reset timer off..

@petersilva
Copy link
Contributor Author

perhaps one message per minute?

@andreleblanc11
Copy link
Member

andreleblanc11 commented Mar 11, 2024

That seems like a reasonable interval to me

@reidsunderland
Copy link
Member

It would be awesome if it could say downloaded x/y bytes (z%) :)

We don't always know the size, so sometimes it would just have to say downloaded x bytes.

If that's hard to implement then don't bother, but it would be really nice for debugging and keeping an eye on slow transfers

@petersilva
Copy link
Contributor Author

in write chunk is called by get() and get accepts( length as and argument. ) when the length is unknown, it's value is zero, so if the value is >0 then could use that to give a progress message. would not be a big deal.

@petersilva
Copy link
Contributor Author

get() also has message as an argument, which might have a 'size' value as well.

@petersilva
Copy link
Contributor Author

pass some size estimate to chunk_init()

@reidsunderland
Copy link
Member

One other thing I just remembered: when sr3 is in the middle of a big/slow download, it can't be stopped cleanly. Either the download has to finish or it will get killed by sr3 stop.

If the transfer code is being worked on, maybe we could also look into a way to interrupt and stop a transfer part-way through, then do some clean up, like putting the message into the retry queue?

Let me know if I should create a separate issue.

@andreleblanc11
Copy link
Member

andreleblanc11 commented Mar 15, 2024

I feel like that should be its own issue, but when I start working on this, I could have a fix for that as well to be included?

Could introduce a self.please_stop condition at the chunk writing intervals and then add the message to the retry queue when called

@andreleblanc11
Copy link
Member

Could also just rename the issue to englobe both problems to a common one

@andreleblanc11 andreleblanc11 changed the title sanity killing transfers in progress. Improve handling of large file transfers Mar 19, 2024
@andreleblanc11
Copy link
Member

Ive been looking at how FTP's get method works and it looks like that it does a full download of a file in one shot (at least when its not binary). There's isn't really any logic about writing out the file in chunks and looping from what I've been seeing. That would mean that using a timer would not work for preventing sanity killing instances on big downloads.

@andreleblanc11
Copy link
Member

@petersilva proposed of having some kind of check inside of sanity that verifies if the transfer is completed or not.

@petersilva
Copy link
Contributor Author

so the stuff in the transfer class would work for built in transfers... so it still needs to be done. for FTP, the thing I was hinting at was to use existing, or generate a few additional metrics. then have sr3 sanity base it's decision to kill it on the metrics.

so... the metrics are generated once per batch. Say the batch is 25. So it will write to the metrics file a cumulation of statistics every 25 files.
It's a json tuple written out a subset from a sender might look like this:

...  "transferTxBytes": 665952, "transferRxFiles": 0, "transferT
xFiles": 321, "msgRate": 4.79695708533305,  ...

the sr.py program already loads current metrics for each component (c), configration (cfg) and instance(i) as part of start up,
so sanity should have *self.states[c][cfg]['instance_metrics'][i] which should be a dict with the existing metrics as fields.

example of what might work...

so... the msgRate indicates how many files per second are being transferred. and so you might be able to use that... to say... that sanity should wait, say 10 times long than it takes to transfer a file on average...
so 10*(1/4.7) = 2.09 seconds would be long enough to complain, given that five files should be arriving per second...

so if a sender has a transfer rate of 0.0004, then: 10*(1/0.004) == 2500
or 40 minutes before you complain...

msgRate would work on both subscribers and senders, but it might be thrown off by rejections... this is just the general idea...

sr3 sanity could walk the instance metrics... and get a better idea of what to kill from that.

@reidsunderland
Copy link
Member

Another similar idea:

sr3 could log something like starting to download /path/to/file/on_filesystem just before starting the download. We could also include the expected file size, if it's known from the message. The "path to current file in progress" could also be stored in the metrics.

And sr3 sanity could check that file and see if it's growing.

I'm not sure if this is more or less complicated than calculating an estimated download rate and estimating the expected download time based off that.

@petersilva
Copy link
Contributor Author

petersilva commented Mar 21, 2024

writing a file per file gets painful if you're writing hundreds of files per second... The metrics reduce that to one file per batch. You could still get the same info... there are other metrics (transferBytes) ... one could use to get more predictive metric, but I think maybe a test... if writing less than one file per second, then the one file/file thing makes sense... again... you need a metrics (but this time builtin to the flow) to descide whether to write the file or not. it could include how long it thinks it will take (combining internal current rate with size.) or even just plain estimate a completion time... or a "kill me If haven't done anything at this time." ... removing all respondibility from sanity to calculate what is "too long"

@petersilva petersilva added NextRelease Feature Targeted for Next Release and removed NextRelease Feature Targeted for Next Release labels Apr 14, 2024
@andreleblanc11 andreleblanc11 added the Priority 3 - Important worrisome impact... should study carefully label Apr 15, 2024
@petersilva petersilva added ReliabilityRecovery improve behaviour in failure situations. and removed enhancement New feature or request labels Apr 15, 2024
@petersilva
Copy link
Contributor Author

Weird thing I started noticing... in dynamic load test, running on my laptop (new thing... ) the fan turns and the ordinarily trivial transfers via localhost ftp can crash:

2024-05-23 18:42:46,940 [INFO] sarracenia.flowcb.log after_post posted to a file with baseUrl: ftp://anonymous@localhost:2121/ relPath: home/peter/sarra_devdocroot/posted_by_srpost_test2/20240523/WXO-DD/bulletins/alphanumeric/20240523/SR/KWAL/22/SRWA20_KWAL_232242___19425
Traceback (most recent call last):
  File "/home/peter/sr3/sarracenia/instance.py", line 248, in <module>
    i.start()
  File "/home/peter/sr3/sarracenia/instance.py", line 239, in start
    self.running_instance.run()
  File "/home/peter/sr3/sarracenia/flow/__init__.py", line 572, in run
    time.sleep(increment)
  File "/home/peter/sr3/sarracenia/transfer/__init__.py", line 61, in alarm_raise
    raise TimeoutException("signal alarm timed out")
sarracenia.transfer.TimeoutException: signal alarm timed out
2024-05-23 18:50:49,499 [INFO] sarracenia.flow loadCallbacks flowCallback plugins to load: ['sarracenia.flowcb.gather.message.Message', 'sarracenia.flowcb.retry.Retry', 'sarracenia.flowcb.housekeeping.resources.Resources', 'sarracenia.flowcb.v2wrapper.V2Wrapper', 'log']
2024-05-23 18:50:49,502 [ERROR] sarracenia.config config_path script not found file_total_save

This is odd.

@petersilva
Copy link
Contributor Author

the transfer is lasting 8 seconds, and the timer is going off.

@petersilva
Copy link
Contributor Author

I read over the logic in FTP and in transfer... and it's definitely not "in one shot" as described above, we don't use the get verb, we use retr routines, which transfer in self.o.bufsize transfers via the write_chunk() routine, and there is even an on_data() routine available for subclasses to do odd transformations on the data in flight.

Anyways, I added a routine "logProgress()" that will print a message once in a while during long transfers (of FTP). if you're not using binary accellerators, then SFTP uses the same routines in the Transfer (Parent) class, so logProgress() should work as-is for sftp and http also.

I think it does everything we need. There was also an instance crashing bug that occurred because the more focused try/excepts no longer catch errors where the upstream file does not exist, causing ftp subscribers to crash when dealing with that case. The actual crash it because an alarm timer is set, but never cleared when an error occurs. The fix is to cancel the timer regardless of whether the operation succeeds or fails.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Priority 3 - Important worrisome impact... should study carefully ReliabilityRecovery improve behaviour in failure situations.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants