Improve handling of large file transfers #966

petersilva · 2024-03-11T15:21:05Z

when downloading large files, that are kind of slow, sanity can kill the transfer in progress.

Could address by having transfer/init.py write_chunk to emit encouraging occasional messages to convince sanity that the log is not dead.

petersilva · 2024-03-11T15:23:28Z

write_chunk_init -- start a timer...
write_chunk -- check timer, emit message if too long...
write_chunk_end -- reset timer off..

petersilva · 2024-03-11T17:06:45Z

perhaps one message per minute?

andreleblanc11 · 2024-03-11T17:14:40Z

That seems like a reasonable interval to me

reidsunderland · 2024-03-11T18:48:20Z

It would be awesome if it could say downloaded x/y bytes (z%) :)

We don't always know the size, so sometimes it would just have to say downloaded x bytes.

If that's hard to implement then don't bother, but it would be really nice for debugging and keeping an eye on slow transfers

petersilva · 2024-03-11T18:53:16Z

in write chunk is called by get() and get accepts( length as and argument. ) when the length is unknown, it's value is zero, so if the value is >0 then could use that to give a progress message. would not be a big deal.

petersilva · 2024-03-11T18:53:44Z

get() also has message as an argument, which might have a 'size' value as well.

petersilva · 2024-03-11T18:55:15Z

pass some size estimate to chunk_init()

reidsunderland · 2024-03-15T17:29:49Z

One other thing I just remembered: when sr3 is in the middle of a big/slow download, it can't be stopped cleanly. Either the download has to finish or it will get killed by sr3 stop.

If the transfer code is being worked on, maybe we could also look into a way to interrupt and stop a transfer part-way through, then do some clean up, like putting the message into the retry queue?

Let me know if I should create a separate issue.

andreleblanc11 · 2024-03-15T17:42:48Z

I feel like that should be its own issue, but when I start working on this, I could have a fix for that as well to be included?

Could introduce a self.please_stop condition at the chunk writing intervals and then add the message to the retry queue when called

andreleblanc11 · 2024-03-15T17:45:12Z

Could also just rename the issue to englobe both problems to a common one

andreleblanc11 · 2024-03-21T16:31:35Z

Ive been looking at how FTP's get method works and it looks like that it does a full download of a file in one shot (at least when its not binary). There's isn't really any logic about writing out the file in chunks and looping from what I've been seeing. That would mean that using a timer would not work for preventing sanity killing instances on big downloads.

andreleblanc11 · 2024-03-21T16:32:45Z

@petersilva proposed of having some kind of check inside of sanity that verifies if the transfer is completed or not.

petersilva · 2024-03-21T18:02:59Z

so the stuff in the transfer class would work for built in transfers... so it still needs to be done. for FTP, the thing I was hinting at was to use existing, or generate a few additional metrics. then have sr3 sanity base it's decision to kill it on the metrics.

so... the metrics are generated once per batch. Say the batch is 25. So it will write to the metrics file a cumulation of statistics every 25 files.
It's a json tuple written out a subset from a sender might look like this:

...  "transferTxBytes": 665952, "transferRxFiles": 0, "transferT
xFiles": 321, "msgRate": 4.79695708533305,  ...

the sr.py program already loads current metrics for each component (c), configration (cfg) and instance(i) as part of start up,
so sanity should have *self.states[c][cfg]['instance_metrics'][i] which should be a dict with the existing metrics as fields.

example of what might work...

so... the msgRate indicates how many files per second are being transferred. and so you might be able to use that... to say... that sanity should wait, say 10 times long than it takes to transfer a file on average...
so 10*(1/4.7) = 2.09 seconds would be long enough to complain, given that five files should be arriving per second...

so if a sender has a transfer rate of 0.0004, then: 10*(1/0.004) == 2500
or 40 minutes before you complain...

msgRate would work on both subscribers and senders, but it might be thrown off by rejections... this is just the general idea...

sr3 sanity could walk the instance metrics... and get a better idea of what to kill from that.

reidsunderland · 2024-03-21T18:13:46Z

Another similar idea:

sr3 could log something like starting to download /path/to/file/on_filesystem just before starting the download. We could also include the expected file size, if it's known from the message. The "path to current file in progress" could also be stored in the metrics.

And sr3 sanity could check that file and see if it's growing.

I'm not sure if this is more or less complicated than calculating an estimated download rate and estimating the expected download time based off that.

petersilva · 2024-03-21T18:28:49Z

writing a file per file gets painful if you're writing hundreds of files per second... The metrics reduce that to one file per batch. You could still get the same info... there are other metrics (transferBytes) ... one could use to get more predictive metric, but I think maybe a test... if writing less than one file per second, then the one file/file thing makes sense... again... you need a metrics (but this time builtin to the flow) to descide whether to write the file or not. it could include how long it thinks it will take (combining internal current rate with size.) or even just plain estimate a completion time... or a "kill me If haven't done anything at this time." ... removing all respondibility from sanity to calculate what is "too long"

petersilva · 2024-05-23T22:57:29Z

Weird thing I started noticing... in dynamic load test, running on my laptop (new thing... ) the fan turns and the ordinarily trivial transfers via localhost ftp can crash:

2024-05-23 18:42:46,940 [INFO] sarracenia.flowcb.log after_post posted to a file with baseUrl: ftp://anonymous@localhost:2121/ relPath: home/peter/sarra_devdocroot/posted_by_srpost_test2/20240523/WXO-DD/bulletins/alphanumeric/20240523/SR/KWAL/22/SRWA20_KWAL_232242___19425
Traceback (most recent call last):
  File "/home/peter/sr3/sarracenia/instance.py", line 248, in <module>
    i.start()
  File "/home/peter/sr3/sarracenia/instance.py", line 239, in start
    self.running_instance.run()
  File "/home/peter/sr3/sarracenia/flow/__init__.py", line 572, in run
    time.sleep(increment)
  File "/home/peter/sr3/sarracenia/transfer/__init__.py", line 61, in alarm_raise
    raise TimeoutException("signal alarm timed out")
sarracenia.transfer.TimeoutException: signal alarm timed out
2024-05-23 18:50:49,499 [INFO] sarracenia.flow loadCallbacks flowCallback plugins to load: ['sarracenia.flowcb.gather.message.Message', 'sarracenia.flowcb.retry.Retry', 'sarracenia.flowcb.housekeeping.resources.Resources', 'sarracenia.flowcb.v2wrapper.V2Wrapper', 'log']
2024-05-23 18:50:49,502 [ERROR] sarracenia.config config_path script not found file_total_save

This is odd.

petersilva · 2024-05-23T22:58:04Z

the transfer is lasting 8 seconds, and the timer is going off.

petersilva · 2024-05-24T05:53:05Z

I read over the logic in FTP and in transfer... and it's definitely not "in one shot" as described above, we don't use the get verb, we use retr routines, which transfer in self.o.bufsize transfers via the write_chunk() routine, and there is even an on_data() routine available for subclasses to do odd transformations on the data in flight.

Anyways, I added a routine "logProgress()" that will print a message once in a while during long transfers (of FTP). if you're not using binary accellerators, then SFTP uses the same routines in the Transfer (Parent) class, so logProgress() should work as-is for sftp and http also.

I think it does everything we need. There was also an instance crashing bug that occurred because the more focused try/excepts no longer catch errors where the upstream file does not exist, causing ftp subscribers to crash when dealing with that case. The actual crash it because an alarm timer is set, but never cleared when an error occurs. The fix is to cancel the timer regardless of whether the operation succeeds or fails.

petersilva added bug Something isn't working enhancement New feature or request labels Mar 11, 2024

petersilva assigned andreleblanc11 Mar 11, 2024

andreleblanc11 changed the title ~~sanity killing transfers in progress.~~ Improve handling of large file transfers Mar 19, 2024

petersilva added NextRelease Feature Targeted for Next Release and removed NextRelease Feature Targeted for Next Release labels Apr 14, 2024

andreleblanc11 added the Priority 3 - Important worrisome impact... should study carefully label Apr 15, 2024

petersilva added ReliabilityRecovery improve behaviour in failure situations. and removed enhancement New feature or request labels Apr 15, 2024

petersilva mentioned this issue May 24, 2024

Issue824 morestates and issue966 and... other stuff #1067

Merged

petersilva closed this as completed in #1067 May 27, 2024

petersilva closed this as completed in f9a9a59 May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve handling of large file transfers #966

Improve handling of large file transfers #966

petersilva commented Mar 11, 2024

petersilva commented Mar 11, 2024 •

edited

petersilva commented Mar 11, 2024

andreleblanc11 commented Mar 11, 2024 •

edited

reidsunderland commented Mar 11, 2024

petersilva commented Mar 11, 2024

petersilva commented Mar 11, 2024

petersilva commented Mar 11, 2024

reidsunderland commented Mar 15, 2024

andreleblanc11 commented Mar 15, 2024 •

edited

andreleblanc11 commented Mar 15, 2024

andreleblanc11 commented Mar 21, 2024

andreleblanc11 commented Mar 21, 2024

petersilva commented Mar 21, 2024

reidsunderland commented Mar 21, 2024

petersilva commented Mar 21, 2024 •

edited

petersilva commented May 23, 2024

petersilva commented May 23, 2024

petersilva commented May 24, 2024

Improve handling of large file transfers #966

Improve handling of large file transfers #966

Comments

petersilva commented Mar 11, 2024

petersilva commented Mar 11, 2024 • edited

petersilva commented Mar 11, 2024

andreleblanc11 commented Mar 11, 2024 • edited

reidsunderland commented Mar 11, 2024

petersilva commented Mar 11, 2024

petersilva commented Mar 11, 2024

petersilva commented Mar 11, 2024

reidsunderland commented Mar 15, 2024

andreleblanc11 commented Mar 15, 2024 • edited

andreleblanc11 commented Mar 15, 2024

andreleblanc11 commented Mar 21, 2024

andreleblanc11 commented Mar 21, 2024

petersilva commented Mar 21, 2024

reidsunderland commented Mar 21, 2024

petersilva commented Mar 21, 2024 • edited

petersilva commented May 23, 2024

petersilva commented May 23, 2024

petersilva commented May 24, 2024

petersilva commented Mar 11, 2024 •

edited

andreleblanc11 commented Mar 11, 2024 •

edited

andreleblanc11 commented Mar 15, 2024 •

edited

petersilva commented Mar 21, 2024 •

edited