Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

setTimeLimit(): varying results in multicore with failing job #665

Open
r2evans opened this issue Feb 8, 2023 · 7 comments
Open

setTimeLimit(): varying results in multicore with failing job #665

r2evans opened this issue Feb 8, 2023 · 7 comments
Labels
Milestone

Comments

@r2evans
Copy link

r2evans commented Feb 8, 2023

In a multicore plan, if the expression fails (e.g., due to timeout here) then the return from value(.) might be either the error itself (expected) or one of two other failures ("Unexpected result" or "major/minor"). I do not see a pattern in how it varies. Further, when "Unexpected", the worker is not freed.

Other than the preamble, the expression is unchanged, and I tend to reset workers between (but I tried without resetting, does not appear to influence the results).

library(future)
plan(multicore, workers = 3)

nbrOfFreeWorkers()
# [1] 3
thisfut <- future({
  setTimeLimit(elapsed = 2, transient = TRUE)
  Sys.sleep(3)
  1L
})
value(thisfut)
# Error: Unexpected result (of class ‘NULL’ != ‘FutureResult’) retrieved for MulticoreFuture future (label = ‘<none>’, expression = ‘{; setTimeLimit(elapsed = 2, transient = TRUE); Sys.sleep(3); 1L; }’):
nbrOfFreeWorkers()
# [1] 2

nbrOfFreeWorkers()
# [1] 3
thisfut <- future({
  setTimeLimit(elapsed = 2, transient = TRUE)
  Sys.sleep(3)
  1L
})
value(thisfut)
# Error in future::FutureResult(value = ...future.value$value, visible = ...future.value$visible,  :
#   reached elapsed time limit
nbrOfFreeWorkers()
# [1] 3

nbrOfFreeWorkers()
# [1] 3
thisfut <- future({
  setTimeLimit(elapsed = 2, transient = TRUE)
  Sys.sleep(3)
  1L
})
value(thisfut)
# Error in c("major", "minor") %in% names(x) : reached elapsed time limit
nbrOfFreeWorkers()
# [1] 3

It takes several tries to get all three returns. This is in a fresh R instance, nothing else loaded or done between calls.

The first ("Unexpected error") does not automatically free the worker, I must explicitly resetWorkers(plan()) to get it back.

I know there are cautions against using R.utils::withTimeout when coordinating with other processes, which I will infer to be similar-enough to setTimeLimit but not applicable since the timeout is completely within the other process.

sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] future_1.31.0

loaded via a namespace (and not attached):
[1] compiler_4.2.2    parallelly_1.34.0 tools_4.2.2       parallel_4.2.2
[5] listenv_0.9.0     codetools_0.2-18  digest_0.6.31     globals_0.16.2
@r2evans r2evans added the bug label Feb 8, 2023
@r2evans
Copy link
Author

r2evans commented Feb 8, 2023

FYI, this was started initially in rstudio/promises#86 in a related effort; the double-reporting issue there I think has been resolved in dev, but I think the variable failing is future-specific.

@HenrikBengtsson
Copy link
Owner

HenrikBengtsson commented Feb 8, 2023

Thanks for this. Some quick initial comments. I think I understand what's going on. The time-out limit is not reset on the worker after the future expression has been finished. This means it'll also apply to the code wrapping up the results and sending it back to the parent R process.

Depending on exactly when this timeout kicks in, you'll get different types of errors. Your different examples show this and makes sense to me. For example, the "unexpected result" error message is from an internal sanity check that makes sure that the proper data structure has been received by the parent R process, but in your example it received nothing, indicating that the connection was interrupted. That error is actually of class UnexpectedFutureResultError, which is a special instance of class FutureError. To distinguish them from regular run-time evaluation errors, the FutureError class indicates an error in the "orchestration" of futures. When those happen, they can be quite disruptive, and depending on parallel backend, you might have to reset the backend, or even restart R.

I think that if you add setTimeLimit(cpu = Inf, elapsed = Inf, transient = FALSE) to the end of your future expression, e.g.

thisfut <- future({
  on.exit(setTimeLimit(cpu = Inf, elapsed = Inf, transient = FALSE))  ## TESTING ONLY!
  setTimeLimit(elapsed = 2, transient = TRUE)
  Sys.sleep(3)
  1L
})

then this problem should go away? Is that the case? If so, I'll look into possibly adding a setTimeLimit(cpu = Inf, elapsed = Inf, transient = FALSE) automatically.

The reason why I say TESTING ONLY!, is that I have yet not decided if it is a good idea to use on.exit() in future expressions. It works because of how it's evaluated internally, but I'm not yet ready to promise that behavior, but it's good enough for our testing purposes here.

BTW, Sys.sleep() is very special when it comes to timeouts, and its behavior varies with OS, cf. HenrikBengtsson/Wishlist-for-R#47.

@HenrikBengtsson HenrikBengtsson added this to the Next release milestone Feb 8, 2023
@r2evans
Copy link
Author

r2evans commented Feb 8, 2023

Thank you for the prompt and detailed response!

  1. Sys.sleep() is just a placeholder for longer-running expressions that I need to be bounded, ergo setTimeLimit. Note taken. (And I learned something, thanks!)
  2. Especially with the explicit transient=TRUE, I'm surprised that the on.exit(..) changes any behavior, much less fixes this problem. My interpretation of transient=TRUE is that as soon as the current computation is complete (i.e., everything within the future({...}), it should automatically return to the default Inf limits. Do think think this could be a bug in setTimeLimit, or am I misinterpreting it?
    Edit: on re-reading your comment, the transient= appears to be insufficient due to the extra code you're running for futures-management, I get it, so some of your code is also not being allowed to continue operation. Would it change any behavior to run setTimeLimit(.) and my real code within a local(.) or other further-constricting environment?

I see that this (on.exit(..)) trick is what you do in R.utils::withTimeout, which suggests to me that it's just a "feature" (perhaps not quite "known deficiency") of setTimeLimit.

I understand that you are concerned with imposing this as default behavior. For now I'm going to have to use it in production (so that failed futures don't leak workers, a problem I've been dealing with). Do you think the use of reg.finalizer is any more or less risky than on.exit?

@HenrikBengtsson HenrikBengtsson changed the title varying results in multicore with failing job setTimeLimit(): varying results in multicore with failing job Feb 8, 2023
@HenrikBengtsson
Copy link
Owner

I understand that you are concerned with imposing this as default behavior. For now I'm going to have to use it in production (so that failed futures don't leak workers, a problem I've been dealing with). Do you think the use of reg.finalizer is any more or less risky than on.exit?

I'm only hesitant because I haven't had a deep focus of thinking about what it would entail to allow on.exit() within a future expression. Quite simplified, you can think of the current implementation for evaluating futures to looks something like:

setup_future_code()

result <- tryCatch({
  local(<future expression here>)
}, error = function(ex) {
  ...
})

teardown_future_code()
send_results_back(result)

Since it's evaluated within a local() already, then on.exit() works already now. I doubt that will change, but I do not want to make a full commitment to it right now. An advantage of recommending/requiring an explicit local(), is that you need that in vanilla sequential code, e.g.

res <- local({
  on.exit(...)
  42L
})

You cannot just do:

res <- {
  on.exit(...)
  42L
}

So, for future() analogue to regular evaluation, requiring local() makes more sense, especially so when replacing <- with future's %<-% parallel assignment operator.

Would it change any behavior to run setTimeLimit(.) and my real code within a local(.) or other further-constricting environment?

Yes, using an explicit local() would guarantee that on.exit() will always work, with or without futures. So,

thisfut <- future(local{
  on.exit(setTimeLimit(cpu = Inf, elapsed = Inf, transient = FALSE)) 
  setTimeLimit(elapsed = 2, transient = TRUE)
  Sys.sleep(3)
  1L
}))

would be the most future-proof solution (pun intended), and it explicitly communicates the scope and intent.

That said, I will consider adding something like on.exit(setTimeLimit(cpu = Inf, elapsed = Inf, transient = FALSE)) internally, to protect against hidden timeouts, which might be a left-over from some deep-down dependency. Doing so would be inline with recent attempts to undo R options and environment set, so they do not spill over future (pun 2) futures. To clarify, with the current behavior, you can create a long-term timeout in one future, which will remain if run on a persistent worker (e.g. multisession), and might trigger a timeout, say, in one hour from now.

@r2evans
Copy link
Author

r2evans commented Feb 8, 2023

Actually, I'm having a difficult time meeting my expectation of setTimeLimit even outside of future.

 print({
   setTimeLimit(cpu = Inf, elapsed = Inf, transient = FALSE)
   e <- environment()
   reg.finalizer(e,
                 function(ign) {cat('fin1\n');setTimeLimit(cpu = Inf, elapsed = Inf, transient = FALSE);},
                 onexit = TRUE)
   on.exit({cat('fin2\n');setTimeLimit(cpu = Inf, elapsed = Inf, transient = FALSE);})
   setTimeLimit(cpu = 1, elapsed = 1, transient = TRUE)
   system("sleep 3")
   # Sys.sleep(3)
   message("hey")
   setTimeLimit(cpu = Inf, elapsed = Inf, transient = FALSE)
   1L
 })

Also done with print(local({...})). Also done using Sys.sleep instead of system. Clearly overkill.

Some of the time I get Error: reached elapsed time limit, but usually it happily messages "hey" and returns 1L. I can't seem to find a way to get any expression to reliably fail on a time limit, either in future or not ... off-topic for this issue but I think I'd like to shore-up this issue so that it behaves as expected both inside and outside of a future. Do you have any insight?

My expectation is that this should always: fail, show the error message, not display "hey", and not return 1L. Clearly I'm confused about something :-/

@HenrikBengtsson
Copy link
Owner

Do you have any insight?

I only skimmed your comment here, but my first instinct is that also system() might not be interruptable, or at least have a delayed interruption. Try with for (kk in 1:300) Sys.sleep(0.01) instead and see if that makes a difference.

Also, it doesn't look like that finalizer makes a difference in this example, since it'll only kick in if you delete e, and then it's unpredictable when the GC will clean it up.

@r2evans
Copy link
Author

r2evans commented Feb 9, 2023

Okay, I think it's finally sinking in: neither Sys.sleep nor system in this usage are good tests for setTimeLimit, locally or in a future. Replacing system("sleep 3") and Sys.sleep(3) with something like (function() replicate(300, Sys.sleep(0.1)))() (I was too lazy to define it as a named func) seems to remedy the symptoms. Incidentally, so does processx::run("sleep", "3") (I really don't like system/system2 :-/)

For reg.finalizer, I had assumed that once we extracted the value from a multicore worker, the forked process and therefore the environment would be harvested, so it should have been finalized. Perhaps, as with many things gc-related, it just isn't happening as quickly as I assumed.

I should have listened more closely to your first cautions: on.exit(..) does not seem necessary when using either the looped-Sys.sleep or the processx variant.

Thank you for the dialogue!

@HenrikBengtsson HenrikBengtsson modified the milestones: 1.32.0, Next release Mar 6, 2023
@HenrikBengtsson HenrikBengtsson modified the milestones: 1.33.0, Next release Jul 1, 2023
@HenrikBengtsson HenrikBengtsson modified the milestones: 1.33.1, Next release Dec 21, 2023
@HenrikBengtsson HenrikBengtsson modified the milestones: 1.33.2, Next release Mar 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants