Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Black Scholes test fails at large sizes #10

Open
calebwin opened this issue Aug 4, 2021 · 0 comments
Open

Black Scholes test fails at large sizes #10

calebwin opened this issue Aug 4, 2021 · 0 comments
Labels
banyan-arrays-jl Concerning BanyanArrays.jl banyan-jl Concerning Banyan.jl bug Something isn't working

Comments

@calebwin
Copy link
Contributor

calebwin commented Aug 4, 2021

This could be happening because of several reasons:

  • Running out of memory because GC.gc() calls not placed strategically (not really an issue any more)
  • Running out of memory because of not enough initial free memory (may not be an issue)
  • Running out of disk space because of EBS limitations or because of some unknown extra usage (the most common issue)
  • Job occasionally failing maybe because of printing (almost definitely not an issue)

(The below 2 issues might be because of open-mpi/ompi#6014. So we may need a newer version of Open-MPI.)

  • Job occasionally failing because of:
slurmstepd: error: *** JOB 3737 ON compute-dy-t3large-2 CANCELLED AT 2021-08-03T16:28:28 ***
slurmstepd: error: *** STEP 3737.0 ON compute-dy-t3large-2 CANCELLED AT 2021-08-03T16:28:28 ***

signal (15): Terminated
in expression starting at /home/ec2-user/executor.jl:52
epoll_wait at /lib64/libc.so.6 (unknown line)

signal (15): Terminated
in expression starting at /home/ec2-user/executor.jl:52
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
mca_btl_vader_fbox_read_header at /codebuild/output/src084091651/src/ompi_build/BUILD/openmpi-4.1.0/opal/mca/btl/vader/btl_vader_fbox.h:72 [inlined]
mca_btl_vader_check_fboxes at /codebuild/output/src084091651/src/ompi_build/BUILD/openmpi-4.1.0/opal/mca/btl/vader/btl_vader_fbox.h:195 [inlined]
mca_btl_vader_component_progress at /codebuild/output/src084091651/src/ompi_build/BUILD/openmpi-4.1.0/opal/mca/btl/vader/btl_vader_component.c:765
  • Job failing because of:
srun: error: compute-dy-t32xlarge-1: task 0: Killed
slurmstepd: error: compute-dy-t32xlarge-1 [0] pmixp_client_v2.c:210 [_errhandler] mpi/pmix: ERROR: Error handler invoked: status = -25: Interrupted system call (4)
slurmstepd: error: *** STEP 112.0 ON compute-dy-t32xlarge-1 CANCELLED AT 2021-06-25T13:33:56 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: compute-dy-t32xlarge-1: task 1: Killed
srun: error: compute-dy-t32xlarge-1: tasks 2-7: Killed
  • The scheduler tries to have the result of the Black Scholes model be materialized to disk unnecessarily. This is likely because the finalizers of these values are not getting garbage collected on the gc() call on in the final compute. We may need to set these variables to nothing first or call gc(true).
@calebwin calebwin added bug Something isn't working banyan-jl Concerning Banyan.jl banyan-arrays-jl Concerning BanyanArrays.jl labels Aug 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
banyan-arrays-jl Concerning BanyanArrays.jl banyan-jl Concerning Banyan.jl bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant