Skip to content
This repository has been archived by the owner on Mar 19, 2024. It is now read-only.

working prototype of wandb #271

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

pclucas14
Copy link

This is a working prototype of a wandb logger. I'm filing the PR now so the team can have a look, but there are probably still a few things to iron out first.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 6, 2021
@prigoyal
Copy link
Contributor

prigoyal commented Apr 6, 2021

also cc @QuentinDuval and @min-xu-ai :)

@pclucas14
Copy link
Author

Right now a critical issue is that for some errors, wandb stalls instead of crashing. When testing this, if a run stalls, rerun it with wandb disabled (by running wandb disabled in the command line)

@min-xu-ai
Copy link
Contributor

Right now a critical issue is that for some errors, wandb stalls instead of crashing. When testing this, if a run stalls, rerun it with wandb disabled (by running wandb disabled in the command line)

This is exciting. Can you share a bit more info on how to test this out? I can check out your branch and I suppose I need to install some wandb pip packages and etc.? Can you share the commands for folks to test it out and to see how it is used in action?

@pclucas14
Copy link
Author

pclucas14 commented Apr 6, 2021

Hi,

to install wandb pip install wandb should do it. Some basic stuff you should know : There are essentially three wandb modes.

  1. wandb on will enable wandb and will sync your runs directly to the cloud, as well as log files internally
  2. wandb off does not upload online, only saves files locally, which you can upload later
  3. wandb disabled makes all wandb operations no-ops and is equivalent to not doing anything.

If you don't have a wandb account you should start by creating one. Once that's done, wandb will give you instructions on what to set up on your remote server.

In terms of reproducing the error, in my end if you do distributed training (num_nodes > 1) and put a super large batch size leading to OOM, wandb currently stalls and I'm not sure why. It may also be an issue specific to me, I will also investigate.

@pclucas14
Copy link
Author

This is the exact command I am running :

PYTHONPATH=./ python tools/run_distributed_engines.py config=pretrain/simclr/simclr_wandb.yaml config.DISTRIBUTED.NUM_PROC_PER_NODE=2 config.HOOKS.WANDB_SETUP.USE_WANDB=True config.DISTRIBUTED.NUM_NODES=1 config.HOOKS.WANDB_SETUP.EXP_NAME=simclr_mini_2gpu config.DATA.TRAIN.BATCHSIZE_PER_REPLICA=1000

running after wandb disabled gives crashes giving an OOM error, as seen here :

INFO 2021-04-06 11:45:33,467 trainer_main.py: 301: Phase advanced. Rank: 1
INFO 2021-04-06 11:45:33,467 state_update_hooks.py:  98: Starting phase 16 [train]
INFO 2021-04-06 11:45:34,856 trainer_main.py: 301: Phase advanced. Rank: 0
INFO 2021-04-06 11:45:34,857 state_update_hooks.py:  98: Starting phase 16 [train]
Traceback (most recent call last):
  File "tools/run_distributed_engines.py", line 58, in <module>
    hydra_main(overrides=overrides)
  File "tools/run_distributed_engines.py", line 45, in hydra_main
    hook_generator=default_hook_generator,
  File "/private/home/lucaspc/repos/vissl/vissl/utils/distributed_launcher.py", line 145, in launch_distributed
    daemon=False,
  File "/private/home/lucaspc/.conda/envs/vissl2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/private/home/lucaspc/.conda/envs/vissl2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
  File "/private/home/lucaspc/.conda/envs/vissl2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception:

....

    self.padding, self.dilation, self.groups)
RuntimeError: CUDA out of memory. Tried to allocate 1.10 GiB (GPU 1; 15.78 GiB total capacity; 12.60 GiB already allocated; 641.75 MiB free; 13.93 GiB reserved in total by PyTorch)


With wandb on the method does not crash, but rather stalls. This is the last thing shown in the prompt:

INFO 2021-04-06 11:41:48,606 trainer_main.py: 301: Phase advanced. Rank: 1
INFO 2021-04-06 11:41:48,606 state_update_hooks.py:  98: Starting phase 16 [train]
INFO 2021-04-06 11:41:49,963 trainer_main.py: 301: Phase advanced. Rank: 0
INFO 2021-04-06 11:41:49,964 state_update_hooks.py:  98: Starting phase 16 [train]

Copy link
Contributor

@min-xu-ai min-xu-ai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this hook can cause the stall when OOM should have been raised. Maybe it wasn't really a stall? How about the case without an OOM, does the training go on as normal? Doesn't get slow down when wandb is enabled? I'd suggest put some prints in the hook and the train loop to see where exactly the stall happens if it indeed stalls.


wandb_available = True
except ImportError:
logging.info("Tensorboard is not available")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stale msg

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Comment on lines +81 to +82
import wandb
from vissl.hooks import SSLWandbHook
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move these to the top or is it due to circular imports?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to only import wandb if the program is calling the hook. This way it will still work if a user has not installed wandb yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to me to do local imports .

vissl/utils/wandb.py Outdated Show resolved Hide resolved
from classy_vision.generic.distributed_util import is_primary
from classy_vision.hooks.classy_hook import ClassyHook

if is_primary():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do this check? Importing on all ranks should be fine too?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wandb import has some overhead to it, so I was trying to limit it to the main worker since it's the only one using it. This can be changed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to me to do it on primary_rank only. :)

vissl/hooks/wandb_hook.py Outdated Show resolved Hide resolved
@pclucas14
Copy link
Author

I don't see how this hook can cause the stall when OOM should have been raised. Maybe it wasn't really a stall? How about the case without an OOM, does the training go on as normal? Doesn't get slow down when wandb is enabled? I'd suggest put some prints in the hook and the train loop to see where exactly the stall happens if it indeed stalls.

Ok, I will try this tomorrow and let you know. Note that this is not specific to an OOM error, I saw the same thing with another error on the data side.

@pclucas14
Copy link
Author

pclucas14 commented Apr 12, 2021

Hi, quick update :

I'm unable to use the current logger when launching jobs with SLURM. If I ask for an interactive session and launch a job it works ok, however launching directly with slurm I get the following errror :

-- Process 3 terminated with the following error:
Traceback (most recent call last):
  File "/private/home/lucaspc/.conda/envs/vissl2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/private/home/lucaspc/repos/vissl/vissl/utils/distributed_launcher.py", line 200, in _distributed_worker
    process_main(cfg, dist_run_id, local_rank=local_rank, node_id=node_id)
  File "/private/home/lucaspc/repos/vissl/vissl/utils/distributed_launcher.py", line 193, in process_main
    hook_generator=hook_generator,
  File "/private/home/lucaspc/repos/vissl/vissl/engines/train.py", line 93, in train_main
    hooks = hook_generator(cfg)
  File "/private/home/lucaspc/repos/vissl/vissl/hooks/__init__.py", line 126, in default_hook_generator
    wandb_hook = get_wandb_hook(cfg)
  File "/private/home/lucaspc/repos/vissl/vissl/utils/wandb.py", line 120, in get_wandb_hook
    name=name
  File "/private/home/lucaspc/.conda/envs/vissl2/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 742, in init
    run = wi.init()
  File "/private/home/lucaspc/.conda/envs/vissl2/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 513, in init
    raise UsageError(error_message)
wandb.errors.UsageError: Error communicating with wandb process

srun: error: learnfair0928: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=39934341.0

If anyone with distributed programming knowledge knows why this is happening, please let me know

Copy link
Contributor

@prigoyal prigoyal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @pclucas14 :)

from classy_vision.generic.distributed_util import is_primary
from classy_vision.hooks.classy_hook import ClassyHook

if is_primary():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to me to do it on primary_rank only. :)


def is_wandb_available():
"""
Check whether wandb is available or not.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a link here to wandb documentation that open source users can read/access? :)

Check whether wandb is available or not.

Returns:
wandb_available (bool): based on wandb imports, returns whether tensboarboard
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: tensboarboard -> wandb

Comment on lines +81 to +82
import wandb
from vissl.hooks import SSLWandbHook
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to me to do local imports .

f.write(wandb_id)

name = cfg.HOOKS.WANDB_SETUP.EXP_NAME
if name == "??":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's set name = "" instead of ??


BYTE_TO_MiB = 2 ** 20

class SSLWandbHook(ClassyHook):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question on this: is this hook similar to Tensorboard hook with the only difference being in logging to "wandb" instead of tensorboard?

Is it possible that we can inherit the TensorboardHook ? Or alternatively, does it make sense to extend the TensorboardHook directly to optionally log to wandb as well is user is using WandB ?

@prigoyal
Copy link
Contributor

Hi, quick update :

I'm unable to use the current logger when launching jobs with SLURM. If I ask for an interactive session and launch a job it works ok, however launching directly with slurm I get the following errror :

-- Process 3 terminated with the following error:
Traceback (most recent call last):
  File "/private/home/lucaspc/.conda/envs/vissl2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/private/home/lucaspc/repos/vissl/vissl/utils/distributed_launcher.py", line 200, in _distributed_worker
    process_main(cfg, dist_run_id, local_rank=local_rank, node_id=node_id)
  File "/private/home/lucaspc/repos/vissl/vissl/utils/distributed_launcher.py", line 193, in process_main
    hook_generator=hook_generator,
  File "/private/home/lucaspc/repos/vissl/vissl/engines/train.py", line 93, in train_main
    hooks = hook_generator(cfg)
  File "/private/home/lucaspc/repos/vissl/vissl/hooks/__init__.py", line 126, in default_hook_generator
    wandb_hook = get_wandb_hook(cfg)
  File "/private/home/lucaspc/repos/vissl/vissl/utils/wandb.py", line 120, in get_wandb_hook
    name=name
  File "/private/home/lucaspc/.conda/envs/vissl2/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 742, in init
    run = wi.init()
  File "/private/home/lucaspc/.conda/envs/vissl2/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 513, in init
    raise UsageError(error_message)
wandb.errors.UsageError: Error communicating with wandb process

srun: error: learnfair0928: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=39934341.0

If anyone with distributed programming knowledge knows why this is happening, please let me know

regarding this, we need to understand what wandb requires in order to run via SLURM. Are there specific requirements or documentation on configuring wandb when using slurm?

@pclucas14
Copy link
Author

I emailed the support team at Weights and Biases, will keep you posted.

@pclucas14
Copy link
Author

After speaking with people from WandB and having them look at the log, it seems that the following line gets called more than once, which is problematic. I need it to be called exactly once. Any tips on how to make that happen ? I though wrapping it in a if is_primary() would be sufficient, but it seems that is not the case.

@QuentinDuval
Copy link
Contributor

QuentinDuval commented Apr 20, 2021

the following line

Indeed, this is because torch.distributed is not initialised at this point (I got bitten by it as well). You should use dist_rank = get_machine_local_and_dist_rank()[1].

@prigoyal I think we need to rework this a bit, and either:

  • introduce a is_primary function in VISSL that does the right thing (call torch.distributed if initialised and do as get_machine_local_and_dist_rank otherwise)
  • or move the creation of the hook after the initialisation of torch.distributed

What you do both think?

@prigoyal
Copy link
Contributor

the following line

Indeed, this is because torch.distributed is not initialised at this point (I got bitten by it as well). You should use dist_rank = get_machine_local_and_dist_rank()[1].

@prigoyal I think we need to rework this a bit, and either:

  • introduce a is_primary function in VISSL that does the right thing (call torch.distributed if initialised and do as get_machine_local_and_dist_rank otherwise)
  • or move the creation of the hook after the initialisation of torch.distributed

What you do both think?

indeed. THere is already a is_primary() function https://github.com/facebookresearch/vissl/blob/master/vissl/hooks/tensorboard_hook.py#L7 . Does this help? :)

@pclucas14
Copy link
Author

the following line

Indeed, this is because torch.distributed is not initialised at this point (I got bitten by it as well). You should use dist_rank = get_machine_local_and_dist_rank()[1].
@prigoyal I think we need to rework this a bit, and either:

  • introduce a is_primary function in VISSL that does the right thing (call torch.distributed if initialised and do as get_machine_local_and_dist_rank otherwise)
  • or move the creation of the hook after the initialisation of torch.distributed

What you do both think?

indeed. THere is already a is_primary() function https://github.com/facebookresearch/vissl/blob/master/vissl/hooks/tensorboard_hook.py#L7 . Does this help? :)

I should specify that I am indeed using is_primary from classy_vision.generic.distributed_util

@QuentinDuval
Copy link
Contributor

QuentinDuval commented Apr 21, 2021

indeed. THere is already a is_primary() function https://github.com/facebookresearch/vissl/blob/master/vissl/hooks/tensorboard_hook.py#L7 . Does this help? :)

It will not help: it's the very function to avoid in that case because torch.distributed is not initialised up to that point and so it will return 0 for all callers => multiple workers will call wandb.init.

@pclucas14 Could you check that by replacing is_primary with 0 == get_machine_local_and_dist_rank()[1] your problem of init is solved?

If it does not work, could you try this instead?

local_rank, distributed_rank = get_machine_local_and_dist_rank()
if local_rank == 0 and distributed_rank == 0:
    # init code

The function we should create would look like this:

def is_primary():
    if not dist.is_initialized():
        dist_rank = get_machine_local_and_dist_rank()[1]
        return dist_rank == 0
   else:
       return get_rank() == 0  # get rank of classy-vision

The other option is not to check the initialisation of torch.distributed:

def is_primary():
    dist_rank = get_machine_local_and_dist_rank()[1]
    return dist_rank == 0

@prigoyal
Copy link
Contributor

The function we should create would look like this:

def is_primary():
    if not dist.is_initialized():
        dist_rank = get_machine_local_and_dist_rank()[1]
        return dist_rank == 0
   else:
       return get_rank() == 0  # get rank of classy-vision

thank you so much @pclucas14 for clarification and fully agree @QuentinDuval . I think we should go for the above function and implement it in vissl/utils/misc.py . We should then switch the is_primary() call everywhere in VISSL from classy vision to our own . @pclucas14 , if you wish, this broader change can be taken care of by us in a separate PR and in this PR, you can create a is_primary() as Quentin suggested above and use it where you need. :)

@facebook-github-bot
Copy link
Contributor

Hi @pclucas14!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

@surajpaib surajpaib mentioned this pull request Jun 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants