working prototype of wandb #271

pclucas14 · 2021-04-06T15:11:07Z

This is a working prototype of a wandb logger. I'm filing the PR now so the team can have a look, but there are probably still a few things to iron out first.

prigoyal · 2021-04-06T15:15:57Z

also cc @QuentinDuval and @min-xu-ai :)

pclucas14 · 2021-04-06T15:22:26Z

Right now a critical issue is that for some errors, wandb stalls instead of crashing. When testing this, if a run stalls, rerun it with wandb disabled (by running wandb disabled in the command line)

min-xu-ai · 2021-04-06T16:39:27Z

Right now a critical issue is that for some errors, wandb stalls instead of crashing. When testing this, if a run stalls, rerun it with wandb disabled (by running wandb disabled in the command line)

This is exciting. Can you share a bit more info on how to test this out? I can check out your branch and I suppose I need to install some wandb pip packages and etc.? Can you share the commands for folks to test it out and to see how it is used in action?

pclucas14 · 2021-04-06T18:26:01Z

Hi,

to install wandb pip install wandb should do it. Some basic stuff you should know : There are essentially three wandb modes.

wandb on will enable wandb and will sync your runs directly to the cloud, as well as log files internally
wandb off does not upload online, only saves files locally, which you can upload later
wandb disabled makes all wandb operations no-ops and is equivalent to not doing anything.

If you don't have a wandb account you should start by creating one. Once that's done, wandb will give you instructions on what to set up on your remote server.

In terms of reproducing the error, in my end if you do distributed training (num_nodes > 1) and put a super large batch size leading to OOM, wandb currently stalls and I'm not sure why. It may also be an issue specific to me, I will also investigate.

pclucas14 · 2021-04-06T18:47:57Z

This is the exact command I am running :

PYTHONPATH=./ python tools/run_distributed_engines.py config=pretrain/simclr/simclr_wandb.yaml config.DISTRIBUTED.NUM_PROC_PER_NODE=2 config.HOOKS.WANDB_SETUP.USE_WANDB=True config.DISTRIBUTED.NUM_NODES=1 config.HOOKS.WANDB_SETUP.EXP_NAME=simclr_mini_2gpu config.DATA.TRAIN.BATCHSIZE_PER_REPLICA=1000

running after wandb disabled gives crashes giving an OOM error, as seen here :

INFO 2021-04-06 11:45:33,467 trainer_main.py: 301: Phase advanced. Rank: 1
INFO 2021-04-06 11:45:33,467 state_update_hooks.py:  98: Starting phase 16 [train]
INFO 2021-04-06 11:45:34,856 trainer_main.py: 301: Phase advanced. Rank: 0
INFO 2021-04-06 11:45:34,857 state_update_hooks.py:  98: Starting phase 16 [train]
Traceback (most recent call last):
  File "tools/run_distributed_engines.py", line 58, in <module>
    hydra_main(overrides=overrides)
  File "tools/run_distributed_engines.py", line 45, in hydra_main
    hook_generator=default_hook_generator,
  File "/private/home/lucaspc/repos/vissl/vissl/utils/distributed_launcher.py", line 145, in launch_distributed
    daemon=False,
  File "/private/home/lucaspc/.conda/envs/vissl2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/private/home/lucaspc/.conda/envs/vissl2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
  File "/private/home/lucaspc/.conda/envs/vissl2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception:

....

    self.padding, self.dilation, self.groups)
RuntimeError: CUDA out of memory. Tried to allocate 1.10 GiB (GPU 1; 15.78 GiB total capacity; 12.60 GiB already allocated; 641.75 MiB free; 13.93 GiB reserved in total by PyTorch)

With wandb on the method does not crash, but rather stalls. This is the last thing shown in the prompt:

INFO 2021-04-06 11:41:48,606 trainer_main.py: 301: Phase advanced. Rank: 1
INFO 2021-04-06 11:41:48,606 state_update_hooks.py:  98: Starting phase 16 [train]
INFO 2021-04-06 11:41:49,963 trainer_main.py: 301: Phase advanced. Rank: 0
INFO 2021-04-06 11:41:49,964 state_update_hooks.py:  98: Starting phase 16 [train]

min-xu-ai

I don't see how this hook can cause the stall when OOM should have been raised. Maybe it wasn't really a stall? How about the case without an OOM, does the training go on as normal? Doesn't get slow down when wandb is enabled? I'd suggest put some prints in the hook and the train loop to see where exactly the stall happens if it indeed stalls.

min-xu-ai · 2021-04-06T19:19:16Z

vissl/utils/wandb.py

+
+        wandb_available = True
+    except ImportError:
+        logging.info("Tensorboard is not available")


min-xu-ai · 2021-04-06T19:19:50Z

vissl/utils/wandb.py

+    import wandb
+    from vissl.hooks import SSLWandbHook


move these to the top or is it due to circular imports?

I want to only import wandb if the program is calling the hook. This way it will still work if a user has not installed wandb yet.

makes sense to me to do local imports .

vissl/utils/wandb.py

min-xu-ai · 2021-04-06T19:22:01Z

vissl/hooks/wandb_hook.py

+from classy_vision.generic.distributed_util import is_primary
+from classy_vision.hooks.classy_hook import ClassyHook
+
+if is_primary():


why do this check? Importing on all ranks should be fine too?

wandb import has some overhead to it, so I was trying to limit it to the main worker since it's the only one using it. This can be changed

I think it makes sense to me to do it on primary_rank only. :)

vissl/hooks/wandb_hook.py

pclucas14 · 2021-04-07T01:43:07Z

I don't see how this hook can cause the stall when OOM should have been raised. Maybe it wasn't really a stall? How about the case without an OOM, does the training go on as normal? Doesn't get slow down when wandb is enabled? I'd suggest put some prints in the hook and the train loop to see where exactly the stall happens if it indeed stalls.

Ok, I will try this tomorrow and let you know. Note that this is not specific to an OOM error, I saw the same thing with another error on the data side.

pclucas14 · 2021-04-12T21:02:51Z

Hi, quick update :

I'm unable to use the current logger when launching jobs with SLURM. If I ask for an interactive session and launch a job it works ok, however launching directly with slurm I get the following errror :

-- Process 3 terminated with the following error:
Traceback (most recent call last):
  File "/private/home/lucaspc/.conda/envs/vissl2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/private/home/lucaspc/repos/vissl/vissl/utils/distributed_launcher.py", line 200, in _distributed_worker
    process_main(cfg, dist_run_id, local_rank=local_rank, node_id=node_id)
  File "/private/home/lucaspc/repos/vissl/vissl/utils/distributed_launcher.py", line 193, in process_main
    hook_generator=hook_generator,
  File "/private/home/lucaspc/repos/vissl/vissl/engines/train.py", line 93, in train_main
    hooks = hook_generator(cfg)
  File "/private/home/lucaspc/repos/vissl/vissl/hooks/__init__.py", line 126, in default_hook_generator
    wandb_hook = get_wandb_hook(cfg)
  File "/private/home/lucaspc/repos/vissl/vissl/utils/wandb.py", line 120, in get_wandb_hook
    name=name
  File "/private/home/lucaspc/.conda/envs/vissl2/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 742, in init
    run = wi.init()
  File "/private/home/lucaspc/.conda/envs/vissl2/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 513, in init
    raise UsageError(error_message)
wandb.errors.UsageError: Error communicating with wandb process

srun: error: learnfair0928: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=39934341.0

If anyone with distributed programming knowledge knows why this is happening, please let me know

prigoyal

thank you @pclucas14 :)

prigoyal · 2021-04-13T12:47:06Z

vissl/hooks/wandb_hook.py

+from classy_vision.generic.distributed_util import is_primary
+from classy_vision.hooks.classy_hook import ClassyHook
+
+if is_primary():


I think it makes sense to me to do it on primary_rank only. :)

prigoyal · 2021-04-13T12:50:04Z

vissl/utils/wandb.py

+
+def is_wandb_available():
+    """
+    Check whether wandb is available or not.


can we add a link here to wandb documentation that open source users can read/access? :)

prigoyal · 2021-04-13T12:50:19Z

vissl/utils/wandb.py

+    Check whether wandb is available or not.
+
+    Returns:
+        wandb_available (bool): based on wandb imports, returns whether tensboarboard


nit: tensboarboard -> wandb

prigoyal · 2021-04-13T12:51:05Z

vissl/utils/wandb.py

+    import wandb
+    from vissl.hooks import SSLWandbHook


makes sense to me to do local imports .

prigoyal · 2021-04-13T12:51:47Z

vissl/utils/wandb.py

+                f.write(wandb_id)
+
+        name = cfg.HOOKS.WANDB_SETUP.EXP_NAME
+        if name == "??":


nit: let's set name = "" instead of ??

prigoyal · 2021-04-13T12:55:16Z

vissl/hooks/wandb_hook.py

+
+BYTE_TO_MiB = 2 ** 20
+
+class SSLWandbHook(ClassyHook):


A question on this: is this hook similar to Tensorboard hook with the only difference being in logging to "wandb" instead of tensorboard?

Is it possible that we can inherit the TensorboardHook ? Or alternatively, does it make sense to extend the TensorboardHook directly to optionally log to wandb as well is user is using WandB ?

prigoyal · 2021-04-13T12:57:00Z

Hi, quick update :

I'm unable to use the current logger when launching jobs with SLURM. If I ask for an interactive session and launch a job it works ok, however launching directly with slurm I get the following errror :

-- Process 3 terminated with the following error:
Traceback (most recent call last):
  File "/private/home/lucaspc/.conda/envs/vissl2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/private/home/lucaspc/repos/vissl/vissl/utils/distributed_launcher.py", line 200, in _distributed_worker
    process_main(cfg, dist_run_id, local_rank=local_rank, node_id=node_id)
  File "/private/home/lucaspc/repos/vissl/vissl/utils/distributed_launcher.py", line 193, in process_main
    hook_generator=hook_generator,
  File "/private/home/lucaspc/repos/vissl/vissl/engines/train.py", line 93, in train_main
    hooks = hook_generator(cfg)
  File "/private/home/lucaspc/repos/vissl/vissl/hooks/__init__.py", line 126, in default_hook_generator
    wandb_hook = get_wandb_hook(cfg)
  File "/private/home/lucaspc/repos/vissl/vissl/utils/wandb.py", line 120, in get_wandb_hook
    name=name
  File "/private/home/lucaspc/.conda/envs/vissl2/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 742, in init
    run = wi.init()
  File "/private/home/lucaspc/.conda/envs/vissl2/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 513, in init
    raise UsageError(error_message)
wandb.errors.UsageError: Error communicating with wandb process

srun: error: learnfair0928: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=39934341.0

If anyone with distributed programming knowledge knows why this is happening, please let me know

regarding this, we need to understand what wandb requires in order to run via SLURM. Are there specific requirements or documentation on configuring wandb when using slurm?

pclucas14 · 2021-04-13T16:36:08Z

I emailed the support team at Weights and Biases, will keep you posted.

pclucas14 · 2021-04-20T20:17:19Z

After speaking with people from WandB and having them look at the log, it seems that the following line gets called more than once, which is problematic. I need it to be called exactly once. Any tips on how to make that happen ? I though wrapping it in a if is_primary() would be sufficient, but it seems that is not the case.

QuentinDuval · 2021-04-20T22:39:56Z

the following line

Indeed, this is because torch.distributed is not initialised at this point (I got bitten by it as well). You should use dist_rank = get_machine_local_and_dist_rank()[1].

@prigoyal I think we need to rework this a bit, and either:

introduce a is_primary function in VISSL that does the right thing (call torch.distributed if initialised and do as get_machine_local_and_dist_rank otherwise)
or move the creation of the hook after the initialisation of torch.distributed

What you do both think?

prigoyal · 2021-04-20T22:54:37Z

the following line

Indeed, this is because torch.distributed is not initialised at this point (I got bitten by it as well). You should use dist_rank = get_machine_local_and_dist_rank()[1].

@prigoyal I think we need to rework this a bit, and either:

introduce a is_primary function in VISSL that does the right thing (call torch.distributed if initialised and do as get_machine_local_and_dist_rank otherwise)

or move the creation of the hook after the initialisation of torch.distributed

What you do both think?

indeed. THere is already a is_primary() function https://github.com/facebookresearch/vissl/blob/master/vissl/hooks/tensorboard_hook.py#L7 . Does this help? :)

pclucas14 · 2021-04-21T02:53:07Z

the following line

Indeed, this is because torch.distributed is not initialised at this point (I got bitten by it as well). You should use dist_rank = get_machine_local_and_dist_rank()[1].
@prigoyal I think we need to rework this a bit, and either:

introduce a is_primary function in VISSL that does the right thing (call torch.distributed if initialised and do as get_machine_local_and_dist_rank otherwise)

or move the creation of the hook after the initialisation of torch.distributed

What you do both think?

indeed. THere is already a is_primary() function https://github.com/facebookresearch/vissl/blob/master/vissl/hooks/tensorboard_hook.py#L7 . Does this help? :)

I should specify that I am indeed using is_primary from classy_vision.generic.distributed_util

QuentinDuval · 2021-04-21T12:58:17Z

indeed. THere is already a is_primary() function https://github.com/facebookresearch/vissl/blob/master/vissl/hooks/tensorboard_hook.py#L7 . Does this help? :)

It will not help: it's the very function to avoid in that case because torch.distributed is not initialised up to that point and so it will return 0 for all callers => multiple workers will call wandb.init.

@pclucas14 Could you check that by replacing is_primary with 0 == get_machine_local_and_dist_rank()[1] your problem of init is solved?

If it does not work, could you try this instead?

local_rank, distributed_rank = get_machine_local_and_dist_rank()
if local_rank == 0 and distributed_rank == 0:
    # init code

The function we should create would look like this:

def is_primary():
    if not dist.is_initialized():
        dist_rank = get_machine_local_and_dist_rank()[1]
        return dist_rank == 0
   else:
       return get_rank() == 0  # get rank of classy-vision

The other option is not to check the initialisation of torch.distributed:

def is_primary():
    dist_rank = get_machine_local_and_dist_rank()[1]
    return dist_rank == 0

prigoyal · 2021-04-21T13:10:35Z

The function we should create would look like this:

def is_primary():
    if not dist.is_initialized():
        dist_rank = get_machine_local_and_dist_rank()[1]
        return dist_rank == 0
   else:
       return get_rank() == 0  # get rank of classy-vision

thank you so much @pclucas14 for clarification and fully agree @QuentinDuval . I think we should go for the above function and implement it in vissl/utils/misc.py . We should then switch the is_primary() call everywhere in VISSL from classy vision to our own . @pclucas14 , if you wish, this broader change can be taken care of by us in a separate PR and in this PR, you can create a is_primary() as Quentin suggested above and use it where you need. :)

facebook-github-bot · 2021-12-31T12:21:41Z

Hi @pclucas14!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

working prototype of wandb

ba2d734

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 6, 2021

min-xu-ai reviewed Apr 6, 2021

View reviewed changes

pclucas14 added 2 commits April 6, 2021 12:38

typos

9b8f320

typos

1480fa7

prigoyal reviewed Apr 13, 2021

View reviewed changes

added file for debugginh

6956215

surajpaib mentioned this pull request Jun 15, 2022

WandB support #268

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

working prototype of wandb #271

working prototype of wandb #271

pclucas14 commented Apr 6, 2021

prigoyal commented Apr 6, 2021

pclucas14 commented Apr 6, 2021

min-xu-ai commented Apr 6, 2021

pclucas14 commented Apr 6, 2021 •

edited

pclucas14 commented Apr 6, 2021

min-xu-ai left a comment

min-xu-ai Apr 6, 2021

pclucas14 Apr 6, 2021

min-xu-ai Apr 6, 2021

pclucas14 Apr 6, 2021

prigoyal Apr 13, 2021

min-xu-ai Apr 6, 2021

pclucas14 Apr 6, 2021

prigoyal Apr 13, 2021

pclucas14 commented Apr 7, 2021

pclucas14 commented Apr 12, 2021 •

edited

prigoyal left a comment

prigoyal Apr 13, 2021

prigoyal Apr 13, 2021

prigoyal Apr 13, 2021

prigoyal Apr 13, 2021

prigoyal Apr 13, 2021

prigoyal Apr 13, 2021

prigoyal commented Apr 13, 2021

pclucas14 commented Apr 13, 2021

pclucas14 commented Apr 20, 2021

QuentinDuval commented Apr 20, 2021 •

edited

prigoyal commented Apr 20, 2021

pclucas14 commented Apr 21, 2021

QuentinDuval commented Apr 21, 2021 •

edited

prigoyal commented Apr 21, 2021

facebook-github-bot commented Dec 31, 2021


		BYTE_TO_MiB = 2 ** 20

		class SSLWandbHook(ClassyHook):

working prototype of wandb #271

Are you sure you want to change the base?

working prototype of wandb #271

Conversation

pclucas14 commented Apr 6, 2021

prigoyal commented Apr 6, 2021

pclucas14 commented Apr 6, 2021

min-xu-ai commented Apr 6, 2021

pclucas14 commented Apr 6, 2021 • edited

pclucas14 commented Apr 6, 2021

min-xu-ai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pclucas14 commented Apr 7, 2021

pclucas14 commented Apr 12, 2021 • edited

prigoyal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prigoyal commented Apr 13, 2021

pclucas14 commented Apr 13, 2021

pclucas14 commented Apr 20, 2021

QuentinDuval commented Apr 20, 2021 • edited

prigoyal commented Apr 20, 2021

pclucas14 commented Apr 21, 2021

QuentinDuval commented Apr 21, 2021 • edited

prigoyal commented Apr 21, 2021

facebook-github-bot commented Dec 31, 2021

Process

pclucas14 commented Apr 6, 2021 •

edited

pclucas14 commented Apr 12, 2021 •

edited

QuentinDuval commented Apr 20, 2021 •

edited

QuentinDuval commented Apr 21, 2021 •

edited