Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

assert self.has_full_params #1134

Open
pokameng opened this issue Sep 11, 2023 · 4 comments
Open

assert self.has_full_params #1134

pokameng opened this issue Sep 11, 2023 · 4 comments

Comments

@pokameng
Copy link

hello
@min-xu-ai
I'm using FSDP to encapsulate the model, but I'm getting the following error:

assert self.has_full_params

This is my codes:

`model = build_network(cfg).cuda()

model.cnet = auto_wrap_bn(model.cnet,single_rank_pg=False)

# model.fnet = auto_wrap_bn(model.fnet,single_rank_pg=False)
# model.att = auto_wrap_bn(model.att,single_rank_pg=False)
# model.update_block = auto_wrap_bn(model.update_block,single_rank_pg=False)


model.cnet = checkpoint_wrapper(model.cnet)
model.fnet = checkpoint_wrapper(model.fnet)
model.att = checkpoint_wrapper(model.att)
model.update_block = checkpoint_wrapper(model.update_block)


# model.cnet = FSDP(model.cnet)
# model.fnet = FSDP(model.fnet)
model.update_block = FSDP(model.update_block)
model.att = FSDP(model.att)

loguru_logger.info("Parameter Count: %d" % count_parameters(model)) # 12659389

if cfg.restore_ckpt is not None:
    print("[Loading ckpt from {}]".format(cfg.restore_ckpt))
    model.load_state_dict(torch.load(cfg.restore_ckpt), strict=True)

# model.cuda()
model.train()

train_loader = datasets.fetch_dataloader(cfg)
optimizer, scheduler = fetch_optimizer(model, cfg.trainer)

total_steps = 0
scaler = GradScaler(enabled=cfg.mixed_precision)
logger = Logger(model, scheduler, cfg)

reporter = MemReporter(model)
reporter.report()
print(f"After model loading - Memory Allocated: {torch.cuda.memory_allocated() / 1024 ** 2} MB")
should_keep_training = True
while should_keep_training:

    for i_batch, data_blob in enumerate(train_loader):
        
        # optimizer.zero_grad()
        images, flows, valids = [x.cuda() for x in data_blob]
        model.zero_grad(set_to_none=True)
        print(f"Before forward pass - Memory Allocated: {torch.cuda.memory_allocated() / 1024 ** 2} MB")
        
        if cfg.add_noise:
            stdv = np.random.uniform(0.0, 5.0)
            images = (images + stdv * torch.randn(*images.shape).cuda()).clamp(0.0, 255.0)

        output = {}
        flow_predictions = model(images, output)
        print(f"After forward pass (before backward) - Memory Allocated: {torch.cuda.memory_allocated() / 1024 ** 2} MB")
        loss, metrics, _ = loss_func(flow_predictions, flows, valids, cfg)
        loss.backward()
        # scaler.scale(loss).backward()
        print(f"After backward - Memory Allocated: {torch.cuda.memory_allocated() / 1024 ** 2} MB")
        scaler.unscale_(optimizer)
        print(f"After optimizer step and params update - Memory Allocated: {torch.cuda.memory_allocated() / 1024 ** 2} MB")
        torch.nn.utils.clip_grad_norm_(model.parameters(), cfg.trainer.clip)
        
        # scaler.step(optimizer)
        optimizer.step()
        scheduler.step()
        # scaler.update()

        metrics.update(output)
        logger.push(metrics)

        if total_steps % cfg.val_freq == cfg.val_freq - 1:
            PATH = '%s/%d_%s.pth' % (cfg.log_dir, total_steps+1, cfg.name)
            # torch.save(model.state_dict(), PATH)

            results = {}
            for val_dataset in cfg.validation:
                if val_dataset == 'sintel_train':
                    results.update(evaluate_tile.validate_sintel(model.module))

            logger.write_dict(results)
            
            model.train()
        
        total_steps += 1

        if total_steps > cfg.trainer.num_steps:
            should_keep_training = False
            break

logger.close()
save_path = cfg.log_dir + '/final.pth'
print(save_path)
torch.save(model.state_dict(), save_path)

return save_path`

Traceback (most recent call last): File "/home/dxy/anaconda3/envs/videoflow/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/home/wsm/VideoFlow-main/FSDP_BOF.py", line 250, in main_worker train(cfg) File "/home/wsm/VideoFlow-main/FSDP_BOF.py", line 174, in train loss.backward() File "/home/dxy/anaconda3/envs/videoflow/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/home/dxy/anaconda3/envs/videoflow/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/dxy/anaconda3/envs/videoflow/lib/python3.10/site-packages/fairscale/nn/data_parallel/fully_sharded_data_parallel.py", line 1516, in _pre_backward_hook self._use_full_params() File "/home/dxy/anaconda3/envs/videoflow/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/home/dxy/anaconda3/envs/videoflow/lib/python3.10/site-packages/fairscale/nn/data_parallel/fully_sharded_data_parallel.py", line 2061, in _use_full_params assert self.has_full_params AssertionError

Can you help me?

@min-xu-ai
Copy link
Contributor

Can you first try the pytorch version of FSDP? If you can't, can you please let me know the reason?

Also, we usually wrap the top level model with FSDP. Quickly reading your code, it seems that you only wrapped 2 sub-modules?

@pokameng
Copy link
Author

Can you first try the pytorch version of FSDP? If you can't, can you please let me know the reason?

Also, we usually wrap the top level model with FSDP. Quickly reading your code, it seems that you only wrapped 2 sub-modules?
hello bro
I can run code with the Pytorch version of FSDP successfully.
the code like this:
model = FSDP(model,auto_wrap_policy=my_auto_wrap_policy, # use_orig_params=True,#是否保留原始参数 # cpu_offload=CPUOffload(offload_params=True), # mixed_precision=fp16_policy, # sharding_strategy=ShardingStrategy.SHARD_GRAD_OP, # ZeRO 2 sharding_strategy=ShardingStrategy.FULL_SHARD, # ZeRO 3 backward_prefetch = BackwardPrefetch.BACKWARD_POST, # optimize memory # backward_prefetch = BackwardPrefetch.BACKWARD_PRE, # optimize speed )
I compared the memory changes under different training strategies (DP,DDP and FSDP)
image
But when I set batch size =2 ,FSDP is OOM. So I'm looking for solutions like auto_wrap_bn or checkpoint_wrapper.
But when I use checkpoint_wrapper, It reports self.has_full_params AssertionError.

@pokameng
Copy link
Author

Can you first try the pytorch version of FSDP? If you can't, can you please let me know the reason?
Also, we usually wrap the top level model with FSDP. Quickly reading your code, it seems that you only wrapped 2 sub-modules?
hello bro
I can run code with the Pytorch version of FSDP successfully.
the code like this:
model = FSDP(model,auto_wrap_policy=my_auto_wrap_policy, # use_orig_params=True,#是否保留原始参数 # cpu_offload=CPUOffload(offload_params=True), # mixed_precision=fp16_policy, # sharding_strategy=ShardingStrategy.SHARD_GRAD_OP, # ZeRO 2 sharding_strategy=ShardingStrategy.FULL_SHARD, # ZeRO 3 backward_prefetch = BackwardPrefetch.BACKWARD_POST, # optimize memory # backward_prefetch = BackwardPrefetch.BACKWARD_PRE, # optimize speed )
I compared the memory changes under different training strategies (DP,DDP and FSDP)
image
But when I set batch size =2 ,FSDP is OOM. So I'm looking for solutions like auto_wrap_bn or checkpoint_wrapper.
But when I use checkpoint_wrapper, It reports self.has_full_params AssertionError.

@min-xu-ai
can you help me?

@min-xu-ai
Copy link
Contributor

Sorry. I can't. Can you check with pytorch folks since they have a FSDP version that's more supported and official?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants