assert self.has_full_params #1134

pokameng · 2023-09-11T03:11:54Z

hello
@min-xu-ai
I'm using FSDP to encapsulate the model, but I'm getting the following error:

assert self.has_full_params

This is my codes:

`model = build_network(cfg).cuda()

model.cnet = auto_wrap_bn(model.cnet,single_rank_pg=False)

# model.fnet = auto_wrap_bn(model.fnet,single_rank_pg=False)
# model.att = auto_wrap_bn(model.att,single_rank_pg=False)
# model.update_block = auto_wrap_bn(model.update_block,single_rank_pg=False)


model.cnet = checkpoint_wrapper(model.cnet)
model.fnet = checkpoint_wrapper(model.fnet)
model.att = checkpoint_wrapper(model.att)
model.update_block = checkpoint_wrapper(model.update_block)


# model.cnet = FSDP(model.cnet)
# model.fnet = FSDP(model.fnet)
model.update_block = FSDP(model.update_block)
model.att = FSDP(model.att)

loguru_logger.info("Parameter Count: %d" % count_parameters(model)) # 12659389

if cfg.restore_ckpt is not None:
    print("[Loading ckpt from {}]".format(cfg.restore_ckpt))
    model.load_state_dict(torch.load(cfg.restore_ckpt), strict=True)

# model.cuda()
model.train()

train_loader = datasets.fetch_dataloader(cfg)
optimizer, scheduler = fetch_optimizer(model, cfg.trainer)

total_steps = 0
scaler = GradScaler(enabled=cfg.mixed_precision)
logger = Logger(model, scheduler, cfg)

reporter = MemReporter(model)
reporter.report()
print(f"After model loading - Memory Allocated: {torch.cuda.memory_allocated() / 1024 ** 2} MB")
should_keep_training = True
while should_keep_training:

    for i_batch, data_blob in enumerate(train_loader):
        
        # optimizer.zero_grad()
        images, flows, valids = [x.cuda() for x in data_blob]
        model.zero_grad(set_to_none=True)
        print(f"Before forward pass - Memory Allocated: {torch.cuda.memory_allocated() / 1024 ** 2} MB")
        
        if cfg.add_noise:
            stdv = np.random.uniform(0.0, 5.0)
            images = (images + stdv * torch.randn(*images.shape).cuda()).clamp(0.0, 255.0)

        output = {}
        flow_predictions = model(images, output)
        print(f"After forward pass (before backward) - Memory Allocated: {torch.cuda.memory_allocated() / 1024 ** 2} MB")
        loss, metrics, _ = loss_func(flow_predictions, flows, valids, cfg)
        loss.backward()
        # scaler.scale(loss).backward()
        print(f"After backward - Memory Allocated: {torch.cuda.memory_allocated() / 1024 ** 2} MB")
        scaler.unscale_(optimizer)
        print(f"After optimizer step and params update - Memory Allocated: {torch.cuda.memory_allocated() / 1024 ** 2} MB")
        torch.nn.utils.clip_grad_norm_(model.parameters(), cfg.trainer.clip)
        
        # scaler.step(optimizer)
        optimizer.step()
        scheduler.step()
        # scaler.update()

        metrics.update(output)
        logger.push(metrics)

        if total_steps % cfg.val_freq == cfg.val_freq - 1:
            PATH = '%s/%d_%s.pth' % (cfg.log_dir, total_steps+1, cfg.name)
            # torch.save(model.state_dict(), PATH)

            results = {}
            for val_dataset in cfg.validation:
                if val_dataset == 'sintel_train':
                    results.update(evaluate_tile.validate_sintel(model.module))

            logger.write_dict(results)
            
            model.train()
        
        total_steps += 1

        if total_steps > cfg.trainer.num_steps:
            should_keep_training = False
            break

logger.close()
save_path = cfg.log_dir + '/final.pth'
print(save_path)
torch.save(model.state_dict(), save_path)

return save_path`

Traceback (most recent call last): File "/home/dxy/anaconda3/envs/videoflow/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/home/wsm/VideoFlow-main/FSDP_BOF.py", line 250, in main_worker train(cfg) File "/home/wsm/VideoFlow-main/FSDP_BOF.py", line 174, in train loss.backward() File "/home/dxy/anaconda3/envs/videoflow/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/home/dxy/anaconda3/envs/videoflow/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/dxy/anaconda3/envs/videoflow/lib/python3.10/site-packages/fairscale/nn/data_parallel/fully_sharded_data_parallel.py", line 1516, in _pre_backward_hook self._use_full_params() File "/home/dxy/anaconda3/envs/videoflow/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/home/dxy/anaconda3/envs/videoflow/lib/python3.10/site-packages/fairscale/nn/data_parallel/fully_sharded_data_parallel.py", line 2061, in _use_full_params assert self.has_full_params AssertionError

Can you help me?

The text was updated successfully, but these errors were encountered:

min-xu-ai · 2023-09-11T03:25:52Z

Can you first try the pytorch version of FSDP? If you can't, can you please let me know the reason?

Also, we usually wrap the top level model with FSDP. Quickly reading your code, it seems that you only wrapped 2 sub-modules?

pokameng · 2023-09-11T03:39:36Z

Can you first try the pytorch version of FSDP? If you can't, can you please let me know the reason?

Also, we usually wrap the top level model with FSDP. Quickly reading your code, it seems that you only wrapped 2 sub-modules?
hello bro
I can run code with the Pytorch version of FSDP successfully.
the code like this:
model = FSDP(model,auto_wrap_policy=my_auto_wrap_policy, # use_orig_params=True,#是否保留原始参数 # cpu_offload=CPUOffload(offload_params=True), # mixed_precision=fp16_policy, # sharding_strategy=ShardingStrategy.SHARD_GRAD_OP, # ZeRO 2 sharding_strategy=ShardingStrategy.FULL_SHARD, # ZeRO 3 backward_prefetch = BackwardPrefetch.BACKWARD_POST, # optimize memory # backward_prefetch = BackwardPrefetch.BACKWARD_PRE, # optimize speed )
I compared the memory changes under different training strategies (DP,DDP and FSDP)

But when I set batch size =2 ,FSDP is OOM. So I'm looking for solutions like auto_wrap_bn or checkpoint_wrapper.
But when I use checkpoint_wrapper, It reports self.has_full_params AssertionError.

pokameng · 2023-09-12T03:29:26Z

Can you first try the pytorch version of FSDP? If you can't, can you please let me know the reason?
Also, we usually wrap the top level model with FSDP. Quickly reading your code, it seems that you only wrapped 2 sub-modules?
hello bro
I can run code with the Pytorch version of FSDP successfully.
the code like this:
model = FSDP(model,auto_wrap_policy=my_auto_wrap_policy, # use_orig_params=True,#是否保留原始参数 # cpu_offload=CPUOffload(offload_params=True), # mixed_precision=fp16_policy, # sharding_strategy=ShardingStrategy.SHARD_GRAD_OP, # ZeRO 2 sharding_strategy=ShardingStrategy.FULL_SHARD, # ZeRO 3 backward_prefetch = BackwardPrefetch.BACKWARD_POST, # optimize memory # backward_prefetch = BackwardPrefetch.BACKWARD_PRE, # optimize speed )
I compared the memory changes under different training strategies (DP,DDP and FSDP)

But when I set batch size =2 ,FSDP is OOM. So I'm looking for solutions like auto_wrap_bn or checkpoint_wrapper.
But when I use checkpoint_wrapper, It reports self.has_full_params AssertionError.

@min-xu-ai
can you help me?

min-xu-ai · 2023-09-12T03:52:47Z

Sorry. I can't. Can you check with pytorch folks since they have a FSDP version that's more supported and official?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assert self.has_full_params #1134

assert self.has_full_params #1134

pokameng commented Sep 11, 2023

min-xu-ai commented Sep 11, 2023

pokameng commented Sep 11, 2023

pokameng commented Sep 12, 2023

min-xu-ai commented Sep 12, 2023

assert self.has_full_params #1134

assert self.has_full_params #1134

Comments

pokameng commented Sep 11, 2023

model.cnet = auto_wrap_bn(model.cnet,single_rank_pg=False)

min-xu-ai commented Sep 11, 2023

pokameng commented Sep 11, 2023

pokameng commented Sep 12, 2023

min-xu-ai commented Sep 12, 2023