Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can sam-hq perform breakpoint training: continue training on the already trained sam-hq, or continue training on the official sam-hq weights #126

Open
YUANMU227 opened this issue Mar 13, 2024 · 2 comments

Comments

@YUANMU227
Copy link

Can sam-hq perform breakpoint training: continue training on the already trained sam-hq, or continue training on the official sam-hq weights?

Modifying the --start_epoch parameter will not allow continued training on the existing epoch_*.pth.

How to perform breakpoint training?

@bhack
Copy link

bhack commented Mar 18, 2024

It seems that currently the restore restore_model logic is available only on eval:

sam-hq/train/train.py

Lines 361 to 373 in 3224888

else:
sam = sam_model_registry[args.model_type](checkpoint=args.checkpoint)
_ = sam.to(device=args.device)
sam = torch.nn.parallel.DistributedDataParallel(sam, device_ids=[args.gpu], find_unused_parameters=args.find_unused_params)
if args.restore_model:
print("restore model from:", args.restore_model)
if torch.cuda.is_available():
net_without_ddp.load_state_dict(torch.load(args.restore_model))
else:
net_without_ddp.load_state_dict(torch.load(args.restore_model,map_location="cpu"))
evaluate(args, net, sam, valid_dataloaders, args.visualize)

@bhack
Copy link

bhack commented Mar 21, 2024

@lkeab Can you share the sam-hq only network checkpoint so that we could eventually restore it for the train?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants