ExpertGate Nan loss using multi-head classifier #1639

appledora · 2024-04-19T21:51:43Z

In my current experiments, I am trying to set up a different dataset ( with different number of classes) as separate tasks/experiences. While trying to train on ExpertGate, I am not sure if I am doing the data processing correctly as I am getting Nan loss repeatedly.

Here's my dataprocessing code:

for idx, (task_name, num_classes) in enumerate(zip(task_list, classes_per_task)):
    print("task: ", task_name)
    _, _, _, train_dataset, test_dataset = get_dataloaders(task_name, 0.8, 16)
    train_avalanche_dataset = _make_taskaware_classification_dataset(train_dataset)
   
    test_avalanche_dataset = _make_taskaware_classification_dataset(test_dataset)
    train_dataset_list.append(train_avalanche_dataset)
    test_dataset_list.append(test_avalanche_dataset)

ncbm = nc_benchmark(
        train_dataset_list,
        test_dataset_list,
        n_experiences = 100,
        task_labels=True,
        shuffle=False,
        class_ids_from_zero_in_each_exp=True,
        one_dataset_per_exp=True,
        train_transform=None,
        eval_transform=None,
    )

This is how I am initializing the model and strategy:

model = ExpertGate(shape=(3, 224, 224),device=device)
model.expert.classifier[6] = MultiHeadClassifier(4096)
cl_strategy =  ExpertGateStrategy(model = model, optimizer = optimizer,
                train_mb_size=256,
                eval_mb_size=128,
                train_epochs=2, 
                ae_lr=1e-3,
                device = device
                )

Here's a sample (on-going) training log:

Device:  cuda:0
Starting experiment...
Start training on experience  0
-- >> Start of training phase << --

TRAINING NEW AUTOENCODER
-- >> Start of training phase << --
15106it [09:10, 27.44it/s]                          
Epoch 0 ended.
	Loss_Epoch/train_phase/train_stream/Task000 = 135.1136
100%|██████████| 5241/5241 [00:55<00:00, 95.15it/s] 
Epoch 1 ended.
	Loss_Epoch/train_phase/train_stream/Task000 = 124.1420
-- >> End of training phase << --
FINISHED TRAINING NEW AUTOENCODER


SELECTING EXPERT
FINISHED EXPERT SELECTION


TRAINING EXPERT
100%|██████████| 21/21 [00:34<00:00,  1.65s/it]
Epoch 0 ended.
	Loss_Epoch/train_phase/train_stream/Task000 = nan
	Top1_Acc_Epoch/train_phase/train_stream/Task000 = 0.0130
100%|██████████| 21/21 [00:34<00:00,  1.64s/it]
Epoch 1 ended.
	Loss_Epoch/train_phase/train_stream/Task000 = nan
	Top1_Acc_Epoch/train_phase/train_stream/Task000 = 0.0036
-- >> End of training phase << --
-- >> Start of eval phase << --
-- Starting eval on experience 0 (Task 0) from test stream --
100%|██████████| 11/11 [00:09<00:00,  1.21it/s]
> Eval on experience 0 (Task 0) from test stream ended.
	Loss_Exp/eval_phase/test_stream/Task000/Exp000 = nan
	Top1_Acc_Exp/eval_phase/test_stream/Task000/Exp000 = 0.0061
-- >> End of eval phase << --
	Loss_Stream/eval_phase/test_stream/Task000 = nan
	Top1_Acc_Stream/eval_phase/test_stream/Task000 = 0.0061
End training on experience  0
Training time: 191.16707921028137
Computing accuracy on the test set
Start training on experience  1
-- >> Start of training phase << --

TRAINING NEW AUTOENCODER
-- >> Start of training phase << --
 30%|███       | 3756/12490 [01:01<02:16, 64.00it/s]
1

1

Would love to hear any pointers on this?
In general, what is the best way to set up dataloaders for my particular setting?

The text was updated successfully, but these errors were encountered:

appledora · 2024-04-19T21:56:13Z

btw, I wanted to initially ask on your slack - but the invite link is no longer working.

niniack · 2024-04-20T14:30:48Z

Heya, happy to try and help you out with this. I wrote this strategy for Avalanche a while ago. I'd like to make sure it's usable for you. Hopefully we can squash any bugs, if there is something in the implementation. I might be a bit slow in response, so please bear with me.

It seems like you're using a custom dataset, were you able to repro this issue with a non-custom dataset? If so, could you please share a minimal example. I want to figure out whether this is a root issue with the implementation or whether there is something specific with your use case. In your current setup, what optimizer & lr are you using?

appledora · 2024-04-21T04:17:32Z

Hey, @niniack . Thanks for your contributions.
So, I haven't tried using the default datasets as they don't particularly fit my current objective of using each dataset as a separate task. The custom get_dataloader function that I have used here simply returns Pytorch dataloader objects and datasets classes.
Here's the base Dataset class I am using:

class ImageDataset(Dataset):
    def __init__(self):
        self.data_path = ""
        self.data_name = ""
        self.num_classes = 0
        self.train_transform = None
        self.train_csv_path = ""
        self.image_paths = []
        self.labels = []

    def get_num_classes(self):
        return self.num_classes
    def __getitem__(self, index):
        img_path = self.image_paths[index]
        label = self.labels[index]
        img = Image.open(img_path).convert("RGB")
        if self.train_transform:
            img = self.train_transform(img)
            
        return img, label
    def __len__(self):
        return len(self.image_paths)
    
    @property
    def label_dict(self):
        return {i: self.class_map[i] for i in range(self.num_classes)}
    
    def __repr__(self):
        return f"ImageDataset({self.data_name}) with {self.__len__} instances"

def get_dataloader(dataset) : 
    split_size = int(0.8 * len(dataset))
    train_dataset, val_dataset = torch.utils.data.random_split(dataset, [split_size, len(dataset) - split_size])

    train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True, drop_last=True,  pin_memory=True)
    val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=batch_size, shuffle=False, drop_last=True,  pin_memory=True)

    return train_loader, val_loader, dataset.get_num_classes()
    ```

appledora · 2024-04-21T04:21:54Z

This is the optimizer I am using. I picked the hyperparameters from the continual-learning-baseline repo :

model = ExpertGate(shape=(3, 224, 224),device=device)
optimizer = SGD(
        model.expert.parameters(), lr=0.1, momentum=0.9, weight_decay=0.0005
    )

appledora changed the title ~~EpertGate Nan loss using multi-head classifier~~ ExpertGate Nan loss using multi-head classifier Apr 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ExpertGate Nan loss using multi-head classifier #1639

ExpertGate Nan loss using multi-head classifier #1639

appledora commented Apr 19, 2024 •

edited

appledora commented Apr 19, 2024

niniack commented Apr 20, 2024 •

edited

appledora commented Apr 21, 2024 •

edited

appledora commented Apr 21, 2024 •

edited

ExpertGate Nan loss using multi-head classifier #1639

ExpertGate Nan loss using multi-head classifier #1639

Comments

appledora commented Apr 19, 2024 • edited

appledora commented Apr 19, 2024

niniack commented Apr 20, 2024 • edited

appledora commented Apr 21, 2024 • edited

appledora commented Apr 21, 2024 • edited

appledora commented Apr 19, 2024 •

edited

niniack commented Apr 20, 2024 •

edited

appledora commented Apr 21, 2024 •

edited

appledora commented Apr 21, 2024 •

edited