Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues about the performance #1

Open
zl9501 opened this issue May 10, 2021 · 3 comments
Open

Issues about the performance #1

zl9501 opened this issue May 10, 2021 · 3 comments

Comments

@zl9501
Copy link

zl9501 commented May 10, 2021

Hi, thanks for your nice job.

I ran the training script provided in this repo, and did not change any code. However, there is a significant performance gap between the code and your paper (for example, 0.588 v.s. 0.715 AUROC on tinyimagenet).

Should I tune some learning parameters for increasing accuracy? I have tried to adjust the lr and epochs but it does not work. I am looking forward to your insightful suggestions for this.

(cvaecaposr) xx@xxx:~/cvaecaposr$ sh ./scripts/train_tinyimagenet.sh
{
    "data_base_path": "./data",
    "val_ratio": 0.2,
    "seed": 1234,
    "known_classes": [
        2,
        3,
        13,
        30,
        44,
        45,
        64,
        66,
        76,
        101,
        111,
        121,
        128,
        130,
        136,
        158,
        167,
        170,
        187,
        193
    ],
    "unknown_classes": [
        0,
        1,
        4,
        5,
        6,
        7,
        8,
        9,
        10,
        11,
        12,
        14,
        15,
        16,
        17,
        18,
        19,
        20,
        21,
        22,
        23,
        24,
        25,
        26,
        27,
        28,
        29,
        31,
        32,
        33,
        34,
        35,
        36,
        37,
        38,
        39,
        40,
        41,
        42,
        43,
        46,
        47,
        48,
        49,
        50,
        51,
        52,
        53,
        54,
        55,
        56,
        57,
        58,
        59,
        60,
        61,
        62,
        63,
        65,
        67,
        68,
        69,
        70,
        71,
        72,
        73,
        74,
        75,
        77,
        78,
        79,
        80,
        81,
        82,
        83,
        84,
        85,
        86,
        87,
        88,
        89,
        90,
        91,
        92,
        93,
        94,
        95,
        96,
        97,
        98,
        99,
        100,
        102,
        103,
        104,
        105,
        106,
        107,
        108,
        109,
        110,
        112,
        113,
        114,
        115,
        116,
        117,
        118,
        119,
        120,
        122,
        123,
        124,
        125,
        126,
        127,
        129,
        131,
        132,
        133,
        134,
        135,
        137,
        138,
        139,
        140,
        141,
        142,
        143,
        144,
        145,
        146,
        147,
        148,
        149,
        150,
        151,
        152,
        153,
        154,
        155,
        156,
        157,
        159,
        160,
        161,
        162,
        163,
        164,
        165,
        166,
        168,
        169,
        171,
        172,
        173,
        174,
        175,
        176,
        177,
        178,
        179,
        180,
        181,
        182,
        183,
        184,
        185,
        186,
        188,
        189,
        190,
        191,
        192,
        194,
        195,
        196,
        197,
        198,
        199
    ],
    "split_num": 0,
    "batch_size": 32,
    "num_workers": 0,
    "dataset": "tiny_imagenet",
    "z_dim": 128,
    "lr": 5e-05,
    "t_mu_shift": 10.0,
    "t_var_scale": 0.01,
    "alpha": 1.0,
    "beta": 0.01,
    "margin": 10.0,
    "in_dim_caps": 16,
    "out_dim_caps": 32,
    "checkpoint": "",
    "mode": "train",
    "epochs": 100
}
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]

  | Name    | Type      | Params
--------------------------------------
0 | enc     | ResNet34  | 21.3 M
1 | vae_cap | VaeCap    | 23.5 M
2 | fc      | Linear    | 10.5 M
3 | dec     | Decoder   | 760 K
4 | t_mean  | Embedding | 51.2 K
5 | t_var   | Embedding | 51.2 K
--------------------------------------
56.1 M    Trainable params
0         Non-trainable params
56.1 M    Total params
224.552   Total estimated model params size (MB)
/xx/anaconda3/envs/cvaecaposr/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 80 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
/xx/anaconda3/envs/cvaecaposr/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 80 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
Epoch 20: 100%|██████████▉| 312/313 [00:38<00:00,  8.21it/s, loss=4.99e+03, v_num=0, train_acc=0.938, validation_acc=0.456Epoch    21: reducing learning rate of group 0 to 2.5000e-05.█████████████████████████████▊ | 62/63 [00:03<00:00, 16.56it/s]
Epoch 28: 100%|██████████▉| 312/313 [00:37<00:00,  8.35it/s, loss=3.31e+03, v_num=0, train_acc=0.906, validation_acc=0.460Epoch    29: reducing learning rate of group 0 to 1.2500e-05.█████████████████████████████▊ | 62/63 [00:03<00:00, 16.48it/s]
Epoch 42: 100%|██████████▉| 312/313 [00:38<00:00,  8.20it/s, loss=2.32e+03, v_num=0, train_acc=0.906, validation_acc=0.459Epoch    43: reducing learning rate of group 0 to 6.2500e-06.█████████████████████████████▊ | 62/63 [00:03<00:00, 16.35it/s]
Epoch 48: 100%|██████████▉| 312/313 [00:37<00:00,  8.35it/s, loss=1.88e+03, v_num=0, train_acc=0.969, validation_acc=0.474Epoch    49: reducing learning rate of group 0 to 3.1250e-06.█████████████████████████████▊ | 62/63 [00:03<00:00, 16.42it/s]
Epoch 54: 100%|██████████▉| 312/313 [00:37<00:00,  8.25it/s, loss=2.41e+03, v_num=0, train_acc=0.938, validation_acc=0.465Epoch    55: reducing learning rate of group 0 to 1.5625e-06.█████████████████████████████▊ | 62/63 [00:03<00:00, 16.49it/s]
Epoch 60: 100%|██████████▉| 312/313 [00:37<00:00,  8.35it/s, loss=1.72e+03, v_num=0, train_acc=1.000, validation_acc=0.468Epoch    61: reducing learning rate of group 0 to 7.8125e-07.█████████████████████████████▊ | 62/63 [00:03<00:00, 16.53it/s]
Epoch 66: 100%|██████████▉| 312/313 [00:37<00:00,  8.35it/s, loss=1.88e+03, v_num=0, train_acc=1.000, validation_acc=0.471Epoch    67: reducing learning rate of group 0 to 3.9063e-07.█████████████████████████████▊ | 62/63 [00:03<00:00, 16.44it/s]
Epoch 72: 100%|██████████▉| 312/313 [00:38<00:00,  8.21it/s, loss=1.62e+03, v_num=0, train_acc=1.000, validation_acc=0.466Epoch    73: reducing learning rate of group 0 to 1.9531e-07.█████████████████████████████▊ | 62/63 [00:03<00:00, 16.36it/s]
Epoch 78: 100%|██████████▉| 312/313 [00:37<00:00,  8.35it/s, loss=1.15e+03, v_num=0, train_acc=1.000, validation_acc=0.472Epoch    79: reducing learning rate of group 0 to 9.7656e-08.█████████████████████████████▊ | 62/63 [00:03<00:00, 16.41it/s]
Epoch 84: 100%|██████████▉| 312/313 [00:40<00:00,  7.73it/s, loss=1.48e+03, v_num=0, train_acc=0.969, validation_acc=0.470Epoch    85: reducing learning rate of group 0 to 4.8828e-08.█████████████████████████████▊ | 62/63 [00:04<00:00, 15.34it/s]
Epoch 90: 100%|██████████▉| 312/313 [00:38<00:00,  8.04it/s, loss=1.68e+03, v_num=0, train_acc=0.938, validation_acc=0.472Epoch    91: reducing learning rate of group 0 to 2.4414e-08.█████████████████████████████▊ | 62/63 [00:03<00:00, 16.67it/s]
Epoch 96: 100%|██████████▉| 312/313 [00:38<00:00,  8.09it/s, loss=1.82e+03, v_num=0, train_acc=0.938, validation_acc=0.471Epoch    97: reducing learning rate of group 0 to 1.2207e-08.█████████████████████████████▊ | 62/63 [00:03<00:00, 16.40it/s]
Epoch 99: 100%|███████████| 313/313 [00:38<00:00,  8.10it/s, loss=1.76e+03, v_num=0, train_acc=1.000, validation_acc=0.468Saving latest checkpoint...
Epoch 99: 100%|███████████| 313/313 [00:40<00:00,  7.73it/s, loss=1.76e+03, v_num=0, train_acc=1.000, validation_acc=0.468]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
/xx/anaconda3/envs/cvaecaposr/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: The dataloader, test dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 80 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
Testing: 100%|██████████████████████████████████████████████████████████████████████████▊| 312/313 [00:23<00:00, 13.07it/s]/xx/anaconda3/envs/cvaecaposr/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: Metric `AUROC` will save all targets and predictions in buffer. For large datasets this may lead to large memory footprint.
  warnings.warn(*args, **kwargs)
Testing: 100%|███████████████████████████████████████████████████████████████████████████| 313/313 [00:23<00:00, 13.10it/s]
--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'test_auroc': 0.5880855321884155}

@liMike1998
Copy link

hello,I also meet the same problem.And do you know why now?

@mattolson93
Copy link

I cannot recreate the reported results either

@wjun0830
Copy link

wjun0830 commented Oct 4, 2021

I ran the RPL official code for this split and the result is terrible, either. (Using 32x32 resolution)

63.0569111111111 | 61.4288777777778 | 64.7651666666667 | 60.7026888888889 | 62.0719888888889

AUROC for each split is like the above.
And the average is 62.4051266666667.

Repository owner deleted a comment from jk8898 Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants
@wjun0830 @mattolson93 @liMike1998 @zl9501 and others