HPO_Pipeline fails on AutoSF models #1369

vinven7 · 2024-02-19T13:05:12Z

Describe the bug

I am trying to optimize AutoSF on a custom dataset. However, this triggers a device-side assert error in CUDA.

Here is the full trace:

I 2024-02-19 07:58:28,133] A new study created in memory with name: no-name-54ecbfdb-b81a-4379-b02a-ef5ffdd29652
INFO:pykeen.hpo.hpo:Using model: <class 'pykeen.models.unimodal.auto_sf.AutoSF'>
INFO:pykeen.hpo.hpo:Using loss: <class 'pykeen.losses.MarginRankingLoss'>
INFO:pykeen.hpo.hpo:Using optimizer: <class 'torch.optim.adam.Adam'>
INFO:pykeen.hpo.hpo:Using training loop: <class 'pykeen.training.slcwa.SLCWATrainingLoop'>
INFO:pykeen.hpo.hpo:Using negative sampler: <class 'pykeen.sampling.basic_negative_sampler.BasicNegativeSampler'>
INFO:pykeen.hpo.hpo:Using evaluator: <class 'pykeen.evaluation.rank_based_evaluator.RankBasedEvaluator'>
INFO:pykeen.hpo.hpo:Attempting to maximize both.realistic.inverse_harmonic_mean_rank
INFO:pykeen.hpo.hpo:Filter validation triples when testing: True
WARNING:pykeen.pipeline.api:No random seed is specified. Setting to 4229552334.
[W 2024-02-19 07:58:28,139] Trial 0 failed with parameters: {'model.embedding_dim': 128, 'loss.margin': 1.633297580856592, 'optimizer.lr': 0.04577728396873623, 'negative_sampler.num_negs_per_pos': 11, 'training.num_epochs': 400, 'training.batch_size': 4096} because of the following error: RuntimeError('CUDA error: device-side assert triggered\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n').
Traceback (most recent call last):
  File "/home/synthesisproject/anaconda3/envs/vineeth_14/lib/python3.10/site-packages/optuna/study/_optimize.py", line 200, in _run_trial
    value_or_values = func(trial)
  File "/home/synthesisproject/anaconda3/envs/vineeth_14/lib/python3.10/site-packages/pykeen/hpo/hpo.py", line 309, in __call__
    raise e
  File "/home/synthesisproject/anaconda3/envs/vineeth_14/lib/python3.10/site-packages/pykeen/hpo/hpo.py", line 259, in __call__
    result = pipeline(
  File "/home/synthesisproject/anaconda3/envs/vineeth_14/lib/python3.10/site-packages/pykeen/pipeline/api.py", line 1487, in pipeline
    set_random_seed(_random_seed)
  File "/home/synthesisproject/anaconda3/envs/vineeth_14/lib/python3.10/site-packages/pykeen/utils.py", line 298, in set_random_seed
    generator = torch.manual_seed(seed=seed)
  File "/home/synthesisproject/anaconda3/envs/vineeth_14/lib/python3.10/site-packages/torch/random.py", line 40, in manual_seed
    torch.cuda.manual_seed_all(seed)
  File "/home/synthesisproject/anaconda3/envs/vineeth_14/lib/python3.10/site-packages/torch/cuda/random.py", line 113, in manual_seed_all
    _lazy_call(cb, seed_all=True)
  File "/home/synthesisproject/anaconda3/envs/vineeth_14/lib/python3.10/site-packages/torch/cuda/__init__.py", line 183, in _lazy_call
    callable()
  File "/home/synthesisproject/anaconda3/envs/vineeth_14/lib/python3.10/site-packages/torch/cuda/random.py", line 111, in cb
    default_generator.manual_seed(seed)
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

How to reproduce

hpo_pipeline_result = hpo_pipeline(
    n_trials=30,
    dataset = 'Nations',
    model='AutoSF',
  #  model_kwargs_ranges=dict(
  #      embedding_dim=dict(type=int, low=4, high=754, q=50)  # Use 'q' for quantization step
  #  ),
#    loss= 'Self-Adversarial Negative Sampling Loss',
#    loss_kwargs_ranges = dict(
#      adversarial_temperature = dict(type = float, low =0.1, high =0.5, q=0.1)
#    ),
#    optimizer='Adam',
#    lr_scheduler='ExponentialLR',
#    training_loop='sLCWA',  
#    training_kwargs_ranges=dict(
#        num_epochs=dict(type=int, low=50, high=500, q=50), 
#    ),
#    negative_sampler='basic',
#    negative_sampler_kwargs_ranges=dict(
#        num_negs_per_pos=dict(type=int, low=3, high=39, q=3),
#    ),
#    stopper='early',
    save_model_directory=save_directory,
)

I have various combinations of parameters to see if that solves the problem, but it does not work even in this simplest case.

Environment

Key	Value
OS	posix
Platform	Linux
Release	4.18.0-305.19.1.el8_4.x86_64
Time	Mon Feb 19 08:04:23 2024
Python	3.10.11
PyKEEN	1.10.1
PyKEEN Hash	UNHASHED
PyKEEN Branch
PyTorch	2.0.1
CUDA Available?	true
CUDA Version	11.8
cuDNN Version	8700

Additional information

No response

Issue Template Checks

This is not a feature request (use a different issue template if it is)
This is not a question (use the discussions forum instead)
I've read the text explaining why including environment information is important and understand if I omit this information that my issue will be dismissed

mberr · 2024-02-19T18:20:30Z

I could not reproduce the error with

from pykeen.hpo import hpo_pipeline

hpo_pipeline_result = hpo_pipeline(
    n_trials=3,
    dataset="Nations",
    model="AutoSF",
    training_kwargs=dict(num_epochs=1),
)

and this env

Key	Value
OS	nt
Platform	Windows
Release	10
Time	Mon Feb 19 19:20:02 2024
Python	3.11.2
PyKEEN	1.10.2-dev
PyKEEN Hash	`c94213c`
PyKEEN Branch	master
PyTorch	2.1.1+cu121
CUDA Available?	true
CUDA Version	12.1
cuDNN Version	8801

vinven7 added the bug Something isn't working label Feb 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HPO_Pipeline fails on AutoSF models #1369

HPO_Pipeline fails on AutoSF models #1369

vinven7 commented Feb 19, 2024 •

edited by mberr

mberr commented Feb 19, 2024

HPO_Pipeline fails on AutoSF models #1369

HPO_Pipeline fails on AutoSF models #1369

Comments

vinven7 commented Feb 19, 2024 • edited by mberr

Describe the bug

How to reproduce

Environment

Additional information

Issue Template Checks

mberr commented Feb 19, 2024

vinven7 commented Feb 19, 2024 •

edited by mberr