Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPO_Pipeline fails on AutoSF models #1369

Open
3 tasks done
vinven7 opened this issue Feb 19, 2024 · 1 comment
Open
3 tasks done

HPO_Pipeline fails on AutoSF models #1369

vinven7 opened this issue Feb 19, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@vinven7
Copy link

vinven7 commented Feb 19, 2024

Describe the bug

I am trying to optimize AutoSF on a custom dataset. However, this triggers a device-side assert error in CUDA.

Here is the full trace:

I 2024-02-19 07:58:28,133] A new study created in memory with name: no-name-54ecbfdb-b81a-4379-b02a-ef5ffdd29652
INFO:pykeen.hpo.hpo:Using model: <class 'pykeen.models.unimodal.auto_sf.AutoSF'>
INFO:pykeen.hpo.hpo:Using loss: <class 'pykeen.losses.MarginRankingLoss'>
INFO:pykeen.hpo.hpo:Using optimizer: <class 'torch.optim.adam.Adam'>
INFO:pykeen.hpo.hpo:Using training loop: <class 'pykeen.training.slcwa.SLCWATrainingLoop'>
INFO:pykeen.hpo.hpo:Using negative sampler: <class 'pykeen.sampling.basic_negative_sampler.BasicNegativeSampler'>
INFO:pykeen.hpo.hpo:Using evaluator: <class 'pykeen.evaluation.rank_based_evaluator.RankBasedEvaluator'>
INFO:pykeen.hpo.hpo:Attempting to maximize both.realistic.inverse_harmonic_mean_rank
INFO:pykeen.hpo.hpo:Filter validation triples when testing: True
WARNING:pykeen.pipeline.api:No random seed is specified. Setting to 4229552334.
[W 2024-02-19 07:58:28,139] Trial 0 failed with parameters: {'model.embedding_dim': 128, 'loss.margin': 1.633297580856592, 'optimizer.lr': 0.04577728396873623, 'negative_sampler.num_negs_per_pos': 11, 'training.num_epochs': 400, 'training.batch_size': 4096} because of the following error: RuntimeError('CUDA error: device-side assert triggered\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n').
Traceback (most recent call last):
  File "/home/synthesisproject/anaconda3/envs/vineeth_14/lib/python3.10/site-packages/optuna/study/_optimize.py", line 200, in _run_trial
    value_or_values = func(trial)
  File "/home/synthesisproject/anaconda3/envs/vineeth_14/lib/python3.10/site-packages/pykeen/hpo/hpo.py", line 309, in __call__
    raise e
  File "/home/synthesisproject/anaconda3/envs/vineeth_14/lib/python3.10/site-packages/pykeen/hpo/hpo.py", line 259, in __call__
    result = pipeline(
  File "/home/synthesisproject/anaconda3/envs/vineeth_14/lib/python3.10/site-packages/pykeen/pipeline/api.py", line 1487, in pipeline
    set_random_seed(_random_seed)
  File "/home/synthesisproject/anaconda3/envs/vineeth_14/lib/python3.10/site-packages/pykeen/utils.py", line 298, in set_random_seed
    generator = torch.manual_seed(seed=seed)
  File "/home/synthesisproject/anaconda3/envs/vineeth_14/lib/python3.10/site-packages/torch/random.py", line 40, in manual_seed
    torch.cuda.manual_seed_all(seed)
  File "/home/synthesisproject/anaconda3/envs/vineeth_14/lib/python3.10/site-packages/torch/cuda/random.py", line 113, in manual_seed_all
    _lazy_call(cb, seed_all=True)
  File "/home/synthesisproject/anaconda3/envs/vineeth_14/lib/python3.10/site-packages/torch/cuda/__init__.py", line 183, in _lazy_call
    callable()
  File "/home/synthesisproject/anaconda3/envs/vineeth_14/lib/python3.10/site-packages/torch/cuda/random.py", line 111, in cb
    default_generator.manual_seed(seed)
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

How to reproduce

hpo_pipeline_result = hpo_pipeline(
    n_trials=30,
    dataset = 'Nations',
    model='AutoSF',
  #  model_kwargs_ranges=dict(
  #      embedding_dim=dict(type=int, low=4, high=754, q=50)  # Use 'q' for quantization step
  #  ),
#    loss= 'Self-Adversarial Negative Sampling Loss',
#    loss_kwargs_ranges = dict(
#      adversarial_temperature = dict(type = float, low =0.1, high =0.5, q=0.1)
#    ),
#    optimizer='Adam',
#    lr_scheduler='ExponentialLR',
#    training_loop='sLCWA',  
#    training_kwargs_ranges=dict(
#        num_epochs=dict(type=int, low=50, high=500, q=50), 
#    ),
#    negative_sampler='basic',
#    negative_sampler_kwargs_ranges=dict(
#        num_negs_per_pos=dict(type=int, low=3, high=39, q=3),
#    ),
#    stopper='early',
    save_model_directory=save_directory,
)

I have various combinations of parameters to see if that solves the problem, but it does not work even in this simplest case.

Environment

Key Value
OS posix
Platform Linux
Release 4.18.0-305.19.1.el8_4.x86_64
Time Mon Feb 19 08:04:23 2024
Python 3.10.11
PyKEEN 1.10.1
PyKEEN Hash UNHASHED
PyKEEN Branch
PyTorch 2.0.1
CUDA Available? true
CUDA Version 11.8
cuDNN Version 8700

Additional information

No response

Issue Template Checks

  • This is not a feature request (use a different issue template if it is)
  • This is not a question (use the discussions forum instead)
  • I've read the text explaining why including environment information is important and understand if I omit this information that my issue will be dismissed
@vinven7 vinven7 added the bug Something isn't working label Feb 19, 2024
@mberr
Copy link
Member

mberr commented Feb 19, 2024

I could not reproduce the error with

from pykeen.hpo import hpo_pipeline

hpo_pipeline_result = hpo_pipeline(
    n_trials=3,
    dataset="Nations",
    model="AutoSF",
    training_kwargs=dict(num_epochs=1),
)

and this env

Key Value
OS nt
Platform Windows
Release 10
Time Mon Feb 19 19:20:02 2024
Python 3.11.2
PyKEEN 1.10.2-dev
PyKEEN Hash c94213c
PyKEEN Branch master
PyTorch 2.1.1+cu121
CUDA Available? true
CUDA Version 12.1
cuDNN Version 8801

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants