Issue with dp --pt test and validation dataset size #3766

PhelanShao · 2024-05-10T02:14:03Z

Bug summary

Encountered an issue when using the "descriptor": "dpa2" to train a model from scratch for 500k steps and then testing the model on a merged validation dataset. The merged validation dataset contains 7290 frames of data from two sources: C2O29H4_1124 and C2O3H4_6166.

When using dp test, an error occurred: The size of tensor a (17) must match the size of tensor b (25) at non-singleton dimension 1. It appears to be related to the size of the validation dataset, as adding the -n parameter with a smaller value to run successfully.

DeePMD-kit Version

v3.0.0a1.dev81+g23f67a13

Backend and its version

PyTorch 2.0.0, CUDA cu117

How did you download the software?

Others (write below)

Input Files, Running Commands, Error Log, etc.

Command
dp --pt test -m model.ckpt.pt -s /share/20240508/merged_validation_data/
Error Log
root@bohrium-25571-1132203:/share/20240508**# dp --pt test -m model.ckpt.pt -s /share/20240508/C2O29H4/ -n 100 -d results**
/opt/mamba/lib/python3.10/site-packages/torch/cuda/init.py:107: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
[2024-05-11 22:04:24,536] DEEPMD WARNING To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
[2024-05-11 22:04:26,714] DEEPMD WARNING You can use the environment variable DP_INFER_BATCH_SIZE tocontrol the inference batch size (nframes * natoms). The default value is 1024.
[2024-05-11 22:04:28,443] DEEPMD WARNING You can use the environment variable DP_INFER_BATCH_SIZE tocontrol the inference batch size (nframes * natoms). The default value is 1024.
[2024-05-11 22:04:28,444] DEEPMD INFO # ---------------output of dp test---------------
[2024-05-11 22:04:28,444] DEEPMD INFO # testing system : /share/20240508/C2O29H4
Traceback (most recent call last):
File "/opt/mamba/bin/dp", line 8, in
sys.exit(main())
File "/opt/mamba/lib/python3.10/site-packages/deepmd/main.py", line 807, in main
deepmd_main(args)
File "/opt/mamba/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 64, in main
test(**dict_args)
File "/opt/mamba/lib/python3.10/site-packages/deepmd/entrypoints/test.py", line 147, in test
err = test_ener(
File "/opt/mamba/lib/python3.10/site-packages/deepmd/entrypoints/test.py", line 337, in test_ener
ret = dp.eval(
File "/opt/mamba/lib/python3.10/site-packages/deepmd/infer/deep_pot.py", line 158, in eval
results = self.deep_eval.eval(
File "/opt/mamba/lib/python3.10/site-packages/deepmd/pt/infer/deep_eval.py", line 267, in eval
out = self._eval_func(self._eval_model, numb_test, natoms)(
File "/opt/mamba/lib/python3.10/site-packages/deepmd/pt/infer/deep_eval.py", line 339, in eval_func
return self.auto_batch_size.execute_all(
File "/opt/mamba/lib/python3.10/site-packages/deepmd/pt/utils/auto_batch_size.py", line 83, in execute_all
n_batch, result = self.execute(execute_with_batch_size, index, natoms)
File "/opt/mamba/lib/python3.10/site-packages/deepmd/utils/batch_size.py", line 111, in execute
raise e
File "/opt/mamba/lib/python3.10/site-packages/deepmd/utils/batch_size.py", line 108, in execute
n_batch, result = callable(max(batch_nframes, 1), start_index)
File "/opt/mamba/lib/python3.10/site-packages/deepmd/pt/utils/auto_batch_size.py", line 59, in execute_with_batch_size
return (end_index - start_index), callable(
File "/opt/mamba/lib/python3.10/site-packages/deepmd/pt/infer/deep_eval.py", line 409, in _eval_model
batch_output = model(
File "/opt/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/mamba/lib/python3.10/site-packages/deepmd/pt/train/wrapper.py", line 173, in forward
model_pred = self.modeltask_key
File "/opt/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The size of tensor a (17) must match the size of tensor b (25) at non-singleton dimension 1

root@bohrium-25571-1132203:/share/20240508# /opt/mamba/lib/python3.10/site-packag return torch._C._cuda_getDeviceCount() [2024-05-11 22:05:40,841] DEEPMD [2024-05-11 22:05:43,028] DEEPMD [2024-05-11 22:05:44,759] DEEPMD [2024-05-11 22:05:44,760] DEEPMD INFO [2024-05-11 22:05:44,760] DEEPMD INFO [2024-05-11 22:05:51,862] DEEPMD INFO [2024-05-11 22:05:51,862] DEEPMD INFO [2024-05-11 22:05:51,862] DEEPMD INFO [2024-05-11 22:05:51,862] DEEPMD INFO [2024-05-11 22:05:51,862] DEEPMD INFO [2024-05-11 22:05:51,862] DEEPMD INFO [2024-05-11 22:05:51,862] DEEPMD INFO [2024-05-11 22:05:51,863] DEEPMD INFO [2024-05-11 22:05:51,863] DEEPMD INFO [2024-05-11 22:05:51,863] DEEPMD INFO [2024-05-11 22:05:51,863] DEEPMD INFO [2024-05-11 22:05:51,911] DEEPMD INFO dp --pt test -m model.ckpt.pt -s /share/20240508/C2O29H4/ -n 10 -d results
es/torch/cuda/init.py:107: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
> 0
WARNING To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
WARNING You can use the environment variable DP_INFER_BATCH_SIZE tocontrol the inference batch size (nframes * natoms). The default value is 1024.
WARNING You can use the environment variable DP_INFER_BATCH_SIZE tocontrol the inference batch size (nframes * natoms). The default value is 1024.
# ---------------output of dp test---------------
# testing system : /share/20240508/C2O29H4
# number of test data : 10
Energy MAE : 1.556307e+00 eV
Energy RMSE : 2.110725e+00 eV
Energy MAE/Natoms : 4.446592e-02 eV
Energy RMSE/Natoms : 6.030642e-02 eV
Force MAE : 5.084302e-01 eV/A
Force RMSE : 7.582858e-01 eV/A
Virial MAE : 6.835131e+00 eV
Virial RMSE : 9.148349e+00 eV
Virial MAE/Natoms : 1.952894e-01 eV
Virial RMSE/Natoms : 2.613814e-01 eV
# -----------------------------------------------

When setting the dataset to a single validation set before merging, I encountered the same error.
Modifying the command to:

dp --pt test -m model.ckpt.pt -s /share/20240508/validation_data/ -n 100

worked.

Returning to the merged dataset, command:
dp --pt test -m model.ckpt.pt -s /share/20240508/merged_validation_data/ -n 100

and encountered the same error: The size of tensor a (17) must match the size of tensor b (25) at non-singleton dimension 1.

Changing n to 10 worked:
dp --pt test -m model.ckpt.pt -s /share/20240508/merged_validation_data/ -n 10

Steps to Reproduce

1.Train the model with "descriptor": "dpa2" from scratch for 500k steps. & cp model.ckpt.pt
2.Merge multiple validation datasets into one dataset named merged_validation_data (7290 frames, C2O29H4_1124, C2O3H4_6166).
3.Run the test command:
dp --pt test -m model.ckpt.pt -s /share/20240508/merged_validation_data/

Further Information, Files, and Links

registry.dp.tech/dptech/prod-157/deepmd-kit:202Q1
model_validation_data.zip
model:
https://drive.google.com/file/d/1lVAJFZBnBr2rb-aevxR_nLdPBp_mZZFB/view?usp=drive_link
dpa2_input:
input.json

No response

Chengqian-Zhang · 2024-05-29T07:39:48Z

I use

v3.0.0a1.dev81+g23f67a13 DeePMD-kit version
2.0.0+cu117 Pytorch version.
the model checkpoint and validation data you provide

I can't reproduce the same error as yours.

My output is:
(base) root@bohrium-11461-1141514:~/issue# dp --pt test -m model.ckpt-500000.pt -s merged_validation_data
[2024-05-29 14:43:43,099] DEEPMD WARNING To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
[2024-05-29 14:43:48,504] DEEPMD INFO # ---------------output of dp test---------------
[2024-05-29 14:43:48,504] DEEPMD INFO # testing system : merged_validation_data/C2O29H4
[2024-05-29 14:43:52,597] DEEPMD INFO Adjust batch size from 1024 to 512
/opt/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py:1501: UserWarning: operator() sees varying value in profiling, ignoring and this should be handled by GUARD logic (Triggered internally at ../third_party/nvfuser/csrc/parser.cpp:3777.)
return forward_call(*args, **kwargs)
[2024-05-29 14:43:54,826] DEEPMD INFO Adjust batch size from 512 to 256
[2024-05-29 14:46:47,689] DEEPMD INFO # number of test data : 1404
[2024-05-29 14:46:47,690] DEEPMD INFO Energy MAE : 9.018451e-01 eV
[2024-05-29 14:46:47,690] DEEPMD INFO Energy RMSE : 8.481161e+00 eV
[2024-05-29 14:46:47,690] DEEPMD INFO Energy MAE/Natoms : 2.576700e-02 eV
[2024-05-29 14:46:47,690] DEEPMD INFO Energy RMSE/Natoms : 2.423189e-01 eV
[2024-05-29 14:46:47,690] DEEPMD INFO Force MAE : 2.593189e-01 eV/A
[2024-05-29 14:46:47,690] DEEPMD INFO Force RMSE : 1.283905e+00 eV/A
[2024-05-29 14:46:47,690] DEEPMD INFO Virial MAE : 3.679828e+00 eV
[2024-05-29 14:46:47,690] DEEPMD INFO Virial RMSE : 5.281812e+00 eV
[2024-05-29 14:46:47,690] DEEPMD INFO Virial MAE/Natoms : 1.051379e-01 eV
[2024-05-29 14:46:47,690] DEEPMD INFO Virial RMSE/Natoms : 1.509089e-01 eV
[2024-05-29 14:46:47,690] DEEPMD INFO # -----------------------------------------------
[2024-05-29 14:46:47,690] DEEPMD INFO # ---------------output of dp test---------------
[2024-05-29 14:46:47,690] DEEPMD INFO # testing system : merged_validation_data/C2O3H4
[2024-05-29 14:47:44,314] DEEPMD INFO # number of test data : 2208
[2024-05-29 14:47:44,314] DEEPMD INFO Energy MAE : 6.144467e-01 eV
[2024-05-29 14:47:44,314] DEEPMD INFO Energy RMSE : 4.617050e+00 eV
[2024-05-29 14:47:44,314] DEEPMD INFO Energy MAE/Natoms : 6.827186e-02 eV
[2024-05-29 14:47:44,314] DEEPMD INFO Energy RMSE/Natoms : 5.130056e-01 eV
[2024-05-29 14:47:44,314] DEEPMD INFO Force MAE : 3.096173e-01 eV/A
[2024-05-29 14:47:44,314] DEEPMD INFO Force RMSE : 6.542296e-01 eV/A
[2024-05-29 14:47:44,314] DEEPMD INFO Virial MAE : 2.412724e+00 eV
[2024-05-29 14:47:44,314] DEEPMD INFO Virial RMSE : 3.296499e+00 eV
[2024-05-29 14:47:44,314] DEEPMD INFO Virial MAE/Natoms : 2.680805e-01 eV
[2024-05-29 14:47:44,314] DEEPMD INFO Virial RMSE/Natoms : 3.662777e-01 eV
[2024-05-29 14:47:44,314] DEEPMD INFO # -----------------------------------------------
[2024-05-29 14:47:44,314] DEEPMD INFO # ----------weighted average of errors-----------
[2024-05-29 14:47:44,314] DEEPMD INFO # number of systems : 2
[2024-05-29 14:47:44,315] DEEPMD INFO Energy MAE : 7.261597e-01 eV
[2024-05-29 14:47:44,315] DEEPMD INFO Energy RMSE : 6.402392e+00 eV
[2024-05-29 14:47:44,315] DEEPMD INFO Energy MAE/Natoms : 5.175004e-02 eV
[2024-05-29 14:47:44,315] DEEPMD INFO Energy RMSE/Natoms : 4.286043e-01 eV
[2024-05-29 14:47:44,315] DEEPMD INFO Force MAE : 2.738023e-01 eV/A
[2024-05-29 14:47:44,315] DEEPMD INFO Force RMSE : 1.138859e+00 eV/A
[2024-05-29 14:47:44,315] DEEPMD INFO Virial MAE : 2.905253e+00 eV
[2024-05-29 14:47:44,315] DEEPMD INFO Virial RMSE : 4.181720e+00 eV
[2024-05-29 14:47:44,315] DEEPMD INFO Virial MAE/Natoms : 2.047440e-01 eV
[2024-05-29 14:47:44,315] DEEPMD INFO Virial RMSE/Natoms : 3.014352e-01 eV
[2024-05-29 14:47:44,315] DEEPMD INFO # -----------------------------------------------

PhelanShao added the bug label May 10, 2024

iProzd self-assigned this May 11, 2024

AnguseZhang mentioned this issue May 20, 2024

[BUG] dp test encounters killed problem on a single A100 machine where batch_size is only 10. #3797

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with dp --pt test and validation dataset size #3766

Issue with dp --pt test and validation dataset size #3766

PhelanShao commented May 10, 2024 •

edited

Chengqian-Zhang commented May 29, 2024

Issue with dp --pt test and validation dataset size #3766

Issue with dp --pt test and validation dataset size #3766

Comments

PhelanShao commented May 10, 2024 • edited

Bug summary

DeePMD-kit Version

Backend and its version

How did you download the software?

Input Files, Running Commands, Error Log, etc.

Steps to Reproduce

Further Information, Files, and Links

Chengqian-Zhang commented May 29, 2024

PhelanShao commented May 10, 2024 •

edited