-
Notifications
You must be signed in to change notification settings - Fork 484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with dp --pt test and validation dataset size #3766
Comments
I use
I can't reproduce the same error as yours. My output is: |
Bug summary
Encountered an issue when using the "descriptor": "dpa2" to train a model from scratch for 500k steps and then testing the model on a merged validation dataset. The merged validation dataset contains 7290 frames of data from two sources: C2O29H4_1124 and C2O3H4_6166.
When using dp test, an error occurred: The size of tensor a (17) must match the size of tensor b (25) at non-singleton dimension 1. It appears to be related to the size of the validation dataset, as adding the -n parameter with a smaller value to run successfully.
DeePMD-kit Version
v3.0.0a1.dev81+g23f67a13
Backend and its version
PyTorch 2.0.0, CUDA cu117
How did you download the software?
Others (write below)
Input Files, Running Commands, Error Log, etc.
Command
dp --pt test -m model.ckpt.pt -s /share/20240508/merged_validation_data/
Error Log
root@bohrium-25571-1132203:/share/20240508**# dp --pt test -m model.ckpt.pt -s /share/20240508/C2O29H4/ -n 100 -d results**
/opt/mamba/lib/python3.10/site-packages/torch/cuda/init.py:107: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
[2024-05-11 22:04:24,536] DEEPMD WARNING To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
[2024-05-11 22:04:26,714] DEEPMD WARNING You can use the environment variable DP_INFER_BATCH_SIZE tocontrol the inference batch size (nframes * natoms). The default value is 1024.
[2024-05-11 22:04:28,443] DEEPMD WARNING You can use the environment variable DP_INFER_BATCH_SIZE tocontrol the inference batch size (nframes * natoms). The default value is 1024.
[2024-05-11 22:04:28,444] DEEPMD INFO # ---------------output of dp test---------------
[2024-05-11 22:04:28,444] DEEPMD INFO # testing system : /share/20240508/C2O29H4
Traceback (most recent call last):
File "/opt/mamba/bin/dp", line 8, in
sys.exit(main())
File "/opt/mamba/lib/python3.10/site-packages/deepmd/main.py", line 807, in main
deepmd_main(args)
File "/opt/mamba/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 64, in main
test(**dict_args)
File "/opt/mamba/lib/python3.10/site-packages/deepmd/entrypoints/test.py", line 147, in test
err = test_ener(
File "/opt/mamba/lib/python3.10/site-packages/deepmd/entrypoints/test.py", line 337, in test_ener
ret = dp.eval(
File "/opt/mamba/lib/python3.10/site-packages/deepmd/infer/deep_pot.py", line 158, in eval
results = self.deep_eval.eval(
File "/opt/mamba/lib/python3.10/site-packages/deepmd/pt/infer/deep_eval.py", line 267, in eval
out = self._eval_func(self._eval_model, numb_test, natoms)(
File "/opt/mamba/lib/python3.10/site-packages/deepmd/pt/infer/deep_eval.py", line 339, in eval_func
return self.auto_batch_size.execute_all(
File "/opt/mamba/lib/python3.10/site-packages/deepmd/pt/utils/auto_batch_size.py", line 83, in execute_all
n_batch, result = self.execute(execute_with_batch_size, index, natoms)
File "/opt/mamba/lib/python3.10/site-packages/deepmd/utils/batch_size.py", line 111, in execute
raise e
File "/opt/mamba/lib/python3.10/site-packages/deepmd/utils/batch_size.py", line 108, in execute
n_batch, result = callable(max(batch_nframes, 1), start_index)
File "/opt/mamba/lib/python3.10/site-packages/deepmd/pt/utils/auto_batch_size.py", line 59, in execute_with_batch_size
return (end_index - start_index), callable(
File "/opt/mamba/lib/python3.10/site-packages/deepmd/pt/infer/deep_eval.py", line 409, in _eval_model
batch_output = model(
File "/opt/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/mamba/lib/python3.10/site-packages/deepmd/pt/train/wrapper.py", line 173, in forward
model_pred = self.modeltask_key
File "/opt/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The size of tensor a (17) must match the size of tensor b (25) at non-singleton dimension 1
root@bohrium-25571-1132203:/share/20240508# dp --pt test -m model.ckpt.pt -s /share/20240508/C2O29H4/ -n 10 -d results
/opt/mamba/lib/python3.10/site-packages/torch/cuda/init.py:107: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
[2024-05-11 22:05:40,841] DEEPMD WARNING To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
[2024-05-11 22:05:43,028] DEEPMD WARNING You can use the environment variable DP_INFER_BATCH_SIZE tocontrol the inference batch size (nframes * natoms). The default value is 1024.
[2024-05-11 22:05:44,759] DEEPMD WARNING You can use the environment variable DP_INFER_BATCH_SIZE tocontrol the inference batch size (nframes * natoms). The default value is 1024.
[2024-05-11 22:05:44,760] DEEPMD INFO # ---------------output of dp test---------------
[2024-05-11 22:05:44,760] DEEPMD INFO # testing system : /share/20240508/C2O29H4
[2024-05-11 22:05:51,862] DEEPMD INFO # number of test data : 10
[2024-05-11 22:05:51,862] DEEPMD INFO Energy MAE : 1.556307e+00 eV
[2024-05-11 22:05:51,862] DEEPMD INFO Energy RMSE : 2.110725e+00 eV
[2024-05-11 22:05:51,862] DEEPMD INFO Energy MAE/Natoms : 4.446592e-02 eV
[2024-05-11 22:05:51,862] DEEPMD INFO Energy RMSE/Natoms : 6.030642e-02 eV
[2024-05-11 22:05:51,862] DEEPMD INFO Force MAE : 5.084302e-01 eV/A
[2024-05-11 22:05:51,862] DEEPMD INFO Force RMSE : 7.582858e-01 eV/A
[2024-05-11 22:05:51,863] DEEPMD INFO Virial MAE : 6.835131e+00 eV
[2024-05-11 22:05:51,863] DEEPMD INFO Virial RMSE : 9.148349e+00 eV
[2024-05-11 22:05:51,863] DEEPMD INFO Virial MAE/Natoms : 1.952894e-01 eV
[2024-05-11 22:05:51,863] DEEPMD INFO Virial RMSE/Natoms : 2.613814e-01 eV
[2024-05-11 22:05:51,911] DEEPMD INFO # -----------------------------------------------
When setting the dataset to a single validation set before merging, I encountered the same error.
Modifying the command to:
dp --pt test -m model.ckpt.pt -s /share/20240508/validation_data/ -n 100
worked.
Returning to the merged dataset, command:
dp --pt test -m model.ckpt.pt -s /share/20240508/merged_validation_data/ -n 100
and encountered the same error: The size of tensor a (17) must match the size of tensor b (25) at non-singleton dimension 1.
Changing n to 10 worked:
dp --pt test -m model.ckpt.pt -s /share/20240508/merged_validation_data/ -n 10
Steps to Reproduce
1.Train the model with "descriptor": "dpa2" from scratch for 500k steps. & cp model.ckpt.pt
2.Merge multiple validation datasets into one dataset named merged_validation_data (7290 frames, C2O29H4_1124, C2O3H4_6166).
3.Run the test command:
dp --pt test -m model.ckpt.pt -s /share/20240508/merged_validation_data/
Further Information, Files, and Links
registry.dp.tech/dptech/prod-157/deepmd-kit:202Q1
model_validation_data.zip
model:
https://drive.google.com/file/d/1lVAJFZBnBr2rb-aevxR_nLdPBp_mZZFB/view?usp=drive_link
dpa2_input:
input.json
No response
The text was updated successfully, but these errors were encountered: