Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with dp --pt test and validation dataset size #3766

Open
PhelanShao opened this issue May 10, 2024 · 1 comment
Open

Issue with dp --pt test and validation dataset size #3766

PhelanShao opened this issue May 10, 2024 · 1 comment
Assignees
Labels

Comments

@PhelanShao
Copy link

PhelanShao commented May 10, 2024

Bug summary

Encountered an issue when using the "descriptor": "dpa2" to train a model from scratch for 500k steps and then testing the model on a merged validation dataset. The merged validation dataset contains 7290 frames of data from two sources: C2O29H4_1124 and C2O3H4_6166.

When using dp test, an error occurred: The size of tensor a (17) must match the size of tensor b (25) at non-singleton dimension 1. It appears to be related to the size of the validation dataset, as adding the -n parameter with a smaller value to run successfully.

DeePMD-kit Version

v3.0.0a1.dev81+g23f67a13

Backend and its version

PyTorch 2.0.0, CUDA cu117

How did you download the software?

Others (write below)

Input Files, Running Commands, Error Log, etc.

Command
dp --pt test -m model.ckpt.pt -s /share/20240508/merged_validation_data/
Error Log
root@bohrium-25571-1132203:/share/20240508**# dp --pt test -m model.ckpt.pt -s /share/20240508/C2O29H4/ -n 100 -d results**
/opt/mamba/lib/python3.10/site-packages/torch/cuda/init.py:107: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
[2024-05-11 22:04:24,536] DEEPMD WARNING To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
[2024-05-11 22:04:26,714] DEEPMD WARNING You can use the environment variable DP_INFER_BATCH_SIZE tocontrol the inference batch size (nframes * natoms). The default value is 1024.
[2024-05-11 22:04:28,443] DEEPMD WARNING You can use the environment variable DP_INFER_BATCH_SIZE tocontrol the inference batch size (nframes * natoms). The default value is 1024.
[2024-05-11 22:04:28,444] DEEPMD INFO # ---------------output of dp test---------------
[2024-05-11 22:04:28,444] DEEPMD INFO # testing system : /share/20240508/C2O29H4
Traceback (most recent call last):
File "/opt/mamba/bin/dp", line 8, in
sys.exit(main())
File "/opt/mamba/lib/python3.10/site-packages/deepmd/main.py", line 807, in main
deepmd_main(args)
File "/opt/mamba/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 64, in main
test(**dict_args)
File "/opt/mamba/lib/python3.10/site-packages/deepmd/entrypoints/test.py", line 147, in test
err = test_ener(
File "/opt/mamba/lib/python3.10/site-packages/deepmd/entrypoints/test.py", line 337, in test_ener
ret = dp.eval(
File "/opt/mamba/lib/python3.10/site-packages/deepmd/infer/deep_pot.py", line 158, in eval
results = self.deep_eval.eval(
File "/opt/mamba/lib/python3.10/site-packages/deepmd/pt/infer/deep_eval.py", line 267, in eval
out = self._eval_func(self._eval_model, numb_test, natoms)(
File "/opt/mamba/lib/python3.10/site-packages/deepmd/pt/infer/deep_eval.py", line 339, in eval_func
return self.auto_batch_size.execute_all(
File "/opt/mamba/lib/python3.10/site-packages/deepmd/pt/utils/auto_batch_size.py", line 83, in execute_all
n_batch, result = self.execute(execute_with_batch_size, index, natoms)
File "/opt/mamba/lib/python3.10/site-packages/deepmd/utils/batch_size.py", line 111, in execute
raise e
File "/opt/mamba/lib/python3.10/site-packages/deepmd/utils/batch_size.py", line 108, in execute
n_batch, result = callable(max(batch_nframes, 1), start_index)
File "/opt/mamba/lib/python3.10/site-packages/deepmd/pt/utils/auto_batch_size.py", line 59, in execute_with_batch_size
return (end_index - start_index), callable(
File "/opt/mamba/lib/python3.10/site-packages/deepmd/pt/infer/deep_eval.py", line 409, in _eval_model
batch_output = model(
File "/opt/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/mamba/lib/python3.10/site-packages/deepmd/pt/train/wrapper.py", line 173, in forward
model_pred = self.modeltask_key
File "/opt/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The size of tensor a (17) must match the size of tensor b (25) at non-singleton dimension 1

root@bohrium-25571-1132203:/share/20240508# dp --pt test -m model.ckpt.pt -s /share/20240508/C2O29H4/ -n 10 -d results
/opt/mamba/lib/python3.10/site-packages/torch/cuda/init.py:107: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
[2024-05-11 22:05:40,841] DEEPMD WARNING To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
[2024-05-11 22:05:43,028] DEEPMD WARNING You can use the environment variable DP_INFER_BATCH_SIZE tocontrol the inference batch size (nframes * natoms). The default value is 1024.
[2024-05-11 22:05:44,759] DEEPMD WARNING You can use the environment variable DP_INFER_BATCH_SIZE tocontrol the inference batch size (nframes * natoms). The default value is 1024.
[2024-05-11 22:05:44,760] DEEPMD INFO # ---------------output of dp test---------------
[2024-05-11 22:05:44,760] DEEPMD INFO # testing system : /share/20240508/C2O29H4
[2024-05-11 22:05:51,862] DEEPMD INFO # number of test data : 10
[2024-05-11 22:05:51,862] DEEPMD INFO Energy MAE : 1.556307e+00 eV
[2024-05-11 22:05:51,862] DEEPMD INFO Energy RMSE : 2.110725e+00 eV
[2024-05-11 22:05:51,862] DEEPMD INFO Energy MAE/Natoms : 4.446592e-02 eV
[2024-05-11 22:05:51,862] DEEPMD INFO Energy RMSE/Natoms : 6.030642e-02 eV
[2024-05-11 22:05:51,862] DEEPMD INFO Force MAE : 5.084302e-01 eV/A
[2024-05-11 22:05:51,862] DEEPMD INFO Force RMSE : 7.582858e-01 eV/A
[2024-05-11 22:05:51,863] DEEPMD INFO Virial MAE : 6.835131e+00 eV
[2024-05-11 22:05:51,863] DEEPMD INFO Virial RMSE : 9.148349e+00 eV
[2024-05-11 22:05:51,863] DEEPMD INFO Virial MAE/Natoms : 1.952894e-01 eV
[2024-05-11 22:05:51,863] DEEPMD INFO Virial RMSE/Natoms : 2.613814e-01 eV
[2024-05-11 22:05:51,911] DEEPMD INFO # -----------------------------------------------

When setting the dataset to a single validation set before merging, I encountered the same error.
Modifying the command to:

dp --pt test -m model.ckpt.pt -s /share/20240508/validation_data/ -n 100

worked.

Returning to the merged dataset, command:
dp --pt test -m model.ckpt.pt -s /share/20240508/merged_validation_data/ -n 100

and encountered the same error: The size of tensor a (17) must match the size of tensor b (25) at non-singleton dimension 1.

Changing n to 10 worked:
dp --pt test -m model.ckpt.pt -s /share/20240508/merged_validation_data/ -n 10

Steps to Reproduce

1.Train the model with "descriptor": "dpa2" from scratch for 500k steps. & cp model.ckpt.pt
2.Merge multiple validation datasets into one dataset named merged_validation_data (7290 frames, C2O29H4_1124, C2O3H4_6166).
3.Run the test command:
dp --pt test -m model.ckpt.pt -s /share/20240508/merged_validation_data/

Further Information, Files, and Links

registry.dp.tech/dptech/prod-157/deepmd-kit:202Q1
model_validation_data.zip
model:
https://drive.google.com/file/d/1lVAJFZBnBr2rb-aevxR_nLdPBp_mZZFB/view?usp=drive_link
dpa2_input:
input.json

No response

@Chengqian-Zhang
Copy link
Collaborator

I use

  • v3.0.0a1.dev81+g23f67a13 DeePMD-kit version
  • 2.0.0+cu117 Pytorch version.
  • the model checkpoint and validation data you provide

I can't reproduce the same error as yours.

Image

My output is:
(base) root@bohrium-11461-1141514:~/issue# dp --pt test -m model.ckpt-500000.pt -s merged_validation_data
[2024-05-29 14:43:43,099] DEEPMD WARNING To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
[2024-05-29 14:43:48,504] DEEPMD INFO # ---------------output of dp test---------------
[2024-05-29 14:43:48,504] DEEPMD INFO # testing system : merged_validation_data/C2O29H4
[2024-05-29 14:43:52,597] DEEPMD INFO Adjust batch size from 1024 to 512
/opt/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py:1501: UserWarning: operator() sees varying value in profiling, ignoring and this should be handled by GUARD logic (Triggered internally at ../third_party/nvfuser/csrc/parser.cpp:3777.)
return forward_call(*args, **kwargs)
[2024-05-29 14:43:54,826] DEEPMD INFO Adjust batch size from 512 to 256
[2024-05-29 14:46:47,689] DEEPMD INFO # number of test data : 1404
[2024-05-29 14:46:47,690] DEEPMD INFO Energy MAE : 9.018451e-01 eV
[2024-05-29 14:46:47,690] DEEPMD INFO Energy RMSE : 8.481161e+00 eV
[2024-05-29 14:46:47,690] DEEPMD INFO Energy MAE/Natoms : 2.576700e-02 eV
[2024-05-29 14:46:47,690] DEEPMD INFO Energy RMSE/Natoms : 2.423189e-01 eV
[2024-05-29 14:46:47,690] DEEPMD INFO Force MAE : 2.593189e-01 eV/A
[2024-05-29 14:46:47,690] DEEPMD INFO Force RMSE : 1.283905e+00 eV/A
[2024-05-29 14:46:47,690] DEEPMD INFO Virial MAE : 3.679828e+00 eV
[2024-05-29 14:46:47,690] DEEPMD INFO Virial RMSE : 5.281812e+00 eV
[2024-05-29 14:46:47,690] DEEPMD INFO Virial MAE/Natoms : 1.051379e-01 eV
[2024-05-29 14:46:47,690] DEEPMD INFO Virial RMSE/Natoms : 1.509089e-01 eV
[2024-05-29 14:46:47,690] DEEPMD INFO # -----------------------------------------------
[2024-05-29 14:46:47,690] DEEPMD INFO # ---------------output of dp test---------------
[2024-05-29 14:46:47,690] DEEPMD INFO # testing system : merged_validation_data/C2O3H4
[2024-05-29 14:47:44,314] DEEPMD INFO # number of test data : 2208
[2024-05-29 14:47:44,314] DEEPMD INFO Energy MAE : 6.144467e-01 eV
[2024-05-29 14:47:44,314] DEEPMD INFO Energy RMSE : 4.617050e+00 eV
[2024-05-29 14:47:44,314] DEEPMD INFO Energy MAE/Natoms : 6.827186e-02 eV
[2024-05-29 14:47:44,314] DEEPMD INFO Energy RMSE/Natoms : 5.130056e-01 eV
[2024-05-29 14:47:44,314] DEEPMD INFO Force MAE : 3.096173e-01 eV/A
[2024-05-29 14:47:44,314] DEEPMD INFO Force RMSE : 6.542296e-01 eV/A
[2024-05-29 14:47:44,314] DEEPMD INFO Virial MAE : 2.412724e+00 eV
[2024-05-29 14:47:44,314] DEEPMD INFO Virial RMSE : 3.296499e+00 eV
[2024-05-29 14:47:44,314] DEEPMD INFO Virial MAE/Natoms : 2.680805e-01 eV
[2024-05-29 14:47:44,314] DEEPMD INFO Virial RMSE/Natoms : 3.662777e-01 eV
[2024-05-29 14:47:44,314] DEEPMD INFO # -----------------------------------------------
[2024-05-29 14:47:44,314] DEEPMD INFO # ----------weighted average of errors-----------
[2024-05-29 14:47:44,314] DEEPMD INFO # number of systems : 2
[2024-05-29 14:47:44,315] DEEPMD INFO Energy MAE : 7.261597e-01 eV
[2024-05-29 14:47:44,315] DEEPMD INFO Energy RMSE : 6.402392e+00 eV
[2024-05-29 14:47:44,315] DEEPMD INFO Energy MAE/Natoms : 5.175004e-02 eV
[2024-05-29 14:47:44,315] DEEPMD INFO Energy RMSE/Natoms : 4.286043e-01 eV
[2024-05-29 14:47:44,315] DEEPMD INFO Force MAE : 2.738023e-01 eV/A
[2024-05-29 14:47:44,315] DEEPMD INFO Force RMSE : 1.138859e+00 eV/A
[2024-05-29 14:47:44,315] DEEPMD INFO Virial MAE : 2.905253e+00 eV
[2024-05-29 14:47:44,315] DEEPMD INFO Virial RMSE : 4.181720e+00 eV
[2024-05-29 14:47:44,315] DEEPMD INFO Virial MAE/Natoms : 2.047440e-01 eV
[2024-05-29 14:47:44,315] DEEPMD INFO Virial RMSE/Natoms : 3.014352e-01 eV
[2024-05-29 14:47:44,315] DEEPMD INFO # -----------------------------------------------

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Backlog
Development

No branches or pull requests

3 participants