Adds Ascend platform adaptation code #1881

ZhengdQin · 2022-08-30T08:29:20Z

Add transfer-to-ascend module, one can use command "dp transfer-to-ascend mix_precision -i water.pb -o Ascend_transfer.pb" to transfer a model to mix-precision Ascend_transfer.pb, the model can excute on Ascend platform.
Modify dp test module for Ascend platform.
Modify Lammps +deepMD for Ascend platform

codecov-commenter · 2022-08-30T09:25:16Z

Codecov Report

Patch coverage: 48.64% and project coverage change: -0.54 ⚠️

Comparison is base (58bed2b) 78.00% compared to head (563f312) 77.47%.

❗ Current head 563f312 differs from pull request most recent head d0261d5. Consider uploading reports for the commit d0261d5 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##            devel    #1881      +/-   ##
==========================================
- Coverage   78.00%   77.47%   -0.54%     
==========================================
  Files         118      116       -2     
  Lines        9853    10139     +286     
==========================================
+ Hits         7686     7855     +169     
- Misses       2167     2284     +117

Impacted Files	Coverage Δ
deepmd/utils/convert.py	`14.65% <ø> (+0.49%)`	⬆️
deepmd/infer/deep_pot.py	`45.16% <9.44%> (-24.95%)`	⬇️
deepmd/entrypoints/convert.py	`15.38% <50.00%> (ø)`
deepmd/entrypoints/transfer.py	`75.59% <58.06%> (+3.47%)`	⬆️
deepmd/utils/transfer_to_ascend.py	`73.68% <73.68%> (ø)`
deepmd/entrypoints/transfer_to_ascend.py	`80.00% <80.00%> (ø)`
deepmd/train/trainer.py	`80.68% <85.71%> (+0.02%)`	⬆️
deepmd/entrypoints/__init__.py	`100.00% <100.00%> (ø)`
deepmd/entrypoints/main.py	`92.12% <100.00%> (+4.83%)`	⬆️
deepmd/entrypoints/train.py	`88.39% <100.00%> (+0.26%)`	⬆️
... and 18 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

source/CMakeLists.txt

source/lmp/CMakeLists.txt

source/CMakeLists.txt

deepmd/entrypoints/main.py

deepmd/infer/deep_pot.py

deepmd/utils/argcheck.py

deepmd/entrypoints/transfer_to_ascend.py

deepmd/entrypoints/transfer.py

source/op/prod_env_mat_multi_device.cc

source/api_cc/src/DeepPot.cc

njzjz · 2022-08-30T19:32:59Z

The title should be more precise.

source/CMakeLists.txt

deepmd/entrypoints/transfer.py

source/op/prod_env_mat_multi_device.cc

source/lmp/ascend_env.sh.in

deepmd/utils/network.py

njzjz · 2022-08-31T22:46:56Z

deepmd/utils/network.py

-                            b_initializer, 
-                            trainable = trainable)
-        variable_summaries(b, 'bias')
+        if final_layer and GLOBAL_ASCEND_OUT_PRECISION:


@denghuilu It looks similar as mixed_prec. However the weight for mixed_prec is cast but the weight for GLOBAL_ASCEND_OUT_PRECISION is not cast. What do you think about it?

It seems like they have already set the precisions before running the networks:

deepmd-kit/deepmd/utils/transfer_to_ascend.py

Lines 77 to 78 in 58367be

jdata["model"]["descriptor"]["precision"] = "float16"

jdata["model"]["fitting_net"]["precision"] = "float16"

.

Thanks, we modify the original precision directly. We tried different mix-precision models on Ascend platform and found that the GLOBAL_ASCEND_OUT_PRECISION being float32 (only the last biasadd is float32) is important to ensure the accuracy of the ascend transfered model, so we cast the every weight except the last biasadd.

It has the same aim as mixed_prec so it's better to merge these two variables.

Thanks for the comment! Considering our mix-precision model is defferent with the mixed_prec defined model. only the last biasadd is the float32 type, so using mixed_prec needs to change the code logic and makes it difficult to understand.

deepmd/entrypoints/transfer.py

wanghan-iapcm

Could you please add document for how to use deepmd-kit on ascend?

wanghan-iapcm · 2022-09-07T02:27:13Z

deepmd/train/trainer.py

-        if not self.is_compress:
+        if self.is_ascend_transfer:
+            self._init_from_frz_model()
+


Do we need a new case for ascend transfer? Is it the same as training with the --init-model option?

Thanks for the comment. The ascend transfer is very similar as training with init-model, the reason we cannot use training with init-model is the following:

Considering we may add new functions in the ascend transfer module in the future, developing a new module has better augmentability.

We cannot use train with init-model directly, since we only build a model without training. At the same time, we can automatically modify the input.json. In this way, we can finish build, freeze and transfer in one command.

The reason we cannot use training with init-model is the following:

Considering we may add new functions in the ascend transfer module in the future, developing a new module has better augment ability.
We cannot use dp train with init-model directly, since we only build a model without training. At the same time, we can automatically modify the input.json. In this way, we can finish build, freeze and transfer in one command.

wanghan-iapcm · 2022-09-07T02:37:20Z

deepmd/utils/argcheck.py

@@ -458,6 +458,7 @@ def model_args ():
    doc_sw_rmin = 'The lower boundary of the interpolation between short-range tabulated interaction and DP. It is only required when `use_srtab` is provided.'
    doc_sw_rmax = 'The upper boundary of the interpolation between short-range tabulated interaction and DP. It is only required when `use_srtab` is provided.'
    doc_compress_config = 'Model compression configurations'
+    doc_ascend_transfer = 'Model transfer to ascend mix-precision model'


This argument would not be needed if ascend training is the same as --init-model training.

Thanks, ascend transfer is different from init-model training. please see the detailed explanations in the above reply.

wanghan-iapcm · 2022-09-07T02:48:08Z

deepmd/utils/network.py

-                            b_initializer, 
-                            trainable = trainable)
-        variable_summaries(b, 'bias')
+        if final_layer and GLOBAL_ASCEND_OUT_PRECISION is not None:


I would suggest adding a option out_precision to the interface (default=GLOBAL_TF_FLOAT_PRECISION), so not only ascent is supported.
Note: changing the behavior of the function by a global variable is dangerous.

Good idea! We have removed the global variable and added the option out_precision to the interface.

wanghan-iapcm · 2022-09-07T02:52:43Z

deepmd/utils/network.py

+                                GLOBAL_ASCEND_OUT_PRECISION,
+                                b_initializer, 
+                                trainable = trainable)
+            variable_summaries(b, 'bias')


mv variable_summaries out of the if-else?

Thanks, we have fixed it.

wanghan-iapcm · 2022-09-07T03:50:21Z

source/api_cc/include/DeepPot.h

+  * @brief Initialize the DP.
+  * @param[in] model The name of the frozen model file.
+  * @param[in] gpu_rank The GPU rank. Default is 0.


plz update the doc str

Thanks, we have add the reference.

wanghan-iapcm · 2022-09-07T03:51:03Z

source/api_cc/include/DeepPot.h

+  * @param[in] model The name of the frozen model file.
+  * @param[in] gpu_rank The GPU rank. Default is 0.
+  **/
+  void init (const std::string & model, const int & nloc, const int & gpu_rank = 0);


Do you require the nlocal to be a constant? this is a very strict restriction, as the number of atoms in a local region may change during MD simulations.

Yes, the nlocal for Ascend platform is a constant, cause we pad the number of each type of atom. Considering nocal changes during the inference, we increase the value to 1.1 times the original nlocal.

wanghan-iapcm · 2022-09-07T03:52:00Z

source/lmp/pair_deepmd.cpp

+        type_count[type[ii]-1] ++;
+      }
+      deep_pot.init_graph (arg[0], type_count, get_file_content(arg[0]));
+      deep_pot.init (arg[0], nlocal, get_node_rank());


nlocal changes if the number of subregions > 1.

Thanks, we pad the nlocal value, so it is OK if the fluctuation range is less than the given value.

wanghan-iapcm · 2022-09-07T03:56:44Z

Please provide a proper title for this PR.

amcadmus · 2022-09-07T05:24:33Z

Is it possible to provide unit tests for the contributed code?

denghuilu · 2022-09-08T03:38:49Z

It seems that on non-data center GPU cards, the transfered model has an impressive speedup performance. I have tested the new model in a local 1080ti environment and achieved a speedup by a factor of 7.5 (water benchmark system, 12288 atoms):

double precision original model

ascend method transfer model

njzjz · 2022-09-08T04:06:37Z

@denghuilu Are they the same model? The output looks different.

ZhengdQin · 2022-09-14T09:51:34Z

Is it possible to provide unit tests for the contributed code?
Good idea! we will add the tests in the next commit.

njzjz

Please resolve conflicts

deepmd/entrypoints/transfer.py

deepmd/utils/transfer_to_ascend.py

source/CMakeLists.txt

…d models; 4. add lammps code for ascend models;

…ProdEnvMatAMash op and modified some details.

njzjz · 2022-09-18T04:13:18Z

deepmd/entrypoints/transfer.py

+    ----------
+    new_graph : tf.Graph 
+        orginal new graph
+    Returns :


Suggested change

Returns :

Returns

Thanks, we have fixed it.

njzjz · 2022-09-18T04:15:21Z

deepmd/infer/deep_pot.py

+        ----------
+        feed_dict : dict of tensor
+            Session original feed_dict includes coord, type, box, mesh, natoms.
+        t_out : list of tensor


The type and the order are different from those in line 408. Please check which is right.

Thanks, we have fixed it.

…according to the comments.

denghuilu · 2022-10-24T05:22:46Z

@denghuilu Are they the same model? The output looks different.

No, they are not the same models, ascend method transfer model was casted from the original model. There may be some truncation error.

github-actions bot added C++ Gromacs LAMMPS OP Python labels Aug 30, 2022

denghuilu requested review from wanghan-iapcm, njzjz and denghuilu August 30, 2022 14:15

denghuilu reviewed Aug 30, 2022

View reviewed changes

source/CMakeLists.txt Show resolved Hide resolved

njzjz requested changes Aug 30, 2022

View reviewed changes

github-actions bot removed the Gromacs label Aug 31, 2022

njzjz reviewed Aug 31, 2022

View reviewed changes

wanghan-iapcm requested changes Sep 7, 2022

View reviewed changes

wanghan-iapcm requested review from njzjz and denghuilu September 7, 2022 03:54

ZhengdQin changed the title ~~Devel~~ Adds Ascend platform adaptation code Sep 14, 2022

ZhengdQin closed this Sep 14, 2022

ZhengdQin deleted the devel branch September 14, 2022 09:57

ZhengdQin restored the devel branch September 14, 2022 09:58

ZhengdQin reopened this Sep 14, 2022

njzjz requested changes Sep 14, 2022

View reviewed changes

github-actions bot added the Docs label Sep 15, 2022

add convert_org_to_ascend in convert

1a42f36

root added 12 commits September 15, 2022 19:25

add convert_org_to_ascend

3947b65

change the variable is_transfer to is_ascend_transfer

344f02e

1. add a new dp argument transfer-to-ascend; 2. add dp test for ascen…

3f1d134

…d models; 4. add lammps code for ascend models;

1. fix network bug; 2. fix prod_env_mat op register bug

93016e3

fix prod_env_mat_multi_device.cc miswriting

fcc909c

fix a Ascend incremental code bug

05216ef

fix mixed_prec bug

3b9a794

fix some details according to the Modification comments

ba22b1f

fix transfer spell

2fbc8fc

Second modified version according to the comments, we refactored the …

4771c75

…ProdEnvMatAMash op and modified some details.

Third modified version according to the comments

98a68c9

sync fork modifications

9124fac

ZhengdQin force-pushed the devel branch from b65a3f4 to 9124fac Compare September 15, 2022 14:10

github-actions bot removed the Docs label Sep 15, 2022

fix a bug in network.py

a1f4a10

njzjz reviewed Sep 18, 2022

View reviewed changes

root added 4 commits September 23, 2022 14:53

1. add a test unit for transfer-to-ascend interface; 2. fix the bugs …

d2ec06d

…according to the comments.

set DP_INTERFACE_PREC=ascend_mix for the test unit

aeb1804

modify deeppot-2.pbtxt which has BatchMatMulV2

3eb6609

set ascend_mix

563f312

fix a cmakelist bug, find python first then find TF

d0261d5

ZhengdQin force-pushed the devel branch 2 times, most recently from 15541f9 to e4e4e91 Compare December 14, 2022 08:31

github-actions bot added the Gromacs label Dec 14, 2022

ZhengdQin force-pushed the devel branch from e4e4e91 to d0261d5 Compare March 7, 2023 06:00

github-actions bot removed the Gromacs label Mar 7, 2023

njzjz added the new feature label Nov 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds Ascend platform adaptation code #1881

Adds Ascend platform adaptation code #1881

ZhengdQin commented Aug 30, 2022

codecov-commenter commented Aug 30, 2022 •

edited by codecov bot

njzjz commented Aug 30, 2022

njzjz Aug 31, 2022

denghuilu Sep 5, 2022 •

edited by njzjz

ZhengdQin Sep 6, 2022

njzjz Sep 7, 2022

ZhengdQin Sep 14, 2022

wanghan-iapcm left a comment

wanghan-iapcm Sep 7, 2022

ZhengdQin Sep 8, 2022

ZhengdQin Sep 23, 2022

wanghan-iapcm Sep 7, 2022

ZhengdQin Sep 8, 2022

wanghan-iapcm Sep 7, 2022

ZhengdQin Sep 23, 2022

wanghan-iapcm Sep 7, 2022

ZhengdQin Sep 14, 2022

wanghan-iapcm Sep 7, 2022

ZhengdQin Sep 14, 2022

wanghan-iapcm Sep 7, 2022

ZhengdQin Sep 14, 2022

wanghan-iapcm Sep 7, 2022

ZhengdQin Sep 14, 2022

wanghan-iapcm commented Sep 7, 2022

amcadmus commented Sep 7, 2022

denghuilu commented Sep 8, 2022

njzjz commented Sep 8, 2022

ZhengdQin commented Sep 14, 2022

njzjz left a comment

njzjz Sep 18, 2022

ZhengdQin Sep 23, 2022

njzjz Sep 18, 2022

ZhengdQin Sep 23, 2022

denghuilu commented Oct 24, 2022 •

edited

	jdata["model"]["descriptor"]["precision"] = "float16"
	jdata["model"]["fitting_net"]["precision"] = "float16"

Adds Ascend platform adaptation code #1881

Are you sure you want to change the base?

Adds Ascend platform adaptation code #1881

Conversation

ZhengdQin commented Aug 30, 2022

codecov-commenter commented Aug 30, 2022 • edited by codecov bot

Codecov Report

njzjz commented Aug 30, 2022

Choose a reason for hiding this comment

denghuilu Sep 5, 2022 • edited by njzjz

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wanghan-iapcm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wanghan-iapcm commented Sep 7, 2022

amcadmus commented Sep 7, 2022

denghuilu commented Sep 8, 2022

njzjz commented Sep 8, 2022

ZhengdQin commented Sep 14, 2022

njzjz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

denghuilu commented Oct 24, 2022 • edited

codecov-commenter commented Aug 30, 2022 •

edited by codecov bot

denghuilu Sep 5, 2022 •

edited by njzjz

denghuilu commented Oct 24, 2022 •

edited