Unambiguous compression setup to resume properly #682

ljaljushkin · 2021-04-27T08:30:15Z

Introduced a new way of resuming compression for PyTorch and TensorFlow.
The idea is to restore compression state instead of building it from scratch according to the config.

It's essential for AutoQ/HAWQ/NAS-like algorithms that are not deterministic and depend on the input data.
Therefore there is a chance that a checkpoint that was saved after AutoQ of one NNCF run will not be loadable/resumable
in another NNCF run. A complete information on how the quantizers are set up in the model should be saved along with
the checkpoints, so as to be able to load a quantized checkpoint for evaluation at all times.

This information is saved by CompressionState class

PyTorch NOW

model_state_dict = compression_model.state_dict()
compression_state = ctrl.get_compression_state()
...
create_compressed_model(model, config, compression_state=compression_state)
load_state(model, model_state_dict, is_strict=True)

PyTorch BEFORE

model_state_dict = compression_model.state_dict()
ctrl_state = ctrl.get_state()

create_compressed_model(model, config, resuming_state_dict=model_state_dict)
ctrl.load_state(ctrl_state)

TensorFlow NOW

checkpoint = tf.train.Checkpoint(compression_state=TFCompressionStateLoader())
load_checkpoint(checkpoint, ckpt_path)
compression_state = checkpoint.compression_state.state

compression_ctrl, compress_model = create_compressed_model(model, nncf_config, compression_state)
checkpoint = tf.train.Checkpoint(model=compress_model, compression_state=TFCompressionState(compression_ctrl))
load_checkpoint(checkpoint=checkpoint, ckpt_path=config.ckpt_path)

TensorFlow BEFORE

compression_ctrl, compress_model = create_compressed_model(model, nncf_config, should_init=not resume_training)
checkpoint = tf.train.Checkpoint(model=compress_model, compression_ctrl=compression_ctrl)
load_checkpoint(checkpoint=checkpoint, ckpt_path=config.ckpt_path)

alexsu52

Are you going to introduce a common NNCFNetwork class in this PR?

nncf/common/quantization/structs.py

ljaljushkin · 2021-04-27T13:42:35Z

backward compatibility tests
doc strings
more tests for corner cases (not matching configs on resume)
fix CI (tests and pylint)
fix patches of 3rd party integration

Introduce a new way of saving and loading NNCF Compression State:

model_state_dict = compression_model.state_dict()
compression_state = ctrl.get_compression_state()
...
create_compressed_model(model, config, compression_state=compression_state)
load_state(model, model_state_dict, is_strict=True)

Instead of:

model_state_dict = compression_model.state_dict()
ctrl_state = ctrl.get_state()

create_compressed_model(model, config, resuming_state_dict=model_state_dict)
ctrl.load_state(ctrl_state)

Previously, it was done in a standart PyTorch way via state_dict() call, which is defined for torch.Module and its wrappers - NNCFNetwork and Distributed/DataParallel (DDP, DP).
It's a dictionary consisting of PyTorch Tensors and string keys.

A we discussed, for unambiguous restoring compressed model we need 2 more custom structures besides torch tensors. - builder and controller states. For instance, QuantizerSetup that describes where to insert FQs, their dependencies and parameters.

nncf/nncf/torch/quantization/quantizer_setup.py

Lines 97 to 103 in f2cc93d

    
           class QuantizerSetupBase: 
        
               def __init__(self): 
        
                   self.quantization_points = {}  # type: Dict[QuantizationPointId, QuantizationPointBase] 
        
                   self.unified_scale_groups = {}  # type: Dict[int, Set[QuantizationPointId]] 
        
                   self.shared_input_operation_set_groups = {}  # type: Dict[int, Set[QuantizationPointId]] 
        
                   self._next_unified_scale_gid = 0 
        
                   self._next_shared_inputs_gid = 0

)

Ideally, we would like to override state_dict to include these 2 structures.
But all approaches, I am aware of, leads to freezing in DDP:

we can encode builder/ctrl states to ByteTensor (object -> json-compatible dict -> json str -> bytes -> ByteTensor)
But can't register buffer for this tensor on NNCFNetwork init, which is required by DDP. We can't do that because we don't know about builders in that moment - they are applied later by design. And as soon as we register buffer outside of init, DDP hangs on broadcasting.
Moreover, we can't guarantee identical sizes of these tensors on each GPU (in case of sophisticated initialization), which is also required for DDP.
we can override state_dict() to return Dict like this:

{ 
  "model_state": super.state_dict(), 
  "ctrl_state": ctrl.get_state(), 
   "builder_state": builder_state
}

However, DDP hangs again, because it heavily relies on parameters of modules and expects only Pytorch Tensors.
Hence, state_dict can't be overridden for including builder and controller states.
3) New method of NNCFNetwork (e.g. get_checkpoint) is also unacceptable, because DDP, DP doesn't have it, and user will need extract NNCFNetwork each time:

if isinstance(module, DataParallel):
  module = module.module
checkpoint = module.get_checkpoint()

A new method of controller is the only remaining approach.

nncf_checkpoint = ctrl.get_nncf_checkpoint()

nncf/graph/transformations/commands.py

nncf/quantization/algo.py

tests/quantization/test_serialize_to_json.py

daniil-lyakhov · 2021-04-29T14:37:48Z

During debugging I find out that json can't handle scheduler state when it is too big. For example if I have current_step
==251199 with type <class 'numpy.int64'> I'll get TypeError: Object of type 'int64' is not JSON serializable

We have to either cast all values in get_state methods to native python types (from Numpy) or we can expand functionality of json to handle numpy int64ref

ljaljushkin · 2021-05-20T08:54:20Z

During debugging I find out that json can't handle scheduler state when it is too big. For example if I have current_step
==251199 with type <class 'numpy.int64'> I'll get TypeError: Object of type 'int64' is not JSON serializable

We have to either cast all values in get_state methods to native python types (from Numpy) or we can expand functionality of json to handle numpy int64ref

this comment is obsolete, since standard Python int should be enough to represent very long numbers, just need to cast numpy.int64

vshampor · 2021-05-20T17:10:33Z

Jenkins please retry a build

vshampor

Could you please add a comment, or mark the spots in the PR by comments of your own, that illustrate:

the changes to the user flow that are mandatory after this PR in order for nothing to break (it would be good if there were no such changes at all)
the exact way in which the user is supposed to save an NNCF checkpoint in their flow
the exact way in which the user is supposed to load the NNCF checkpoint
the additional operations that the NNCF algo developer should do in general in order to mark some or the other part of their algorithm data to become save-able and load-able from such checkpoints

I think that illustrating these points would help with the review.

tests/quantization/resnet18.json

ljaljushkin · 2021-05-26T10:38:01Z

Could you please add a comment, or mark the spots in the PR by comments of your own, that illustrate:

the changes to the user flow that are mandatory after this PR in order for nothing to break (it would be good if there were no such changes at all)

the exact way in which the user is supposed to save an NNCF checkpoint in their flow

the exact way in which the user is supposed to load the NNCF checkpoint

the additional operations that the NNCF algo developer should do in general in order to mark some or the other part of their algorithm data to become save-able and load-able from such checkpoints

I think that illustrating these points would help with the review.

definitely make sense, will do it shortly

examples/classification/main.py

nncf/torch/graph/transformations/commands.py

alexsu52

Are you going to add support for this feature in the TF backend? According to the offline discussion, the issue with saving compression state in the checkpoint is solved and I don't see any concerns to support it in the TF.

ljaljushkin · 2021-05-27T07:44:42Z

Are you going to add support for this feature in the TF backend? According to the offline discussion, the issue with saving compression state in the checkpoint is solved and I don't see any concerns to support it in the TF.

Let's do it iteratively if there's no concern about API, otherwise we would need extra effort to keep this branch merged with upcoming changes in develop.
BTW, wasn't it planned to involve @daniil-lyakhov to the TF part?

alexsu52 · 2021-05-27T08:00:53Z

Are you going to add support for this feature in the TF backend? According to the offline discussion, the issue with saving compression state in the checkpoint is solved and I don't see any concerns to support it in the TF.

Let's do it iteratively if there's no concern about API, otherwise we would need extra effort to keep this branch merged with upcoming changes in develop.
BTW, wasn't it planned to involve @daniil-lyakhov to the TF part?

I don't have this in plans.

nncf/api/compression.py

nncf/common/graph/transformations/commands.py

nncf/common/quantization/structs.py

ljaljushkin · 2021-07-05T20:31:31Z

SOTA eval validation has FAILED🤕, because of breaking changes in the builder related classes. Need to update re-run TF eval and correct checkpoints for PT/

sota eval for TF [build 275]
sota eval for PT [build 411]

…oid BN adapt issue

tests/tensorflow/test_sanity_sample.py

daniil-lyakhov · 2021-07-06T05:44:30Z

Jenkins please retry a build

ljaljushkin · 2021-07-06T07:33:17Z

Jenkins please retry a build

ljaljushkin · 2021-07-06T09:30:11Z

SOTA eval validation is WIP

sota eval for PT [build 413]
sota eval for PT [locally]
sota eval for TF [build 278]
sota eval for TF [locally]

ljaljushkin

the most recent changes, just FYI

examples/tensorflow/segmentation/evaluation.py

nncf/common/quantization/quantizer_setup.py

tests/tensorflow/data/configs/mask_rcnn_coco2017_magnitude_sparsity_int8.json

tests/tensorflow/helpers.py

tests/tensorflow/quantization/test_builder_state.py

tests/tensorflow/sparsity/rb/test_integration.py

tests/tensorflow/test_sanity_sample.py

...torch/classification/configs/sparsity_quantization/resnet50_imagenet_rb_sparsity50_int8.json

ljaljushkin · 2021-07-06T15:57:12Z

SOTA eval validation is WIP

sota eval for PT [build 413]

sota eval for PT [locally]

sota eval for TF [build 278]

sota eval for TF [locally]

SOTA validation is green for PT and TF

ljaljushkin · 2021-07-06T19:45:13Z

🎉 🎉 🎉
@alexsu52 @andrey-churkin @daniil-lyakhov @vshampor Thank you for the thorough and responsible review! 👍

ljaljushkin · 2021-07-07T07:10:36Z

Post build sota eval validation is green (PT - 415, TF - 279).
Except for errors with pruning algorithm in TF, which have been happening before the merge:

ERROR:nncf:Invalid NNCF config supplied! jsonschema.exceptions.ValidationError: For algorithm: 'filter_pruning

@evgeniya-egupova @alexsu52

Introduced a new way of resuming compression for PyTorch and TensorFlow. The idea is to restore compression state instead of building it from scratch according to the config.

ljaljushkin requested review from a team, vshampor, alexsu52, daniil-lyakhov, lzrvch, AlexKoff88 and andrey-churkin April 27, 2021 08:30

ljaljushkin marked this pull request as draft April 27, 2021 08:31

alexsu52 reviewed Apr 27, 2021

View reviewed changes

nncf/common/quantization/structs.py Show resolved Hide resolved

vshampor reviewed Apr 28, 2021

View reviewed changes

nncf/graph/transformations/commands.py Outdated Show resolved Hide resolved

nncf/quantization/algo.py Outdated Show resolved Hide resolved

nncf/quantization/algo.py Outdated Show resolved Hide resolved

tests/quantization/test_serialize_to_json.py Outdated Show resolved Hide resolved

ljaljushkin force-pushed the qsetup_ckpt branch from ae8eef5 to 8ebd43c Compare May 18, 2021 19:22

ljaljushkin marked this pull request as ready for review May 26, 2021 08:42

ljaljushkin requested review from alexsu52 and vshampor May 26, 2021 09:08

vshampor reviewed May 26, 2021

View reviewed changes

tests/quantization/resnet18.json Outdated Show resolved Hide resolved

ljaljushkin commented May 26, 2021

View reviewed changes

examples/classification/main.py Outdated Show resolved Hide resolved

ljaljushkin commented May 26, 2021

View reviewed changes

nncf/torch/graph/transformations/commands.py Show resolved Hide resolved

alexsu52 reviewed May 27, 2021

View reviewed changes