Releases: SonySemiconductorSolutions/mct-model-optimization
Release 2.4.2
What's Changed
API changes
- Introduced the
output_names
argument inpytorch_export_model
. This optional parameter specifies a list of output node names for export compatibility and is applicable only when usingPytorchExportSerializationFormat.ONNX
. - Usage example: To set the model output names, use the
output_names
argument:
mct.exporter.pytorch_export_model(
model=quantized_exportable_model,
save_model_path=onnx_file_path,
repr_dataset=representative_data_gen,
output_names=['model_output'])
- Now the output name (i.e., the name of the output layer in the exported ONNX model) will be
model_output
, instead of the default nameoutput
, which would be assigned ifoutput_names
is not specified.
Black-Duck Reports 2.4.2
bd_2.4.2 Generating docs
Release 2.4.1
What's Changed
Improved activation memory estimation for Mixed Precision
We’ve refined how MCT estimates activation memory to ensure more accurate resource planning for quantized models, critical for edge deployment.
- Quantization-Preserving and Fused Operators: Quantization-preserving and fused operators are now included in activation memory estimation, thereby addressing potential memory underestimation.
- Note: Quantization preserving for a layer with multiple inputs is disabled.
- Shift Negative Activation Correction Fix: An issue when SNC was used during mixed-precision quantization for activations has been fixed. During SNC, a 16-bit quantization layer is used. Thus, it was mistakenly considered in the activation memory estimation for the mixed-precision solution, even though it is part of a fused operation.
- Finding activation cuts for Max Cut computation became deterministic for consistent results.
Improved layers' sensitivity evaluation for Mixed Precision
- Isolated Bitwidth Testing: When evaluating a specific bitwidth candidate, other layers now stay in float precision (previously set to maximal bitwidth). This isolates the impact of the tested layer, providing clearer insights into its effect on model accuracy.
- Sensitivity Normalization: The new metric_normalization parameter in MixedPrecisionQuantizationConfig lets you optionally normalize sensitivity metrics by either the maximal or minimal bitwidth candidate of the layer. The default behaviour remains as before (non-normalized).
- New Exponential Weighting Method: A new weighting method MpDistanceWeighting.EXP was added, based on the exponent of negative distances between the quantized and the float models, with exp_distance_weighting_sigma parameter in MixedPrecisionQuantizationConfig controlling the normalization of the distances prior to applying the exponent.
- Custom Sensitivity Metrics: A new custom_metric_fn parameter in MixedPrecisionQuantizationConfig allows you to define your own sensitivity metric function. It takes a model (configured with a candidate bitwidth) and returns a float scalar.
New Version of TPC Schema - v2
We’ve upgraded the Target Platform Capabilities (TPC) schema to v2, expanding support for new operations and model configurations.
- New OperatorSetNames: EXP, COS, SIN: These additions enable quantization of models with exponential, cosine, and sine operations.
- Quantization-Preserving Layers: A new boolean flag, insert_preserving_quantizers in TargetPlatformCapabilities, lets you add quantization-preserving activation holder layers to the final quantized model. Note that this is supported only in PyTorch.
Introduced Activation Threshold Search Using Hessian-Weighted MSE (HMSE)
- A new method leverages a weighted activation histogram, with weights derived from Hessian values, to improve activation threshold search. This can be configured via QuantizationErrorMethod.
Weights Quantization Configuration Updates
We’ve added flexibility to weights quantization to give you more control over compression.
- Positional Weights Quantization**: Add support for positional weights quantization config where weights are used in functional layers.
- Manual Bitwidth Option**: You can now specify manual bitwidth for weights, overriding automatic settings for precise control.
Support for edge-mdt-cl Custom Layers
- Added integration with the edge-mdt-cl package, enabling custom layers optimized for edge deployment. Check the edge-mdt-cl repo for layer documentation.
Extending Supported Versions
- Added support for PyTorch 2.6 and for NumPy 2.
Breaking changes
- Support Discontinued for Old Frameworks Versions: Support discontinued for TensorFlow 2.12, 2.13 and PyTorch 2.2.
- Layers from sony-custom-layers No Longer Supported:
The sony-custom-layers package is deprecated and replaced by edge-mdt-cl. Update your code to use edge-mdt-cl layers. Refer to the new package’s documentation for more information.
Additional Changes and Bug Fixes
- Improved Error Reporting:
We’ve enhanced error messages to help you diagnose and resolve issues faster, especially in complex PyTorch models.- Disconnected Input Nodes: Better detection and clearer reporting for PyTorch models with disconnected input nodes, which often occur when the forward pass includes optional or unused inputs (#1360).
- PyTorch .to Misuse: Improved error messages for incorrect use of the
.to
method in PyTorch, providing more context to simplify debugging (#1382).
- Fix for Reused Nodes: Addressed a bug where reused nodes were incorrectly added before original nodes, which could disrupt model structure (#1418).
- ONNX Export Enhancements:
- Weights Sharing Support: Added support for weights sharing in exported models, reducing model size and eliminating redundancy for more efficient storage and inference (#1402).
- ONNX Opset 20: Now uses ONNX opset 20 for PyTorch versions > 2.4 for better compatibility with ONNX tools and runtimes.
- Custom Output Names: Enabled the ability to specify output names, giving you more control for integrating exported models with other pipelines.
- Positional Weights Quantization in ONNX Fake-Quant Mode: We’ve fixed an issue that prevented the export of ONNX models in fake-quantized mode when the model included positional weights in functional layers.
- Multi-Input Fix: Fixed an export issue for fake-quantized models with multiple inputs.
- Dynamic Output Size Support: Added support for dynamic output sizes in
nn.ConvTranspose2d
(#1381). - Debug Option to Bypass MCT Facade: Introduced a debug mode to bypass the MCT facade, allowing you to quickly determine whether an issue originates from your model or from MCT itself (#1410).
- Upgrade to Pydantic 2: Upgraded to Pydantic 2 for improved data validation (#1426).
Tutorials
- Add PyTorch Tutorial for Activation Z-Score Threshold: We’ve added a new PyTorch tutorial to guide you through using activation z-score thresholds for the quantization of PyTorch models. Try it on Google Colab!
Black-Duck Reports 2.4.1
bd_2.4.1 Update version to 2.4.1
Black-Duck Reports 2.4.0
bd_2.4.0 Fix bug in export using FQ ONNX in replacing activation holder with m…
Release 2.3.0
What's Changed
Major Changes
Target Platform Capabilities (TPC) Changes
TPC Schema
- Introduced a new Schema (version v1) mechanism to establish the language for building a target platform capabilities description.
- The schema defines the TargetPlatformCapabilites class, which can be built to describe the platform capabilities.
- The
OperatorSetNames
enum provides a closed set of operator set names that allows to set quantization configuration options for commonly used operators. - Using a custom operator set name is also available.
- All schema classes are using pydantic
BaseModel
for enhanced validation and schema flexibility.- MCT has a new dependency in "pydantic < 2.0".
- In addition, a new versioning system was introduced, using minor and patch versions.
Naming Refactor
- Creating the schema mechanism was followed by some classes renaming:
TargetPlatformModel
→TargetPlatformCapabilities
TargetPlatformCapabilities
→FrameworkQuantizationCapabilities
OperatorSetConcat
→OperatorSetGroup
Attach TPC to Framework
- A new module named
AttachTpcToFramework
handles the conversion from a framework-independentTargetPlatformCapabilities
description to a framework-specificFrameworkQuantizationCapabilities
that maps each framework's operator to its possible quantization configurations. - Available for Tensorflow and PyTorch via
AttachTpcToKeras
andAttachTpcToPytorch
, respectively.
API changes
- All MCT's APIs are expecting to get a target_platform_capabilities object (
TargetPlatformCapabilities
), which contains the framework-independent platform capabilities description. - This is changed from the previous behaviour which expected an initialized framework-specific object.
- Note: the default behavior of MCT's APIs is not changed! calling an API function without passing a TPC object or passing an object obtained using the following API:
get_target_platform_capabilities(<FW_NAME>, DEFAULT_TP_MODEL)
would use the same default TPC as in previous release.- Regardless, users that accessed TPC-related classes not via the published API may encounter breaking changes due to class renaming and files hierarchy changes.
Tighter activation memory estimation via Max-Cut(Experimental)
- Replace Max-Tensor with Max-Cut as the activation memory estimation method in the mixed precision algorithm.
- The Max-Cut metric considers the model operator's execution schedule for a more precise estimation of activation memory (#1295)
- Note: this is an estimation of the actual memory usage during runtime, the actual memory in runtime may differ.
- 16-bit Activation Quantization (experimental)
- The new activation memory estimation allows flexible usage of the mixed precision algorithm to enable 16-bit activation quantization (dependent on a TPC that supports 16-bit quantization for different operators).
- 16-bit quantization can be enabled either via Manual Bit-width selection API or automatically, by executing mixed precision with a proper activation or total memory constraint.
- Note that when running mixed precision with activation memory constraint to enable 16-bit allocation, shift negative correction should be disabled.
Improved GPTQ algorithm via Sample Layer Attention (SLA):
- Enabled SLA by default in both Keras and PyTorch (#1287, #1260)
- Added gradual activation quantization support for enhanced results when quantizing activations (#1244, #1237)
- Implemented Rademacher distribution for Hessian estimation (#1250)
- For more details, please visit our paper.
Resource Utilization (RU) calculation:
- Use max cut activation method for activation and total resource utilization computation.
- Compute the total target from weights and activations utilization instead of using it as a separate metric.
- Weights memory computation now includes all quantized weights in the model, instead of considering only kernel attributes. This may change the results of existing execution of mixed precision scenarios.
- Note that the
ResourceUtilization
API did not change.
Minor Changes
- Added Activation Bias Correction feature to potentially enhance quantization results of vision transformers (#1256)
- Added substitution to decompose MatMul operation into baseline components in PyTorch (#1313)
- Added substitution decompose scaled dot product attention operator in PyTorch (#1229)
- Converted core configuration classes to dataclasses for simpler usage and strict behavior verification (
CoreConfig
,QuantizationConfig
, etc.) (#1203) - Trainable Infrastructure changes:
- Moved STE/LSQ activation quantizers from QAT to trainable infrastructure.
- Renamed Trainable QAT quantizer to Weight Trainable quantizer (#1240)
- Added support for PyTorch 2.4, PyTorch 2.5, and Python 3.12
Bug Fixes
- Fix activation gradient backpropagating in GPTQ for PyTorch models. It now uses STE Activation Trainable quantizers with frozen quantization parameters instead of Activation Inferable quantizers, which did not propagate gradients. (#1197)
- Fix ONNX export when PyTorch models have multiple inputs/outputs (#1223)
- Fixed the issue of duplicating reused layers in PyTorch models (#1217)
- Fixed HMSE being overridden by MSE after resource utilization computation (#1253)
- Resolved duplicate QCOs error handling (#1282, #1149)
- Fixed tf.nn.{conv2d,convolution} substitution to handle attributes with default values that were not passed explicitly (#1275)
- Fixed handling errors in PyTorch graphs by managing nodes with missing outputs and ensuring robust extraction of output shapes (#1186)
New Contributors
Welcome @ambitious-octopus and @itai-berman for their first contributions! #1186 , #1266
Release 2.2.2
Removed TPCs - Breaking Change
- This patch release removes IMX500 TPCs versions v2 and v3, for both Keras and PyTorch.
- The following features are no longer available out-of-the-box via the provided IMX500 TPC by MCT:
- Quantized model metadata
- Activation 16-bit quantization
- Constant weights quantization
Full Changelog: v2.2.1...v2.2.2
Release 2.2.1
Bug Fixes and Other Changes:
- A necessary modification for YOLOv8 quantization, derived from #1186 .
Release 2.1.1
Bug Fixes and Other Changes:
- A necessary modification for YOLOv8 quantization, derived from #1186 .
Release 2.2.0
What's Changed
- This release includes breaking changes in the Target Platform Capabilities module (TPC). If you use a custom TPC, be sure to review the Breaking changes section.
General changes
-
Quantization enhancements:
-
Improved Hessian information computation runtime: speeds-up GPTQ, HMSE and Mixed Precision with Hessian-based loss.
get_keras_gptq_config
andget_pytorch_gptq_config
functions now allow to gethessian_batch_size
argument to control the size of the batch in Hessian computation for GPTQ.
-
Data Generation Upgrade: Improved Speed, Performance and Coverage.
- Add
SmoothAugmentationImagePipeline
– an image pipeline implementation that includes gaussian smoothing and random cropping and clipping. - Improved performance with float16 support in PyTorch.
- Introduced
ReduceLROnPlateauWithReset
scheduler – a learning rate scheduler which reduce learning rate when a metric has stopped improving and allows resetting the learning rate to the initial value after a specified number of bad epochs.
- Add
-
Shift negative correction for activations:
- Update shift negative for GELU activation operator.
- Enable shift negative correction by default in QuantizationConfig in CoreConfig.
-
-
Introduce new Explainable Quantization (Xquant) tool (experimental):
-
Introduced TPC IMX500.v3 (experimental):
- Support constants quantization. Constants Add, Sub, Mul & Div operators will be quantized to 8 bits Power of Two quantization, per-axis. Axis is chosen per constant according to minimum quantization error.
- IMX500 TPC now supports 16-bit activation quantization for the following operators: Add, Sub, Mul, Concat & Stack.
- Support assigning allowed input precision options to each operator, that is, the precision representation of the input activation tensor of the operator.
- Default TPC remains IMX500.v1.
- For selecting IMX500.v3 in keras:
tpc_v3 = mct.get_target_platform_capabilities("tensorflow", 'imx500', target_platform_version="v3")
mct.ptq.keras_post_training_quantization(model, representative_data_gen, target_platform_capabilities=tpc_v3)
- For selecting IMX500.v3 in pytorch:
tpc_v3 = mct.get_target_platform_capabilities("pytorch", 'imx500', target_platform_version="v3")
mct.ptq. pytorch_post_training_quantization(model, representative_data_gen, target_platform_capabilities=tpc_v3)
-
Introduced BitWidthConfig API:
- Allow manual adjustment of activation bit-widths for specific model layers through a new class under CoreConfig.
- Usage example of manual selection of 16bit activations available at PyTorch object detection YOLOv8n tutorial.
-
Tutorials:
- MCT tutorial notebooks updates:
- Added new tutorials for IMX500:
- instance segmentation YOLOv8n and a pose estimation YOLOv8n quantization in PyTorch, including an optional Gradient-Based PTQ step for optimized performance.
- A torchvision model quantization for IMX500.
- Added new classification models to MCT’s IMX500-Notebooks.
- Added new tutorials for IMX500:
- Added new MCT features tutorials: Xquant tutorial in PyTorch and Keras. In addition, a new tutorial for GPTQ in PyTorch has been added.
- Update PyTorch object detection YOLOv8n tutorial with 16 bits manual configuration.
- MCT tutorial notebooks updates:
Breaking changes
- To configure OpQuantizationConfig in the TPC, an additional arguments has been added:
Signedness
specifies the signedness of the quantization method (signed or unsigned quantization).supported_input_activation_n_bits
sets the number of bits that operator accepts as input.
Bug fixes:
- Fixed a bug in PyTorch model reader of reshape operator #1086.
- Fixed a bug in GPTQ with bias learning for cases that a convolutional layer with None as a bias #1109.
- Fixed an issue with mixed precision where when running only weights/activation compression with mixed precision. If layers with multiple candidates of the other (activation/weights) exist, the search would fail or be incorrect. A new filtering procedure has been added before running mixed precision, to filter out unnecessary candidates #1162.
New Contributors
Welcome @DaniAffCH, @irenaby, @yardeny-sony for their first contribution! PR #1094, PR #1118, PR #1163
Full Changelog: v2.1.0...v2.2.0