Hardware-related: Hardware profiling and parsing

General Flow

There are three main parts of the hardware profiling module.

Seach Space

Define search space and then traverse it to generate all primitives. We define a tuple Prim to describe primitives, which consists of:

prim_type: primitive type. It must be defined in aw_nas/ops and can be fetched by get_op.
spatial_size: input feature map size of primitives.
C: input channel number.
C_out: output channel number.
stride: only set for convolutional type primitives, 2 or 1.
kernel_size: ont set for convolutional type primitives.
kwargs: a dict for extra keys.

After traversing search space, all primitives are sent to ProfilingNetAssember to be assembled as complete networks to be profiled. Each assembled network is represented as an awnas yaml file that can be passed into general model defined in aw_nas/final/general_model.py.

Offline Measuring and Parsing

For DPU or other embedded devices, measurement has to be taken offline. We provide an example of DPUCompiler defined in aw_nas/hardware/dpu.py, which includes pytorch2caffe and fixed-point process, and you can adopt your own tools by yourself.

After measurement finished, the result which consists of performances of all basic layers supported in embedded devices should be parsed by DPUCompiler.parse_file.

For GPU/CPU, we provide a script to profile those networks online.

Latency Model

The final part of hardware-profiling. This part organizes the previous result to build an actual model that can predict the hardware-related measures of an arbitrary architecture in the search space. Currently, only latency table is available, which may be inconsistent with real performances because it is based on a linear hypothesis. We will provide the revised models later.

aw_nas provides a command-line interface awnas-hw to orchestrate the hardware-related objective (e.g., latency, energy, etc.) profiling and parsing flow. A complete workflow example is illustrated as follows.

Usage

DPU Profiling

Step 1

Use genprof command to generate a series of config files to be profiled.

awnas-hw genprof examples/hardware/configs/ofa_final.yaml examples/hardware/configs/ofa_lat.yaml --result-dir ./results --compile-hardware dpu

ofa_final.yaml: the aw_nas config file. Search space should be defined here.
ofa_lat.yaml: the hardware-object config file for generating config files to be profiled. profiling_primitive_cfg, hwobjmodel_type and mixin_search_space_type and corresponding configs should be defined here, and it also should provide a template defined in profiling_net_cfg.base_cfg_template to generate profiling config files.
(optional) --result-dir: the directory saving the results. You should CAREFULLY use it because it will erase all contents in the directory if it already exists.
(optional) --compile-hardware: specify which hardware compiler would be used. The compiler is defined in hardware/compiler. The interface compile must be implemented, and conversion / quantization / fixed-point or other steps can be instantiated here. In our example hardware/compiler/dpu.py, we provide an instance of converting profiled-to-be config files to Caffe model, which needs to import extra module pytorch2caffe.

The result is shown as follows.

results
├── config.yaml
├── hardwares
│   └── 0-dpu
│       └── ...
├── hwobj_config.yaml
├── prof_nets
│   └── ...
└── prof_prims.yaml

config.yaml: a copy of the aw_nas config file.
hardwares/{$exp-num}-{$compiler}: Caffe models converted from aw_nas config files.
prof_nets: aw_nas config files to be profiled.
prof_prims.yaml: includes all primitives to be profiled.
hwobj_config.yaml:
(optional) pytorch_to_caffe: meta information that contains a mapping from pytorch module name to Caffe prototype name or other deployable formats, which is necessary for later profiling.

Step 2

Measure all Caffe models generated by the previous step. Notice: some problems may occur during the compiling process because of unsupported layers in your pytorch code. You can cope with them by either removing unsupported layers or add them into the pytorch2caffe module.

Step 3

Use parse command to parse the DPU measurement result files to YAML format files.

awnas-hw parse {$hw_cfg_file} {$prof_result_dir} {$prof_prim_file}{$prim_to_ops_file} --hwobj-type latency --result-dir profiled_nets

hw_cfg_file: config file for hardware, example can be find in examples/hardware/configs/ofa_lat.yaml. hardware_compiler_type and hardware_compiler_cfg should be defined here.
prof_result_dir: actual profiled results measured offline, which contains performances(latency, energy or memory, etc.) of each layer(represented as Caffe prototype name or other deployable formats). Notice: there may encounter layer-fusion problems in some devices that cause uncertainty of some layers. Parsing method can be implemented as Compiler.parse_file.
prof_prim_file: contains all primitives generated by step 1.
prim_to_ops_file: meta information generated by step 1. It contains a mapping from pytorch module name to Caffe prototype name or other deployable formats.
result_dir: has the exactly same structure as prof_result_dir, but it consists of YAML format files that contains performances of each primitive in the network.

Step 4

Use genmodel command to generate a latency model.

awnas-hw genmodel {$cfg_file} {$hwobj_cfg_file} {$prof_prim_dir} --result_file

cfg_file: awnas config file that defines search space.
hwobj_cfg_file: hardware config file that contains mixin_search_space_cfg, profiling_primitive_cfg and hwobjmodel_type.
prof_prim_dir: profiled networks generated by the previous step.
--result-file: dump the hardware objective model (a latency model by default, other types of measures like energy can also be incorporated).

In hwobj_cfg_file, entry prof_prims_cfg must be defined with following keys specified: sample, as_dict, spatial_size, base_channels, mult_ratio, strides, acts, use_ses, stem_stride, primitive_type. An example:

prof_prims_cfg:
  sample: null # or int
  as_dict: true # if set false, the return value is a namedtuple
  spatial_size: 300
  base_channels: [16, 16, 24, 32, 64, 96, 160, 960, 1280]
  mult_ratio: 1.
  strides: [1, 2, 2, 2, 1, 2]
  acts: ["relu6", "relu6", "relu6", "h_swish", "h_swish", "h_swish"]
  use_ses: [ False, False, True, False, True, True ]  
  stem_stride: 2
  primitive_type: 'mobilenet_v3_block

The profiling primitive configurations must be identical to those of hardware objective configurations in the search space configuration file. For example, please refer to examples/hardware/det_ofa_xavier.yaml and examples/hardware/det_ofa_hardware.yaml

CPU/GPU Profiling

Compared with DPU or other hardware devices, it is easier to implement profiling for CPU/GPU because there is no need to measure performances offline.

Step 1

Do the same step as DPU-profiling, but it does not have to specify compiler.

awnas-hw genprof examples/hardware/configs/ofa_final.yaml examples/hardware/configs/ofa_lat.yaml --result-dir ./results

Step 2

python scripts/hardware/latency.py config_0.yaml[, config_1.yaml[, ...]] --device {$device_id} --perf_dir {$result_directory}

config_i.yaml: awnas config files generated by the previous step. You can pass arbitrary number of files to the script by using shell regex expression, e.g. python latency.py config_{0..20}.yaml, or python latency.py config_dir/*.yaml
--device: device id. Using 0~MAX_CUDA_NUM-1 specifies which GPU device will be used to measure performances, and -1 specifies CPU. Passing other device id will raise a ValueError.
--perf_dir: result directory. The result in this dir has the exactly same format with the result of step 3 of DPU profiling.

Step 3

Since parsing profiled result of GPU/CPU is done by step 2, we can directly generate latency model as DPU Step 4.

awnas-hw genmodel {$cfg_file} {$hwobj_cfg_file} {$prof_prim_dir} --result_file

Latency models take different input features as input. For example, linear regression model predicts the actual network latency from the latency sum of its blocks. MLP takes a padded list of latency data as input for each prediction. LSTM's input feature contains the configurations for every block. The different input formats are constructed with preprocessors, and each hardware cost model uses a list of preprocessors:

Legal Preprocessor Combinations

table: ["block_sum", "remove_anomaly", "flatten"]
regression: ["block_sum", "remove_anomaly", "flatten", "extract_sum_features"]
mlp: ["block_sum", "remove_anomaly", "flatten", "padding"]
lstm: ["block_sum", "remove_anomaly", "flatten", "extrace_lstm_features"]

Apply Latency Model During Searching

You can apply the latency model now during the searching process by combining your objective with HardwareObjectve defined in aw_nas/objective/hardware.py. We provide a cascading objective called ContainerObjective to do this, which accepts a series of objectives then assemble their performances, losses, and rewards. You can find more details in aw_nas/objective/container.py, and an example in examples/hardware/configs/hardware_obj.yaml.

For CPU/GPU profiling, we provide some latency tables and latency regression models here. You can reuse them directly during the search process.

Implement the interface for new search spaces

We provide a mixin class MixinProfilingSearchSpace. This interface has two methods that must be implemented:

generate_profiling_primitives: profiling cfgs => return the profiling primitive list
parse_profiling_primitives: primitive hw-related objective list, profiling/hwobj model cfgs => hwobj model

You might need to implement the hardware-related objective model class for the new search space. You can reuse some codes in aw_nas/hardware/ofa_obj.py.

Implement the interface for new hardware

To implement hardware-specific compilation and parsing process, create a new class inheriting BaseHardwareCompiler, implement the compile and hwobj_net_to_primitive methods. As stated before, you can put your new hardware implementation python file into the AWNAS_HOME/plugins, to make it accessible by aw_nas.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hardware.md

hardware.md

Hardware-related: Hardware profiling and parsing

General Flow

Seach Space

Offline Measuring and Parsing

Latency Model

Usage

DPU Profiling

Step 1

Step 2

Step 3

Step 4

CPU/GPU Profiling

Step 1

Step 2

Step 3

Apply Latency Model During Searching

Implement the interface for new search spaces

Implement the interface for new hardware

Files

hardware.md

Latest commit

History

hardware.md

File metadata and controls

Hardware-related: Hardware profiling and parsing

General Flow

Seach Space

Offline Measuring and Parsing

Latency Model

Usage

DPU Profiling

Step 1

Step 2

Step 3

Step 4

CPU/GPU Profiling

Step 1

Step 2

Step 3

Apply Latency Model During Searching

Implement the interface for new search spaces

Implement the interface for new hardware