Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[onert-micro] Introduce Training #12873

Open
BalyshevArtem opened this issue Apr 16, 2024 · 2 comments
Open

[onert-micro] Introduce Training #12873

BalyshevArtem opened this issue Apr 16, 2024 · 2 comments
Labels
type/discussion We need discussion. Discussion itself can help. Even without conclusions!

Comments

@BalyshevArtem
Copy link
Contributor

BalyshevArtem commented Apr 16, 2024

What

Let's discuss how to add training into onert-micro.

Why

We need add training feature into onert-micro for some target models.

cc @Torrero, @SlavikMIPT, @chunseoklee, @lemmaa

@BalyshevArtem BalyshevArtem added the type/discussion We need discussion. Discussion itself can help. Even without conclusions! label Apr 16, 2024
@BalyshevArtem
Copy link
Contributor Author

BalyshevArtem commented Apr 16, 2024

First proposal

The idea is to make a two-stage process. At the first stage, a model is prepared from the initial model on the developer's host (using one-toolchain) - back propagation graph is built, optimizations of this graph are carried out, and so on. At the second stage, based on the initial model and the resulting back propagation graph, training is performed on the device using onert-micro.

The main goal in this proposal is to keep onert-micro as simple as possible, without greatly complicating its logic, without greatly increasing the code base (the same as the binary size). This will help to keep it light during normal (non-training) usage. The introduction of a two-stage process will help achieve this goal, while adding variability in terms of applying various complex optimizations and hypothyroidism checks.

Proposed Overall structure of this two-stage system:

image

  1. First, we have a circle model. We submit it to the program TrainingConfigureTool input (the name is temporary, can be supplied separately, can be part of a one-toolchain), at the output it outputs three files: a circle model from which the weights that will be trained are cut out, a wof (weight only format) file in which the trained weights are stored, and a backpropagation graph in the form of a circle model.
  2. All this files and files with prepared train and test datasets are submitted to the onert-micro training input. According to the training parameters set in the application (it is possible to iterate through them in order to find the best combination), training takes place.
  3. At the end, after the end of the training process, it is checked on the test data whether there has been an improvement and if so, then we save the new weights obtained in the wof file.

Details

First stage it is - TrainingConfigureTool. Its two main tasks of this:

  • cutting weights from the source file (layers for this can be set manually or selected automatically in the future to achieve better performance under current resource constraints)
  • creating a back propagation graph that will allow learning to take place, performing which onert-micro will be able to calculate gradients.

As a result of the operation of this tool, we will receive three files: circle model without training weights, file where stored weights for training (wof - weights only format), and circle (maybe circle +) model with backpropagation graph.
in the future, the Training Tool will be able to perform actions to improve the training process:

  • based on a given memory budget, select only a part of the layers for training, or even a part of the weights of a certain layer - this process called sparse backpropagation.
  • applications of mixed precision quantization, search for parts of the network for which the materialization technique will be performed (when intermediate results are not saved for some part of the network, but they are recalculated in the backpropagation process)
  • optimizations on the graph itself
  • and so on

The output graph of back propagation will consist of both traditional circle operations and special operations calculating the gradient for the current operation (for example, Conv2DWeigthGrad operation calculating the gradient for weights, Conv2DInputGrad - operation calculating the gradient for input tensor). These operations can be added as custom to circle, or for example, add them as specific to circle+. I prefer the second option. That is, these are operations for calculating the gradient.

Second stage is - Onert-micro Training - will provide various training parameters. In order to achieve maximum effect during training, onert-micro training will be able to support different optimizers (SGD, ADAM, RMSProp, maybe some custom), the choice of the size of the batch, and specific constants of optimizers (learning rate, constans for ADAM, for RMSProp).

@BalyshevArtem
Copy link
Contributor Author

Proposal for a file structure containing only trainable weights

.wof (weigth only format)

This proposal is taken from A3: Define a separate format for storing a single file of diffs (≈ changed weights) from internal repo (proposed by @glistening ).

0 2   4 8 4 + N * 4    
MAGIC NUMBER SCHEMA VERSION RESERVED n_buffers (=N) Offset 1 Offset N Buffer 1 Data Buffer 2 Data

Set offset as -1 if there is no data.

The sequence numbers in this .wof file correspond to the tensor numbers from the original circle file. That is, the number of buffers corresponds to the number of tensors in the original file. In order to find the necessary constant data for the current tensor number k in the original file, you need to take data from the buffer number k.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/discussion We need discussion. Discussion itself can help. Even without conclusions!
Projects
None yet
Development

No branches or pull requests

1 participant