[Feature] Support calculating loss during validation #1503

fanqiNO1 · 2024-02-22T10:43:07Z

Background

Since early stopping requires validation loss as a possible metric, mmengine currently does not support calculating and parsing validation loss as a metric.

However, due to the inconsistency of model implementations and the fact that calculating validation loss is not a common requirement, the process of calculating validation loss should not be initiated by mmengine, but rather, initiated by the model and returned by mmengine with a convention to be parsed and returned as a metric.

Thus this PR aims to implement this return-and-resolve convention without introducing breaking change.

Design

In order not to introduce breaking change, we chose to return the loss computed by the model at val_step (model.forward with mode='predict' or predict) wrapped by BaseDataElement and append after the val step result.

Therefore, mmengine needs to get the last item of the result of val_step in ValLoop and determine whether it is validation loss or not. If it is validation loss, it will perform the relevant computation and return it at the end of the ValLoop, and then compute other metrics based on the items other than the validation loss, e.g., the accuracy, etc. If it is not a val loss, it will not be processed.

Adaptation

Custom Model

Take https://github.com/open-mmlab/mmengine/blob/02f80e8bdd38f6713e04a872304861b02157905a/examples/distributed_training.py#L14-#L25 as an example.

class MMResNet50(BaseModel):

    def __init__(self):
        super().__init__()
        self.resnet = torchvision.models.resnet50()

    def forward(self, imgs, labels, mode):
        x = self.resnet(imgs)
        if mode == 'loss':
            return {'loss': F.cross_entropy(x, labels)}
        elif mode == 'predict':
-          return x, labels
+          val_loss = {'loss': F.cross_entropy(x, labels)}
+          return x, labels, BaseDataElement(loss=val_loss)

MMPreTrain

Take https://github.com/open-mmlab/mmpretrain/blob/17a886cb5825cd8c26df4e65f7112d404b99fe12/mmpretrain/models/classifiers/image.py#L248-L249 as an example.

    def predict(self,
                inputs: torch.Tensor,
                data_samples: Optional[List[DataSample]] = None,
                **kwargs) -> List[DataSample]:
        """Predict results from a batch of inputs.

        Args:
            inputs (torch.Tensor): The input tensor with shape
                (N, C, ...) in general.
            data_samples (List[DataSample], optional): The annotation
                data of every samples. Defaults to None.
            **kwargs: Other keyword arguments accepted by the ``predict``
                method of :attr:`head`.
        """
        feats = self.extract_feat(inputs)
-       return self.head.predict(feats, data_samples, **kwargs)
+       preds = self.head.predict(feats, data_samples, **kwargs)
+       loss = self.head.loss(feats, data_samples)
+       loss_sample = DataSample(loss=loss)
+       preds.append(loss_sample)
+       return preds

MMPose

Calculating loss in this way maybe not correct.

Take https://github.com/open-mmlab/mmpose/blob/5a3be9451bdfdad2053a90dc1199e3ff1ea1a409/mmpose/models/pose_estimators/topdown.py#L99-#L120 as an example.

    def predict(self, inputs: Tensor, data_samples: SampleList) -> SampleList:
        """Predict results from a batch of inputs and data samples with post-
        processing.

        Args:
            inputs (Tensor): Inputs with shape (N, C, H, W)
            data_samples (List[:obj:`PoseDataSample`]): The batch
                data samples

        Returns:
            list[:obj:`PoseDataSample`]: The pose estimation results of the
            input images. The return value is `PoseDataSample` instances with
            ``pred_instances`` and ``pred_fields``(optional) field , and
            ``pred_instances`` usually contains the following keys:

                - keypoints (Tensor): predicted keypoint coordinates in shape
                    (num_instances, K, D) where K is the keypoint number and D
                    is the keypoint dimension
                - keypoint_scores (Tensor): predicted keypoint scores in shape
                    (num_instances, K)
        """
        assert self.with_head, (
            'The model must have head to perform prediction.')

        if self.test_cfg.get('flip_test', False):
            _feats = self.extract_feat(inputs)
            _feats_flip = self.extract_feat(inputs.flip(-1))
            feats = [_feats, _feats_flip]
+           loss = self.head.loss(_feats, data_samples, train_cfg=self.train_cfg)
        else:
            feats = self.extract_feat(inputs)
+           loss = self.head.loss(feats, data_samples, train_cfg=self.train_cfg)

        preds = self.head.predict(feats, data_samples, test_cfg=self.test_cfg)

        if isinstance(preds, tuple):
            batch_pred_instances, batch_pred_fields = preds
        else:
            batch_pred_instances = preds
            batch_pred_fields = None

        results = self.add_pred_to_datasample(batch_pred_instances,
                                              batch_pred_fields, data_samples)
+       results.append(loss_sample)
        return results

In addition, you should add dict(type='GenerateTarget', encoder=codec) to val_pipeline similar to train_pipeline.

MikasaLee · 2024-02-23T12:29:32Z

Hope to merge val loss into mmengine as soon as possible, which is a very useful feature

mmengine/runner/loops.py

fanqiNO1 added 9 commits February 22, 2024 18:29

[Feature] Support calculating loss during validation

e48195d

[Fix] Remove unused import

6f8c6c7

[Fix] Fix test

5d1ecca

[Fix] Fix test

0e7fcdc

[Fix] Fix test

850ae69

[Fix] Fix typo

14a86fb

[Fix] Fix bugs

0d5c490

[Fix] Use HistoryBuffer

3295976

[Fix] Remove HistoryBuffer

f8b105a

zhouzaida requested a review from HAOCHENYE February 26, 2024 06:18

zhouzaida linked an issue Feb 26, 2024 that may be closed by this pull request

[Feature] Support calculating loss in the validation step #1486

Closed

fanqiNO1 added 4 commits February 28, 2024 16:36

[Fix] Fix append logic

b5aaa37

[Fix] Clear loss before run_iter

4eb3244

[Fix] Fix logic

df2efbe

[Fix] Fix logic

e0b4cea

HAOCHENYE previously approved these changes Mar 5, 2024

View reviewed changes

[Enhance] Extract parse losses

4d76d6c

fanqiNO1 dismissed HAOCHENYE’s stale review via 4d76d6c March 6, 2024 03:43

[Enhance] Extract update loss

bcfc424

guyleaf mentioned this pull request May 2, 2024

[Feature] Support calculating loss in the validation step #1486

Closed

HAOCHENYE previously approved these changes May 6, 2024

View reviewed changes

zhouzaida reviewed May 17, 2024

View reviewed changes

mmengine/runner/loops.py Outdated Show resolved Hide resolved

Update mmengine/runner/loops.py

6ce02b6

zhouzaida dismissed HAOCHENYE’s stale review via 6ce02b6 May 17, 2024 06:54

zhouzaida approved these changes May 17, 2024

View reviewed changes

zhouzaida merged commit d1f1aab into open-mmlab:main May 17, 2024
9 of 20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support calculating loss during validation #1503

[Feature] Support calculating loss during validation #1503

fanqiNO1 commented Feb 22, 2024 •

edited

MikasaLee commented Feb 23, 2024

[Feature] Support calculating loss during validation #1503

[Feature] Support calculating loss during validation #1503

Conversation

fanqiNO1 commented Feb 22, 2024 • edited

Background

Design

Adaptation

Custom Model

MMPreTrain

MMPose

Calculating loss in this way maybe not correct.

MikasaLee commented Feb 23, 2024

fanqiNO1 commented Feb 22, 2024 •

edited