Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CodeCamp2023-325] Find the proper learning rate #1318

Open
wants to merge 42 commits into
base: main
Choose a base branch
from

Conversation

yhna940
Copy link
Contributor

@yhna940 yhna940 commented Aug 23, 2023

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

The primary aim of this pull request is to introduce a tuning methodology for automatically determining the optimal learning rate for model training. Hyperparameter tuning, especially finding the optimal learning rate, is crucial for effective model training. The optimal learning rate serves as a starting point that can significantly reduce the required time and resources in broader hyperparameter space exploration. Given the inherent expensive nature of experiments, adopting a black-box optimization formulation, where the input is the hyperparameter and the output corresponds to model performance, is a strategic choice.

Modification

In this PR, we've integrated a tuning concept that focuses on black-box optimization strategies, such as evolutionary algorithms and Bayesian optimization, to discover the best learning rates. Recognizing the intricate nature of these strategies, instead of implementing from scratch, we've incorporated external libraries like Nevergrad (developed by META), ensuring robustness and efficiency in our search process.

Structure & Roles

Tuner

The Tuner serves as the main orchestrator for the hyperparameter tuning process.

  • Responsibilities:
    • Injects hyperparameters into the runner configuration.
    • Initiates the training/evaluation process with the given set of hyperparameters.

Report Hook

This component acts as an intermediary, gathering results from the training and evaluation phases and formatting them for further analysis.

  • Responsibilities:
    • Monitors the training process up to a specified number of tuning iterations or epochs.
    • Extracts key performance metrics and scores from the Runner's outputs.
    • Reports these results in a standardized format, making them ready for analysis and further decision-making.

Searcher

The Searcher operates in the realm of the hyperparameter space. Using historical data and sophisticated optimization techniques, it suggests the next set of hyperparameters to be evaluated.

  • Responsibilities:
    • Analyzes the history of hyperparameters and their corresponding performance metrics.
    • Suggests a suitable candidate point in the hyperparameter space for the next round of training/evaluation.
    • Can integrate with external optimization libraries/tools such as Hyperopt, Scikit-optimize, or Microsoft's CFO to make informed recommendations.

BC-breaking (Optional)

Does the modification introduce changes that break the backward-compatibility of the downstream repos?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

torchrun --nproc_per_node 2 examples/tune/find_lr.py --launcher pytorch

Checklist

  1. Pre-commit or other linting tools are used to fix the potential lint issues.
  2. The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
  3. If the modification has potential influence on downstream projects, this PR should be tested with downstream projects, like MMDet or MMCls.
  4. The documentation has been modified accordingly, like docstring or example tutorials.

References

  1. mim grid search: https://github.com/open-mmlab/mim/blob/main/mim/commands/gridsearch.py
  2. pytorch lightning tuning: https://lightning.ai/docs/pytorch/stable/_modules/lightning/pytorch/tuner/tuning.html#Tuner
  3. ray tune searcher: https://docs.ray.io/en/latest/tune/api/suggestion.html

TODO

  • unit tests
  • docstring

@CLAassistant
Copy link

CLAassistant commented Aug 23, 2023

CLA assistant check
All committers have signed the CLA.

@zhouzaida zhouzaida linked an issue Aug 24, 2023 that may be closed by this pull request
@HAOCHENYE
Copy link
Collaborator

Thank you for your contribution. Your PR message and docstring helped me understand your design. Before we delve into implementation details, let's consider the relationship between the runner and tunner. Could we have the runner manage the ReportHook and Searcher, or perhaps make the Tunner an attribute of the runner? This would provide users with a friendlier experience, enabling automatic hyperparameter discovery in end-to-end training with the Runner.

@yhna940
Copy link
Contributor Author

yhna940 commented Aug 31, 2023

Thank you for your contribution. Your PR message and docstring helped me understand your design. Before we delve into implementation details, let's consider the relationship between the runner and tunner. Could we have the runner manage the ReportHook and Searcher, or perhaps make the Tunner an attribute of the runner? This would provide users with a friendlier experience, enabling automatic hyperparameter discovery in end-to-end training with the Runner.

@HAOCHENYE Thank you for your feedback. Your suggestion resonated with me, and I believe you've raised a valid point.

In line with your suggestion, I've added a class method to the Runner class named from_tuning. The overarching idea is to position the Tuner as an auxiliary tool when instantiating the Runner, thus confining the lifespan of the Tuner and enabling the Runner to orchestrate the tuning process.

For example, users can now employ the following streamlined approach:

runner = Runner.from_tuning(
    runner_cfg=runner_cfg,
    hparam_spec={
        'optim_wrapper.optimizer.lr': {
            'type': 'continuous',
            'lower': 1e-5,
            'upper': 1e-3
        }
    },
    monitor='loss',
    rule='less',
    num_trials=16,
    tuning_epoch=2,
    searcher_cfg=dict(type='NevergradSearcher'),
)
runner.train()

This design not only enhances code readability but also provides users with a seamless experience of automatic hyperparameter discovery integrated with end-to-end training using the Runner.

I appreciate your valuable insights and would love to hear any further thoughts or feedback you might have.

@yhna940 yhna940 marked this pull request as ready for review September 1, 2023 04:29
Copy link
Collaborator

@HAOCHENYE HAOCHENYE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution! I believe the current solution is reasonable. However, I'm curious if it's possible to integrate the process of calling from_tuning into the Runner.train and have Runner control whether or not to use the Tunner through a parameter. What do you think are the potential risks associated with this approach?

Comment on lines 21 to 22
report_op (str, optional): The operation to report the score.
Options are 'latest', 'mean'. Defaults to 'latest'.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to describe the meaning of latest, mean here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In line with your suggestion, I have added comments to the meaning and role of report_op, explaining the options like latest and mean.

Comment on lines 23 to 24
max_scoreboard_len (int, optional):
The maximum length of the scoreboard.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scoreboard is a new conception for users. We need to introduce it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify the newly introduced concept of the scoreboard, I have incorporated additional comments in the relevant section to guide users regarding its purpose and usage.

mmengine/tune/_report_hook.py Outdated Show resolved Hide resolved

tag, _ = runner.log_processor.get_log_after_iter(
runner, batch_idx, 'train')
score = tag.get(self.monitor, None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
score = tag.get(self.monitor, None)
score = tag.get(self.monitor)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest adding a prefix to monitor like train/loss and val/accuracy. We only check the monitor at specific phase (train or validation) according to its prefix, and raise an error immediately if it is not defined in tag. Reporting an error to users immediately could be better than raising an error after the whole tuning round.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to check the monitored value is a number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following your advice, I have enhanced the monitoring process by specifying prefixes to it. Moreover, I've embedded logic to verify that the monitored values are numerical to prevent potential errors.

broadcast_object_list(scores_to_broadcast, src=0)
score = scores_to_broadcast[0]
if is_main_process():
self._searcher.record(hparam, score)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the loss is not synchronized across ranks in mmengine (convenient for watch loss of different ranks), which means that searcher of different ranks could receive different scores and suggest different hparams. Maybe we should apply reduce-mean op to score

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have modified the process to communicate score through a reduce-mean operation, aligning with your observation to enhance the consistency across different ranks.

Comment on lines +245 to +248
if is_main_process():
hparams_to_broadcast = [self._searcher.suggest()]
else:
hparams_to_broadcast = [None] # type: ignore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not mistaken, the reason for only calling suggest within the main process is that there might be randomness in the results from the searcher for the same scores. Maybe we should highlight this point (randomness) in the comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You were absolutely right regarding the randomness associated with the suggest method. To provide a clearer picture, I have appended comments highlighting that the method is solely executed in the main process due to the inherent randomness in the outcomes.

for k in keys[:-1]:
if isinstance(cfg, list):
idx = int(k)
if idx >= len(cfg) or idx < 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A negative index is allowed for list. Is it necessary to raise an error here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To accommodate the utility of negative indices, I have removed the assertion that previously restricted this, allowing for more flexibility in list operations.

self._logger.info(f'Best hyperparameters obtained: {best_hparam}')
self._logger.info(f'Best score obtained: {best_score}')
self._logger.info('Tuning completed.')
return dict(hparam=best_hparam, score=best_score)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Runner, several global variables are defined, including Visualizer, Logger, and DefaultScope (which inherits from ManagerMixin). We should clean these variables after each trial. Furthermore, it might be more advantageous to conduct the tuning process within a temporary directory and delete it once it is completed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking cues from your insights, I have incorporated a mechanism to clear global variables following each trial, referencing the testing function for the implementation. Additionally, I have configured the tuning process to transpire within a temporary directory, ensuring its deletion upon completion to maintain a clean workspace.

@yhna940
Copy link
Contributor Author

yhna940 commented Sep 13, 2023

Thank you for your contribution! I believe the current solution is reasonable. However, I'm curious if it's possible to integrate the process of calling from_tuning into the Runner.train and have Runner control whether or not to use the Tunner through a parameter. What do you think are the potential risks associated with this approach?

Hello, @HAOCHENYE

Thank you very much for your thoughtful suggestion. I appreciate the proactive approach to potentially integrating the from_tuning process directly within the Runner.train method and controlling the utilization of the tuner through a parameter. This could streamline the process, making the tuning more seamless.

However, I think there is a potential risk in combining the two methods. Firstly, at the time of invoking the tuning within the train method, the runner instance has already been instantiated. This means that we will have co-existing instances - the caller runner and the callee runner within the tuner. Both these instances maintain their own models, optimizers, and data loaders, potentially increasing the memory usage considerably.

Furthermore, replacing the attributes of the caller runner with those of the tuned attributes from the post-tuning could introduce considerable complexities. We might find ourselves having to define intricate logic to safely replace the attributes without any adverse side effects. For instance, if we aim to tune the learning rate, we would need to alter the optimizer’s state dict; if we intend to modify the number of data samples, it would require rebuilding the data loader, among other potential modifications. Pre-defining rules for attribute replacement can be a challenging task given the numerous potential scenarios and combinations that would need to be accounted for.

If we are considering integrating tuning within the runner.train, one approach might be to implement a lazy initialization for the runner. This way, the instantiation of the runner is deferred until the train method is invoked, allowing the tuning to complete and the hyperparameters to be decided before the runner instance is created. This could potentially mitigate the concerns mentioned above. However, this approach would entail a significant modification to the runner's operation, and it seems to be a quite extensive and difficult task to undertake in this PR.

I am eager to hear your esteemed opinion on whether my concerns are valid or feedback on this matter.

@HAOCHENYE
Copy link
Collaborator

I'm very sorry for the delayed response. I think your considerations are very reasonable. MMEngine introduced FlexibleRunner in v0.8.0, which can fully lazy-initiate various components, and it should be able to address the situation where both Tuner and Runner hold two models simultaneously during the training phase. However, this is somewhat unrelated; let's continue to focus on Runner for this PR.

Currently, during initialization, Runner instantiates components like the model, visualizer, and logger. If you want to find the best learning rate during the train, you can still do it similarly to the current approach. You can build a new Runner during train, find the best parameters, and then inject them into the relevant components of the original Runner, rather than directly using the current Runner to search for the best learning rate. However, even with this approach, it doesn't resolve the issue of having two models simultaneously when searching for the learning rate.

For me, both searching for the learning rate during the training phase and searching for it through the from_tuning interface are acceptable. You can implement it according to your preference 😄 .

@yhna940
Copy link
Contributor Author

yhna940 commented Sep 24, 2023

I'm very sorry for the delayed response. I think your considerations are very reasonable. MMEngine introduced FlexibleRunner in v0.8.0, which can fully lazy-initiate various components, and it should be able to address the situation where both Tuner and Runner hold two models simultaneously during the training phase. However, this is somewhat unrelated; let's continue to focus on Runner for this PR.

Currently, during initialization, Runner instantiates components like the model, visualizer, and logger. If you want to find the best learning rate during the train, you can still do it similarly to the current approach. You can build a new Runner during train, find the best parameters, and then inject them into the relevant components of the original Runner, rather than directly using the current Runner to search for the best learning rate. However, even with this approach, it doesn't resolve the issue of having two models simultaneously when searching for the learning rate.

For me, both searching for the learning rate during the training phase and searching for it through the from_tuning interface are acceptable. You can implement it according to your preference 😄 .

Hello, @HAOCHENYE ,

First and foremost, I'd like to express my gratitude for your comprehensive feedback on the proposal. Your insights and the clarity with which you've approached the problem have been immensely beneficial.

I am thankful for your dual-pronged approach suggestion, providing both an integration within the runner.train method and using the from_tuning interface for hyperparameter tuning. After pondering over the two alternatives, I find myself gravitating towards the latter, employing the from_tuning method. Several reasons underpin this preference:

  1. Separation of Concerns: Leveraging the from_tuning method inherently segregates the tuning phase from the training phase. This explicit demarcation ensures that each phase has its specific focus, leading to more structured and understandable code.
  2. Avoidance of Co-existing Attributes: As you rightly pointed out, having the caller runner and the callee runner simultaneously poses memory and complexity concerns. With the from_tuning approach, this co-existence is avoided, leading to a more memory-efficient and streamlined workflow.

I sincerely hope my approach aligns well with the vision of MMEngine. I'd appreciate further comments, reviews, or feedback on this direction or any other aspect of the PR to refine and improve it.

Thank you once again for your guidance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] find the proper learning rate
3 participants