Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add recover model from checkpoint first if it exists, store is local #376

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kitianFresh
Copy link
Contributor

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #167

Special notes for your reviewer:

@kitianFresh kitianFresh requested a review from qmhu June 23, 2022 11:52
@kitianFresh kitianFresh force-pushed the feature/support-local-stored-checkpoint-for-percentile-algorithm branch from 1d0091f to 61159bf Compare June 23, 2022 11:58
@github-actions
Copy link
Contributor

github-actions bot commented Jun 23, 2022

🎉 Successfully Build Images.

Overview: https://finops.coding.net/public-artifacts/gocrane/crane/packages

Image Pull Command
crane-agent:pr-376-61159bf docker pull finops-docker.pkg.coding.net/gocrane/crane/crane-agent:pr-376-61159bf
dashboard:pr-376-61159bf docker pull finops-docker.pkg.coding.net/gocrane/crane/dashboard:pr-376-61159bf
metric-adapter:pr-376-61159bf docker pull finops-docker.pkg.coding.net/gocrane/crane/metric-adapter:pr-376-61159bf
craned:pr-376-61159bf docker pull finops-docker.pkg.coding.net/gocrane/crane/craned:pr-376-61159bf

// 1. predictor checkpoints all metric namers together periodically by a independent routine. but this will not guarantee the checkpoint data is consistent with the latest updated model in memory
// 2. predictor checkpoints the metric namer each time after model is updating, so the checkpoint is always latest. for example, the percentile to do checkpoint after add sample for each metric namer.
// 3. application caller such as evpa triggers the metric namer to do checkpoint. delegate the trigger to application caller
type Checkpointer interface {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move directory checkpoint into somewhere deeper, for example, in evpa's rootdirr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move directory checkpoint into somewhere deeper, for example, in evpa's rootdirr

I think check pointer is a common lib, not only for evpa. other model can be checkpointed too if it supports. so i put it in top level

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice for a common checkpoint ability, So for a common lib, maybe we need to change interface function to a general way.

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the motivation of checkpoint should be clarified much clear; the current supported DSP algorithm actions in an off-line manner, i.e., no explicit training process; for those models having explicitly distinguished training and reasoning processes, checkpoint is much useful;

switch cpStoreType {
case checkpoint.StoreTypeLocal:
checkpointer, err = checkpoint.InitCheckpointer(checkpoint.StoreType(opts.CheckpointerStore), opts.CheckpointerLocalConfig)
if err != nil {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the InitCheckpointer be called exactly once? e.g., by sync.Once.Do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the InitCheckpointer be called exactly once? e.g., by sync.Once.Do

it is ok to call InitCheckpointer more than once, becasue the factory pattern, there is a map to store each storage checkpoint. it will return if it already exists.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

var _ Checkpointer = &Local{}

// Local use local filesystem as checkpoint storage, so if you use Local, the craned need some persistent volumes such as cbs as storage to keep the states
type Local struct {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just take the storage type as a configurable parameter? by that way, we can reuse the code among different types of storage, including the Start method below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just take the storage type as a configurable parameter? by that way, we can reuse the code among different types of storage, including the Start method below.

Yes, a high level abstraction is abstract a checkpoint manager framework for different storages, user can implement different storage if they want.

But now the checkpoint structure is not stable and is still evolving. and some algorithms do not support the model checkpoint. So we just iterate a little to support some urgent needs of users.

So now just implement different storage backed checkpoint. If you want, we can discuss more in crane meeting or wechat

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zsnmwy zsnmwy added the area:algorithm crane algorithm label Aug 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:algorithm crane algorithm
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support checkpoint for percentile algorithm predictor
3 participants