Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The problem with scalars #233

Open
kzelias opened this issue Mar 1, 2024 · 12 comments
Open

The problem with scalars #233

kzelias opened this issue Mar 1, 2024 · 12 comments

Comments

@kzelias
Copy link

kzelias commented Mar 1, 2024

Hello! I have two identical experiments.
For the first one, the scalars are displayed correctly, for the second one I get an error. The rest of the parameters are logged correctly, the problem is in the scalars.
What could be the reason?

Error 100 : General data error (RequestError(400, 'search_phase_execution_exception', 'Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [metric] in order to load field data by uninverting the inverted index. Note that this can use significant memory.'))

Work:
изображение

Error:
изображение

@kzelias
Copy link
Author

kzelias commented Mar 1, 2024

docker.io/allegroai/clearml:1.14.1-448

similar problems
#89
#178

@jkhenning
Copy link
Member

Hi @kzelias, what is your code doing, exactly?

@kzelias
Copy link
Author

kzelias commented Mar 4, 2024

It's just a task over hydra.

import pytorch_lightning as pl
from omegaconf import OmegaConf

from nemo.collections.asr.models import EncDecHybridRNNTCTCBPEModel
from nemo.core.config import hydra_runner
from nemo.utils import logging
from nemo.utils.exp_manager import exp_manager

from clearml import Task

CONFIG_NAME = "fastconformer_287_start_tune_b128_lr2e-5"

@hydra_runner(config_path="../../cfg_train/conformers/cvm", config_name=CONFIG_NAME)
def main(cfg):

    task = Task.init(project_name="ap-models", task_name=CONFIG_NAME)
    logger = task.get_logger()

    trainer = pl.Trainer(**cfg.trainer)
    exp_manager(trainer, cfg.get("exp_manager", None))
    asr_model = EncDecHybridRNNTCTCBPEModel(cfg=cfg.model, trainer=trainer)

    # Initialize the weights of the model from another model, if provided via config
    print("------INITING FROM PRETRAIN------")
    asr_model.maybe_init_from_pretrained_checkpoint(cfg)
    print("------INITED------")

    logging.info(f'MODEL train_ds config: {asr_model.cfg.train_ds}')
    logging.info(f'MODEL optim config: {asr_model.cfg.optim}')
    trainer.fit(asr_model)


if __name__ == '__main__':
    main()  # noqa pylint: disable=no-value-for-parameter

@kzelias
Copy link
Author

kzelias commented Mar 4, 2024

UPD: At the beginning of training, scalers work, after 5-10 thousand steps, this error appears.

@jkhenning
Copy link
Member

This might be an issue with Elastic- can you check the Elastic docker container logs?

@kzelias
Copy link
Author

kzelias commented Mar 5, 2024

The error existed for one week. She disappeared today.
All that happened during this time was the restart of the apiserver a few hours ago.
Something strange. Is the apiserver related to elastic?

@jkhenning
Copy link
Member

It's using Elastic

@kzelias kzelias closed this as completed Mar 6, 2024
@kzelias
Copy link
Author

kzelias commented Mar 6, 2024

the situation repeated itself. this time, the api server rebooted quickly.
elastic log is not detailed
clearml-elastic-master.log

apiserver:

[2024-03-06 07:44:28,228] [9] [WARNING] [elasticsearch] POST http://clearml-elastic-master:9200/events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b/_search [status:400 request:0.007s]
[2024-03-06 07:44:28,232] [9] [WARNING] [elasticsearch] POST http://clearml-elastic-master:9200/events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b/_search [status:400 request:0.003s]
[2024-03-06 07:44:28,235] [9] [WARNING] [elasticsearch] POST http://clearml-elastic-master:9200/events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b/_search [status:400 request:0.002s]
[2024-03-06 07:44:28,238] [9] [WARNING] [elasticsearch] POST http://clearml-elastic-master:9200/events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b/_search [status:400 request:0.002s]

clearml-apiserver.log

@jkhenning
Copy link
Member

Can you share your code? Something seems to be causing an illegal query, but I can't figure out what it is

@kzelias
Copy link
Author

kzelias commented Mar 6, 2024

My code is here
#233 (comment)

Server deployed by helm
https://github.com/allegroai/clearml-helm-charts/tree/main/charts/clearml
I only changed this repository for elastic. I haven't changed the version

- name: elasticsearch
  repository: https://charts.bitnami.com/bitnami
  version: 7.17.3

@kzelias
Copy link
Author

kzelias commented May 3, 2024

some more logs from apiserver

[2024-05-03 07:48:39,500] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.get_all in 6ms
[2024-05-03 07:48:39,846] [9] [INFO] [clearml.service_repo] Returned 200 for events.get_task_single_value_metrics in 23ms
[2024-05-03 07:48:39,887] [9] [WARNING] [elasticsearch] POST http://clearml-elastic-master:9200/events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b/_search [status:400 request:0.037s]
[2024-05-03 07:48:39,889] [9] [ERROR] [clearml.service_repo] Returned 500 for events.scalar_metrics_iter_histogram in 60ms, msg=General data error (RequestError(400, 'search_phase_execution_exception', 'Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [metric] in order to load field data by uninverting the inverted index. Note that this can use significant memory.'))
[2024-05-03 07:48:39,921] [9] [INFO] [clearml.service_repo] Returned 200 for events.add_batch in 113ms
[2024-05-03 07:48:39,994] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.get_all in 6ms

@kzelias kzelias reopened this May 3, 2024
@jkhenning
Copy link
Member

@kzelias the last server version has some fixes that are related to this issue - can you try with v1.15.0?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants