The problem with scalars #233

kzelias · 2024-03-01T07:13:26Z

Hello! I have two identical experiments.
For the first one, the scalars are displayed correctly, for the second one I get an error. The rest of the parameters are logged correctly, the problem is in the scalars.
What could be the reason?

Error 100 : General data error (RequestError(400, 'search_phase_execution_exception', 'Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [metric] in order to load field data by uninverting the inverted index. Note that this can use significant memory.'))

Work:

Error:

The text was updated successfully, but these errors were encountered:

kzelias · 2024-03-01T17:06:08Z

docker.io/allegroai/clearml:1.14.1-448

similar problems
#89
#178

jkhenning · 2024-03-04T15:32:34Z

Hi @kzelias, what is your code doing, exactly?

kzelias · 2024-03-04T16:06:54Z

It's just a task over hydra.

import pytorch_lightning as pl
from omegaconf import OmegaConf

from nemo.collections.asr.models import EncDecHybridRNNTCTCBPEModel
from nemo.core.config import hydra_runner
from nemo.utils import logging
from nemo.utils.exp_manager import exp_manager

from clearml import Task

CONFIG_NAME = "fastconformer_287_start_tune_b128_lr2e-5"

@hydra_runner(config_path="../../cfg_train/conformers/cvm", config_name=CONFIG_NAME)
def main(cfg):

    task = Task.init(project_name="ap-models", task_name=CONFIG_NAME)
    logger = task.get_logger()

    trainer = pl.Trainer(**cfg.trainer)
    exp_manager(trainer, cfg.get("exp_manager", None))
    asr_model = EncDecHybridRNNTCTCBPEModel(cfg=cfg.model, trainer=trainer)

    # Initialize the weights of the model from another model, if provided via config
    print("------INITING FROM PRETRAIN------")
    asr_model.maybe_init_from_pretrained_checkpoint(cfg)
    print("------INITED------")

    logging.info(f'MODEL train_ds config: {asr_model.cfg.train_ds}')
    logging.info(f'MODEL optim config: {asr_model.cfg.optim}')
    trainer.fit(asr_model)


if __name__ == '__main__':
    main()  # noqa pylint: disable=no-value-for-parameter

kzelias · 2024-03-04T18:15:05Z

UPD: At the beginning of training, scalers work, after 5-10 thousand steps, this error appears.

jkhenning · 2024-03-05T07:40:15Z

This might be an issue with Elastic- can you check the Elastic docker container logs?

kzelias · 2024-03-05T10:37:16Z

The error existed for one week. She disappeared today.
All that happened during this time was the restart of the apiserver a few hours ago.
Something strange. Is the apiserver related to elastic?

jkhenning · 2024-03-06T07:15:07Z

It's using Elastic

kzelias · 2024-03-06T08:02:42Z

the situation repeated itself. this time, the api server rebooted quickly.
elastic log is not detailed
clearml-elastic-master.log

apiserver:

[2024-03-06 07:44:28,228] [9] [WARNING] [elasticsearch] POST http://clearml-elastic-master:9200/events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b/_search [status:400 request:0.007s]
[2024-03-06 07:44:28,232] [9] [WARNING] [elasticsearch] POST http://clearml-elastic-master:9200/events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b/_search [status:400 request:0.003s]
[2024-03-06 07:44:28,235] [9] [WARNING] [elasticsearch] POST http://clearml-elastic-master:9200/events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b/_search [status:400 request:0.002s]
[2024-03-06 07:44:28,238] [9] [WARNING] [elasticsearch] POST http://clearml-elastic-master:9200/events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b/_search [status:400 request:0.002s]

clearml-apiserver.log

jkhenning · 2024-03-06T09:41:25Z

Can you share your code? Something seems to be causing an illegal query, but I can't figure out what it is

kzelias · 2024-03-06T10:01:49Z

My code is here
#233 (comment)

Server deployed by helm
https://github.com/allegroai/clearml-helm-charts/tree/main/charts/clearml
I only changed this repository for elastic. I haven't changed the version

- name: elasticsearch
  repository: https://charts.bitnami.com/bitnami
  version: 7.17.3

kzelias · 2024-05-03T09:09:47Z

some more logs from apiserver

[2024-05-03 07:48:39,500] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.get_all in 6ms
[2024-05-03 07:48:39,846] [9] [INFO] [clearml.service_repo] Returned 200 for events.get_task_single_value_metrics in 23ms
[2024-05-03 07:48:39,887] [9] [WARNING] [elasticsearch] POST http://clearml-elastic-master:9200/events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b/_search [status:400 request:0.037s]
[2024-05-03 07:48:39,889] [9] [ERROR] [clearml.service_repo] Returned 500 for events.scalar_metrics_iter_histogram in 60ms, msg=General data error (RequestError(400, 'search_phase_execution_exception', 'Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [metric] in order to load field data by uninverting the inverted index. Note that this can use significant memory.'))
[2024-05-03 07:48:39,921] [9] [INFO] [clearml.service_repo] Returned 200 for events.add_batch in 113ms
[2024-05-03 07:48:39,994] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.get_all in 6ms

jkhenning · 2024-05-06T13:21:13Z

@kzelias the last server version has some fixes that are related to this issue - can you try with v1.15.0?

kzelias closed this as completed Mar 6, 2024

kzelias reopened this May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The problem with scalars #233

The problem with scalars #233

kzelias commented Mar 1, 2024

kzelias commented Mar 1, 2024

jkhenning commented Mar 4, 2024

kzelias commented Mar 4, 2024

kzelias commented Mar 4, 2024

jkhenning commented Mar 5, 2024

kzelias commented Mar 5, 2024

jkhenning commented Mar 6, 2024

kzelias commented Mar 6, 2024

jkhenning commented Mar 6, 2024

kzelias commented Mar 6, 2024

kzelias commented May 3, 2024

jkhenning commented May 6, 2024

The problem with scalars #233

The problem with scalars #233

Comments

kzelias commented Mar 1, 2024

kzelias commented Mar 1, 2024

jkhenning commented Mar 4, 2024

kzelias commented Mar 4, 2024

kzelias commented Mar 4, 2024

jkhenning commented Mar 5, 2024

kzelias commented Mar 5, 2024

jkhenning commented Mar 6, 2024

kzelias commented Mar 6, 2024

jkhenning commented Mar 6, 2024

kzelias commented Mar 6, 2024

kzelias commented May 3, 2024

jkhenning commented May 6, 2024