generating clearml-reports #225

sriram-dsl · 2024-01-10T12:22:11Z

I have set up a ClearML server. Currently, I am training a YOLO model using the YOLOv5 framework within a container. I am utilizing the --project and --name flags during the training process. However, the training results are not being sent to the server; instead, they are being stored inside the container with the project and name specified by the flags.

python3 train.py --data/data.yaml --batch 32 --epochs 3 --weights yolov5s.pt --project testing/clearml --name yolo

it is saving results under
testing/clearml/yolo/(results)

The text was updated successfully, but these errors were encountered:

sriram-dsl · 2024-01-10T12:24:40Z

ClearML is configured with the correct credentials. It is successfully sending the data to the server but is not sending the generated reports.

ainoam · 2024-01-10T18:58:05Z

@sriram-dsl which versions of clearml and clearml-server are you using?

The results should be logged by clearml AS WELL AS stored in your save_dir. Different results are stored in different places: Metrics are shown under 'Scalars', summary images and curves under 'Plots' and training samples under 'Debug samples'.

Which are you missing?

BTW - you might have a typo in the invocation you provided: --data/data.yaml does not seem right.

sriram-dsl · 2024-01-11T16:59:31Z

if i launch a training it should appear under Projects tab right?
how many are running, which is completed, plots, graps and training results
i got all those in a normal environment
but inside the container, iam getting issue

sriram-dsl · 2024-01-11T17:00:20Z

@sriram-dsl which versions of clearml and clearml-server are you using?

The results should be logged by clearml AS WELL AS stored in your save_dir. Different results are stored in different places: Metrics are shown under 'Scalars', summary images and curves under 'Plots' and training samples under 'Debug samples'.

Which are you missing?

BTW - you might have a typo in the invocation you provided: --data/data.yaml does not seem right.

typo mistake that is --data data/data.yaml

sriram-dsl · 2024-02-16T12:10:32Z

Is the logging issue related to a specific commit ID? Need clarification on this

ainoam · 2024-02-18T10:03:04Z

@sriram-dsl we're asking for ClearML versions in order to try to validate ClearML behaviour as closely as possible to your environment.
From your description so far, it sounds like there might be a connectivity issue between your container environment and your ClearML server? How are you setting up the container? How are you running your training inside the container? Can you provide some logs for the training?

sriram-dsl · 2024-02-19T12:00:38Z

ClearML version: 1.13.2
I have set up an Ubuntu 20.04 container and cloned a YOLOv5 repository inside it to conduct training. Now, I aim to send the logs generated from this container to a server.

root@104ab3100c7d:~/yolov5# python3 train.py --img 320 --batch 2 --epochs 3 --data coco128.yaml --weights yolov5s.pt --cache --project container --name testing
train: weights=yolov5s.pt, cfg=, data=coco128.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=3, batch_size=2, imgsz=320, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=ram, image_weights=False, device=, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=container, name=testing, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: ⚠️ YOLOv5 is out of date by 743 commits. Use `git pull` or `git clone https://github.com/ultralytics/yolov5` to update.
YOLOv5 🚀 v6.1-211-gcee5959c Python-3.8.10 torch-1.11.0+cu102 CPU

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
Weights & Biases: run 'pip install wandb' to automatically track and visualize YOLOv5 🚀 runs (RECOMMENDED)
TensorBoard: Start with 'tensorboard --logdir container', view at http://localhost:6006/

Dataset not found ⚠, missing paths ['/root/datasets/coco128/images/train2017']
Downloading https://ultralytics.com/assets/coco128.zip to coco128.zip...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6.66M/6.66M [00:00<00:00, 7.08MB/s]
Dataset download success ✅ (3.7s), saved to /root/datasets

.........

Model summary: 270 layers, 7235389 parameters, 7235389 gradients, 16.6 GFLOPs

Transferred 349/349 items from yolov5s.pt
Scaled weight_decay = 0.0005
optimizer: SGD with parameter groups 57 weight (no decay), 60 weight, 60 bias
train: Scanning '/root/datasets/coco128/labels/train2017' images and labels...126 found, 2 missing, 0 empty, 0 corrupt: 100%|██████████| 128/128 [00:0
train: New cache created: /root/datasets/coco128/labels/train2017.cache
train: Caching images (0.0GB ram): 100%|██████████| 128/128 [00:00<00:00, 2060.30it/s]                                                                
val: Scanning '/root/datasets/coco128/labels/train2017.cache' images and labels... 126 found, 2 missing, 0 empty, 0 corrupt: 100%|██████████| 128/128 
val: Caching images (0.0GB ram): 100%|██████████| 128/128 [00:00<00:00, 1655.63it/s]                                                                  
Plotting labels to container/testing3/labels.jpg... 

AutoAnchor: 3.96 anchors/target, 0.957 Best Possible Recall (BPR). Anchors are a poor fit to dataset ⚠️, attempting to improve...
AutoAnchor: WARNING: Extremely small objects found: 35 of 929 labels are < 3 pixels in size
AutoAnchor: Running kmeans for 9 anchors on 927 points...
AutoAnchor: Evolving anchors with Genetic Algorithm: fitness = 0.6699: 100%|██████████| 1000/1000 [00:00<00:00, 4459.00it/s]                          
AutoAnchor: thr=0.25: 0.9935 best possible recall, 3.75 anchors past thr
AutoAnchor: n=9, img_size=320, metric_all=0.263/0.670-mean/best, past_thr=0.477-mean: 5,7, 12,11, 16,27, 43,36, 54,72, 62,145, 146,107, 166,191, 298,218
AutoAnchor: Done ✅ (optional: update model *.yaml to use these anchors in the future)
Image sizes 320 train, 320 val
Using 2 dataloader workers
Logging results to container/testing3
Starting training for 3 epochs...

..........

3 epochs completed in 0.013 hours.
Optimizer stripped from container/testing3/weights/last.pt, 14.7MB
Optimizer stripped from container/testing3/weights/best.pt, 14.7MB

Validating container/testing3/weights/best.pt...
Fusing layers... 
Model summary: 213 layers, 7225885 parameters, 0 gradients, 16.4 GFLOPs
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95:   0%|          | 0/32 [00:00<?, ?it/s]  

..........
Results saved to container/testing3

jkhenning · 2024-02-19T15:06:46Z

Did you set up clearml.conf inside your container?

sriram-dsl · 2024-02-19T15:57:29Z

Did you set up clearml.conf inside your container?

Yes, I've installed ClearML and configured the ClearML client using credentials generated from the self-hosting server.

eugen-ajechiloae-clearml · 2024-02-19T16:18:26Z

Hi @sriram-dsl ! It is possible that this may solve your problem:
If you init the task manually: can you please try initializing your clearml task using output_uri=True? You can set it to the location you upload the model to, or set sdk.development.default_output_uri (or even CLEARML_DEFAULT_OUTPUT_URI env var) to the file server you want the model to be uploaded to. It can be the same as the file server used under the api section in clearml.conf.

sriram-dsl · 2024-02-19T17:20:30Z

this is the api and sdk section in the clearml.conf

# ClearML SDK configuration file
api {
    # Notice: 'host' is the api server (default port 8008), not the web server.
    api_server: http://<public_ip>:8008
    web_server: http://<public_ip>:8082
    files_server: http://<public_ip>:8081
    # Credentials are generated using the webapp, http://<public_ip>:8082/settings
    # Override with os environment: CLEARML_API_ACCESS_KEY / CLEARML_API_SECRET_KEY
    credentials {"access_key": "OTVT8HO*********5RXU", "secret_key": "OZClLTtposYbKC5UVA*****************4LvM7yhiaFm06"}
}
sdk {
    # ClearML - default SDK configuration

    storage {
        cache {
            # Defaults to <system_temp_folder>/clearml_cache
            default_base_dir: "~/.clearml/cache"
            # default_cache_manager_size: 100
        }

        direct_access: [
            # Objects matching are considered to be available for direct access, i.e. they will not be downloaded
            # or cached, and any download request will return a direct reference.
            # Objects are specified in glob format, available for url and content_type.
            { url: "file://*" }  # file-urls are always directly referenced
        ]
    }

do i need to add any key:value pairs in it to log results while launching yolov5 training
iam initialising the training with the basic yolov5 training command, mention in here #225 (comment)

sriram-dsl · 2024-03-13T05:19:48Z

@eugen-ajechiloae-clearml, @jkhenning, @ainoam

Something similar to this might solve my problem: allegroai/clearml#363
In other words, should I make changes inside yolov5/train.py?

My requirement is to log the experiment results generated during training to the ClearML server. The training is conducted inside a container. I have set up an Ubuntu 20.04 container, cloned the YOLOv5 repository inside the Ubuntu container, and started training using that YOLOv5 Git repository.

eugen-ajechiloae-clearml · 2024-03-13T14:40:27Z

Hi @sriram-dsl ! If you wish to modify train.py, you could use a dataset, or you could use the OutputModel https://clear.ml/docs/latest/docs/references/sdk/model_outputmodel class to upload your models if it fits your case.
Usually, datasets are used to store more than just models (like training datasets).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

generating clearml-reports #225

generating clearml-reports #225

sriram-dsl commented Jan 10, 2024

sriram-dsl commented Jan 10, 2024 •

edited

ainoam commented Jan 10, 2024

sriram-dsl commented Jan 11, 2024

sriram-dsl commented Jan 11, 2024

sriram-dsl commented Feb 16, 2024

ainoam commented Feb 18, 2024

sriram-dsl commented Feb 19, 2024

jkhenning commented Feb 19, 2024

sriram-dsl commented Feb 19, 2024 •

edited

eugen-ajechiloae-clearml commented Feb 19, 2024

sriram-dsl commented Feb 19, 2024

sriram-dsl commented Mar 13, 2024 •

edited

eugen-ajechiloae-clearml commented Mar 13, 2024

generating clearml-reports #225

generating clearml-reports #225

Comments

sriram-dsl commented Jan 10, 2024

sriram-dsl commented Jan 10, 2024 • edited

ainoam commented Jan 10, 2024

sriram-dsl commented Jan 11, 2024

sriram-dsl commented Jan 11, 2024

sriram-dsl commented Feb 16, 2024

ainoam commented Feb 18, 2024

sriram-dsl commented Feb 19, 2024

jkhenning commented Feb 19, 2024

sriram-dsl commented Feb 19, 2024 • edited

eugen-ajechiloae-clearml commented Feb 19, 2024

sriram-dsl commented Feb 19, 2024

sriram-dsl commented Mar 13, 2024 • edited

eugen-ajechiloae-clearml commented Mar 13, 2024

sriram-dsl commented Jan 10, 2024 •

edited

sriram-dsl commented Feb 19, 2024 •

edited

sriram-dsl commented Mar 13, 2024 •

edited