Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generating clearml-reports #225

Open
sriram-dsl opened this issue Jan 10, 2024 · 13 comments
Open

generating clearml-reports #225

sriram-dsl opened this issue Jan 10, 2024 · 13 comments

Comments

@sriram-dsl
Copy link

I have set up a ClearML server. Currently, I am training a YOLO model using the YOLOv5 framework within a container. I am utilizing the --project and --name flags during the training process. However, the training results are not being sent to the server; instead, they are being stored inside the container with the project and name specified by the flags.

python3 train.py --data/data.yaml --batch 32 --epochs 3 --weights yolov5s.pt --project testing/clearml --name yolo 

it is saving results under
testing/clearml/yolo/(results)

@sriram-dsl
Copy link
Author

sriram-dsl commented Jan 10, 2024

ClearML is configured with the correct credentials. It is successfully sending the data to the server but is not sending the generated reports.

@ainoam
Copy link
Collaborator

ainoam commented Jan 10, 2024

@sriram-dsl which versions of clearml and clearml-server are you using?

The results should be logged by clearml AS WELL AS stored in your save_dir. Different results are stored in different places: Metrics are shown under 'Scalars', summary images and curves under 'Plots' and training samples under 'Debug samples'.

Which are you missing?

BTW - you might have a typo in the invocation you provided: --data/data.yaml does not seem right.

@sriram-dsl
Copy link
Author

if i launch a training it should appear under Projects tab right?
how many are running, which is completed, plots, graps and training results
i got all those in a normal environment
but inside the container, iam getting issue

@sriram-dsl
Copy link
Author

@sriram-dsl which versions of clearml and clearml-server are you using?

The results should be logged by clearml AS WELL AS stored in your save_dir. Different results are stored in different places: Metrics are shown under 'Scalars', summary images and curves under 'Plots' and training samples under 'Debug samples'.

Which are you missing?

BTW - you might have a typo in the invocation you provided: --data/data.yaml does not seem right.

typo mistake that is --data data/data.yaml

@sriram-dsl
Copy link
Author

Is the logging issue related to a specific commit ID? Need clarification on this

@ainoam
Copy link
Collaborator

ainoam commented Feb 18, 2024

@sriram-dsl we're asking for ClearML versions in order to try to validate ClearML behaviour as closely as possible to your environment.
From your description so far, it sounds like there might be a connectivity issue between your container environment and your ClearML server? How are you setting up the container? How are you running your training inside the container? Can you provide some logs for the training?

@sriram-dsl
Copy link
Author

ClearML version: 1.13.2
I have set up an Ubuntu 20.04 container and cloned a YOLOv5 repository inside it to conduct training. Now, I aim to send the logs generated from this container to a server.

root@104ab3100c7d:~/yolov5# python3 train.py --img 320 --batch 2 --epochs 3 --data coco128.yaml --weights yolov5s.pt --cache --project container --name testing
train: weights=yolov5s.pt, cfg=, data=coco128.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=3, batch_size=2, imgsz=320, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=ram, image_weights=False, device=, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=container, name=testing, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: ⚠️ YOLOv5 is out of date by 743 commits. Use `git pull` or `git clone https://github.com/ultralytics/yolov5` to update.
YOLOv5 🚀 v6.1-211-gcee5959c Python-3.8.10 torch-1.11.0+cu102 CPU

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
Weights & Biases: run 'pip install wandb' to automatically track and visualize YOLOv5 🚀 runs (RECOMMENDED)
TensorBoard: Start with 'tensorboard --logdir container', view at http://localhost:6006/

Dataset not found ⚠, missing paths ['/root/datasets/coco128/images/train2017']
Downloading https://ultralytics.com/assets/coco128.zip to coco128.zip...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6.66M/6.66M [00:00<00:00, 7.08MB/s]
Dataset download success ✅ (3.7s), saved to /root/datasets

.........

Model summary: 270 layers, 7235389 parameters, 7235389 gradients, 16.6 GFLOPs

Transferred 349/349 items from yolov5s.pt
Scaled weight_decay = 0.0005
optimizer: SGD with parameter groups 57 weight (no decay), 60 weight, 60 bias
train: Scanning '/root/datasets/coco128/labels/train2017' images and labels...126 found, 2 missing, 0 empty, 0 corrupt: 100%|██████████| 128/128 [00:0
train: New cache created: /root/datasets/coco128/labels/train2017.cache
train: Caching images (0.0GB ram): 100%|██████████| 128/128 [00:00<00:00, 2060.30it/s]                                                                
val: Scanning '/root/datasets/coco128/labels/train2017.cache' images and labels... 126 found, 2 missing, 0 empty, 0 corrupt: 100%|██████████| 128/128 
val: Caching images (0.0GB ram): 100%|██████████| 128/128 [00:00<00:00, 1655.63it/s]                                                                  
Plotting labels to container/testing3/labels.jpg... 

AutoAnchor: 3.96 anchors/target, 0.957 Best Possible Recall (BPR). Anchors are a poor fit to dataset ⚠️, attempting to improve...
AutoAnchor: WARNING: Extremely small objects found: 35 of 929 labels are < 3 pixels in size
AutoAnchor: Running kmeans for 9 anchors on 927 points...
AutoAnchor: Evolving anchors with Genetic Algorithm: fitness = 0.6699: 100%|██████████| 1000/1000 [00:00<00:00, 4459.00it/s]                          
AutoAnchor: thr=0.25: 0.9935 best possible recall, 3.75 anchors past thr
AutoAnchor: n=9, img_size=320, metric_all=0.263/0.670-mean/best, past_thr=0.477-mean: 5,7, 12,11, 16,27, 43,36, 54,72, 62,145, 146,107, 166,191, 298,218
AutoAnchor: Done ✅ (optional: update model *.yaml to use these anchors in the future)
Image sizes 320 train, 320 val
Using 2 dataloader workers
Logging results to container/testing3
Starting training for 3 epochs...

..........

3 epochs completed in 0.013 hours.
Optimizer stripped from container/testing3/weights/last.pt, 14.7MB
Optimizer stripped from container/testing3/weights/best.pt, 14.7MB

Validating container/testing3/weights/best.pt...
Fusing layers... 
Model summary: 213 layers, 7225885 parameters, 0 gradients, 16.4 GFLOPs
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95:   0%|          | 0/32 [00:00<?, ?it/s]  

..........
Results saved to container/testing3

@jkhenning
Copy link
Member

Did you set up clearml.conf inside your container?

@sriram-dsl
Copy link
Author

sriram-dsl commented Feb 19, 2024

Did you set up clearml.conf inside your container?

Yes, I've installed ClearML and configured the ClearML client using credentials generated from the self-hosting server.

@eugen-ajechiloae-clearml

Hi @sriram-dsl ! It is possible that this may solve your problem:
If you init the task manually: can you please try initializing your clearml task using output_uri=True? You can set it to the location you upload the model to, or set sdk.development.default_output_uri (or even CLEARML_DEFAULT_OUTPUT_URI env var) to the file server you want the model to be uploaded to. It can be the same as the file server used under the api section in clearml.conf.

@sriram-dsl
Copy link
Author

this is the api and sdk section in the clearml.conf

# ClearML SDK configuration file
api {
    # Notice: 'host' is the api server (default port 8008), not the web server.
    api_server: http://<public_ip>:8008
    web_server: http://<public_ip>:8082
    files_server: http://<public_ip>:8081
    # Credentials are generated using the webapp, http://<public_ip>:8082/settings
    # Override with os environment: CLEARML_API_ACCESS_KEY / CLEARML_API_SECRET_KEY
    credentials {"access_key": "OTVT8HO*********5RXU", "secret_key": "OZClLTtposYbKC5UVA*****************4LvM7yhiaFm06"}
}
sdk {
    # ClearML - default SDK configuration

    storage {
        cache {
            # Defaults to <system_temp_folder>/clearml_cache
            default_base_dir: "~/.clearml/cache"
            # default_cache_manager_size: 100
        }

        direct_access: [
            # Objects matching are considered to be available for direct access, i.e. they will not be downloaded
            # or cached, and any download request will return a direct reference.
            # Objects are specified in glob format, available for url and content_type.
            { url: "file://*" }  # file-urls are always directly referenced
        ]
    }

do i need to add any key:value pairs in it to log results while launching yolov5 training
iam initialising the training with the basic yolov5 training command, mention in here #225 (comment)

@sriram-dsl
Copy link
Author

sriram-dsl commented Mar 13, 2024

@eugen-ajechiloae-clearml, @jkhenning, @ainoam

Something similar to this might solve my problem: allegroai/clearml#363
In other words, should I make changes inside yolov5/train.py?

My requirement is to log the experiment results generated during training to the ClearML server. The training is conducted inside a container. I have set up an Ubuntu 20.04 container, cloned the YOLOv5 repository inside the Ubuntu container, and started training using that YOLOv5 Git repository.

@eugen-ajechiloae-clearml

Hi @sriram-dsl ! If you wish to modify train.py, you could use a dataset, or you could use the OutputModel https://clear.ml/docs/latest/docs/references/sdk/model_outputmodel class to upload your models if it fits your case.
Usually, datasets are used to store more than just models (like training datasets).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants