Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

api.files_server without a port causes pipeline to fail #1264

Open
lastsecondsave opened this issue May 8, 2024 · 4 comments
Open

api.files_server without a port causes pipeline to fail #1264

lastsecondsave opened this issue May 8, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@lastsecondsave
Copy link

Describe the bug

I'm trying to run a pipeline with this step:

pipeline.add_function_step(
    name="some_work",
    task_type=TaskTypes.data_processing,
    function=some_work,
    function_kwargs={"x": ["y", "z"]},
)

The function is not relevant, it is not being called. When an agent picks up the step, it fails with:

2024-05-07 16:13:07,837 - clearml.storage - WARNING - Failed getting object size: ValueError('Failed getting object :443/Project/.pipelines/Test/Test%20%252321.82408167353a45778399f48c543be2a5/artifacts/some_work.name/some_work.name.pkl (404): NOT FOUND')
2024-05-07 16:13:08,016 - clearml.storage - ERROR - Could not download https://files.clearml.xxxxx.net:443/Project/.pipelines/Test/Test%20%252321.82408167353a45778399f48c543be2a5/artifacts/some_work.name/some_work.name.pkl , err: Failed getting object :443/Project/.pipelines/Test/Test%20%252321.82408167353a45778399f48c543be2a5/artifacts/some_work.name/some_work.name.pkl (404): NOT FOUND 
Traceback (most recent call last):
  File "/root/.clearml/venvs-builds/3.10/code/some_work.py", line 34, in <module>
    kwargs[k] = parent_task.artifacts[artifact_name].get(deserialization_function=None)
  File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/clearml/binding/artifacts.py", line 171, in get
    local_file = self.get_local_copy(raise_on_error=True, force_download=force_download)
  File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/clearml/binding/artifacts.py", line 244, in get_local_copy
    raise ValueError(
ValueError: Could not retrieve a local copy of artifact some_work.name, failed downloading https://files.clearml.xxxxx.net:443/Project/.pipelines/Test/Test%20%252321.82408167353a45778399f48c543be2a5/artifacts/some_work.name/some_work.name.pkl
2024-05-07 18:13:24
Process failed, exit code 1

URL mentioned is correct and the file can be downloaded with a browser.

To reproduce

This is how ClearML is deployed in our environment:

api {
  web_server: https://clearml.xxxxx.net
  api_server: https://api.clearml.xxxxx.net
  files_server: https://files.clearml.xxxxx.net
}

The problem above disappears if I explicitly set the port for the files_server:

files_server: https://files.clearml.xxxxx.net:443

My wild guess is here you create an object's name from the url. This makes the port a part of the name (you can see it in the log). And here you reconstruct the url, which probably will look like https://files.clearml.xxxxx.net/:443/.... And this one causes 404.

Expected behaviour

URLs without ports should not cause any issues.

Environment

  • Server type: self hosted
  • ClearML SDK Version: 1.15.1
  • ClearML Server Version: 1.14.0-431
  • Python Version: 3.10
  • OS: Linux
@lastsecondsave lastsecondsave added the bug Something isn't working label May 8, 2024
@jkhenning
Copy link
Member

Hi @lastsecondsave, I'm not sure I understand - the error you show mentions port 443 - this means the 443 port appears in the URL registered for this artifact (it's explicitly written in the DB), so the problem seem to be that the URLs you're trying to access have the 443 port, while the files_server setting does not have it.
This raises the question, how were these URLs created, and the only way I can think of is that at some point the clearml.conf configuration file did contain files_server with the port, and at that point the files were uploaded...

@lastsecondsave
Copy link
Author

They were created by the another agent. So the agent one has port 443 explicitly set, and the pipeline starts on it:

pipeline.start(queue="one")

The task is being scheduled on the agent two, its config has no port in it:

pipeline.add_function_step(
    execution_queue="two",
    ...)

My assumption was that the first agent has nothing to do with the problem, since it already had a "workaround". And seems like my yesterday debug session was accidentally caused by someone who initially configured it).

Still, the presence or absence of the default ports should not cause anomalies.

@jkhenning
Copy link
Member

Still, the presence or absence of the default ports should not cause anomalies

I'll have to disagree on the last one - in general it's possible to have several different services on different ports, which is why the SDK uses the exact service endpoint to loop for configured credentials - I'm not sure why different clients (agent, SDK) should have different endpoints defined (with or without port) - you should simply decide on one and use it consistently

@lastsecondsave
Copy link
Author

It's not possible to have different services on https://example.com and https://example.com:443, right? The scheme part of the URL plays its role. Given that's how the whole internet works, nobody will expect this to cause any issues. And it's not like your error messages help in this case. At least verify that addresses do not exactly match and write the correct error. My case may be a bit dumb, but your product should be foolproof.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants