Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Behaviour of joinpath when the second path has protocol #213

Open
KuribohG opened this issue Apr 15, 2024 · 6 comments
Open

Behaviour of joinpath when the second path has protocol #213

KuribohG opened this issue Apr 15, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@KuribohG
Copy link

UPath('a').joinpath(UPath('s3:///b'))

This call will return PosixUPath('/b'), but

UPath('a').joinuri(UPath('s3:///b'))

will return S3Path('s3:///b').

Maybe in joinpath, we should return just the second path when the second path has protocol (like in joinuri)?

@ap--
Copy link
Collaborator

ap-- commented Apr 15, 2024

Hello @KuribohG

Thank you for opening the issue.

Could you explain your initial use case? The example you provide seems a bit unusual regarding two points:

  1. You're trying to join a local relative path (as returned by UPath without protocol) to an s3 path. What was your intent?
  2. Your s3 uri has an empty string as a bucket.

The solution here should be that (1) we should raise an exception if joinpath is used with paths of different protocols. (2) we should raise an exception for s3 uris with empty buckets (empty netloc)

Andreas

Extra notes:

s3fs raises ValueError: Attempt to open non key-like path: /b when a 'empty-bucket-path' is trying to be accessed. We should catch this early in UPath.

>>> import fsspec
>>> fsspec.open("s3:///b")
<OpenFile '/b'>
>>> x=_
>>> x.open()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/andreaspoehlmann/Development/loot/venv/lib/python3.11/site-packages/fsspec/core.py", line 135, in open
    return self.__enter__()
           ^^^^^^^^^^^^^^^^
  File "/Users/andreaspoehlmann/Development/loot/venv/lib/python3.11/site-packages/fsspec/core.py", line 103, in __enter__
    f = self.fs.open(self.path, mode=mode)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/andreaspoehlmann/Development/loot/venv/lib/python3.11/site-packages/fsspec/spec.py", line 1293, in open
    f = self._open(
        ^^^^^^^^^^^
  File "/Users/andreaspoehlmann/Development/loot/venv/lib/python3.11/site-packages/s3fs/core.py", line 685, in _open
    return S3File(
           ^^^^^^^
  File "/Users/andreaspoehlmann/Development/loot/venv/lib/python3.11/site-packages/s3fs/core.py", line 2152, in __init__
    raise ValueError("Attempt to open non key-like path: %s" % path)
ValueError: Attempt to open non key-like path: /b

@ap-- ap-- added the bug Something isn't working label Apr 15, 2024
@KuribohG
Copy link
Author

Using empty buckets for s3 uri is my mistake, but in this issue I am focusing on joining paths from different protocol.

In my case, I want to implement a dataset, which can be constructed by a parameter for (1) relative path, (2) absolute path, (3) path with protocols. Each dataset has a working directory, so I need to join the working dir path with this path parameter.

Does all the paths with protocols means somewhat an "absolute" path? Do we allow a path like s3:///../a/b/c? If not, joining two paths path1, path2 from different protocol may be easier, because we can just return path2 since it is an "absolute" uri. But if we do allow this kind of path, throwing an exception seems to be a meaningful behaviour.

@ap--
Copy link
Collaborator

ap-- commented Apr 16, 2024

In my case, I want to implement a dataset, which can be constructed by a parameter for (1) relative path, (2) absolute path, (3) path with protocols. Each dataset has a working directory, so I need to join the working dir path with this path parameter.

If I understand correctly, you want paths within your dataset to be pointing to relative locations, and exchange the root of the dataset?

rel_a = UPath("path/to/fileA.txt")
rel_b = UPath("path/to/somewhere/else/fileB.txt")

root_x = UPath("s3://mybucket/somepath")
root_y = UPath("file:///opt/someotherpath")

# and now
x_a = root_x.joinpath(rel_a)
x_b = root_x.joinpath(rel_b)

y_a = root_y.joinpath(rel_a)
y_b = root_y.joinpaht(rel_b)

You might also want to checkout https://intake.readthedocs.io/en/latest/ if you want to describe your datasets declaratively and load from different locations.

Does all the paths with protocols means somewhat an "absolute" path?

Yes. All UPath's (with the exception of PosixUPath and WindowsUPath) are absolute.

Do we allow a path like s3:///../a/b/c?

I will interpret this as s3://mybucket/../a/b/c and the answer is that this is currently undefined behavior. UPath("s3://mybucket/../a/b/c").resolve() will evaluate to UPath("s3://mybucket/a/b/c"), because mybucket/ is considered the path anchor and pathlib resolves /../a/b/c to /a/b/c

If not, joining two paths path1, path2 from different protocol may be easier, because we can just return path2 since it is an "absolute" uri.

I still think .joinpath should raise an error if the two protocols differ. I currently don't see what the reason for joining absolute paths of different protocols would be. (I can understand that use case if you want urljoin behavior, but that's what .joinuri is for)

@KuribohG
Copy link
Author

Sorry I didn't explain my case clearly enough.

We have an argument for specifying the dataset location (like using cmd args). When we run the script, we need to join the workdir and this argument.

Say, the workdir is /home/username/datasets. User may specify the location as some/relative/path/data, or s3:///path/with/protocol/data. We want to get the final location of the dataset, so I used joinpath in my code. So the expectation is

UPath(`/home/username/datasets`).joinpath(UPath('some/relative/path/data')) -> UPath('/home/username/datasets/some/relative/path/data')
UPath(`/home/username/datasets`).joinpath(UPath('s3:///path/with/protocol/data')) -> UPath('s3:///path/with/protocol/data')

This may be implemented by UPath.resolve or UPath.absolute, but will be unavailable if workdir is specified by some environment variable. Also, joinuri is seem not for this case. I currently just check whether the user arg has a protocol, and if so use that arg as the final location.

@ap--
Copy link
Collaborator

ap-- commented Apr 17, 2024

Thank you for the clarification. If the workdir is the current working directory, you can just do:

def cli(arg):
    pth = UPath(arg).absolute()

This will correctly prepend the cwd if the path is a relative local path, and because all other UPaths are absolute, will just return for example the S3 path if the user provides that.

In the case that workdir is also configurable by the user you could do:

def cli(arg, workdir):
    pth = UPath(arg)
    if not pth.is_absolute():
         pth = UPath(workdir).joinpath(pth)

Let me know if that solves your issue,
Cheers,
Andreas

@KuribohG
Copy link
Author

Thanks, this will solve the problem in my use case.

I think joinpath should be clarified for joining paths from different protocol, and I agree that an exception could be thrown.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants