Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"source" encoding for datasets opened from fsspec objects #8923

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

keewis
Copy link
Collaborator

@keewis keewis commented Apr 9, 2024

When opening files from path-like objects (str, pathlib.Path), the backend machinery (_dataset_from_backend_dataset) sets the "source" encoding. This is useful if we need the original path for additional processing, like writing to a similarly named file, or to extract additional metadata. This would be useful as well when using fsspec to open remote files.

In this PR, I'm extracting the path attribute that most fsspec objects have to set that value. I've considered using isinstance checks instead of the getattr-with-default, but the list of potential classes is too big to be practical (at least 4 classes just within fsspec itself).

If this sounds like a good idea, I'll update the documentation of the "source" encoding to mention this feature.

  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst

@max-sixty
Copy link
Collaborator

Without knowing much (I generally ds.reset_encoding()) it does sound like a good idea!

@Illviljan
Copy link
Contributor

Shouldn't _normalize_path or _find_absolute_paths be able to handle this?

@keewis
Copy link
Collaborator Author

keewis commented Apr 9, 2024

the main use case is indeed to extract additional data, which you'd do immediately after open_dataset (after which you could drop the encoding).

Shouldn't _normalize_path or _find_absolute_paths be able to handle this?

As far as I can tell, they only convert path-likes to string (which these objects are not, they are file-like, not path-like). Are you suggesting we should change that?

@dcherian
Copy link
Contributor

I think this is fine, but our long-term goal is to delete encoding so you might consider a different solution to your problem.

@keewis
Copy link
Collaborator Author

keewis commented Apr 23, 2024

my impression of that discussion was that we wanted to either return the encoding in a separate object, or somehow remove the encoding after the first operation (i.e. not carry it around). Either way would be fine with me, since I would still have access to it immediately after opening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants