Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When read_package is supplied a path to a directory, try datapackage.{json|yaml|yml} #158

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

khusmann
Copy link
Contributor

A possible way to implement #157 . It sounds like you're plan is to let the idea simmer to see how other implementations approach it, but figured I'd submit what I have so you'd have it in your back pocket. Notes:

When the data package path, is an URL, it does not attempt to try the candidate datapackage resource filenames. I figured that with url redirects, etc, a URL without a ".{json|yaml|yml}$" could still exist (no 404) and be pointing to a valid resource file. And we can't really rely on content-type... so maybe better to not mess with. Another reason to not mess with urls is that this matches the behavior of frictionless-py. (Although it appears that data-cli from datahub will happily add /package.json to urls, as I mentioned earlier...)

In my (brief) testing, the frictionless-py cli will only try "datapackage.json" as a candidate file to load when pointed to a directory... it does not discover datapackage.yaml or datapackage.yml. If we wanted to match this behavior, it'd only require editing the autoload candidate list here.

@peterdesmet peterdesmet added function:read_package Function read_package() enhancement New feature or request labels Oct 27, 2023
@peterdesmet peterdesmet added this to the 1.1.0 milestone Oct 27, 2023
@peterdesmet
Copy link
Member

I have simplified this PR and added tests.

  1. When a directory rather than a file is provided, it will only look for datapackage.json (not yaml or yml)
  2. I actually want this functionality for URLs, not only for local directories, so that the example posted in Automatically load "datapackage.json" when supplied a path to a folder #157 (https://datahub.io/core/exchange-rates) works.
  3. This functionality at the level of read_package(), keeping read_descriptor() and check_path() simple. The former is also used for getting Table Schema.
  4. The implementation is now very simple: if file ends with / assume a directory and add datapackage.json. Annoyingly, https://datahub.io/core/exchange-rates won't work, but https://datahub.io/core/exchange-rates/ will. I looked for some alternative implementations to determine if file is a directory or url (withoutdatapackage.json) but:
  • dir.name() doesn't work well for this (e.g. dirname("my_directory") returns .)
  • file.info()$isdir only works for local directories that exist.
  • I didn't want to look for a range of file names or extensions at the end, otherwise file = "datapackage.csv" would return "Can't find file at 'datapackage.csv/datapackage.json'"

Alternatives to support mydir/ as well as mydir are appreciated.

@peterdesmet peterdesmet requested a review from PietrH March 25, 2024 15:54
@khusmann
Copy link
Contributor Author

The implementation is now very simple: if file ends with / assume a directory and add datapackage.json.

Hmm, I see how this is simple from an implementation perspective, but I think it's going to potentially confuse users. A trailing / is very subtle, especially for long paths / urls.

I think we should think about this as two separate behavior cases, local vs remote:

If local, I think we can have an easy implementation with file.info()$isdir: if the supplied path points to a dir, then append "datapackage.json".

If remote, all bets are off... even a url like "https://example.com/example_package/" could actually point to a datapackage.json file! The only way to know what you get is to make the request... That's why I didn't include completion on urls in my implementation, I thought it was probably better to have all web requests be explicit.

Alternatively, if we do want to support behavior like datahub for urls, I would suggest the following:

  1. First try downloading the url. If we get a valid datapackage json, then we use that.
  2. If we get something else, then we try url/datapackage.json.
  3. If both url and url/datapackage.json do not point to valid datapackage json files, then error with Can't find datapackage at "url". This way we don't get messages like "Can't find datapackage at 'http://example.com/datapackage.csv/datapackage.json'"; it would instead read "Can't find datapackage at 'http://example.com/datapackage.csv'"

The downside of this approach is that it always makes a second web request with the first fails. Not pretty, but I don't think there's any other way of knowing what's on the other side without making the request.

@peterdesmet peterdesmet removed this from the 1.1.0 milestone Mar 25, 2024
@peterdesmet
Copy link
Member

Hmm, I see how this is simple from an implementation perspective, but I think it's going to potentially confuse users.

Agree, which is why I'm ambivalent about this PR.

  1. If we get something else [at the url]

Yeah, that is not great, since the function is downloading something the user didn't specifically asked for, so I agree with your "I thought it was probably better to have all web requests be explicit."

I'll put a pin in this PR and not include it in 1.1.0. As you suggested on Slack, there is likely a more prevalent use case in supporting zipped packages. At least those are explicit files again, not directories with potentially a file.

@peterdesmet peterdesmet removed the request for review from PietrH March 25, 2024 20:57
@peterdesmet peterdesmet marked this pull request as draft March 25, 2024 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request function:read_package Function read_package()
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants