When read_package is supplied a path to a directory, try datapackage.{json|yaml|yml} #158

khusmann · 2023-10-12T20:41:06Z

A possible way to implement #157 . It sounds like you're plan is to let the idea simmer to see how other implementations approach it, but figured I'd submit what I have so you'd have it in your back pocket. Notes:

When the data package path, is an URL, it does not attempt to try the candidate datapackage resource filenames. I figured that with url redirects, etc, a URL without a ".{json|yaml|yml}$" could still exist (no 404) and be pointing to a valid resource file. And we can't really rely on content-type... so maybe better to not mess with. Another reason to not mess with urls is that this matches the behavior of frictionless-py. (Although it appears that data-cli from datahub will happily add /package.json to urls, as I mentioned earlier...)

In my (brief) testing, the frictionless-py cli will only try "datapackage.json" as a candidate file to load when pointed to a directory... it does not discover datapackage.yaml or datapackage.yml. If we wanted to match this behavior, it'd only require editing the autoload candidate list here.

…y loading datapackage.json (or .yaml or .yml) (frictionlessdata#157)

Refactor read_descriptor so that it reads easier

peterdesmet · 2024-03-25T15:54:34Z

I have simplified this PR and added tests.

When a directory rather than a file is provided, it will only look for datapackage.json (not yaml or yml)
I actually want this functionality for URLs, not only for local directories, so that the example posted in Automatically load "datapackage.json" when supplied a path to a folder #157 (https://datahub.io/core/exchange-rates) works.
This functionality at the level of read_package(), keeping read_descriptor() and check_path() simple. The former is also used for getting Table Schema.
The implementation is now very simple: if file ends with / assume a directory and add datapackage.json. Annoyingly, https://datahub.io/core/exchange-rates won't work, but https://datahub.io/core/exchange-rates/ will. I looked for some alternative implementations to determine if file is a directory or url (withoutdatapackage.json) but:

dir.name() doesn't work well for this (e.g. dirname("my_directory") returns .)
file.info()$isdir only works for local directories that exist.
I didn't want to look for a range of file names or extensions at the end, otherwise file = "datapackage.csv" would return "Can't find file at 'datapackage.csv/datapackage.json'"

Alternatives to support mydir/ as well as mydir are appreciated.

khusmann · 2024-03-25T19:23:17Z

The implementation is now very simple: if file ends with / assume a directory and add datapackage.json.

Hmm, I see how this is simple from an implementation perspective, but I think it's going to potentially confuse users. A trailing / is very subtle, especially for long paths / urls.

I think we should think about this as two separate behavior cases, local vs remote:

If local, I think we can have an easy implementation with file.info()$isdir: if the supplied path points to a dir, then append "datapackage.json".

If remote, all bets are off... even a url like "https://example.com/example_package/" could actually point to a datapackage.json file! The only way to know what you get is to make the request... That's why I didn't include completion on urls in my implementation, I thought it was probably better to have all web requests be explicit.

Alternatively, if we do want to support behavior like datahub for urls, I would suggest the following:

First try downloading the url. If we get a valid datapackage json, then we use that.
If we get something else, then we try url/datapackage.json.
If both url and url/datapackage.json do not point to valid datapackage json files, then error with Can't find datapackage at "url". This way we don't get messages like "Can't find datapackage at 'http://example.com/datapackage.csv/datapackage.json'"; it would instead read "Can't find datapackage at 'http://example.com/datapackage.csv'"

The downside of this approach is that it always makes a second web request with the first fails. Not pretty, but I don't think there's any other way of knowing what's on the other side without making the request.

peterdesmet · 2024-03-25T20:57:47Z

Hmm, I see how this is simple from an implementation perspective, but I think it's going to potentially confuse users.

Agree, which is why I'm ambivalent about this PR.

If we get something else [at the url]

Yeah, that is not great, since the function is downloading something the user didn't specifically asked for, so I agree with your "I thought it was probably better to have all web requests be explicit."

I'll put a pin in this PR and not include it in 1.1.0. As you suggested on Slack, there is likely a more prevalent use case in supporting zipped packages. At least those are explicit files again, not directories with potentially a file.

khusmann added 2 commits October 12, 2023 16:17

When read_package is supplied a path to a directory, automatically tr…

d757750

…y loading datapackage.json (or .yaml or .yml) (frictionlessdata#157)

fix backwards assertion

bfb3308

peterdesmet added function:read_package Function read_package() enhancement New feature or request labels Oct 27, 2023

peterdesmet added this to the 1.1.0 milestone Oct 27, 2023

peterdesmet added 6 commits March 25, 2024 13:57

Merge branch 'main' into pr/158

7a1574b

Solve merge and move new check_path() out of utils.R

ac8a6ab

Switch arguments in grepl

0f2d660

Drop use file_candidates in read_descriptor() and read_path()

f98c1f3

Refactor read_descriptor so that it reads easier

Use sub, not gsub

ea5c1d3

Assume datapackage.json if path ends with / + add tests

5bc1f1c

peterdesmet requested a review from PietrH March 25, 2024 15:54

Skip test if offline

b929a86

peterdesmet removed this from the 1.1.0 milestone Mar 25, 2024

peterdesmet removed the request for review from PietrH March 25, 2024 20:57

peterdesmet marked this pull request as draft March 25, 2024 20:58

khusmann mentioned this pull request Mar 25, 2024

Support loading datapackages from zip files #193

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When read_package is supplied a path to a directory, try datapackage.{json|yaml|yml} #158

When read_package is supplied a path to a directory, try datapackage.{json|yaml|yml} #158

khusmann commented Oct 12, 2023

peterdesmet commented Mar 25, 2024

khusmann commented Mar 25, 2024

peterdesmet commented Mar 25, 2024

When read_package is supplied a path to a directory, try datapackage.{json|yaml|yml} #158

Are you sure you want to change the base?

When read_package is supplied a path to a directory, try datapackage.{json|yaml|yml} #158

Conversation

khusmann commented Oct 12, 2023

peterdesmet commented Mar 25, 2024

khusmann commented Mar 25, 2024

peterdesmet commented Mar 25, 2024