Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can we formalize the file/dataset relationship? #108

Open
mdsumner opened this issue Jan 28, 2021 · 1 comment
Open

can we formalize the file/dataset relationship? #108

mdsumner opened this issue Jan 28, 2021 · 1 comment

Comments

@mdsumner
Copy link
Member

some notes, there are several concepts in play

A given bowerbird source data set has an id, associated with a name, description, source URL/s, and possibly filters that specify which actual files or paths to keep or ignore.

In a simple situation, that id might be used to isolate a particular data set but a given set of files can contain more than one data set of interest, or include files not normally of interest (but occasionally important, for detailed usage or citation or checking purposes).

An id may be used to find the right paths in the file tree to explore, but it won't identify which files exactly to find - they might be multiple files extracted from an archive.

A concrete example, the id "10.5067/U8C09DWVX9LM" relates directly to the source URL "ftp://sidads.colorado.edu/pub/DATASETS/nsidc0081_nrt_nasateam_seaice/" which contains
sea ice concentration data. There are two separate data sets, one for the northern and one for the southern hemispheres. So our intention to find only the southern files does not match the single id. The actual location in the file system is

"./PUBLIC/raad/data/sidads.colorado.edu/" which is the address of the source with the "ftp://" part removed.

This tree includes paths like "DATASETS/nsidc0051_gsfc_nasateam_seaice/final-gsfc/north/daily/1978/north/" which contain the actual data, in ".bin" files - differentiated from
north and south by the "north/" or "south/". Another path "DATASETS/seaice/polar-stereo/tools/" includes auxiliarly grid and coordinate information about the grid itself in .msk or .dat files.

So, we can't have a clean relationship between the files for a data set and the source ID used by bowerbird. The bowerbird source is really a parent. What we could use that parent for is

  1. the id is a parent, the "getter" for the files of interest
  2. the source URLs require processing to be used for file identification (remove the prefix, same as is done during download)
  3. a data set includes file filters, applied to the information from 1, and 2

Currently bowerbird is the source of 1,2 and raadfiles of 3. A data set has no identity atm beyond the name of the filename-getter (and its arguments). Then raadtools provides a read function that uses that filename-getter.

@mdsumner
Copy link
Member Author

related to #55

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant