Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide conversion utility that could create a SIMPLE_CSV format file from the more complete DWCA format download #121

Open
nickynicolson opened this issue Nov 15, 2023 · 8 comments

Comments

@nickynicolson
Copy link

Assuming that a SIMPLE_CSV format download is a subset of the DWCA download - as per the GBIF download FAQ entry:

CSV: Tab delimited CSV. Only contains the data after GBIF interpretation. No multimedia included. More information about CSV
Darwin Core Archive: The Darwin Core Archive (DwC-A) contains both the original data as publisher provided it and the GBIF interpretation. Links (but not files) to multimedia included. More information about DwC-A

... then it would be good to have a utility that could create a SIMPLE_CSV format file from the larger DWCA.

Rationale: - a user develops a script that needs only minimal data and therefore is designed to operate on SIMPLE_CSV format input. Another user of the script has a pre-existing DWCA format download and wants to use this as input to the script (without having to create another download) - so they need a way to slim down the DWCA to the SIMPLE_CSV set of fields.

Is there a GBIF metadata service that returns the fieldnames used in each of these download formats which could help? If so, pygbif could provide access to this and a column rename mapping (if required).

@jmbarrios
Copy link

I believe such thing is outside of the scope of this package. Also there is a Python package to manipulate a DWC Archive python-dwca-reader.
These package are developed to work with the GBIF DWCA downloads.

@nickynicolson
Copy link
Author

I believe such thing is outside of the scope of this package. Also there is a Python package to manipulate a DWC Archive python-dwca-reader. These package are developed to work with the GBIF DWCA downloads.

The default download format for occurrences appears to be SIMPLE_CSV:

def download(queries, format = "SIMPLE_CSV", user=None, pwd=None, email=None, pred_type="and"):

@jmbarrios
Copy link

Indeed, the current default parameter is SIMPLE_CSV, and the supported formats are listed here.

A major challenge when working with a DWCA is that this format is not always consistent. Many times, there isn't a common way to map extra tables to one table. For instance, in the case of a multimedia table, you can expect the id field to be present in a multimedia.txt file. However, it could have a one-to-many relation with an occurrence. In such cases, what would be a common mapping strategy from the multimedia table to a plain table?

I believe that the SIMPLE_CSV format is just focused on share occurrence data without additional information.

@nickynicolson
Copy link
Author

My request is about occurrences only and relates only to DWCA downloads originating from GBIF (pygbif is about facilitating access to the GBIF API from Python).
As GBIF is able to represent occurrence data in both DWCA and SIMPLE_CSV formats I'd like to be able to convert from DWCA to SIMPLE_CSV.
Out of interest, are you speaking for GBIF @jmbarrios ?

@jmbarrios
Copy link

Out of interest, are you speaking for GBIF @jmbarrios ?

No, I'm not. Also I am not associated with GBIF.

@MattBlissett
Copy link
Member

Hi Nicky,

There's an experimental (not necessarily stable, not documented) API for the columns returned in GBIF downloads:

The SIMPLE_CSV format should be a Simple Darwin Core-compatible file, see §6.1 where these files can be shared without a meta.xml description. It's also a subset of the occurrence file in a DWCA format download, and the column names should be identical — if a CSV reader is referencing columns by name it should work fine with either file.

@CecSve is maintaining pygbif, but is on leave until the end of January.

@nickynicolson
Copy link
Author

Thanks @MattBlissett - if / when this becomes stable it might be good to consider making it available from the pygbif library.
I did find two fields in SIMPLE_CSV that are not in DWCA: publishingOrgKey and verbatimScientificNameAuthorship

@MattBlissett
Copy link
Member

My mistake, verbatimScientificNameAuthorship should be scientificNameAuthorship from verbatim.txt. publishingOrgKey would need to be retrieved using the API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants