Provide conversion utility that could create a SIMPLE_CSV format file from the more complete DWCA format download #121

nickynicolson · 2023-11-15T15:23:35Z

Assuming that a SIMPLE_CSV format download is a subset of the DWCA download - as per the GBIF download FAQ entry:

CSV: Tab delimited CSV. Only contains the data after GBIF interpretation. No multimedia included. More information about CSV
Darwin Core Archive: The Darwin Core Archive (DwC-A) contains both the original data as publisher provided it and the GBIF interpretation. Links (but not files) to multimedia included. More information about DwC-A

... then it would be good to have a utility that could create a SIMPLE_CSV format file from the larger DWCA.

Rationale: - a user develops a script that needs only minimal data and therefore is designed to operate on SIMPLE_CSV format input. Another user of the script has a pre-existing DWCA format download and wants to use this as input to the script (without having to create another download) - so they need a way to slim down the DWCA to the SIMPLE_CSV set of fields.

Is there a GBIF metadata service that returns the fieldnames used in each of these download formats which could help? If so, pygbif could provide access to this and a column rename mapping (if required).

jmbarrios · 2023-11-15T15:45:42Z

I believe such thing is outside of the scope of this package. Also there is a Python package to manipulate a DWC Archive python-dwca-reader.
These package are developed to work with the GBIF DWCA downloads.

nickynicolson · 2023-11-15T15:59:38Z

I believe such thing is outside of the scope of this package. Also there is a Python package to manipulate a DWC Archive python-dwca-reader. These package are developed to work with the GBIF DWCA downloads.

The default download format for occurrences appears to be SIMPLE_CSV:

pygbif/pygbif/occurrences/download.py

Line 69 in 9590fcf

    
           def download(queries, format = "SIMPLE_CSV", user=None, pwd=None, email=None, pred_type="and"):

jmbarrios · 2023-11-15T16:34:22Z

Indeed, the current default parameter is SIMPLE_CSV, and the supported formats are listed here.

A major challenge when working with a DWCA is that this format is not always consistent. Many times, there isn't a common way to map extra tables to one table. For instance, in the case of a multimedia table, you can expect the id field to be present in a multimedia.txt file. However, it could have a one-to-many relation with an occurrence. In such cases, what would be a common mapping strategy from the multimedia table to a plain table?

I believe that the SIMPLE_CSV format is just focused on share occurrence data without additional information.

nickynicolson · 2023-11-15T16:42:30Z

My request is about occurrences only and relates only to DWCA downloads originating from GBIF (pygbif is about facilitating access to the GBIF API from Python).
As GBIF is able to represent occurrence data in both DWCA and SIMPLE_CSV formats I'd like to be able to convert from DWCA to SIMPLE_CSV.
Out of interest, are you speaking for GBIF @jmbarrios ?

jmbarrios · 2023-11-15T17:05:06Z

Out of interest, are you speaking for GBIF @jmbarrios ?

No, I'm not. Also I am not associated with GBIF.

MattBlissett · 2023-11-15T17:23:34Z

Hi Nicky,

There's an experimental (not necessarily stable, not documented) API for the columns returned in GBIF downloads:

The SIMPLE_CSV format should be a Simple Darwin Core-compatible file, see §6.1 where these files can be shared without a meta.xml description. It's also a subset of the occurrence file in a DWCA format download, and the column names should be identical — if a CSV reader is referencing columns by name it should work fine with either file.

@CecSve is maintaining pygbif, but is on leave until the end of January.

nickynicolson · 2023-11-16T17:35:59Z

Thanks @MattBlissett - if / when this becomes stable it might be good to consider making it available from the pygbif library.
I did find two fields in SIMPLE_CSV that are not in DWCA: publishingOrgKey and verbatimScientificNameAuthorship

MattBlissett · 2023-11-17T12:22:22Z

My mistake, verbatimScientificNameAuthorship should be scientificNameAuthorship from verbatim.txt. publishingOrgKey would need to be retrieved using the API.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide conversion utility that could create a SIMPLE_CSV format file from the more complete DWCA format download #121

Provide conversion utility that could create a SIMPLE_CSV format file from the more complete DWCA format download #121

nickynicolson commented Nov 15, 2023

jmbarrios commented Nov 15, 2023

nickynicolson commented Nov 15, 2023

jmbarrios commented Nov 15, 2023

nickynicolson commented Nov 15, 2023

jmbarrios commented Nov 15, 2023

MattBlissett commented Nov 15, 2023

nickynicolson commented Nov 16, 2023

MattBlissett commented Nov 17, 2023

Provide conversion utility that could create a SIMPLE_CSV format file from the more complete DWCA format download #121

Provide conversion utility that could create a SIMPLE_CSV format file from the more complete DWCA format download #121

Comments

nickynicolson commented Nov 15, 2023

jmbarrios commented Nov 15, 2023

nickynicolson commented Nov 15, 2023

jmbarrios commented Nov 15, 2023

nickynicolson commented Nov 15, 2023

jmbarrios commented Nov 15, 2023

MattBlissett commented Nov 15, 2023

nickynicolson commented Nov 16, 2023

MattBlissett commented Nov 17, 2023