Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize data.frame's across data sources #38

Open
sckott opened this issue Sep 11, 2015 · 10 comments
Open

Standardize data.frame's across data sources #38

sckott opened this issue Sep 11, 2015 · 10 comments

Comments

@sckott
Copy link
Contributor

sckott commented Sep 11, 2015

Right now, we haven't thought about outputs of each function. I believe all are data.frames. I'll look at each and see what they currently output and see what is shared among them, and see what standard format we can use that will also make combining outputs easier

@sckott
Copy link
Contributor Author

sckott commented Sep 11, 2015

cc @hlapp @xu-hong

@hlapp
Copy link

hlapp commented Sep 30, 2015

@sckott any progress or conclusions on this yet? @xu-hong is about at that step (leaving the issues with XML namespaces aside). Suggestions as to additional decorations that should be added to the data.frame object returned by RNeXML::get_characters()?

@sckott
Copy link
Contributor Author

sckott commented Sep 30, 2015

@hlapp sorry, hadn't looked at this yet. Will do so today

@sckott
Copy link
Contributor Author

sckott commented Sep 30, 2015

Run down of what data objects functions currently output;

function output
betydb_trait data.frame
betydb_search data.frame
betydb_citation data.frame
betydb_site data.frame
betydb_specie data.frame
birdlife_habitat data.frame
birdlife_threats data.frame
coral_locations data.frame
coral_methodologies data.frame
coral_resources data.frame
coral_species data.frame
coral_taxa data.frame
coral_traits data.frame
eol_invasive_ data.frame
fe_native list
g_invasive data.frame
is_native data.frame
leda data.frame
ncbi_byid data.frame
ncbi_byname data.frame/named list of data.frames
ncbi_searcher data.frame/named list of data.frames
traitbank list
  • The bety_*() functions that return lists I think could be easily made to return data.frame's, I'll check. DONE

@sckott
Copy link
Contributor Author

sckott commented Sep 30, 2015

Common fields among functions that could be standardized:

  • id - identifier for the record/taxon/etc. - the meaning of this id could vary between providers, however
  • date - date collected/updated, can be more than one of these
  • latitude - latitude, if spatially explicit record, and avail.
  • longitude - longitude, if spatially explicit record, and avail.
  • name - taxonomic name - combine any separate fields to make this one (not done yet), could be additional name columns

The above aren't real columns in outputs yet, but the ones I think could be standard across most of the data sources in this package.

Part of what makes this hard is that we have a diverse set of data sources, from morphological trait data, to nativity status, to molecular data

I think a way forward could be to provide a suite of functions that do some set of transformations to the data to standardize column names/etc. to allow them to be easily combined across taxa and data sources - at least across the standard fields - and other fields could be included as additional columns at the end

@hlapp
Copy link

hlapp commented Sep 30, 2015

Common fields among functions that could be standardized:

You mean columns among data frames?

@sckott
Copy link
Contributor Author

sckott commented Sep 30, 2015

yes

@sckott
Copy link
Contributor Author

sckott commented Sep 30, 2015

Suggestions as to additional decorations that should be added to the data.frame

@hlapp I don't know what's available to add

@hlapp
Copy link

hlapp commented Oct 7, 2015

name - taxonomic name - combine any separate fields to make this one (not done yet), could be additional name columns

@sckott do you mean the classification, or family and genus for taxa that are species?

@sckott
Copy link
Contributor Author

sckott commented Oct 7, 2015

@hlapp I mean name could be ideally genus + epithet + any subspecific epithets OR previous + authority

I favor leaving authority off the name, and having in a separate column, if provided.

If there data record has lowest ID to family e.g,. then I don't know what best practice is. Perhaps we'd leave name blank

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants