Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thematic analytics #4505

Open
gbif-portal opened this issue Jan 3, 2023 · 10 comments
Open

Thematic analytics #4505

gbif-portal opened this issue Jan 3, 2023 · 10 comments

Comments

@gbif-portal
Copy link
Collaborator

gbif-portal commented Jan 3, 2023

Thematic analytics

As discussed with @tobiasgf @thomasstjerne , it would be nice to have regular analytics, similar to country reports, for the key thematic segments in GBIF, especially DNA, soil, freshwater, etc. @kingenloff we will need this for heath, too.


Github user: @dschigel
User: See in registry - Send email
System: Chrome 108.0.0 / Windows 10.0.0
Referer: https://www.gbif.org/health
Window size: width 1847 - height 913
API log
Site log
System health at time of feedback: OPERATIONAL

@dschigel dschigel changed the title Themic analytics Thematic analytics Jan 3, 2023
@dschigel
Copy link

dschigel commented Jan 3, 2023

implementation options vary from a static regular PDF summaries to on-portal thematic section to hosted portals. Key elements: 1) numbers, all kinds 2) map 3) taxonomy, e.g. pie or box chart plus a tree (if dynamic, like metrics)

@ManonGros
Copy link

@dschigel, for any type of statistics, someone would probably have to identify and tag all the relevant datasets

@dschigel
Copy link

dschigel commented Jan 3, 2023

Yes. Before BoR is fixed, this will have to be done at the i) dataset (tagging) ii) extension use and sometimes iii) publisher level. Example: everything from the MGnify publisher should contribute to the DNA analytics, we don't need to bother with dataset tagging and extension was not used - confirmed by @thomasstjerne

@ManonGros
Copy link

ManonGros commented Jan 3, 2023

I would expect that for this kind of statistics, we probably want everything to be accessible in one query. We wouldn't want something like "this publisher and those three datasets and these three records", we probably want something like "all the datasets with tag A". Maybe when we have categories for datasets (gbif/registry#247), some things will be easier. But in any case, someone will have to identify the datasets that you want to make metrics on. That probably will be the most time consuming.
Do you already have a list of health and soil datasets?

@dschigel
Copy link

dschigel commented Jan 3, 2023

Not yet, but this can and should be done once we here from INF

@MattBlissett
Copy link
Member

Would soil, freshwater etc be dataset-based metrics, or taxon-based?

@dschigel
Copy link

dschigel commented Jan 3, 2023

Needs thinking and testing, but a quick answer is datasets based (in plural), with some cases where tagging by theme can be speeded up / verified / automated by detecting use of known extension or by origins from certain publishers.

@thomasstjerne
Copy link

Would soil, freshwater etc be dataset-based metrics, or taxon-based?

For DNA, this is currently something like:
publisherKey in [key1, key2, ...] OR has DNA derived data extension OR has GGBN amplification extension

@dschigel
Copy link

dschigel commented Jan 3, 2023

Can we test this with DNA, to start with? The WP2023 says: Projected outcomes for 2027: Knowledge gaps are reduced by consolidating data coverage across the thematic areas of relevance. How do we know we did? But doing regular analytics, the before and after, theme by theme. Would country analytics code x criteria offered by @thomasstjerne do the trick?

@dschigel
Copy link

dschigel commented Jan 4, 2023

Maybe we can think about this with a country report at hand as a model https://analytics-files.gbif-uat.org/country/DK/GBIF_CountryReport_DK.pdf - we can even plan over a printout which elements are applicable for thematic analytics, which not? Even if analytics as vizualisation will not be ready soon, a capture of January state of data per theme is necessary. DNA is the most compact and straightforward to try, once we know it will work, we can proceed to tag soil, health and freshwater datasets (2023 priorities), then eventually all the rest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants