Differentiate between physical occurrences vs. DNA occurrences #2

SSuominen1 · 2021-03-18T15:59:03Z

Question from meeting:
▪ How can users of the data differentiate between occurrences based on physical samples vs. dna?
▪ Dmitry Schigel of GBIF said that they have been thinking about this, currently this would be recorded/sliced through the BasisOfRecord field, but this is not ideal.
▪ Possibility to use flags on data for the different sources
▪ Concerns about how non-genetic scientists can use the data (Katrina Exter). Let's say someone wants to compares the species from eDNA from project X to those from project Y, how will they know that they are comparing like with like? If the type of sequences are different (e.g. ITS vs 16S) then you will by definition get different species-sets out of the data because they don't look at the same creatures. If project X uses library XX and project Y uses library YY, and they are known to have different levels of accuracy or coverage, then you are not comparing like-to-like. But how can someone know that, if they do not have a background in DNA? Do we expect them to figure it out themselves (which is a option, but then it would be good to add a flag warning people to do this).

We need to make sure that when searching for occurrences, there is a clear separation between DNA-data vs. other occurrence data.

JoBeja · 2021-03-18T16:06:59Z

Not having been in the meeting, I hope my comment makes sense. Regarding Katrina's concerns.

I think this is not only applicable to genetic data, it can also be applicable to other types of data (e.g. physical oceanography or chemistry data). If a user does not have enough knowledge on what they are looking at, the data centres cannot be responsible for this. If we ensure that standardisation and documentation is done to an adequate level, should this not suffice? This is an argument I have heard several times in the past to delay/prevent data sharing, it's an argument that some data centres have used too, as if we could control what users do with the data (I know that is not the intention behind Katrina's comments though).

It's not clear if this comment arose as there are various vocabularies/libraries that can be used for the genetics data, but wouldn't adequate documentation bypass this problem?

dianalg · 2021-03-25T21:55:01Z

One thing I'm concerned about in this context is that a researcher could search OBIS for a particular species, and the resulting data set could contain eDNA-derived occurrences in addition to occurrences obtained by human observation or machine observation. It seems like a legitimate scientific question to me to ask how these data types could be aggregated in a meaningful way. That said, I imagine there are experts who would be able to do so. But I think it's worth considering what OBIS might do with its search interface to ensure users are aware that this mix of data types might occur? And maybe it would be useful to have an option to include or exclude DNA-derived data from a search result?

claudenozeres · 2021-05-26T13:48:14Z

The issue title is physical vs DNA. I would like to go further, by suggesting the utility of doing so relates to taxonomy and geography--and how we display differences (this refers to several matters under Adding Genetic Data - Part 2).

“…collect once, use many times…”
https://www.marinespecies.org/news.php?p=show&id=8741

In the early days of biodiversity data publishing, this was a good mantra to follow and convince others to contribute. Now, I sometimes regret it. Today, we have lots of data online. And have some tools to see if spatial, metadata, etc are complete.
OBIS argues they are doing better than GBIF in these areas for quality control. In addition, they use WoRMS as the taxonomic backbone—not open like GBIF for scientificName. However, this strength is also related to weaknesses for species records in OBIS (relative to GBIF) that is becoming more evident with more data (and more re-uses)—in particular because of genomic data and maps, and thus perhaps is time to consider other features to improve display of species data for fitness of use that are not simply geospatial errors. Differentiating physical and DNA occurrences will make this better.

On OBIS, for species concepts, the only guidance given is that an older name will be automatically displayed with currently accepted, ‘valid’ name according to WoRMS. A data download will still display the original (submitted) name in records. This is handy and useful and obvious (even though it can lead to some mixups when taxons shift ranks).

On OBIS, for genetic identifiers (ASV/OTU) in datasets, it has been suggested that these will become searchable. It has also been suggested that more details of geospatial filtering/searches will become possible. Together, this means that OBIS will display (on site or in R, for example), spatial patterns (like clustering) of OTUs across regions. This would be similar to what is already shown in BOLDsystems, and thus it would be valuable if BOLD records (and BINs) could be consultable within OBIS. Or perhaps we need to do all our species work in GBIF, because GBIF shows BOLD and OBIS records—but the reverse does not occur.

But there would be some important differences to what I describe above.

Changes made/displayed to original published records. OBIS would display taxon records—traditional and those with genetic identifiers over a spatial grid (and by time, depth, etc). It should then become obvious/evident (in some instances) where traditional records with taxon names are in conflict with genomic information. Or else we need tools to reveal these. Example 1: traditional records name a species. Genetic data suggest none of that species are present in an area, but that another one is present. See Ammodytes hexapterus vs Ammodytes personatus in NE Pacific. Question: what do we do with the traditional records? The name is correct from WoRMS, the identity is wrong from genetics—so OBIS should auto-correct? Or suggest if difference between physical vs DNA, should only use DNA? Example 2: traditional records go to species, genetic identifier has no species name yet, so is shown at a more general taxonomic level. Again, name of record is correct, but identity is wrong—except this time, a user will not notice it because the genetic identifier is too general. See: Gersemia rubiformis vs. Alcyonacea in NE Pacific. In this case, seeing there are unnamed DNA records will not help to correct the physical records with names.
Changes made at a record level with new taxonomic information (as compared to over datasets). Records on OBIS can only be updated by re-publishing the whole dataset. It is likely that most datasets will not become updated for species identifications—because the responsible contacts are no longer available, there is no supporting information to confirm, etc. Unfortunately, OBIS is dominated by fisheries surveys, compared to those from museums as seen on GBIF, and more recently from citizen science on iNaturalist. As a consequence, incorrect species (for a region) may be numerically dominant on OBIS. Comparing these to museum records and iNaturalist (from GBIF) makes this obvious, but the high number will hide these and records may be misused. Museums are more likely to be correct (and so less need to correct). Citizen science on iNaturalist (mostly coastal marine observations) may be flawed originally, but they are very easily corrected by a curation community of amateurs and experts. Result: iNaturalist could surpass OBIS for important marine species observations—because they are easily verifiable and corrected. At least in North America. Similar to how OBIS is showing original and accepted species names, we need tools to indicate what are original and accepted species locations. And it is not enough to use the numerically dominant name in an area (this is how we operate with corrections on BOLD) because the errors from fisheries are dominant. If DNA records are in a different pipeline on OBIS, then changes (like names) could become easier/faster to apply/display. And then show 'valid species due to DNA' on physical occurrences--even as these records are not changed (Q: just why is it so easy to correct GBIF records through iNaturalist?).

I raised this matter in BiodiversityNext 2019. There may be a place to discuss this during the consultation of Digital Extended Specimens - Phase 2. https://discourse.gbif.org/t/digital-extended-specimens-phase-2/2651

And I hope to raise this matter in June at Digital Data 2021: Digital Datas Grand Challenge: Expanding Discovery Across Multiple Domains. https://www.idigbio.org/content/digital-data-2021-digital-datas-grand-challenge-expanding-discovery-across-multiple-domains
See under themes of: Genomic Data.

dianalg · 2021-05-27T17:39:45Z

I'm not sure I understand everything that @claudenozeres is saying, but two things occur to me here:

If species identification and distribution information from "traditional" versus "genetic" sources are in conflict, it doesn't seem right to me that OBIS should "auto-correct" records. I think OBIS's role is to highlight such conflicts if possible, but not to make decisions regarding which data types are more or less accurate or representative of reality. These sorts of decisions should be left to experts, because the right answer is likely to be highly specific to the systems/species/questions being investigated.
I'm not sure it's right to assume that OBIS IDs are less accurate than others just because many of them are derived from surveys in the field. In my experience, data providers submitting to OBIS are highly trained professionals who have deep, hands-on knowledge of the systems in which they work. It's true that these IDs sometimes don't have a lingering physical or digital record that can be re-examined and corrected by others. But I don't think that means we should distrust the expertise of scientists working in the field.

claudenozeres · 2021-05-27T18:07:32Z

I agree, and I am straying into areas that are not clear for myself on how to proceed, but they are issues, so I broached them here.

'...role is to highlight such conflicts'--this is what I am hoping for. The current maps are 'incomplete' and do not reveal the conflicts. The only auto-correct being down is to display only with valid WoRMS names--but this is an action performed on datasets (not left to experts to judge). Is a useful precedent, I was wondering about the discussion on how the genetic identifiers will be displayed relative to WoRMS-valid names.
My intention was not to denigrate the hard work of those on fisheries surveys (and I am one of those). The issue is of inadvertent errors--because are unaware of identification guides, the recent reviews work by others, and especially now with results from genetics. In my experience with Canadian surveys, we have been improving our identifications, but data of past decades vary widely in quality. I have learned of many changes because of the public data, and OBIS is a great boon for this--because we see differences between surveys and sources. However, this is presently not easy to point out and push for corrections to providers on OBIS IPT's. Just today, a museum was notified of a taxon change (not error, but revision) to an observation and it was updated instantly on iNat. Now we have to go and contact all other sources on OBIS to do similar for this taxon--the name shown is valid, but the species may not be. OBIS will only 'auto-correct' if name changes--but cannot know about taxon revisions (such as a split). So I was wondering if there could be a way to alert/encourage source to do updates for corrections, which will become especially urgent when genetic data becomes mapped on the OBIS platform.

claudenozeres · 2021-05-27T18:23:19Z

For reference, here is the taxon example I mentioned above. Am involved with photo+map-based marine species updates nearly every day on iNat. I wish I could do something similar to alert providers on OBIS IPT's (apart from emailing busy folks to tell them they have many errors :(. For specific museum records, is feasible. For biodiversity on fisheries surveys, this is a burden. So differentiating physical vs DNA occurrences can help make it easier to work with data in analyses--taxa with the latter are 'verifiable', while physical ones (if only observations, and not actually conserved specimens) are more tentative identifications, at least for lesser-known or problematic taxa. https://inaturalist.org/observations/78789769

pieterprovoost · 2021-06-04T13:15:37Z

To respond briefly to point (1), determining if a species is present at a specific location based on conflicting evidence from multiple sources really falls outside the scope of what OBIS is currently doing. If there's conflicting information within a record (DNA evidence conflicts with the provided scientific name) we intend to highlight that. What that will look like is to be decided, but it should probably go a bit further than a simple quality flag, for example in the shape of annotations containing alternative identifications. The same could be done when identifications are challenged by distribution records in WoRMS, in the case of a split, etc.

I agree that updating an identification on iNaturalist is easier than having a survey dataset updated, but then again I don't think managing large scale survey data on iNaturalist is a good idea. I would also like to mention that the OBIS node managers are doing a terrific job responding to feedback. This process can probably be improved (let me think a bit about that), and maybe user annotations can provide a temporary patch until the data are corrected at source.

Pinging @bart-v @leenvandepitte regarding taxonomic revisions. Maybe we can think about a joint WoRMS/OBIS notification system for the node managers.

claudenozeres · 2021-06-04T16:20:01Z

Thank you, @pieterprovoost, and my deepest thanks, too, to all the managers--I agree on the quality of work done. Of course, being exposed (public data), in such large volumes, means someone can always find room for more requests or changes, and trying to be all things to all users. I also agree that iNat seems not the place for volume, but am intrigued by how they function. Because of the work at individual observations, it could become an important source of updated biodiversity that will hopefully also eventually show up in OBIS. In the meantime, and relevant to DNA-related revisions, yes, I am eager to see if more can be done between WoRMS/OBIS notifications--at least to managers.

bart-v · 2021-06-04T20:17:03Z

Taxonomic revisions/annotations/feedback on taxonomic and occurrence data has been an issue for over a decade now, especially in GBIF. Multiple solutions were proposed (i.e. Filtered Push), but none of them has been proved functional...
So this is not something will will be able to resolve here quickly :)

FYI, some references

albenson-usgs · 2021-06-07T12:31:53Z

But I think it's worth considering what OBIS might do with its search interface to ensure users are aware that this mix of data types might occur? And maybe it would be useful to have an option to include or exclude DNA-derived data from a search result?

I think GBIF and ALA are grappling with some of these things as well and possibly this Github issue would be relevant? gbif/registry#247

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Differentiate between physical occurrences vs. DNA occurrences #2

Differentiate between physical occurrences vs. DNA occurrences #2

SSuominen1 commented Mar 18, 2021

JoBeja commented Mar 18, 2021

dianalg commented Mar 25, 2021

claudenozeres commented May 26, 2021

dianalg commented May 27, 2021

claudenozeres commented May 27, 2021

claudenozeres commented May 27, 2021

pieterprovoost commented Jun 4, 2021

claudenozeres commented Jun 4, 2021

bart-v commented Jun 4, 2021

albenson-usgs commented Jun 7, 2021

Differentiate between physical occurrences vs. DNA occurrences #2

Differentiate between physical occurrences vs. DNA occurrences #2

Comments

SSuominen1 commented Mar 18, 2021

JoBeja commented Mar 18, 2021

dianalg commented Mar 25, 2021

claudenozeres commented May 26, 2021

dianalg commented May 27, 2021

claudenozeres commented May 27, 2021

claudenozeres commented May 27, 2021

pieterprovoost commented Jun 4, 2021

claudenozeres commented Jun 4, 2021

bart-v commented Jun 4, 2021

albenson-usgs commented Jun 7, 2021