New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Differentiate between physical occurrences vs. DNA occurrences #2
Comments
Not having been in the meeting, I hope my comment makes sense. Regarding Katrina's concerns. I think this is not only applicable to genetic data, it can also be applicable to other types of data (e.g. physical oceanography or chemistry data). If a user does not have enough knowledge on what they are looking at, the data centres cannot be responsible for this. If we ensure that standardisation and documentation is done to an adequate level, should this not suffice? This is an argument I have heard several times in the past to delay/prevent data sharing, it's an argument that some data centres have used too, as if we could control what users do with the data (I know that is not the intention behind Katrina's comments though). It's not clear if this comment arose as there are various vocabularies/libraries that can be used for the genetics data, but wouldn't adequate documentation bypass this problem? |
One thing I'm concerned about in this context is that a researcher could search OBIS for a particular species, and the resulting data set could contain eDNA-derived occurrences in addition to occurrences obtained by human observation or machine observation. It seems like a legitimate scientific question to me to ask how these data types could be aggregated in a meaningful way. That said, I imagine there are experts who would be able to do so. But I think it's worth considering what OBIS might do with its search interface to ensure users are aware that this mix of data types might occur? And maybe it would be useful to have an option to include or exclude DNA-derived data from a search result? |
The issue title is physical vs DNA. I would like to go further, by suggesting the utility of doing so relates to taxonomy and geography--and how we display differences (this refers to several matters under Adding Genetic Data - Part 2). “…collect once, use many times…” In the early days of biodiversity data publishing, this was a good mantra to follow and convince others to contribute. Now, I sometimes regret it. Today, we have lots of data online. And have some tools to see if spatial, metadata, etc are complete. On OBIS, for species concepts, the only guidance given is that an older name will be automatically displayed with currently accepted, ‘valid’ name according to WoRMS. A data download will still display the original (submitted) name in records. This is handy and useful and obvious (even though it can lead to some mixups when taxons shift ranks). On OBIS, for genetic identifiers (ASV/OTU) in datasets, it has been suggested that these will become searchable. It has also been suggested that more details of geospatial filtering/searches will become possible. Together, this means that OBIS will display (on site or in R, for example), spatial patterns (like clustering) of OTUs across regions. This would be similar to what is already shown in BOLDsystems, and thus it would be valuable if BOLD records (and BINs) could be consultable within OBIS. Or perhaps we need to do all our species work in GBIF, because GBIF shows BOLD and OBIS records—but the reverse does not occur. But there would be some important differences to what I describe above.
I raised this matter in BiodiversityNext 2019. There may be a place to discuss this during the consultation of Digital Extended Specimens - Phase 2. https://discourse.gbif.org/t/digital-extended-specimens-phase-2/2651 And I hope to raise this matter in June at Digital Data 2021: Digital Datas Grand Challenge: Expanding Discovery Across Multiple Domains. https://www.idigbio.org/content/digital-data-2021-digital-datas-grand-challenge-expanding-discovery-across-multiple-domains |
I'm not sure I understand everything that @claudenozeres is saying, but two things occur to me here:
|
I agree, and I am straying into areas that are not clear for myself on how to proceed, but they are issues, so I broached them here.
|
For reference, here is the taxon example I mentioned above. Am involved with photo+map-based marine species updates nearly every day on iNat. I wish I could do something similar to alert providers on OBIS IPT's (apart from emailing busy folks to tell them they have many errors :(. For specific museum records, is feasible. For biodiversity on fisheries surveys, this is a burden. So differentiating physical vs DNA occurrences can help make it easier to work with data in analyses--taxa with the latter are 'verifiable', while physical ones (if only observations, and not actually conserved specimens) are more tentative identifications, at least for lesser-known or problematic taxa. https://inaturalist.org/observations/78789769 |
To respond briefly to point (1), determining if a species is present at a specific location based on conflicting evidence from multiple sources really falls outside the scope of what OBIS is currently doing. If there's conflicting information within a record (DNA evidence conflicts with the provided scientific name) we intend to highlight that. What that will look like is to be decided, but it should probably go a bit further than a simple quality flag, for example in the shape of annotations containing alternative identifications. The same could be done when identifications are challenged by distribution records in WoRMS, in the case of a split, etc. I agree that updating an identification on iNaturalist is easier than having a survey dataset updated, but then again I don't think managing large scale survey data on iNaturalist is a good idea. I would also like to mention that the OBIS node managers are doing a terrific job responding to feedback. This process can probably be improved (let me think a bit about that), and maybe user annotations can provide a temporary patch until the data are corrected at source. Pinging @bart-v @leenvandepitte regarding taxonomic revisions. Maybe we can think about a joint WoRMS/OBIS notification system for the node managers. |
Thank you, @pieterprovoost, and my deepest thanks, too, to all the managers--I agree on the quality of work done. Of course, being exposed (public data), in such large volumes, means someone can always find room for more requests or changes, and trying to be all things to all users. I also agree that iNat seems not the place for volume, but am intrigued by how they function. Because of the work at individual observations, it could become an important source of updated biodiversity that will hopefully also eventually show up in OBIS. In the meantime, and relevant to DNA-related revisions, yes, I am eager to see if more can be done between WoRMS/OBIS notifications--at least to managers. |
Taxonomic revisions/annotations/feedback on taxonomic and occurrence data has been an issue for over a decade now, especially in GBIF. Multiple solutions were proposed (i.e. Filtered Push), but none of them has been proved functional... FYI, some references |
I think GBIF and ALA are grappling with some of these things as well and possibly this Github issue would be relevant? gbif/registry#247 |
Question from meeting:
▪ How can users of the data differentiate between occurrences based on physical samples vs. dna?
▪ Dmitry Schigel of GBIF said that they have been thinking about this, currently this would be recorded/sliced through the BasisOfRecord field, but this is not ideal.
▪ Possibility to use flags on data for the different sources
▪ Concerns about how non-genetic scientists can use the data (Katrina Exter). Let's say someone wants to compares the species from eDNA from project X to those from project Y, how will they know that they are comparing like with like? If the type of sequences are different (e.g. ITS vs 16S) then you will by definition get different species-sets out of the data because they don't look at the same creatures. If project X uses library XX and project Y uses library YY, and they are known to have different levels of accuracy or coverage, then you are not comparing like-to-like. But how can someone know that, if they do not have a background in DNA? Do we expect them to figure it out themselves (which is a option, but then it would be good to add a flag warning people to do this).
We need to make sure that when searching for occurrences, there is a clear separation between DNA-data vs. other occurrence data.
The text was updated successfully, but these errors were encountered: