Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify Variables table to enable greater interoperability #160

Open
aufdenkampe opened this issue Sep 28, 2018 · 14 comments
Open

Modify Variables table to enable greater interoperability #160

aufdenkampe opened this issue Sep 28, 2018 · 14 comments

Comments

@aufdenkampe
Copy link
Member

This issue follows up up on my suggestion in ODM2/ODM2DataSharingPortal#71 (comment) that we pick up the "800-lb gorilla" issue to reimagine variables as we described our EnviroVariableNames-ODM2TeamIdeas.

This directly addresses the semantic overloading challenge that we've all been aware of from the beginning and discussed at the April 2014 CUAHSI Ontology Project Capstone workshop and further developed during the May 2016 Environmental Chemistry Names/Ontology Workshop. Most recently @PleiadesAustralia raised in this "Semantic overloading of CVVariableName" issue: ODM2/ODM2ControlledVocabularies#35.

I've been working with @roelofversteeg and we have an idea on how to move forward that allows us to hang on to older Variable/Parameter terms while also breaking down each term into its semantic/conceptual components in a way that enables both more granular queries and cross-walk mapping.

We'll share soon.

@PleiadesAustralia
Copy link

PleiadesAustralia commented Sep 29, 2018 via email

aufdenkampe added a commit that referenced this issue Oct 24, 2018
…OIDs

2018-09-30 updates from @roelofversteeg.
Many changes to Variables, see #160. This including change from "TaxonomicClassifier" to "Species" and FK to both Species and Speciation, eliminating separate speciation CV. #161.
Also addresses #157, #158, #159.
@aufdenkampe
Copy link
Member Author

In our latest commits to the ODM2.1_dev branch, @roelofversteeg and I have found, we believe, the right balance between:

  1. Giving users the flexibility to reuse their old "Variable" or "Parameter" terms (or define new ones), including directly linking to URIs for any externally defined terms; and
  2. Breaking down the concept of a "Variable" or "Parameter" into its core conceptual components, in order to facilitate datasets integration via relatively simple queries to interoperate among various terms from various sources.

Our solution also provides the benefit of creating an interoperability system that maps any existing "Variable" or "Parameter" terms to these core concept components, AND therefore provides a means to utilize ODM2 as a "lingua franca" to integrate datasets from different sources for granular, data-value level queries, as we described in our EnviroVariableNames-ODM2TeamIdeas Google Doc.

@peckhams describes for his CSDMS Standard Names (CSN) that a "Variable" or "Parameter" term can be broken down into these core concepts (where optional components are in brackets):

  • object type + quantity type + [operation] + [modifiers]

Based on our evaluation and discussion in our EnviroVariableNames-ODM2TeamIdeas Google Doc, we broke down the first two concepts a bit further:

  • object type = Medium + TaxonomicClassifier + Speciation
  • quantity type = QuantityKind

Therefore, we believe most Variable terms (from any of the numerous existing, external vocabularies) can be mostly described by these four component concepts which already exist in ODM2:

Where the Taxonomic Classifier term should identify the core object of the variable, such as a "species", typically from a biological, chemical, geological or other kind of taxonomy and preferably taken from an External Identification System, such as:

Where the Medium is considered another object that is containing the object you’re measuring. For a QuantityType that is a ratio, Medium is usually the stuff that is assumed in the denominator of a unit.

Where the QuantityKind is presently taken what is presently called the ODM UnitsTypeCV, which in turn is a an modification of QUDT v1.1 QuantityKind terms that were expanded to better include environmental and chemical variables. NOTE that QUDT v2 is in process, and appears to be doing a more complete job with environmental and chemical variables.

This approach does not yet include CSN [operations] or [modifiers], but could be expanded to include those if necessary.

@aufdenkampe
Copy link
Member Author

Here is an image of the proposed modifications to the Variables and TaxonomicClassifiers tables:

odm2core_odm2 1_dev_2018-10-29

My thinking is that for ODM2.1, we would pre-populate this new Variables table from our existing ODM2 VariableNameCV, after mapping each term to the four core concepts.

Likewise, I'm thinking that we would also pre-populate the TaxonomicClassifiers table with all chemical species currently in the ODM2 VariableName and ODM2 Speciation vocabularies. Note that we propose here to eliminate the use of these two vocabularies, some of which is described in #161.

@PleiadesAustralia
Copy link

PleiadesAustralia commented Oct 30, 2018 via email

@horsburgh
Copy link
Member

Maybe I'm not following this closely enough, but largely, I think this just reorganizes/moves stuff that already exists in ODM2. It may not all be in the Variables table now, but there's good reasons for why things are where they are.

Why is the suggestion to move the content of the Variable Name CV table into the main Variables table? I see the following comment in this commit (8ad71c4):

ODM2CV.CV_VariableName table: Deleted ODM2CV.CV_VariableName table (and also CV_VariableNamenew), as the new approach to Variables does not rely on an internal variable CV, but rather a system for mapping to external variable/parameter term lists.

First, the "CV_VariableNamenew" was never part of the ODM2 schema (not sure where it came from), and second, wasn't the existing VariableName CV already a "system for mapping to external variable/parameter term lists?" It never relied solely on an internal CV and always had the capability to link to terms from other vocabularies. Why break the convention used by all other CVs in ODM2 when it doesn't add new functionality? Users could already link to URIs for externally defined terms via the existing Variable Name CV. How would they do so now - you have a VariableSourceURI, which implies that there is a source out there that supplies all of the information in your proposed new Variables table, and I don't think there is one. So, I'm not sure what interoperability this would bring.

Also - I don't understand why you are suggesting moving Medium into Variables when I think the same arguments you made for keeping other stuff out of Variables (e.g., units) could be made for keeping Medium out of Variables. Wouldn't Medium be a property of the Result (as we had it already) an not a property of a Variable? Temperature is temperature whether it is measured in air or water. Moving the Medium to Variables means that we will be duplicating Variables (now I have to have a variable for temperature in water and another one for temperature in air).

Changing the name of VariableTypeCV to VariableDomainCV seems like an unneeded change. It doesn't really add anything new.

The absence of a VariableCode will not support many use cases where short codes for Variables are commonly used.

QuantityKindCV moves some Units information back into Variables? I know you are proposing that this is equivalent to the UnitsTypeCV, but isn't that a property of Units? It already exists in the Units table that is linked to a Result. I guess I don't really understand how moving it helps.

As @PleiadesAustralia notes - there are many variables (particularly biological ones) where the VariableName (e.g., "Count") is qualified by a TaxonomicClassifier (e.g., some species name).

I think this:

Variable = MediumCV term + TaxonomicClassifier term + SpeciationCV term + QuantityKindCV term

Goes beyond what a "Variable" is, is missing what we are now calling VariableName, and introduces a higher level concept that is not represented by a specific entity in ODM2, but can be constructed from what is already there.

@PleiadesAustralia
Copy link

PleiadesAustralia commented Oct 30, 2018 via email

@horsburgh
Copy link
Member

@PleiadesAustralia - your understanding of the original intent behind TaxonomicClassifiers is correct. They were there to further qualify the "thing" that is observed - e.g., if I am observing a "count" of a biological organism, the taxonomic classifier is there to tell me what species I am counting. And, yes, there are other contexts besides biological data where this is useful.

There's perhaps a couple of things going on there. First, I think you are right that there isn't enough specificity or guidance on how the existing constructs can be used with existing data use cases. This could certainly improve - and we hoped that it would through use. Nobody on our original team could cover all of the use cases to which these constructs might apply.

Second, I think we are constrained somewhat by what we are able to do in a relational database implementation. The information model that describes all of the elements of "what" was measured is sound in ODM2. As @aufdenkampe notes in his description above - the things he's looking at already exist in ODM2. The particular organization in tables as we have it now in the ODM2 blank schemas may not work for every use case. This is not necessarily a deficiency of the information model, but I think there will always be some deficiencies or challenges with physical implementations.

@peckhams
Copy link

peckhams commented Oct 30, 2018 via email

@aufdenkampe
Copy link
Member Author

@PleiadesAustralia, Yes! It sounds like from your second comment (#160 (comment)) that you are already using the TaxonomicClasssifiers table exactly as is intended, to be the primary term identifying the object of the "Variable". It in fact works very well when applied to chemical variables, especially given that chemical taxonomies & ontologies are very well defined (with ChEBI being the best for our purposes, as first identified by @dr-shorthair et al.).

I fully agree with your suggestion:

The structure allows for all of this but we need some explicit documentation and proposed controlled vocabulary sets.

Regarding proposed vocabulary sets, I strongly suggest ChEBI and ITIS for chemistry and biological taxonomy respectively. I also know there are 3-4 good soil taxonomies that differ more by region that completeness (so all could/should be used according to need). Do you have recommendations on:

  • Ecological Habitat?
  • Mineralogy?
  • Biological tissues? (e.g. sap, leaf, fin, muscle)
  • others
    The strong preference is for the list to be published online with an API that allows one to point to a specific term and fetch specific metadata associated with that term.

So, given your second comment, I'm not sure how to interpret your first comment (#160 (comment)):

From a biological sampling perspective I cannot see how this would work.

Just as the concept of aqueous_Nitrate_asN_concentrationMassPerVolume maps beautifully in to the Medium-TaxonomicClassifier-Speciation-QuantityKind concepts, so does a biological variable such as organism_brookTrout_length or habitat_Mayfly_countPerArea. We have used these constructs quite a bit in the existing ODM2.0, and the proposed modifications adds additional flexibility.

It is indeed true that under this "model there would be a requirement to register multiple variable for each biological taxa... [which] number in the tens of thousands." The same is true for chemical "species" and "substances". That is why USGS has >8000 parameter codes for water quality, largely because of all the chemical species. This challenge is also why CUAHSI was completely overwhelmed by the number of requests for new terms in their VariableName CV as soon as geochemists started using their system.

All of this was the motivation for the numerous workshops mentioned in our our EnviroVariableNames-ODM2TeamIdeas Google Doc. The consensus from all these discussions and workshops was that we should no longer try to maintain variable name controlled vocabularies, but rather find an approach define existing and new variable names to their core concepts.

The redesign proposed in this thread is a carefully thought-out response to those needs, and it's based on a lot of conversations.

I appreciate your feedback and look forward to more thoughts and questions.

@aufdenkampe
Copy link
Member Author

aufdenkampe commented Oct 30, 2018

@horsburgh, thanks for your thoughts!

Indeed, the proposed modifications mostly moves around existing fields from one table to another. That was the primary intent -- to group concepts that best belong together -- in order to simplify the input of data and increase the power of queries at extracting related data from multiple sources. Such interoperability and data-value-level integration from diverse data sources has always been the primary motivation of ODM2.

Back in 2013, when we were developing the ODM2 Variables & Results tables, we did a great job of massively improving those capabilities from ODM1 to ODM2, making great strides toward our goal. However, by early 2014 we already realized several weaknesses in our structures for Variables, but we decided that it time to move forward with imperfection and table those issues. I remember those discussions, reasoning and decisions well; I recently reviewed our extensive notes on it all. The decisions we made at the time were good ones, and I supported them.

We now have new knowledge and greater experience to guide a revision to improve ODM2.

The current ODM2.0 Variables table is very constraining because it requires that every record is tied to a term from the single, official ODM2 VariableNameCV. Back in 2013 we added flexibility by allowing one-to-many relationships between a user-selected VariableCode and VariableNameCV terms, but still doesn't provide a pathway to use a VariableCode that can't be considered a subset of an existing VariableNameCV term.

Also, as described in #160 (comment) above, our consensus from the CUAHSI Ontology Capstone Meeting April 2014 held at CUNY and the May 2016 Environmental Chemistry Names/Ontology Workshop held in Boulder was that we should no longer try to maintain variable name controlled vocabularies, but rather find an approach define existing and new variable names to their core concepts.

This proposed ODM2.1 changes are designed to move that idea forward, by:

  • Moving MediumCV + TaxonomicClassiferID + SpeciationCV + QuantityKindCV from the Results table to the Variables table,
    • to allow explicit mapping of core "atomic" concepts to any Variable term that is used;
  • Creating an explicit (and optional) VariableSourceURI to replace ODM1/ODM2 VariableNameCV,
    • to create greater flexibility in the choice of variable naming conventions, with a clear link to established terms with web services.

Most other modifications (primarily field name changes) are relatively non-substantive, but were implemented to improve clarity of understanding, which has long been another high-level goal of ODM2 (which is why we didn't directly adopt most OGC O&M terminology).

@aufdenkampe
Copy link
Member Author

@peckhams, thanks for chiming in with your thoughts and an update of how the CSDMS Standard Names effort has evolved into Geoscience Standard Names. From what I can see, the proposed ODM2.1 would interoperate very well with your system, conventions and endpoints, while also allowing existing CUAHSI HIS/ODM1 and ODM2.0 databases a path for forward migration to ODM2.1. That's awesome. I'm looking digging in deeper to figure out how we can leverage this as much as possible, especially for USGS NWIS parameter codes.

BTW, although the main QUDT website looks unchanged since 2015 (when we harvested metadata from QUDT r1.1 for our Units and UnitsType/QuantityKind vocabularies (some background at ODM2/ODM2ControlledVocabularies#34), there was progress toward QUDT r2.0 in 2017 (see http://www.qudt.org/release2/qudt-catalog.html). Has the project been dead since then? It would be a shame, since they seemed to be more complete and more on the right track than any other effort.

@PleiadesAustralia
Copy link

I am coming back to this conversation after considerable experience with Taxonomic Classifications but have a new quandary. This involves morphology and tissue types off a single biological organism. I am leaning towards expanding my use of the CV_Medium entity. Could a Medium be a Morphology or particular tissue type. I am leaning towards this because this would allow the Results table to be used to compare specific aspects of different species. We could use the Category field to direct the application to particular Groups of species. What do you all think?

@aufdenkampe
Copy link
Member Author

@PleiadesAustralia, thanks for reconnecting with this conversation! It's also great that you are returning with greater experience and interest in biological Taxonomic Classifiers.

I share your perspective that expanding the Medium CV is key for improving ODM2.1 utility for biological and biodiversity datasets. We've always had much more engagement in ODM2 from geoscientists than biologists and ecologists, and the Medium CV reflects that!

I also really like your idea that a massively expanded Medium CV would allow comparisons of specific tissues/morphologies across species.

Do you have a recommended or draft list of additional terms? As I mentioned in #160 (comment) above, I'm keen on learning what exists out there. Although in that comment, I guess I was suggesting that maybe those terms would apply to different TaxonomicClassifers (i.e. species), to more easily connect to complex and external systems. I wonder if we might use the SpeciationID for this purpose?, SpeciationID is a second instance of TaxonomicClassifers, which was designed for the chemical use case of describing something like "nitrate expressed in units of *nitrogen". Could we use that for describing the type of tissue, if the tissue taxonomic is complex and external (which I suspect it is).

For clarity, a diagram of the new variables portion of ODM2.1, including TaxonomicClassifiers, is shown in #153 (comment)

@PleiadesAustralia
Copy link

PleiadesAustralia commented May 12, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants