Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

schema: make the field 'collections' compatible between records and documents #3630

Merged
merged 4 commits into from May 22, 2024

Conversation

psaiz
Copy link
Contributor

@psaiz psaiz commented Apr 23, 2024

No description provided.

@psaiz psaiz linked an issue Apr 23, 2024 that may be closed by this pull request
@tiborsimko
Copy link
Member

The notion of "collection" was used many years ago for links such as:

Currently, and since many years, these are being redirected to the faceted search:

(Note that the redirections are not perfect, e.g. the above 1st and 4th link
example is redirected nicely, the 2nd and 3rd is not!)

That is to say that the "collection" notion is mostly of historical interest,
and we use it still just for persistence purposes, in order to serve some good
content for the old links that may be lying somewhere on the web still (even
though not advertised since many years now).

Therefore, if we have in the docs values such as:

    "collections": [
      {
        "experiment": "CMS"
      }
    ],

This pull request transforms this to:

    "collections": [
      "CMS"
    ],

This is OK, but it should not be fully necessary, because the document will be
matched anyway by the above faceted-search redirection, which is using the
experiment field, which the docs already have.

I would therefore propose to go as far as to remove this field altogether in
the docs, because we don't store the value such as "CMS" in the records's
collection field either. (Because "CMS" is an umbrella value expanded on by
search redirection).

IOW, we should not really need to preserve the collections.experiment field
in the docs.

And for more complex field values, such as:

    "collections": [
      {
        "experiment": "CMS"
      },
      {
        "primary": "education"
      },
      {
        "year": "2010-2012"
      }
    ],

The education/research dichotomy was dropped e couple of years arly after the
initial web design, and the links such as:

These are now simply being redirected to show some convenient content. (Again
some redirections work well, some work less well.)

The "year" collection is also outdated by the faceted search, where the
documents do show up when someone selects 2010. (We did not have any prominent
year-based collection browsing unlike education/research which was part of the
original web design.)

Anyway, the long story short, the most important usage of the collection
concept was in the records, not in docs. And in records we have the following
values only:

$ for file in cernopendata/modules/fixtures/data/records/*.json; do jq -rS '.[].collections[]' $file; done | sort -u
ALICE-Derived-Datasets
ALICE-Learning-Resources
ALICE-Reconstructed-Data
ALICE-Tools
ATLAS-Derived-Datasets
ATLAS-Higgs-Challenge-2014
ATLAS-Learning-Resources
ATLAS-Simulated-Datasets
ATLAS-Tools
Author-Lists
CMS-Condition-Data
CMS-Configuration-Files
CMS-Derived-Datasets
CMS-Learning-Resources
CMS-Luminosity-Information
CMS-Open-Data-Instructions
CMS-Primary-Datasets
CMS-Simulated-Datasets
CMS-Tools
CMS-Trigger-Information
CMS-Validated-Runs
CMS-Validation-Utilities
Data-Policies
JADE-Computing-Notes
JADE-Logbooks
JADE-Tools
LHCb-Collision-Datasets
LHCb-Derived-Datasets
LHCb-Learning-Resources
LHCb-Tools
OPERA-Detector-Events
OPERA-Electronic-Detector-Datasets
OPERA-Emulsion-Detector-Datasets
PHENIX-Derived-Datasets

(We should strive to return something good for links /collection/<value> such
as http://opendata.cern.ch/collection/CMS-Learning-Resources, which is mostly
not the case. But that would call for another redirection-to-facets fix in app
views, not for metadata massaging.)

In documents, the collection field values are more varying, such as:

  "experiment": "ALICE"
  "experiment": "ATLAS"
  "experiment": "CMS"
  "experiment": "JADE"
  "experiment": "LHCb"
  "experiment": "OPERA"
    "Guide"
  "primary": ""
  "primary": "documentation"
  "primary": "Documentation",
  "primary": "education"
  "primary": "News"
  "primary": "research"
  "primary": "VM"
  "secondary": [
  "year": "2010"
  "year": "2010-2012"
  "year": "2011"
  "year": "2013"
  "year": "2015"
  "year": "2015-2016"
  "year": "2016"

See that they mostly don't match the record values, and they are mostly
emulating the experiment field and/or years and/or outdated education/research
dichotomy.

Therefore, since the most important collection redirection
(/collection/<experiment>) is using the experiment field (that we already
have for docs), and since the other collection values such as
education/research are not critical and/or not used since years, I would
propose to consider dropping the collection field in all the docs fully for
the sake of simplicity.

(We already don't use the collection field in several docs pages such as
"cod-about", "cod-privacy-policy", "cod-terms-of-use",
"simulated-dataset-categories", so if it is not really mandatory,
why not think about dropping it?)

P.S. Regardless of whether we go for my suggestion, we should still look into
redirections, because many redirection rules in views.py do not seem to be
working right now. Some redirections should be easy to fix, some are important
since they were used in published papers and materials (see past redirection
issues
,
etc. I think that if we do fix redirections such as
/collection/CMS-Primary-Datasets, we could even think of dropping the
collection field from the record JSONs too; it should not be really necessary
metadata-wise, since we have the type.primary and type.secondary fields
that basically cover the same information-storage need as the good old
collection field.

@psaiz
Copy link
Contributor Author

psaiz commented May 3, 2024

Thanks for the comments. I like the idea to simplify things, so I will use this PR to drop the collections from the docs, create a different issue/PR that will deal with the redirections, and, once those two things are working, we can create another issue/PR for the records.

@psaiz psaiz force-pushed the collections branch 2 times, most recently from a493a0f to bbc6f9d Compare May 13, 2024 09:53
@psaiz psaiz merged commit bfb2364 into cernopendata:master May 22, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Conflicts in the attributes of the opensearch schemas
3 participants