new ID Datastructure #5315

timosachsenberg · 2021-05-11T08:01:31Z

timosachsenberg
May 11, 2021
Maintainer

Hi @hendrikweisser @jpfeuffer @cbielow and others,
I stumbled over some design issues that might make adoption of the ID data structure
more difficult but could be fixed with some minor changes.

Things I found surprising:
Functions that register an entry like e.g.:

IdentificationData::registerInputFile(const InputFile& file)

Either registers a new input file (as the name suggests) or merges in some information from
the other InputFile.

This doesn't follow the principle of least surprise as a user can't see from the code if
a new entry is added or modified.
Even if the user could see that an entry is modified (which he usually might not in a loop)
the semantics of the merge are poorly documented and at least surprising to the user.
But there are also more subtile issues.

E.g.:

      /// Merge in data from another object
      InputFile& operator+=(const InputFile& other)
      {
        if (experimental_design_id.empty())
        {
          experimental_design_id = other.experimental_design_id;
        }
        primary_files.insert(other.primary_files.begin(),
                             other.primary_files.end());
        return *this;
      }

First, it uses the operator+ which should only be used if it is crystal clear what happens. This is not the case and I agree here with more experienced authorities.
Second, the semantics of setting the experimental_design_id are also very implicit, hidden from
the user and not documented. Note that now e.g., the order in which the merges are performed
may influence what the final experimental_design_id will be. This likely leads too hard to find errors in the future
e.g., if order can't be guaranteed in multi-threaded programs.

Much safer would e.g. be to have a dedicated mergePrimaryFiles (and disallow overwriting of the experimental design id).

These problems also occur in the few other methods that try to merge in information and I think now would be a good time to address those.

timosachsenberg · 2021-05-11T10:37:10Z

timosachsenberg
May 11, 2021
Maintainer Author

also e.g. here:

  IdentificationData::ObservationRef
  IdentificationData::registerObservation(const Observation& obs)
  {
    // reference to spectrum or feature is required:
    if (!no_checks_ && obs.data_id.empty())
    {
      String msg = "missing identifier in observation";
      throw Exception::IllegalArgument(__FILE__, __LINE__,
                                       OPENMS_PRETTY_FUNCTION, msg);
    }
    // ref. to input file may be missing, but must otherwise be valid:
    if (!no_checks_ && obs.input_file_opt &&
        !isValidReference_(*obs.input_file_opt, input_files_))
    {
      String msg = "invalid reference to an input file - register that first";
      throw Exception::IllegalArgument(__FILE__, __LINE__,
                                       OPENMS_PRETTY_FUNCTION, msg);
    }

    // can't use "insertIntoMultiIndex_" because Observation doesn't have the
    // "steps_and_scores" member (from ScoredProcessingResult)
    auto result = observations_.insert(obs);
    if (!result.second) // existing element - merge in new information
    {
      observations_.modify(result.first, [&obs](Observation& existing)
                           {
                             existing += obs;
                           });
    }

    observation_lookup_.insert(uintptr_t(&(*result.first)));

    // @TODO: add processing step (currently not supported by Observation)
    return result.first;
  }

1 reply

hendrikweisser May 11, 2021
Collaborator

@timosachsenberg: Before you go looking for more examples, the merging should happen in pretty much every register... function. It's deliberate.

hendrikweisser · 2021-05-11T10:43:07Z

hendrikweisser
May 11, 2021
Collaborator

It's a deliberate design decision that the register... functions attempt to merge information when an element with the same key already exists. This is (or should be?) done consistently in IdentificationData.
You make a very valid point that this is currently poorly documented. However, I don't agree with the "principle of least surprise" argument. If you register an item in IdentificationData, the main expectation is that afterwards the information from the item will be available in IdentificationData. Generally it doesn't matter if there was already an item with the same key - although if there was, the goal is to preserve as much information as possible, hence the merge. (In cases where preexistence matters, you can just check for it with find.) In many use cases for IdentificationData (e.g. merging results from multiple search engines) it's very much expected that the "same" item will be registered multiple times. However, some "metadata" (e.g. search engine scores) may be different, which makes merging necessary. (Experience also shows that other data annotations may be more or less complete depending on the source, so it's desirable to support merging data quite generally - not restricted to scores etc.)

(To be continued...)

0 replies

hendrikweisser · 2021-05-11T11:16:40Z

hendrikweisser
May 11, 2021
Collaborator

Regarding some of the other points:

operator+=: No strong feelings here, we can certainly use a different function name. It just has be consistent, so it can be used in template functions.
Merging semantics: It makes sense to identify conflicting information and abort with an exception in such cases. This is not done consistently yet. (However, in cases of missing information like the InputFile.experimental_design_id example that Timo cited above, I think it's fine to fill in information silently.)

0 replies

hendrikweisser · 2021-05-11T11:27:46Z

hendrikweisser
May 11, 2021
Collaborator

Some points worth highlighting:

Merging of data is quite fundamental to how IdentificationData works. It's a consequence of storing "primary data elements" (everything with a register... function) nonredundantly.
I believe the problems encountered by Timo and Julianus are due to imperfect implementation (not enough checking for conflicts), rather than the general principle of merging.
We can discuss whether register... functions should signal if there's a preexisting item with same key. I have to say in my use of IdentificationData, I've not found a need for this.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new ID Datastructure #5315

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

new ID Datastructure #5315

timosachsenberg May 11, 2021 Maintainer

Replies: 4 comments · 1 reply

timosachsenberg May 11, 2021 Maintainer Author

hendrikweisser May 11, 2021 Collaborator

hendrikweisser May 11, 2021 Collaborator

hendrikweisser May 11, 2021 Collaborator

hendrikweisser May 11, 2021 Collaborator

timosachsenberg
May 11, 2021
Maintainer

Replies: 4 comments 1 reply

timosachsenberg
May 11, 2021
Maintainer Author

hendrikweisser May 11, 2021
Collaborator

hendrikweisser
May 11, 2021
Collaborator

hendrikweisser
May 11, 2021
Collaborator

hendrikweisser
May 11, 2021
Collaborator