Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IntervalList cautious insert #943

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from
Draft

Conversation

CBroz1
Copy link
Member

@CBroz1 CBroz1 commented Apr 23, 2024

Description

We have previously discussed redundancy in IntervalList in #778. This PR makes pervasive edits to the codebase to check existing entries before inserting new ones. This comes with the drawback of future mismatches between substrings of interval_list_name and other features of a given key. I would argue that substrings should not be used in this manner, and the confusion presented will be minor, as the other contents of the key takes precedence. The use of nwb_file_name in IntervalList prevents cross-session reuse of keys, so confusion should be minimized in that respect.

I've marked as draft pending the addition of documentation on this change. I welcome feedback on the structure of cautious_insert and its inclusion across the codebase

Checklist:

  • This PR should be accompanied by a release: (yes/no/unsure)
  • If release, I have updated the CITATION.cff
  • This PR makes edits to table definitions: (yes/no)
  • If table edits, I have included an alter snippet for release notes.
  • I have updated the CHANGELOG.md with PR number and description.
  • I have added/edited docs/notebooks to reflect the changes

@edeno
Copy link
Collaborator

edeno commented Apr 24, 2024

@lfrank @khl02007 @samuelbray32 I just wanted to collect input on this change as it could be a big one.

Copy link
Collaborator

@samuelbray32 samuelbray32 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple high-level comments:

  • I think another enhancement that fits with this PR would be a IntervalList.cleanup(restriction) function.
    • Purpose would be to check if interval(s) matching the restriction are currently referenced anywhere in the database and either:
      • delete them if not
      • return a list of referenced locations if they are
    • Could potentially overwrite the Mixin class's delete function to run this cleanup call if interval_list_name is in the table's keys. This would help solve issues of orphaned entries

src/spyglass/common/common_interval.py Show resolved Hide resolved
src/spyglass/common/common_interval.py Outdated Show resolved Hide resolved
src/spyglass/lfp/analysis/v1/lfp_band.py Show resolved Hide resolved
)

if not exists:
self.insert1(key, *args, **kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this do a check if the exact key is already present whether the interval lists match? Would resolve some of the comments I have in LFP and LFPBand below

lfp_band_valid_times, new_timestamps
)
# check that the valid times are the same
assert np.isclose(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cautious insert could still create errors without doing this explicit check.

  • Say that this interval list name has been previously inserted.
  • the band was deleted the interval list stays
  • parameters were changed in the band that results in a slightly different interval list
  • cautious_insert finds no matches on times, tries to insert with this name
    • We would want an error in the case that it's different

},
replace=True,
)
elif not np.allclose(tmp_valid_times[0], lfp_valid_times):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar concern about still needing to check for matching as in comment for LFPBand

replace=True,
)
approx_name=True,
) # removed replace=True 2024-04-22
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could make redoing this step tricky for users. They would have to know they need to manually delete the interval list when deleting here

@CBroz1
Copy link
Member Author

CBroz1 commented May 6, 2024

- I think another enhancement that fits with this PR would be a `IntervalList.cleanup(restriction)` function.

The data collection I did for #778 suggested there were relatively few orphans to worry about. I've added cleanup/nightly_cleanup functions to mirror those on the nwbfile tables

@CBroz1
Copy link
Member Author

CBroz1 commented May 6, 2024

Thanks for your review, @samuelbray32! It sounds like replace=True is currently serving the purpose of freeing the user from the need to delete an existing orphan during a re-insert. I currently only check valid_times and, if another key exists, run the fk-ref to that key. By doing a deeper check of the key against existing entries, we could infer the user is in this replace case and return the existing key. Is that right?

Under the current model of 1 IntervalList entry per downstream entry, we assume the key we're using is unique and not used elsewhere when we replace. I'm concerned about the possible edge case in which pipeline B replaces the times for pipeline A's data, violating data integrity, so I would want to avoid replace altogether. My new feature of 'reuse existing entries' makes the data integrity piece all the more important

@samuelbray32
Copy link
Collaborator

I'm concerned about the possible edge case in which pipeline B replaces the times for pipeline A's data, violating data integrity, so I would want to avoid replace altogether.

Agreed, replace seems exceedingly dangerous under this model. I'm thinking of the case where the user (or a table) tries to insert an entry that matches an existing primary key but differs in the actual interval values. replace would have handled this before, but maybe there should be some analagous flag for the insert call that doesn't replace the existing entry in this partially-matching case but adds some hash to the key you're trying to enter and inserts that instead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants