Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-39485: Use redis for work dispatch #62

Open
wants to merge 140 commits into
base: main
Choose a base branch
from
Open

Conversation

mfisherlevine
Copy link
Contributor

No description provided.

@mfisherlevine mfisherlevine force-pushed the tickets/DM-39485 branch 2 times, most recently from 4bd5e81 to f2c9680 Compare August 10, 2023 17:20
@mfisherlevine mfisherlevine force-pushed the tickets/DM-39485 branch 2 times, most recently from 6b571fc to 41ed889 Compare August 18, 2023 11:08
@mfisherlevine mfisherlevine force-pushed the tickets/DM-39485 branch 2 times, most recently from 98f7fd0 to bbea315 Compare August 29, 2023 11:45
@mfisherlevine mfisherlevine force-pushed the tickets/DM-39485 branch 6 times, most recently from 80f6757 to 913a85e Compare October 13, 2023 15:01
@mfisherlevine mfisherlevine force-pushed the tickets/DM-39485 branch 2 times, most recently from b9b4fec to b2ab28b Compare February 18, 2024 12:43
@mfisherlevine mfisherlevine force-pushed the tickets/DM-39485 branch 3 times, most recently from 62693e1 to 91900ef Compare March 20, 2024 18:26
@mfisherlevine mfisherlevine force-pushed the tickets/DM-39485 branch 11 times, most recently from 92c2f53 to 261fa4e Compare March 31, 2024 13:46
@mfisherlevine mfisherlevine marked this pull request as ready for review May 3, 2024 17:03
Copy link

@TallJimbo TallJimbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of comments, some minor, some about the future, but really not that many given the PR size.

I did not look too closely at:

  • logic where I didn't have enough context/background to understand it (e.g. decoding redis objects, non-redis channels)
  • scripts (that's all machine-generated, right, for gitops, except scripts/meta?)
  • places where you already had "TODO" comments that suggested that you wanted to rewrite that code substantially already.

@@ -73,3 +73,11 @@ comCamMetadataShardPath: '/sdf/home/m/mfl/u/rubintv/LSSTComCam/sidecar_metadata/
# summit-like configs
tmaMetadataPath: '/sdf/home/m/mfl/u/rubintv/tma/sidecar_metadata'
tmaMetadataShardPath: '/sdf/home/m/mfl/u/rubintv/tma/sidecar_metadata/shards'

# Redis work distribution configs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment supposed to be here? Looks like a pipeline yaml file to me, not a Redis config (same in other similar files).

# Butler paths - use full paths not aliases here as path existence is checked
butlerPath: '/sdf/group/rubin/repo/embargo/butler.yaml'

# # data paths

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these doubled #s significant?

channelName,
watcherType,
doRaise,
queueName=None, # only needed for redis watcher. Not the neatest but will do for now

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Args here are now out of sync with docs.

"""
Convert a pipelineGraph to bytes.

Upstream this to pipe_base after OR3.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given how things have gone since we first added this, we might want to evaluate whether we can drop the stuff that uses it here instead. I may be wrong, but it seemed like being able to change the pipeline without restarting a pod was looking less important.

"""
A dataclass representing a payload.

These go in minimal, but come out full, by using the butler.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally very much recommend using pydantic for structs that get converted to/from JSON, but this expansion-by-butler at least that makes that less of an obvious win.

except redis.exceptions.ConnectionError as e:
raise RuntimeError("Could not connect to redis - is it running?") from e
except Exception as e:
raise RuntimeError(f"Unexpected error connecting to redis: {e}")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise RuntimeError(f"Unexpected error connecting to redis: {e}")
raise RuntimeError(f"Unexpected error connecting to redis: {e}") from e

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe just let this raise...I'm not sure your RuntimeError is adding anything useful here.

The name of the task that has finished processing.
processingId : `int`
Either the exposureId or visitId of the payload that has finished
being processed for the specified task.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd seen processingId in a few places before but didn't see what it meant until I got here. I think it's going to need some kind of abstraction to work with the AOS pipelines, from what I've seen of them.

This is just one example of what I think is the biggest weakness of this system in its current state, so I'm gonna use this thread as a place to start that conversation:: different kinds of data IDs (like visits and exposures) are all over the place, and in some case they're instrument-specific. Instead, I think the redis, control layer, and worker code should instead recognize three kinds of processing:

  • per-detector (these layers do need to know about detectors because that's the worker-affinity dimension);
  • gather-of-detectors (per exposure, per visit, per intra/extra-focal pair-of-visits, maybe per-group someday);
  • cumulative nightly.

I think we can introspect a PipelineGraph to split it up tasks into a sequence of those (or die if that can't be done). The ambitious version of this goal gets rid of all mentions of expRecord from anything other than user-display messages (e.g. logs, plots) and things that actually need exposure metadata (I'm thinking it'd be replaced by some ABC), and it'd get rid of all mention of "steps". But that ambitious version may not be doable until older Rapid Analysis channel types are migrated, and it may never be worth doing at all. The argument for doing is that it prepares us for the next time somebody comes up with a new problem for Rapid Analysis to solve (i.e. new pipeline for it to run), by separating the orchestration logic further from the assumptions about the kinds of pipelines it runs.


This is because, when a butler watcher restarts, it will always find
the most recent exposure record in the repo. We don't want to always
issue these for processing, so we keep a list of what's been seen.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the redis content persistent, then, so if the redis DB goes down it remembers this sort of thing when it comes back up?

writeDataIdFile,
)

__all__ = ("FileWatcher", "RedisWatcher", "ButlerWatcher")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually put __all__ before all of the imports (except __future__ imports).

return f"INCOMING-{instrument}-raw"


class RedisHelper:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some documentation of the Redis schema would be really useful: e.g. keys, types, and what they mean. Absent that, I can't really claim I understood how various states for workers are represented and hence how different methods interact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants