Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should relation data be visible to charm code from within a relation-broken hook? #888

Open
sed-i opened this issue Jan 12, 2023 · 20 comments
Assignees
Labels
24.04 needs design needs more thought or spec

Comments

@sed-i
Copy link
Contributor

sed-i commented Jan 12, 2023

Currently, harness first emits relation-broken and only then invalidates the data.

This means that relation data is still accessible by charm code while inside the relation-broken hook. Is that intentional?

Based on the event page it sounds like charm code shouldn't be able to see any data when inside the broken hook.

operator/ops/testing.py

Lines 697 to 702 in 4d38ef7

for unit_name in rel_list_map[relation_id].copy():
self.remove_relation_unit(relation_id, unit_name)
self._emit_relation_broken(relation_name, relation_id, remote_app)
if self._model is not None:
self._model.relations._invalidate(relation_name) # pyright:ReportPrivateUsage=false

@benhoyt
Copy link
Collaborator

benhoyt commented Jan 13, 2023

Juju actually allows the charm to access (stale?) relation data in relation-broken. However...

I've just tested this using two charms related to one another, each having a relation-broken hook and accessing relation data (from "this" or the "remote" app). The Harness actually gets this fairly different from real Juju in a number of cases.

Below are the results. The "setter" charm is the charm in the relation that's setting relation data via event.relation.data[self.app]["key"] = value, and the "getter" charm is the charm that's reading relation data.

charm which app Juju result Harness result
setter this {...data...} {...data...}
setter remote {} RuntimeError
getter this {} RelationDataAccessError
getter remote {...data...} RuntimeError
  • Both RuntimeErrors have the message "remote-side relation data cannot be accessed during a relation-broken event".
  • The RelationDataAccessError message is "unit/0 is not leader and cannot read its own application databag", and in __repr__ this gets caught and converted to the string <n/a>.
Show raw data from test

To test this (unit tests and real Juju), I hacked my database and webapp charms that I've been using for secrets to add logging into the relation-broken hooks. The database charm is the "setter" and it sets relation data, and the webapp charm is the "getter" as it reads relation data. self_data means data accessed via event.relation.data[self.app] and event_data means remote data accessed via event.relation.data[event.app].

# real Juju

unit-database-0: 15:25:26 INFO unit.database/0.juju-log db:0: database _on_db_relation_broken: <ops.model.Relation db:0> self_data={'db_password_id': 'secret:cf0c1lrs26oc7aah2260'}
unit-database-0: 15:25:26 INFO unit.database/0.juju-log db:0: database _on_db_relation_broken: <ops.model.Relation db:0> event_data={}

unit-webapp-0: 15:25:26 INFO unit.webapp/0.juju-log db:0: webapp _on_db_relation_broken: <ops.model.Relation db:0> self_data={}
unit-webapp-0: 15:25:26 INFO unit.webapp/0.juju-log db:0: webapp _on_db_relation_broken: <ops.model.Relation db:0> event_data={'db_password_id': 'secret:cf0c1lrs26oc7aah2260'}

# database unit tests

database _on_db_relation_broken: <ops.model.Relation db:0> self_data={'db_password_id': 'secret:20a43c1e-41b3-49b2-ba42-6e11ec2cfffb'}
database: exception when getting event data: remote-side relation data cannot be accessed during a relation-broken event

# webapp unit tests

webapp _on_db_relation_broken: <ops.model.Relation db:0> self_data=<n/a>  # masks: exception when getting self data: webapp/0 is not leader and cannot read its own application databag
webapp: exception when getting event data: remote-side relation data cannot be accessed during a relation-broken event

It seems odd that the Harness deviates so much from real Juju, which just allows reads in each case, even if the data is not useful/stale.

I presume we intentionally raise more errors than Juju in tests to try to catch problems early -- for example, the charm probably shouldn't be accessing remote relation data during relation-broken (but Juju lets you). And I'm not sure about the RelationDataAccessError for the case when the "getter" charm tries to read its own data -- that doesn't seem correct.

As to the original issue, it seems like the data is unable to be accessed (in 3 out of 4 cases!). @sed-i, can you post the actual code you were working with when you ran into this? Were you fetching relation data? Was it in the "getter" or "setter" charm? And was it via self.app (this) or event.app (remote)?

In any case, we should decide whether we want to mimic real Juju more closely. Or we should raise an error consistently if you access relation data in relation-broken in all cases. Is there ever a valid use case for doing that? @jameinel, thoughts?

@sed-i
Copy link
Contributor Author

sed-i commented Jan 13, 2023

can you post the actual code you were working with when you ran into this?

Here's the utest that expects the charmlib/juju to take care of cleaning up relation data.
Specifically, the utest was expecting that whatever custom events fire as a result of self.harness.remove_relation(rel_id), would not see any relation data. I think in this test it's the remote data that is expected to go away.

Is there ever a valid use case for doing that?

The pattern we were taking in o11y charms more often than not, is that rel data represents the most up to date state.
After a relation-broken there is no other event that "reruns the charm" with the updated reldata. Relation-broken is the last chance to act on a change. This way, deep charm code doesn't need to know the event it's in (no need to if event is relation-broken then update everything ignoring data).

@benhoyt
Copy link
Collaborator

benhoyt commented Jan 17, 2023

I'm probably missing something obvious, but I don't quite understand -- doesn't the above table show that relation-broken on real Juju still includes the previous data? So wouldn't expecting the Harness to do something different mean the unit test will behave differently in unit tests compared to under real Juju?

@sed-i
Copy link
Contributor Author

sed-i commented Jan 17, 2023

I'm not sure I understand the table correctly:
On relation broken, the remaining app can read the data that was set by the departed app?

@benhoyt
Copy link
Collaborator

benhoyt commented Jan 18, 2023

Yeah, that's right -- that's the last row in the table. It shows the "getter" app (i.e., the other charm from the one that set the data) being able to read data that the remote app (the "setter") set. Under real Juju it can read this data, under the Harness you currently get a RuntimeError("remote-side relation data cannot be accessed during a relation-broken event"), which doesn't seem to match reality. (@jameinel any idea why the Harness tries to be different/stricter than reality here?)

You can see this from the following log line:

unit-webapp-0: 15:25:26 INFO unit.webapp/0.juju-log db:0: webapp _on_db_relation_broken: <ops.model.Relation db:0> event_data={'db_password_id': 'secret:cf0c1lrs26oc7aah2260'}

The webapp charm is the "getter" in this case, and it was able to read that data -- during relation-broken -- that the database charm had set.

@jameinel
Copy link
Member

jameinel commented Jan 18, 2023 via email

@sed-i
Copy link
Contributor Author

sed-i commented Jan 18, 2023

Iiuc, this means that the following pattern is wrong:

def _on_relation_departed(self, _):  # or broken
    self._update_config()  # regenerate everything from current rel data

and instead we should do something like:

def _on_relation_departed(self, event):  # or broken
    self._update_config(excluding=event.relation.data)

Is that correct?

In other words, from within relation-departed/broken:

  • is event.relation included in self.model.relations?
  • is event.relation.data included in self.model.relations[x].data?

@lucabello
Copy link

Hi everyone, chiming in on this; is what @sed-i proposed the pattern we're supposed to follow?

@sed-i
Copy link
Contributor Author

sed-i commented Jun 2, 2023

@PietroPasotti just gave me an idea:
If we always defer a relation-broken event, then next hook (on update-status the latest) there will be no data left in relation data, so charm code could operate on the entire relation data, i.e. without needing to work with the delta that a relation-broken implies.

This is not a great pattern, but it conveys well our dissonance about relation-broken.

@carlcsaposs-canonical
Copy link
Contributor

FYI in some cases (but not all) accessing the remote application data in a relation broken event causes an error

See:
https://bugs.launchpad.net/juju/+bug/1960934

operator/ops/model.py

Lines 1341 to 1349 in 734e12d

if key is None and self.relation.app is None:
# NOTE: if juju gets fixed to set JUJU_REMOTE_APP for relation-broken events, then that
# should fix the only case in which we expect key to be None - potentially removing the
# need for this error in future ops versions (i.e. if relation.app is guaranteed to not
# be None. See https://bugs.launchpad.net/juju/+bug/1960934.
raise KeyError(
'Cannot index relation data with "None".'
' Are you trying to access remote app data during a relation-broken event?'
' This is not allowed.')

canonical/mysql-router-k8s-operator#73

@ca-scribner
Copy link

What the Kubeflow team has seen with istio-pilot is like what @carlcsaposs-canonical reports. In live Juju when handling a relation-broken event:

  1. sometimes event.app=DEPARTING_APP (and I think relation.app is the same thing? probably an alias?)
  2. sometimes event.app=None

Up until today, we had only seen case (1) and whenever we saw it, we also knew that the departing application's data is still in the relation data bag. We handled this by popping the departing data before using the data bag.

# (simplified version - differs slightly from the link)
if isinstance(event, RelationBrokenEvent):
    relation_data.pop((event.relation, event.app))

Now that sometimes we see event.app=None, I wonder if we should instead do something like:

if isinstance(event, RelationBrokenEvent):
    try:
        relation_data.pop((event.relation, event.app))
    except KeyError:
        log_and_pass  # ?

The one question I have is whether, when event.app==None, are we guaranteed that the departing application's data has been removed from the databag? If not, that will cause trouble as we can't pop it

@benhoyt
Copy link
Collaborator

benhoyt commented Jun 15, 2023

@jameinel Per the above and per https://bugs.launchpad.net/juju/+bug/1960934, it seems Juju is sometimes setting JUJU_REMOTE_APP but sometimes not setting it. Do you think that could be fixed on the Juju side? Or I guess we could change Juju to always not set it, but that might be too breaking.

@sed-i
Copy link
Contributor Author

sed-i commented Jun 21, 2023

@sed-i
Copy link
Contributor Author

sed-i commented Jul 5, 2023

Also, I think we usually don't want to see the stale data even on relation-departed:

  1. apps foo and bar are related.
  2. bar scaled down from 2 to 1.
  3. foo received relation-departed, but not broken.
  4. foo finishes running with the old data view which included the departing unit, and will only be able to act on the new data view (without bar/2) on update-status.

Real world example:

  1. Alertmanager is scaled from 2 to 1
  2. Prometheus needs to regenerate its yaml config file with only one target under the alertmanagers section.

@sed-i
Copy link
Contributor Author

sed-i commented Jul 6, 2023

This issue is related to a frequent point of friction in charming: reconciling holistic vs deltas approaches.

  • Relation events give us deltas: unit/x joined or unit/x departed, and now we need to append/pop a section to/from an existing config file.
  • Idempotentency and robustness call for holistic: for example, on config-changed after upgrade we need to iterate over all relations to construct a full config from scratch.

@benhoyt
Copy link
Collaborator

benhoyt commented Sep 29, 2023

Need to consider further what to do here. Possibly related to #940 work.

@carlcsaposs-canonical
Copy link
Contributor

Related: #940 (comment) (difference in usage between local and remote app data during relation-broken)

@PietroPasotti
Copy link
Contributor

This is fixed in ops 2.10, isn't it?

@carlcsaposs-canonical
Copy link
Contributor

This is fixed in ops 2.10, isn't it?

I believe the relation data is still accessible—and I think it should be, for the local app/unit databags

@benhoyt
Copy link
Collaborator

benhoyt commented Mar 14, 2024

@tonyandrewmeyer is going to investigate this further and then we'll make a decision here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
24.04 needs design needs more thought or spec
Projects
None yet
Development

No branches or pull requests

8 participants