Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError: 66 (in used += self.cluster.osd_transfer_remainings[osdid]) #35

Open
patrakov opened this issue Mar 11, 2024 · 5 comments
Open

Comments

@patrakov
Copy link

While trying to rebalance an especially broken cluster, my colleague found this exception:

# ./placementoptimizer.py --osdsize device balance --osdused delta --max-pg-moves 50 --osdfrom fullest
Traceback (most recent call last):
  File "./placementoptimizer.py", line 5475, in <module>
    exit(main())
  File "./placementoptimizer.py", line 5470, in main
    run()
  File "./placementoptimizer.py", line 5434, in <lambda>
    run = lambda: balance(args, state)
  File "./placementoptimizer.py", line 4600, in balance
    need_simulation=True)
  File "./placementoptimizer.py", line 3260, in __init__
    self.init_analyzer.analyze(self)
  File "./placementoptimizer.py", line 4264, in analyze
    self._update_stats()
  File "./placementoptimizer.py", line 4350, in _update_stats
    self.cluster_variance = self.pg_mappings.get_cluster_variance()
  File "./placementoptimizer.py", line 3771, in get_cluster_variance
    for crushclass, usages in self.get_class_osd_usages().items():
  File "./placementoptimizer.py", line 3509, in get_class_osd_usages
    ret[crushclass] = {osdid: self.get_osd_usage(osdid) for osdid in self.osd_candidates_class[crushclass]}
  File "./placementoptimizer.py", line 3509, in <dictcomp>
    ret[crushclass] = {osdid: self.get_osd_usage(osdid) for osdid in self.osd_candidates_class[crushclass]}
  File "./placementoptimizer.py", line 3757, in get_osd_usage
    used = self.get_osd_usage_size(osdid, add_size)
  File "./placementoptimizer.py", line 3714, in get_osd_usage_size
    used += self.cluster.osd_transfer_remainings[osdid]
KeyError: 66

Note that osd.66 is the only OSD which has the hdd_test class:

$ ceph osd tree | grep test
 66  hdd_test     14.55269          osd.66              up   1.00000  1.00000

As we are not permitted to publicly post anything containing UUIDs that can be used to identify the customer's cluster, I am going to submit the debug info via private email.

@patrakov
Copy link
Author

I have successfully worked around the crash by adding --only-crushclass hdd

@TheJJ
Copy link
Owner

TheJJ commented Mar 15, 2024

Thanks for the report and file - i can probably figure this out from the data but you may know directly:
how is the hdd_test class osd selected when the others are from the hdd class? a manual crush root?

@patrakov
Copy link
Author

No idea. Apparently they just set one OSD to this class and later created a pool that uses it. Today they set more OSDs to this class (see the dump from #36).

@TheJJ
Copy link
Owner

TheJJ commented Mar 16, 2024

sounds wild :D let's hope they know what they're doing (but then again they seem to have asked you for help :)
looking at the crush rules all the hdd rules pick default~hdd, and hdd_test devices are not part of this. maybe it's possible having upmaps to devices of different classes? these would not be movement candidates but still need to be accounted when moving data away from them. but this seems rather special, phew. I'm gonna think a bit what this means for handling it properly 🤔

@patrakov
Copy link
Author

patrakov commented Mar 16, 2024

I think I should have expressed myself better. At this point, I think that the fact that I managed to notice a state with only one OSD with the "hdd_test" CRUSH device class is purely due to the miscoordination of my work with their work. The end result (two hosts full of hdd_test OSDs, which makes more sense but still triggers the issue unless --only-crushclass hdd is added) is available in the dump that I sent you for issue #36.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants