Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storm Rebalance Broken #226

Open
JessicaLHartog opened this issue Oct 19, 2017 · 1 comment
Open

Storm Rebalance Broken #226

JessicaLHartog opened this issue Oct 19, 2017 · 1 comment

Comments

@JessicaLHartog
Copy link
Collaborator

JessicaLHartog commented Oct 19, 2017

After the merging of #200 and #213 rebalance of topologies no longer does anything. This is because there are no offers on which slots can be made when a rebalance happens unless there happen to also be other topologies needing assignments.

This is as a result of the way that Nimbus handles the TopologiesMissingAssignments component. A quick rundown of what now happens is:

  • storm-mesos does scheduling of topologies until no topologies need assignments
    since no topologies need assignments, offers are suppressed
  • storm-mesos doesn't do anything in MesosNimbus because no topologies need assignments (and offers are already suppressed)
  • a rebalance command comes in and is registered by Nimbus, a :do-rebalance event is scheduled some number of seconds in the future
  • those number of seconds later there is finally a topology that needs assignment (i.e. the one that was just rebalanced), but there are no offers buffered
  • since there are no offers buffered and there are topologies needing assignments, offers are revived
  • allSlotsAvailableForScheduling returns after reviving offers
  • Nimbus wants slots immediately for the rebalancing topology on, and there's no time for offers to come in and be used in the next allSlotsAvailableForScheduling call
  • since there are no slots available for the workers to be rescheduled onto, they don't get rescheduled and rebalance therefore does nothing

Notably, if there are other topologies needing assignments at the same time as the :do-rebalance is executed, then the rebalance should work as expected.

This also is simply referring to the Storm UI "Rebalance" and its associated command. I have not tested this with the type of rebalance mentioned in the Storm documentation:

## Reconfigure the topology "mytopology" to use 5 worker processes,
## the spout "blue-spout" to use 3 executors and
## the bolt "yellow-bolt" to use 10 executors.

$ storm rebalance mytopology -n 5 -e blue-spout=3 -e yellow-bolt=10

However, I fully expect they hit the same logic in the Nimbus and this same behavior (or something similar) happens that way too.

@JessicaLHartog
Copy link
Collaborator Author

Possible solutions:

Write logic that scrapes ZK state to see if there are any topologies in REBALANCING state, and if there are stop suppressing Offers.

Positive(s):

  • This would be able to accumulate Offers in anticipation of needing them to execute a :do-rebalance.
  • This limits the amount of perpetual offer collecting by this framework.

Negative(s):

  • This is complicated and requires a lot of ZK state parsing for little reward.
  • This is prone to bugs in Storm (like one that exists right now in Storm where if the Nimbus dies while a topology is in REBALANCING state, the only way out of it is to resubmit the topology).

Write logic to hold on to some number of unused Offers so that rebalance does something

Positive(s):

  • This would mean that there are always some Offers that we can leverage to make worker slots on whenever a rebalance command is triggered.

Negative(s):

  • This means that there are guaranteed to be wasted resources.
  • This is insufficient for large topologies that need many slots.
  • This discourages spread of workers across many hosts when rebalancing as there are only a few hosts that will have held Offers during execution of the rebalance.
  • Will likely reproduce the same behavior if an insufficient number of slots for the topology in question can be created across the held Offers.

Identify a way to release the _offersLock in the first round of scheduling where we have topologies that need assignment, revive and collect Offers, then use them.

Positives(s):

  • This is probably the right way to fix this problem.
  • This will not hold Offers when we don't need them.
  • This will enable worker spread across as many hosts as possible for topologies that are being rebalanced.

Negative(s):

  • This is decidedly challenging because of the locking situation.
    • We don't want Offers to become unavailable to us when we anticipate using them for scheduling (hence the lock).
    • It is not likely possible for us to hold the lock, let it go, wait a bit for revived Offers to come in, and regain the lock with any guarantee of that order of events.
      • This is because there are asynchronous updates happening to the Offers map when Offers are received.
  • This also would likely require some implementation of a Finite State Machine with transitions for holding/suppressing offers and the various ways in which you can get to/from any given state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant