Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Que in a multi-node environment #10

Open
odarriba opened this issue Nov 22, 2018 · 7 comments
Open

Using Que in a multi-node environment #10

odarriba opened this issue Nov 22, 2018 · 7 comments

Comments

@odarriba
Copy link

We are trying to use Que to schedule jobs in a multi-node environment, allowing nodes to connect to each other and share mnesia information.

What I'm doing:

  1. Start iex console:
$ mix --sname node1 -S mix
$ mix --sname node2 -S mix
  1. Connect nodes (on node1 console)
iex> Node.connect(:node2@local_hostname)
  1. Check that they are connected with a Node.list() on both sides

  2. Spawn 100 jobs of my TestWorker(which puts a log message and waits one second)

for number <- 1..100 do
  Que.add(TestQue.TestWorker, number)
end

It works, but all jobs run only on one of the nodes (lets say node1), while on the other (node2) no job is executed.
Also, if I execute the same for on node2, all jobs are executed in node1 too.

is there anything we can do to:

  1. Run jobs on any node
  2. Ensure that if one node is down, the tasks in :mnesia remains in the rest of the cluster.

Thanks in advace!

PD: I have also tried executing the setup of mnesia persisted in disk, with this result:

iex(node1@patata)2> nodes = [node() | Node.list]
[:node1@patata, :node2@patata]
iex(node1@patata)3> Que.Persistence.Mnesia.setup!(nodes)
[info] Application mnesia exited: :stopped
** (Memento.MnesiaException) Mnesia operation failed
   :not_active
   Mnesia Error: {:not_active, Que.Persistence.Mnesia.DB.Jobs, :node2@patata}
    (memento) lib/memento/table/table.ex:274: Memento.Table.handle_for_bang!/1
@sheharyarn
Copy link
Owner

sheharyarn commented Nov 26, 2018

Hey @odarriba! This is an excellent question and something that I have previously deliberated a lot. In the end, I decided to not include this in que for multiple reasons:

  1. Jobs are typically run only on a single machine, and in most of the cases, it's the main application server. I wanted to make Que super simple to use.

  2. Not everyone wants to handle failures in the same way. And while it's common to retry them, Que's default mode of operation is to not do anything.

  3. Distribution is hard and no single solution applies to all scenarios. I didn't want to lock que to a specific node distribution and job processing model.

  4. Same as the 3rd point, not all consensus algorithms are created equal and apply to same situations.

How/when/if an application should be replicated to another node is up to the developer. See the Distributed Applications guide on the Erlang website. All that said, for a simple node failover model, this should be pretty straightforward. You can take a look at how singleton implements this (though this means you'll have to start que in runtime: false mode and manually set up the distribution strategy).

@sheharyarn
Copy link
Owner

Thinking about this more (and seeing how quickly your post got many 👍), I might have to reconsider my position on this.

I'm open to the idea of integrating this into Que if we're able to come up with sane defaults that are easy to change and do not negatively affect the developer experience. PRs/Issues are welcome if you have any ideas on how to approach this.

@noizu
Copy link

noizu commented Dec 1, 2018

I might suggest adding a node parameter to the que table so that queue entries are associated with the node they were created on. My cluster handles about half a billion device reports per day and spans multiple nodes. There are certain tasks I'd like to queue up but they'd overwhelm any single node responsible for handling them.

@noizu
Copy link

noizu commented Dec 1, 2018

Or possibly use Mnesia local table type although I'm not entirely sure with all of the details of that table type.

@noizu
Copy link

noizu commented Dec 2, 2018

@odarriba because I needed some additional functionality I made a fork this morning.

https://github.com/noizu/que

I have added the ability to set a priority level on jobs such that all :pri0 jobs will be processed before any :pri1, :pri2, or :pri3 ones (in that order).

Additionally the schema has been updated to include current node. So that when restarting only jobs queued on a specific node will be requeued.

I have not yet added load balancing logic since I don't need it for my immediate needs but you can simulate it well enough using something like.

cluster = [:"node1@domain", :"node2@domain", ...]
for number <- 1..100 do
  Que.remote_add(Enum.random(cluster), TestQue.TestWorker, number)
end

Or if using the original version of the code.

timeout = 50_000
cluster = [:"node1@domain", :"node2@domain", ...]
for number <- 1..100 do
  :rpc.call(Enum.random(cluster), Que, :add, [TestQue.TestWorker, number], timeout)
end

use :rpc.cast if you don't care about confirmation.

@sheharyarn you have written some fantastically readable code here ^_^ keep up the good work. If you'd like to incorporate the priority logic chat with me sometime. I'll continue to refine it either way on my fork since I need to very efficiently process tens of thousands of jobs a minute with prioritization.

@sheharyarn
Copy link
Owner

Thank you for the suggestions @noizu (and sorry for the late reply!). I took a quick look at your fork and it looks very interesting. Over the next few weeks, I'll try to come up with a plan for adding distributed job execution support, and will post back here with an update.

@noizu
Copy link

noizu commented Jan 18, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants