Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support "competing" readers #2

Open
kjnilsson opened this issue Apr 23, 2020 · 10 comments
Open

Support "competing" readers #2

kjnilsson opened this issue Apr 23, 2020 · 10 comments

Comments

@kjnilsson
Copy link
Contributor

kjnilsson commented Apr 23, 2020

Currently all readers read the entire log and there is no mechanism for "competing" reads, i.e. where multiple readers read entries/chunks in round-robin order, i.e. in order to increase the speed at which entries for a given stream are processed.

A possible design for doing competing reads:

  • A competing read coordinator (CRC) process is co-hosted with the osiris writer (leader)
  • The CRC acts like a reader although it only ever scans the chunk headers.
  • Readers that want to "compete" (collaborate may be a better term) over a log register with the CRC and wait for the CRC to inform them of the next chunk id (offset) to read from.
  • The CRC will thus allocate chunk ids to competing reads and maintain state of the current readers, what chunk ids they have been allocated.
  • Readers will ack back when they are finished processing a chunk id and so that they can be allocated another chunk id (some degree of pipelining should be be allowed).
  • The CRC persists the current read state in the log as a special entry type so that it can be replicated recovered from anywhere
  • Thus chunk ids are allocated to available readers in a round-robin-ish manner allowing.
  • Although allocation of chunk ids happen on the writer node the reads can do the actual reads on replica nodes which will further scale out reads across the cluster.

Downsides:

  • Ordering is gone as if a reader fails whilst being allocated a chunk id this chunk id needs to be given to an existing reader which may already have processed a higher chunk id, resulting in this reader having read chunks out of order. That said for competing consumes this is always the case.
@Vanlightly
Copy link
Contributor

The above looks like an elegant design.

This leaves open the question of consumer offset tracking.

Without competing consumers, manually managed consumer offsets are easy. Just periodically write the last offset to some kind of persistent store. With competing consumers, this becomes a more tricky problem to get right as it is not a single offset that represents a high watermark of where consumption has reached.

When we start offering built-in offset tracking, again it becomes more complex.

@kjnilsson
Copy link
Contributor Author

I don't think a competing consumer can do offset tracking, the offset tracking is done by the read coordinator so new consumers just join the round-robin queue and are advised of the next chunk id to read

@gerhard
Copy link

gerhard commented Apr 23, 2020

It all sounds reasonable 👍 from me.

@lukebakken
Copy link
Contributor

👍 to "collaborate"

@acogoluegnes
Copy link
Contributor

  • Readers will ack back when they are finished processing a chunk id and so that they can be allocated another chunk id (some degree of pipelining should be be allowed).

How will it translate in term of API for reading client (e.g. the stream plugin)? Right now they send_file or register_offset_listener to get notified there's something new. I would expect this would be transparent for them as the reader/CRC would control which chunk they are supposed to send.

@acogoluegnes
Copy link
Contributor

  • The CRC persists the current read state in the log as a special entry type so that it can be replicated recovered from anywhere

OK, so competing readers get offset tracking for free? This is not supported yet for traditional readers, but when it will, they will have to issue a command (commit?) and will expect the broker to keep the offset where they left off. Are we on the same page for this?

@acogoluegnes
Copy link
Contributor

What about replay semantics? Can a group of competing consumers can start over? This implies some kind of deletion concept for the group, or at least offset reset for this group.

@kjnilsson
Copy link
Contributor Author

  • Readers will ack back when they are finished processing a chunk id and so that they can be allocated another chunk id (some degree of pipelining should be be allowed).

How will it translate in term of API for reading client (e.g. the stream plugin)? Right now they send_file or register_offset_listener to get notified there's something new. I would expect this would be transparent for them as the reader/CRC would control which chunk they are supposed to send.

Yes we should hide this complexity (choose which process to send to) behind a common api.

@kjnilsson
Copy link
Contributor Author

OK, so competing readers get offset tracking for free? This is not supported yet for traditional readers, but when it will, they will have to issue a command (commit?) and will expect the broker to keep the offset where they left off. Are we on the same page for this?

Yes but I could imagine considering using the same process for keeping consumer offsets for this stream. Might need to ponder that a bit. :)

@kjnilsson
Copy link
Contributor Author

What about replay semantics? Can a group of competing consumers can start over? This implies some kind of deletion concept for the group, or at least offset reset for this group.

No I don't think replay would be supported for a competing reader group, you can of course replay if you like as a "normal" reader, i.e. not competing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants