Skip to content

wukong dataflow notes from similar tools

Philip (flip) Kromer edited this page May 9, 2012 · 1 revision

wukong/dataflow -- Notes from Similar Tools

Notes from Esper

Esper docs

  • Data window views:

    • win:length
    • win:length_batch
    • win:time
    • win:time_batch
    • win:time_length_batch
    • win:time_accum
    • win:ext_timed
    • ext:sort_window
    • ext:time_order
    • std:unique
    • std:groupwin
    • std:lastevent
    • std:firstevent
    • std:firstunique
    • win:firstlength
    • win:firsttime
  • Views that derive statistics:

    • std:size
    • stat:uni
    • stat:linest
    • stat:correl
    • stat:weighted_avg

s

[annotations]
[expression_declarations]
[context context_name]
[insert into insert_into_def]
select select_list
from stream_def [as name] [, stream_def [as name]] [,...]
[where search_conditions]
[group by grouping_expression_list]
[having grouping_search_conditions]
[output output_specification]
[order by order_by_expression_list]
[limit num_rows]
  • Filter criteria. The following operators are highly optimized through indexing and are the preferred means of filtering in high-volume event streams:

    • equals =
    • not equals !=
    • comparison operators < , > , >=, <=
    • ranges-- Ranges come in the following 4 varieties. The use of round () or square [] bracket dictates whether an endpoint is included or excluded. The low point and the high-point of the range are separated by the colon : character.
      • Open ranges that contain neither endpoint (low:high)
      • Closed ranges that contain both endpoints [low:high]. The equivalent 'between' keyword also defines a closed range.
      • Half-open ranges that contain the low endpoint but not the high endpoint [low:high)
      • Half-closed ranges that contain the high endpoint but not the low endpoint (low:high]
    • use the between keyword for a closed range where both endpoints are included
    • use the in keyword and round () or square brackets [] to control how endpoints are included
    • for inverted ranges use the not keyword and the between or in keywords
    • list-of-values checks using the in keyword or the not in keywords followed by a comma-separated list of values
  • where clause ...where fraud.severity = 5 and amount > 500 ...where (orderItem.orderId is null) or (orderItem.class != 10) ...where (orderItem.orderId = null) or (orderItem.class <> 10) ...where itemCount / packageCount > 10

  • aggregate -- aggregate_function( [all | distinct] expression)

    • group by
    • having
    • last, all, first



Notes from Rack

Notes form the rack specification -- a pure expression of the middleware dataflow paradigm

...

This specification aims to formalize the Rack protocol. You can (and should) use Rack::Lint to enforce it. When you develop middleware, be sure to add a Lint before and after to catch all mistakes.

Rack applications

A Rack application is an Ruby object (not a class) that responds to call. It takes exactly one argument, the environment and returns an Array of exactly three values: The status, the headers, and the body.

The Environment

The environment must be an instance of Hash that includes CGI-like headers. The application is free to modify the environment. The environment is required to include these variables (adopted from PEP333), except when they’d be empty, but see below.

  • REQUEST_METHOD: The HTTP request method, such as "GET" or "POST". This cannot ever be an empty string, and so is always required.

  • SCRIPT_NAME: The initial portion of the request URL‘s "path" that corresponds to the application object, so that the application knows its virtual "location". This may be an empty string, if the application corresponds to the "root" of the server.

  • PATH_INFO: The remainder of the request URL‘s "path", designating the virtual "location" of the request‘s target within the application. This may be an empty string, if the request URL targets the application root and does not have a trailing slash. This value may be percent-encoded when I originating from a URL.

  • QUERY_STRING: The portion of the request URL that follows the ?, if any. May be empty, but is always required!

  • SERVER_NAME, SERVER_PORT: When combined with SCRIPT_NAME and PATH_INFO, these variables can be used to complete the URL. Note, however, that HTTP_HOST, if present, should be used in preference to SERVER_NAME for reconstructing the request URL. SERVER_NAME and SERVER_PORT can never be empty strings, and so are always required.

  • HTTP_ Variables: Variables corresponding to the client-supplied HTTP request headers (i.e., variables whose names begin with HTTP_). The presence or absence of these variables should correspond with the presence or absence of the appropriate HTTP header in the request.

  • rack.version: The Array [1,1], representing this version of Rack.

  • rack.url_scheme: http or https, depending on the request URL.

  • rack.input: See below, the input stream.

  • rack.errors: See below, the error stream.

  • rack.multithread: true if the application object may be simultaneously invoked by another thread in the same process, false otherwise.

  • rack.multiprocess: true if an equivalent application object may be simultaneously invoked by another process, false otherwise.

  • rack.run_once: true if the server expects (but does not guarantee!) that the application will only be invoked this one time during the life of its containing process. Normally, this will only be true for a server based on CGI (or something similar).

  • rack.logger: A common object interface for logging messages. The object must implement:

    • info(message, &block)
    • debug(message, &block)
    • warn(message, &block)
    • error(message, &block)
    • fatal(message, &block)

The server or the application can store their own data in the environment, too. The keys must contain at least one dot, and should be prefixed uniquely. The prefix rack. is reserved for use with the Rack core distribution and other accepted specifications, and must not be used otherwise. The environment must not contain the keys HTTP_CONTENT_TYPE or HTTP_CONTENT_LENGTH (use the versions without HTTP_). The CGI keys (named without a period) must have String values. There are the following restrictions:

  • rack.version must be an array of Integers.
  • rack.url_scheme must either be 'http' or 'https'.
  • There must be a valid input stream in rack.input.
  • There must be a valid error stream in rack.errors.
  • The REQUEST_METHOD must be a valid token.
  • The SCRIPT_NAME, if non-empty, must start with '/'
  • The PATH_INFO, if non-empty, must start with '/'
  • The CONTENT_LENGTH, if given, must consist of digits only.
  • One of SCRIPT_NAME or PATH_INFO must be set. PATH_INFO should be '/' if SCRIPT_NAME is empty. SCRIPT_NAME never should be '/', but instead be empty.

The Input Stream

The input stream is an IO-like object which contains the raw record. When applicable, its external encoding must be “ASCII-8BIT” and it must be opened in binary mode, for Ruby 1.9 compatibility. The input stream must respond to gets, each, read and rewind.

  • gets must be called without arguments and return a string, or nil on EOF.
  • read behaves like IO#read. Its signature is read([length, [buffer]]). If length is given, it must be an non-negative Integer (>= 0) or nil, and buffer must be a String and may not be nil. If length is given and not nil, then this method reads at most length bytes from the input stream. If length is not given or nil, then this method reads all data until EOF. When EOF is reached, this method returns nil if length is given and not nil, or "" if length is not given or is nil. If buffer is given, then the read data will be placed into buffer instead of a newly created String object.
  • each must be called without arguments and only yield Strings.
  • rewind must be called without arguments. It rewinds the input stream back to the beginning. It must not raise Errno::ESPIPE: that is, it may not be a pipe or a socket. Therefore, handler developers must buffer the input data into some rewindable object if the underlying input stream is not rewindable.
  • close must never be called on the input stream.

The Error Stream

The error stream must respond to puts, write and flush.

  • puts must be called with a single argument that responds to to_s.
  • write must be called with a single argument that is a String.
  • flush must be called without arguments and must be called in order to make the error appear for sure.
  • close must never be called on the error stream.

The Response

  • status: an HTTP status. When parsed as integer (to_i), it must be greater than or equal to 100.
  • headers: must respond to each, and yield values of key and value. The header keys must be Strings. The header must not contain a Status key, contain keys with : or newlines in their name, contain keys names that end in - or _, but only contain keys that consist of letters, digits, _ or - and start with a letter. The values of the header must be Strings, consisting of lines (for multiple header values, e.g. multiple Set-Cookie values) seperated by “n“. The lines must not contain characters below 037.
    • Content-Type: There must be a Content-Type, except when the Status is 1xx, 204 or 304, in which case there must be none given.
    • Content-Length: There must not be a Content-Length header when the Status is 1xx, 204 or 304.
  • body: must respond to each and must only yield String values. The Body itself should not be an instance of String, as this will break in Ruby 1.9. If the Body responds to close, it will be called after iteration. If the Body responds to to_path, it must return a String identifying the location of a file whose contents are identical to that produced by calling each; this may be used by the server as an alternative, possibly more efficient way to transport the response. The Body commonly is an Array of Strings, the application instance itself, or a File-like object.



Notes from Kafka

Kafka docs

Delivery Guarantees

From Kafka docs

Most messaging systems keep metadata about what messages have been consumed on the broker. That is, as a message is handed out to a consumer, the broker records that fact locally. This is a fairly intuitive choice, and indeed for a single machine server it is not clear where else it could go. Since the data structure used for storage in many messaging systems scale poorly, this is also a pragmatic choice -- since the broker knows what is consumed it can immediately delete it, keeping the data size small.

What is perhaps not obvious, is that getting the broker and consumer to come into agreement about what has been consumed is not a trivial problem. If the broker records a message as consumed immediately every time it is handed out over the network, then if the consumer fails to process the message (say because it crashes or the request times out or whatever) then that message will be lost. To solve this problem, many messaging systems add an acknowledgement feature which means that messages are only marked as sent not consumed when they are sent; the broker waits for a specific acknowledgement from the consumer to record the message as consumed. This strategy fixes the problem of losing messages, but creates new problems. First of all, if the consumer processes the message but fails before it can send an acknowledgement then the message will be consumed twice. The second problem is around performance, now the broker must keep multiple states about every single message (first to lock it so it is not given out a second time, and then to mark it as permanently consumed so that it can be removed). Tricky problems must be dealt with, like what to do with messages that are sent but never acknowledged.

So clearly there are multiple possible message delivery guarantees that could be provided:

  • At most once—this handles the first case described. Messages are immediately marked as consumed, so they can't be given out twice, but many failure scenarios may lead to losing messages.
  • At least once—this is the second case where we guarantee each message will be delivered at least once, but in failure cases may be delivered twice.
  • Exactly once—this is what people actually want, each message is delivered once and only once.