Skip to content

emc-mongoose/mongoose-storage-driver-pravega

Repository files navigation

Gitter chat Issue Tracker CI status Maven metadata URL Docker Pulls

Content

  1. Introduction
  2. Features
  3. Deployment
      3.1. Basic
      3.2. Docker
        3.2.1. Standalone
        3.2.2. Distributed
          3.2.2.1. Additional Node
          3.2.2.2. Entry Node
  4. Configuration
      4.1. Specific Options
      4.2. Tuning
        4.2.1. Concurrency
        4.2.2. Base Storage Driver Usage Warnings
  5. Usage
      5.1. Event Stream Operations
        5.1.1. Create
          5.1.1.1. Transactional
        5.1.2. Read
        5.1.3. Update
        5.1.4. Delete
        5.1.5. End-to-end Latency
      5.2. Byte Stream Operations
        5.2.1. Create
        5.2.2. Read
        5.2.3. Update
        5.2.4. Delete
      5.3. Misc
        5.3.1. Manual Scaling
        5.3.2. Multiple Destination Streams
  6. Open Issues
  7. Development
      7.1. Build
      7.2. Test
        7.2.1. Automated
          7.2.1.1. Unit
          7.2.1.2. Integration
          7.2.1.3. Functional
        7.2.2. Manual

1. Introduction

Mongoose and Pravega are using quite different concepts. So it's necessary to determine how Pravega-specific terms are mapped to the Mongoose abstractions.

Pravega Mongoose
Stream Item Path or Data Item
Scope Storage Namespace
Event Data Item
Stream Segment N/A

2. Features

  • Authentication: provided externally
  • SSL/TLS: not implemented yet
  • Item Types:
    • data: corresponds to an event either byte stream depending on the configuration
    • path: not supported
    • token: not supported
  • Supported load operations:
    • create (events, byte streams)
    • read (events, byte streams)
    • delete (streams)
  • Storage-specific:

3. Deployment

3.1. Basic

Java 11+ is required to build/run.

  1. Get the latest mongoose-base jar from the maven repo and put it to your working directory. Note the particular version, which is referred as BASE_VERSION below.

  2. Get the latest mongoose-storage-driver-preempt jar from the maven repo and put it to the ~/.mongoose/<BASE_VERSION>/ext directory.

  3. Get the latest mongoose-storage-driver-pravega jar from the maven repo and put it to the ~/.mongoose/<BASE_VERSION>/ext directory.

java -jar mongoose-base-<BASE_VERSION>.jar \
    --storage-driver-type=pravega \
    --storage-namespace=scope1 \
    --storage-net-node-addrs=<NODE_IP_ADDRS> \
    --storage-net-node-port=9090 \
    --load-batch-size=100 \
    --storage-driver-limit-queue-input=10000 \
    ...

3.2. Docker

3.2.1. Standalone

docker run \
    --network host \
    emcmongoose/mongoose-storage-driver-pravega \
    --storage-namespace=scope1 \
    --storage-net-node-addrs=<NODE_IP_ADDRS> \
    --load-batch-size=100 \
    --storage-driver-limit-queue-input=10000 \
    ...

Use emcmongoose/mongoose-storage-driver-pravega:4.2.29 for pravega 0.7 or earlier.

3.2.2. Distributed

3.2.2.1. Additional Node

docker run \
    --network host \
    --expose 1099 \
    emcmongoose/mongoose-storage-driver-pravega \
    --run-node

3.2.2.2. Entry Node

docker run \
    --network host \
    emcmongoose/mongoose-storage-driver-pravega \
    --load-step-node-addrs=<ADDR1,ADDR2,...> \
    --storage-net-node-addrs=<NODE_IP_ADDRS> \
    --storage-namespace=scope1 \
    --load-batch-size=100 \
    --storage-driver-limit-queue-input=10000 \
    ...

4. Configuration

4.1. Specific Options

Name Type Default Value Description
storage-driver-control-scope boolean true Allow to try to create scope
storage-driver-control-timeoutMillis integer 2000 The timeout for any Pravega Controller API call
storage-driver-create-timestamp boolean false Should write 8 bytes at the beginning of each event. Required for the e2e latency mode as there is no current option to pass metadata with the event.
storage-driver-event-key-enabled boolean false Specifies if Mongoose should generate its own routing key during the events creation
storage-driver-event-key-count integer 0 Specifies a max count of unique routing keys to use during the events creation (may be considered as a routing key period). 0 value means to use unique routing key for each new event
storage-driver-event-timeoutMillis integer 100 The event read timeout in milliseconds
storage-driver-read-e2eMode boolean false Enables e2e mode for read. This enables tail read as well as writing data to a csv file.
storage-driver-read-tail boolean false Enables tail read. Catch-up read by default.
storage-driver-scaling-type enum "fixed" The scaling policy type to use (fixed/event_rate/kbyte_rate). See the Pravega documentation for details
storage-driver-scaling-rate integer 0 The scaling policy target rate. May be measured in events per second either kilobytes per second depending on the scaling policy type
storage-driver-scaling-factor integer 0 The scaling policy factor. From the Pravega javadoc: the maximum number of splits of a segment for a scale-up event.
storage-driver-scaling-segments integer 1 From the Pravega javadoc: the minimum number of segments that a stream can have independent of the number of scale down events.
storage-driver-stream-data enum "events" Work on events or byte streams (if bytes is set)
storage-net-node-addrs list of strings 127.0.0.1 The list of the Pravega storage nodes to use for the load
storage-net-node-port integer 9090 The default port of the Pravega storage nodes, should be explicitly set to 9090 (the value used by Pravega by default)
storage-net-maxConnPerSegmentstore integer 5 The default amount of connections per each Pravega Segmentstore
storage-net-node-conn-pooling boolean true Use or not the connection pooling for the event writers See this Pravega issue for details

4.2. Tuning

  • storage-net-maxConnPerSegmentstore This parameter can largely affect the performance, but it also increases network workload

  • storage-driver-threads Amount of eventReaders per stream is equal to amount of storage-driver-threads. And as known from Pravega doc, the largest effective reader group consists of as many readers as there are segments in the stream we read.

4.2.1. Concurrency

There are two configuration options controlling the load operations concurrency level.

  • storage-driver-limit-concurrency Limits the count of the active load operations at any moment of the time. The best practice is to set it to 0 (unlimited concurrency for the asynchronous operations, aka the top gear of the "burst mode").

  • storage-driver-threads The count of the threads running/submitting the load operations execution. The meaningful values are usually only few times more than the count of the available CPU threads.

4.2.2. Base Storage Driver Usage Warnings

See the design notes

5. Usage

5.1. Event Stream Operations

Mongoose should perform the load operations on the events when the configuration option item-type is set to data.

5.1.1. Create

Write new events into the specified stream(s). The operation latency is measured as the time between the corresponding writeEvent call returns and the returned completion callback is triggered.

5.1.1.1. Transactional

Using the transactions to create the events allows to write the events in the batch mode. The maximum count of the events per transaction is defined by the load-batch-size configuration option value.

Example:

docker run \
    --network host \
    emcmongoose/mongoose-storage-driver-pravega \
    --storage-namespace=scope1 \
    --storage-driver-event-batch \
    --load-step-limit-count=100000 \
    --load-batch-size=1024 \
    --item-output-path=eventsStream1 \
    --item-data-size=10KB

Note that in this mode the operation (transaction) latency is equal to the duration.

5.1.2. Read

Notes:

  • The Pravega storage doesn't support reading the stream events in the random order.
  • The operation latency is effectively equal to the duration due to blocking read event method.

There is a configuration parameter called storage-driver-read-timeoutMillis. Pravega documentation says it only works when there is no available event in the stream. readNextEvent() will block for the specified time in ms. So, in theory 0 and 1 should work just fine. They do not so far. In practice, this value should be somewhere around 2000 ms (it is the Pravega default value).

docker run \
    --network host \
    emcmongoose/mongoose-storage-driver-pravega \
    --storage-namespace=scope1 \
    --item-input-path=stream3 \
    --load-batch-size=100 \
    --storage-driver-limit-concurrency=0 \
    --load-op-type=read \
    --load-op-recycle  

Right now only single-stream reading is supported. Each thread from --storage-driver-threads parameter will be taking a reader from a joint pool. All readers will be in the same readerGroup. So, it is user's responsibility to define an amount of threads the way that there aren't any readers with no EventSegmentReaders. Which happens when there are no segments assigned to a reader as each segment is assigned to only one reader within a readerGroup.

It's important to know that Mongoose's Item generator is driver agnostic. So it has no knowledge about the fact that reader can be reading basically nothing as it doesn't have an assigned EventSegmentReader, but it will still receive its portion of Items. This way if you set e.g. 5 mongoose threads while having 1 stream segment --load-batch-size=100, you are not going to get 100 successful operations as part of them wouldn't retrieve any data. Considering this case you might want to use timeouts for automation instead of op-count.

To achieve maximum efficiency the main point to be considered is auto-scaling. If it's disabled, then the best match of threads is amount of segments in the stream, which is constant during the reading. With auto-scaling enabled, it is yet to be found out what the best match is.

There are two ways reading can be done:

  • Set the count/time limit to stop reading after a certain moment
  • Don't specify any options to stop reading, in this case driver works endlessly unless explicitly stopped (e.g. ctrl+c)

5.1.3. Update

Not supported. A stream append may be performed using the create load operation type and a same stream previously used to write the events.

5.1.4. Delete

Not supported.

5.1.5. End-to-end Latency

The end-to-end latency is a time span between the CREATE and READ operations executed for the same item. The end-to-end latency may be measured using e2e latency mode:

  1. Start writing the messages to a stream with enabled timestamps recording. Example command:
       --network host \
       emcmongoose/mongoose-storage-driver-pravega \
       --storage-namespace=scope1 \
       --load-service-threads=1 \
       --storage-driver-limit-concurrency=0
       --item-data-size=10B \
       --item-output-path=stream1 \
       --load-batch-size=1000 \
       --storage-driver-limit-queue-input=1000 \
       --storage-driver-create-timestamp \
       --load-op-limit-rate=100000

The last parameter is needed to make writing slower than reading, as it is not always the case. While using e2e latency mode the maximum throughput is not the main point of interest, so this behaviour can be allowed.

  1. Start the e2eMode reading from the same stream:
      --network host \
      emcmongoose/mongoose-storage-driver-pravega \
      --storage-namespace=scope1 \
      --load-batch-size=1000 \
      --load-service-threads=1 \
      --storage-driver-limit-concurrency=0 \
      --load-op-type=read \
      --item-input-path=stream1 \
      --storage-driver-read-e2eMode \
      --load-op-recycle \
      --load-step-id=e2e_test
  1. Check the end-to-end time data in the log/e2e_test/op.trace.csv log file. The data is in the CSV format with 3 columns:
  • internal message id
  • event payload size
  • end-to-end time in milliseconds

Note: the end-to-end time data will not be aggregated in the distributed mode.

5.2. Byte Stream Operations

Mongoose should perform the load operations on the streams when the configuration option storage-driver-stream-data is set to bytes. This means that the whole streams are being accounted as items.

Currently, unsupported for Pravega 0.8+.

5.2.1. Create

Creates the byte streams. The created byte stream is filled with content up to the size determined by the item-data-size option. The create operation will fail with the status code #7 if the stream existed before.

Example:

docker run \
    --network host \
    emcmongoose/mongoose-storage-driver-pravega \
    --storage-driver-stream-data=bytes \
    --storage-namespace=scope1 \
    --storage-driver-limit-concurrency=100 \
    --storage-driver-threads=100

5.2.2. Read

Reads the byte streams.

Example:

docker run \
    --network host \
    emcmongoose/mongoose-storage-driver-pravega \
    --item-input-file=streams.csv \
    --read \
    --storage-driver-stream-data=bytes \
    --storage-driver-limit-concurrency=10 \
    --storage-driver-threads=10 \
    --storage-namespace=scope1

It's also possible to perform the byte streams read w/o the input stream items file:

docker run \
    --network host \
    emcmongoose/mongoose-storage-driver-pravega \
    --item-input-path=scope1 \
    --read \
    --storage-driver-stream-data=bytes \
    --storage-driver-limit-concurrency=10 \
    --storage-driver-threads=10 \
    --storage-namespace=scope1

All streams in the specified scope are listed and analyzed for the current size before the reading.

5.2.3. Update

Not implemented yet

5.2.4. Delete

Before the deletion, the stream must be sealed because of Pravega concepts. So the sealing of the stream is done during the deletion too.

5.3. Misc

5.3.1. Manual Scaling

It's required to make a manual destination stream scaling while the event writing load is in progress in order to see if the rate changes. The additional load step may be used to perform such scaling. In order to not perform any additional load it should be explicitly configured to do a minimal work:

  • load operations count limit: 1
  • concurrency limit: 1
  • payload size: 1 bytes

For more details see the corresponding scenario content.

5.3.2. Multiple Destination Streams

The configuration expression language feature may be used to specify multiple destination streams to write the events. The example of the command to write the events into 1000 destination streams (in the random order):

docker run \
    --network host \
    emcmongoose/mongoose-storage-driver-pravega \
    --storage-namespace=scope1 \
    --item-data-size=1000 \
    --item-output-path=stream-%p\{1000\;1\}

6. Open Issues

Issue Description

7. Development

7.1. Build

Note the Pravega commit # which should be used to build the corresponding Mongoose plugin. Specify the required Pravega commit # in the build.gradle file. Then run:

./gradlew clean jar

7.2. Test

7.2.1. Automated

7.2.1.1. Unit

./gradlew clean test

7.2.1.2. Integration

docker run -d --name=storage --network=host pravega/pravega:<PRAVEGA_VERSION> standalone
./gradlew integrationTest

7.2.1.3. Functional

./gradlew jar
export SUITE=api.storage
TEST=create_event_stream ./gradlew robotest
TEST=create_byte_streams ./gradlew robotest
TEST=read_byte_streams ./gradlew robotest
TEST=read_all_byte_streams ./gradlew robotest
TEST=create_event_transactional_stream ./gradlew robotest

7.2.1. Manual

  1. Build the storage driver
  2. Copy the storage driver's jar file into the mongoose's ext directory:
cp -f build/libs/mongoose-storage-driver-pravega-*.jar ~/.mongoose/<MONGOOSE_BASE_VERSION>/ext/

Note that the Pravega storage driver depends on the Preempt Storage Driver extension so it should be also put into the ext directory 3. Build and install the corresponding Pravega version:

./gradlew pravegaExtract
  1. Run the Pravega standalone node:
build/pravega_/bin/pravega-standalone
  1. Run Mongoose's default scenario with some specific command-line arguments:
java -jar mongoose-<MONGOOSE_BASE_VERSION>.jar \
    --storage-driver-type=pravega \
    --storage-net-node-port=9090 \
    --storage-driver-limit-concurrency=10 \
    --item-output-path=goose-events-stream-0

8. CI

CI is located here:

8.1. CI Runner

It is often the case that robotests are not completed successfully when shared ci runner is used. So, one can create and attach his own ci runner with higher resources.

  1. Start the runner:
docker run -d --name gitlab-runner --restart always \
    -v /srv/gitlab-runner/config:/etc/gitlab-runner \
    -v /var/run/docker.sock:/var/run/docker.sock \
    gitlab/gitlab-runner:latest
  1. Register the runner:
docker run --rm -it -v /srv/gitlab-runner/config:/etc/gitlab-runner gitlab/gitlab-runner register -n \
    --url https://gitlab.com/ \
    --registration-token <token> \
    --executor "docker" \
    --docker-image "docker:18.09.7" \
    --description "mong-runner" \
    --docker-privileged \
    -- tag pravega

One main thing why it's duplicated here is to remind that as we build aan image using dind, we need to set privileged mode for the internal docker process.