Skip to content
Karlo K edited this page Nov 24, 2020 · 2 revisions

Persistor is a Google Cloud Platform (GCP) component used for permanent data storage. Its primary purpose is to serve as a back-up component that works independently of the rest of the system. It is used to simplify the subsequent processing of data in the case of logical errors or data mapping, as well as a safety step.

The idea behind Aquarium GCP Persistor

Think of a GCP infrastructure consisted of three components:

  1. A publisher
  2. A Cloud Function subscriber that collects and stores messages - GCP Persistor
  3. Google Cloud Storage

Storing data in a cheap storage (such as GCS) as a safety step can prove to be beneficial to a great extent. For example, this way we can store all data in GCS for subsequent processing and potential re-submission. GCP Persistor can be used to simplify subsequent processing of data that may be needed due to mapping or logical errors. If the data is already stored in GCS, there is no need to republish it from the source systems, which is usually a long and complicated process. Even in cases when the source system is able to republish the data in a simple manner, this can result in loss of the unrecorded potential changes that may have occurred between the initial and republished dataset.

Additionally, data stored in this manner can be easily made available to different processing needs (depending on use case, for example, BigQuery is not always the best option to access the data).

Introduction to Pub/Sub

Publisher

A publisher application creates and sends messages to a topic. The general flow for a publisher application is to create a message containing your data, and send a request to the Pub/Sub Server to publish the message to the desired topic. A message consists of fields with the message data and metadata. After a message is published, the Pub/Sub service returns the message ID to the publisher.

Pub/Sub subscriptions

To receive messages published to a topic, you must create a subscription to that topic. The subscription connects the topic to a subscriber application that receives and processes messages published to the topic. A topic can have multiple subscriptions, but a given subscription belongs to a single topic. Pub/Sub delivers each published message at least once for every subscription. In cases when subscribers do not keep up with the flow of messages, a message that cannot be delivered within the maximum retention time of 7 days is deleted and is no longer accessible. It is however possible to configure message retention duration (the range is from 10 minutes to 7 days). Also note that a message published before a given subscription was created will usually not be delivered for that subscription.

Once a message is sent to a subscriber, the subscriber should acknowledge the message. A message is considered outstanding once it has been sent out for delivery and before a subscriber acknowledges it. Pub/Sub will repeatedly attempt to deliver any message that has not been acknowledged before the acknowledgment deadline: a configurable, limited amount of time -- known as the ackDeadline -- to acknowledge the outstanding message. Once the deadline passes, the message is no longer considered outstanding, and Pub/Sub will attempt to redeliver the message.

A subscription can use either the pull or push mechanism for message delivery.

Push subscription

In push delivery, Pub/Sub initiates requests to your subscriber application to deliver messages.

  1. The Pub/Sub server sends each message as an HTTPS request to the subscriber application at a pre-configured endpoint.
  2. The endpoint acknowledges the message by returning an HTTP success status code. A non-success response indicates that the message should be resent.

Pull subscription

In cases where customers' applications create a larger throughput, the push mechanism would not be able to consume and store all messages because it processes messages one by one. The processing of an individual message may exceed the preconfigured acknowledgement deadline, also known as the lease. In pull delivery, your subscriber application initiates requests to the Pub/Sub server to retrieve messages.

  1. The subscribing application explicitly calls the pull method, which requests messages for delivery.
  2. The Pub/Sub server responds with the message (or an error if the queue is empty) , and an acknowledgment ID.
  3. The subscriber explicitly calls the acknowledge method, using the returned ack ID to acknowledge receipt.

Importantly, there are two variants of pull subscription: synchronous and streaming pull. Sync pull would ideally be used for periodically active data publishers, and, depending on your programming language, has concurrency support. On the other hand, in case of continuous flow of messages/events, streaming pull should be used.

GCP Persistor

GCP Persistor is a Cloud Function component that collects and stores all your data in an easily accessible way, in its original format. It’s the ultimate fail-safe on the ingestion process that uses GCP managed services.

GCP Persistor should be deployed as a infrastructure component. Based on a simple recommendation table, a customer chooses which Persistor type they would like to activate. There are three different types of Persistor for GCP:

  • Push (topic and subscription need to be in the same project)
  • Sync pull (for periodically active data publishers)
  • Streaming pull (for continuous flow of messages/events)

Different mechanisms define how data is obtained from the subscription and persisted on the storage.

GCP Persistor - Push

With push mechanism, Pub/Sub trigger starts an instance of a Google Cloud function when message/event arrives to the topic. GC function then creates a file, writes message content into that file and stores the file in GCS bucket that the customer provided. This mechanism is suitable for low message throughput.

GCP Persistor - Sync pull

Synchronous pull consists of an invoker that is triggered by the Google Cloud Scheduler and a pull function. Upon triggering the invoker, it runs N instances of pull function, each of which pulls a bulk of messages, writes them into a single file and stores that file in a GCS bucket.

GCP Persistor - Streaming pull

The architecture of streaming pull is similar to that of synchronous pull. The difference is that, in streaming pull, the client sends a StreamingPullRequest, which opens a stream through which data will be received. The connection is bi-directional. While the stream is open, Cloud Pub/Sub will send a StreamingPullResponse with more messages whenever the messages are available for delivery. Streaming pull is the right choice when high throughput and low latency are needed.

When choosing the deployment type, the customer should ask themselves the following two questions:

  • Are my Pub/Sub topic and the subscription in the same project?
  • What is the frequency and volume of messages/events that will be published to the topic?