Persistor is a Google Cloud Platform (GCP) component used for permanent data storage.
Its primary purpose is to serve as a back-up component that works independently of the rest of the system.
It is used to simplify the subsequent processing of data in the case of logical errors or data mapping. Also, it is used as a safety step.
Based on our previous experience, it is very useful to have data stored in cheap storage such as GCP Storage used in Persistor. Data stored that way can be easily made available to different processing needs (depending on the use case, e.g. BigQuery is not always the best option to access the data).
Technology stack:
- Programing language: Go
- Big data: GCP Storage
- Compute: GCP Cloud Functions
More details about Persistor can be found on
Persistor is divided into five modules. A module is a collection of related Go packages that are released together. A file named go.mod in each folder declares the module path: the import path prefix for all packages within the module.
The following structure allows us not to build/deploy the entire project, but only the modules we are planning to use.
+---invoker
| go.mod
| invoker.go
|
+---lib
| getEnvVariable.go
| go.mod
| invokerInfo.go
| puller.go
| pullerInfo.go
| storage.go
| storageInfo.go
|
+---pull
| go.mod
| pull.go
|
+---push
| go.mod
| push.go
|
+---streamingPull
| go.mod
| streamingPull.go
| .gitignore
| CODE_OF_CONDUCT.md
| CONTRIBUTING.md
| LICENSE.md
| README.md
\ SUPPORT_INFORMATION.md
GCP
- Existing or new project
- Enabled billing for your Cloud project
- Enabled the Cloud Functions, Cloud Pub/Sub and Cloud Build APIs.
- Pub/Sub Topic
- Storage Bucket
Program language: Go 1.13
Persistor supports three different subscriber mechanisms:
- Push for minor activity of data publishers, topic and subscription need to be in the same project
- Synchronous Pull for periodically active data publishers
- Streaming Pull for continuous flow of messages/events
Subscriber component reads messages from Pub/Sub subscription, and store them on GCP Storage (GCS) in raw format.
Files on GCS are stored in folder structure based on ingestion date. Format of a folder:
[Bucket Name]/[YYYY]/[MM]/[DD]/[HH]
Each message is stored as separate object on GCS, with content identical to payload of Pub/Sub message.
Instructions on how to set up a local development environment and how to integrate it into an existing environment
To start developing the project further:
git clone https://github.com/syntio/aquarium-persistor-gcp.git
cd aquarium-persistor-gcp
In order to create a new folder/package in the existing solution and to be able to use the library, it is necessary to create and configure the go.mod by adding the following line:
replace github.com/syntio/aquarium-persistor-gcp/lib => ../lib
The library import path requires the following form:
github.com/syntio/aquarium-persistor-gcp/lib
Instructions on how to establish GCP persistor connection between Pub/Sub and GCS storage using Cloud shell can be found here.
Pull does not always store the exact number of messages that was determined by the Number of messages parameter. There could be 1 - 3 more messages pulled and stored. Timeout parameter of a Cloud Function that pulls and stores messages synchronously needs to be adjusted so that the function terminates shortly after the last message is stored. Otherwise the function will stay active until timeout.
Issue tracker: Issues
- In case of sensitive bugs like security vulnerabilities, please contact support@syntio.net directly instead of using issue tracker. We value your effort to improve the security and privacy of this project!
Please refer to CONTRIBUTING.md
The repository is developed and maintained by Syntio Labs.
Licensed under the Apache License, Version 2.0