Processing text data at scale with Apache Beam and Cloud Dataflow

Presents an optimized Apache Beam pipeline for generating sentence embeddings (runnable on Cloud Dataflow). This repository accompanies our blog post: Improving Dataflow Pipelines for Text Data Processing.

We assume that you already have a billing enabled Google Cloud Platform (GCP) project in case you wanted to run the pipeline on Cloud Dataflow.

Running the code locally

To run the code locally, first install the dependencies: pip install -r requirements. If you cannot create a Google Cloud Storage (GCS) Bucket then download the data using from here. We just need the train_data.txt file for our purpose. Also, note that without a GCS Bucket, one cannot run the pipeline on Cloud Dataflow which is the main objective of this repository.

After downloading the dataset, make changes to the respective paths and command-line arguments that use GCS in main.py.

Then execute python main.py -r DirectRunner.

Running the code on Cloud Dataflow

Create a GCS Bucket and note its name.
Then create a folder called data inside the Bucket.
Copy over the train_data.txt file to the data folder: gsutil cp train_data.txt gs://<BUCKET-NAME>/data.

Then run the following from the terminal:

python main.py \
    --project <GCP-Project> \
    --gcs-bucket <BUCKET-NAME>
    --runner DataflowRunner

For more details please refer to our blog post: Improving Dataflow Pipelines for Text Data Processing.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

main.py

main.py

requirements.txt

requirements.txt

setup.py

setup.py

Repository files navigation

Processing text data at scale with Apache Beam and Cloud Dataflow

Running the code locally

Running the code on Cloud Dataflow

About

Releases

Packages

Languages

License

carted/processing-text-data

Folders and files

Latest commit

History

Repository files navigation

Processing text data at scale with Apache Beam and Cloud Dataflow

Running the code locally

Running the code on Cloud Dataflow

About

Topics

Resources

License

Stars

Watchers

Forks

Languages