Google_cloud_to_neo4j v2024-03-27-00_rc00 causes unexpected increase of shuffle data processed and worker OOMs #1416

mudravrik · 2024-04-05T12:03:10Z

Related Template(s)

google_cloud_to_neo4j

Template Version

2024-03-27-00_rc00

What happened?

So I use latest version of google_cloud_to_neo4j pipeline and faced unexpected Dataflow job failures for the past 12 hours or so.

Initially seen error logs were quite cryptic like:
Error message from worker: Data channel closed, unable to receive additional data from SDK sdk-0-0
After some debugging I assumed the actual reason for failure was OOM-ing of workers.
However I haven't changing anything in the job itself, and it is pretty simple: export of single BQ table and writing it into neo4j via custom_query, thus I never experienced OOM issues before with it.

So I tried to understand what have changes and noticed template version change and, more importantly, shift of shuffle data processed from 22.66 KB in yesterday run to 10.67 GB in new, failed one.

After seeing version change of the templated, I tried switching back to gs://dataflow-templates-us-west1/2024-03-06-00_RC00/flex/Google_Cloud_to_Neo4j and things went back to normal in job itself, so I assume the issue was caused by new template version.

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

Fixes GoogleCloudPlatform#1416

fbiville · 2024-04-09T08:44:20Z

Thanks for the report!

There are indeed two concerns.

The first one is about the workers running out of memory.
This is due to a naive implementation that partitions key groups into
smaller lists. The linked pull request fixes that.

The second concern is about the amount of shuffled data.
Prior to PR #1346, the Neo4j template was not taking the parallelism
factor into account.

The parallelism factor now plays its intended role: source rows are
spread across N buckets, where N is the configured or default factor
value.

By default:

node targets have a default parallelism factor of 5
relationship and custom query targets have a default parallelism factor of 1

Since the source data processing may be spread across several workers,
grouping source data rows to a single bucket necessarily involves data
shuffling.

You can adjust the factor in the top-level config object with:

node_write_parallelism for node targets
edge_write_parallelism for relationship targets
custom_query_parallelism for custom query targets

Fixes GoogleCloudPlatform#1416

mudravrik · 2024-04-13T05:41:02Z

@fbiville Thank you some much for your response!

Unfortunately, I have kinda poor understanding of what is going on with parallelism in this template, so maybe you can help me understand it better or point to proper documentation?

In may case BQ part of the job produce multiple (lets say 10x number of Dataflow workers) avro files into the bucket.
Then dataflow workers pick up these files and proceed with writing.
So do I get it right, with *_parallelism = 1 each worker is gonna pick up only 1 file and send insert requests to neo4j one-by-one based on only "active" file? In this case number of parallel queries reaching out neo4j should be equal to the number of workers.
(At least, that is how I interpret behaviour of the pipeline right now - I have 20 workers running on this step, almost 0 shuffle and 20 queries in neo4j)

When I increase parallelism, every worker will again pick up one avro file from BQ, but after it will shuffle the data from this file between multiple workers to achieve parallelism number. But does every worker do the same in this case? So I should end up with num_(active)_workers*custom_query_parallelism number of queries, right?

At least in my use case it feels a bit counterintuitive, from every worker perspective to send the data from the file you worked on to others and at the same time receive the data from other's files.
I mean, I could just use my own file to split into multiple parts and send them without all this cross-sharing :)

It is, however, a very naive perspective from me, probably applicable to my very simple case.

Fixes GoogleCloudPlatform#1416

fbiville · 2024-05-21T13:22:30Z

Hello, apologies for the late reply, I've been thinking about it for a while :)

Without seeing your pipeline, it will be hard to assess your statements.

Basically, the number of parallel write queries executing against Neo4j depends on a number of factors:

the pipeline "shape": are the node/relationship/custom query targets running in parallel or sequentially? by default (currently), nodes are processed first, then relationships, then custom queries
the parallelism factor for each target type: every target type has a configurable parallelism factor with a default value (see above explanation)
the number of available workers for the execution

The parallelism factor you can configure for the Neo4j template applies to each target, and spreads the incoming source data into N partitions, with 1 partition per worker.

The template needs to make sure that each write pipeline step reads from N partitions (where N = parallelism factor).
That requires shuffling from the source data, which is not avoidable.

Fixes GoogleCloudPlatform#1416

PiperOrigin-RevId: 636182889

mudravrik added bug Something isn't working needs triage p2 labels Apr 5, 2024

fbiville self-assigned this Apr 8, 2024

fbiville added a commit to fbiville/DataflowTemplates that referenced this issue Apr 8, 2024

Use built-in batch grouping transform

6409955

Fixes GoogleCloudPlatform#1416

fbiville mentioned this issue Apr 8, 2024

Use built-in batch grouping transform #1418

Merged

fbiville added a commit to fbiville/DataflowTemplates that referenced this issue Apr 11, 2024

Use built-in batch grouping transform

6390ae5

Fixes GoogleCloudPlatform#1416

fbiville added a commit to fbiville/DataflowTemplates that referenced this issue Apr 11, 2024

Use built-in batch grouping transform

6e05e1e

Fixes GoogleCloudPlatform#1416

fbiville added a commit to fbiville/DataflowTemplates that referenced this issue Apr 12, 2024

Use built-in batch grouping transform

d036a73

Fixes GoogleCloudPlatform#1416

fbiville added a commit to fbiville/DataflowTemplates that referenced this issue May 6, 2024

Use built-in batch grouping transform

47d402d

Fixes GoogleCloudPlatform#1416

fbiville added a commit to fbiville/DataflowTemplates that referenced this issue May 22, 2024

Use built-in batch grouping transform

0460d3f

Fixes GoogleCloudPlatform#1416

copybara-service bot closed this as completed in b416dbd May 22, 2024

copybara-service bot pushed a commit that referenced this issue May 22, 2024

Merge pull request #1418 from fbiville:neo4j/gh-1416

9b5416f

PiperOrigin-RevId: 636182889

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Google_cloud_to_neo4j v2024-03-27-00_rc00 causes unexpected increase of shuffle data processed and worker OOMs #1416

Google_cloud_to_neo4j v2024-03-27-00_rc00 causes unexpected increase of shuffle data processed and worker OOMs #1416

mudravrik commented Apr 5, 2024

fbiville commented Apr 9, 2024

mudravrik commented Apr 13, 2024

fbiville commented May 21, 2024

Google_cloud_to_neo4j v2024-03-27-00_rc00 causes unexpected increase of shuffle data processed and worker OOMs #1416

Google_cloud_to_neo4j v2024-03-27-00_rc00 causes unexpected increase of shuffle data processed and worker OOMs #1416

Comments

mudravrik commented Apr 5, 2024

Related Template(s)

Template Version

What happened?

Relevant log output

fbiville commented Apr 9, 2024

mudravrik commented Apr 13, 2024

fbiville commented May 21, 2024