Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google_cloud_to_neo4j v2024-03-27-00_rc00 causes unexpected increase of shuffle data processed and worker OOMs #1416

Closed
mudravrik opened this issue Apr 5, 2024 · 3 comments · Fixed by #1418
Assignees
Labels
bug Something isn't working needs triage p2

Comments

@mudravrik
Copy link

Related Template(s)

google_cloud_to_neo4j

Template Version

2024-03-27-00_rc00

What happened?

So I use latest version of google_cloud_to_neo4j pipeline and faced unexpected Dataflow job failures for the past 12 hours or so.

Initially seen error logs were quite cryptic like:
Error message from worker: Data channel closed, unable to receive additional data from SDK sdk-0-0
After some debugging I assumed the actual reason for failure was OOM-ing of workers.
However I haven't changing anything in the job itself, and it is pretty simple: export of single BQ table and writing it into neo4j via custom_query, thus I never experienced OOM issues before with it.

So I tried to understand what have changes and noticed template version change and, more importantly, shift of shuffle data processed from 22.66 KB in yesterday run to 10.67 GB in new, failed one.

After seeing version change of the templated, I tried switching back to gs://dataflow-templates-us-west1/2024-03-06-00_RC00/flex/Google_Cloud_to_Neo4j and things went back to normal in job itself, so I assume the issue was caused by new template version.

Relevant log output

No response

@mudravrik mudravrik added bug Something isn't working needs triage p2 labels Apr 5, 2024
@fbiville fbiville self-assigned this Apr 8, 2024
fbiville added a commit to fbiville/DataflowTemplates that referenced this issue Apr 8, 2024
@fbiville
Copy link
Collaborator

fbiville commented Apr 9, 2024

Thanks for the report!

There are indeed two concerns.

The first one is about the workers running out of memory.
This is due to a naive implementation that partitions key groups into
smaller lists. The linked pull request fixes that.

The second concern is about the amount of shuffled data.
Prior to PR #1346, the Neo4j template was not taking the parallelism
factor into account.

The parallelism factor now plays its intended role: source rows are
spread across N buckets, where N is the configured or default factor
value.

By default:

  • node targets have a default parallelism factor of 5
  • relationship and custom query targets have a default parallelism factor of 1

Since the source data processing may be spread across several workers,
grouping source data rows to a single bucket necessarily involves data
shuffling.

You can adjust the factor in the top-level config object with:

  • node_write_parallelism for node targets
  • edge_write_parallelism for relationship targets
  • custom_query_parallelism for custom query targets

fbiville added a commit to fbiville/DataflowTemplates that referenced this issue Apr 11, 2024
fbiville added a commit to fbiville/DataflowTemplates that referenced this issue Apr 11, 2024
fbiville added a commit to fbiville/DataflowTemplates that referenced this issue Apr 12, 2024
@mudravrik
Copy link
Author

@fbiville Thank you some much for your response!

Unfortunately, I have kinda poor understanding of what is going on with parallelism in this template, so maybe you can help me understand it better or point to proper documentation?

In may case BQ part of the job produce multiple (lets say 10x number of Dataflow workers) avro files into the bucket.
Then dataflow workers pick up these files and proceed with writing.
So do I get it right, with *_parallelism = 1 each worker is gonna pick up only 1 file and send insert requests to neo4j one-by-one based on only "active" file? In this case number of parallel queries reaching out neo4j should be equal to the number of workers.
(At least, that is how I interpret behaviour of the pipeline right now - I have 20 workers running on this step, almost 0 shuffle and 20 queries in neo4j)

When I increase parallelism, every worker will again pick up one avro file from BQ, but after it will shuffle the data from this file between multiple workers to achieve parallelism number. But does every worker do the same in this case? So I should end up with num_(active)_workers*custom_query_parallelism number of queries, right?

At least in my use case it feels a bit counterintuitive, from every worker perspective to send the data from the file you worked on to others and at the same time receive the data from other's files.
I mean, I could just use my own file to split into multiple parts and send them without all this cross-sharing :)

It is, however, a very naive perspective from me, probably applicable to my very simple case.

fbiville added a commit to fbiville/DataflowTemplates that referenced this issue May 6, 2024
@fbiville
Copy link
Collaborator

Hello, apologies for the late reply, I've been thinking about it for a while :)

Without seeing your pipeline, it will be hard to assess your statements.

Basically, the number of parallel write queries executing against Neo4j depends on a number of factors:

  • the pipeline "shape": are the node/relationship/custom query targets running in parallel or sequentially? by default (currently), nodes are processed first, then relationships, then custom queries
  • the parallelism factor for each target type: every target type has a configurable parallelism factor with a default value (see above explanation)
  • the number of available workers for the execution

The parallelism factor you can configure for the Neo4j template applies to each target, and spreads the incoming source data into N partitions, with 1 partition per worker.

The template needs to make sure that each write pipeline step reads from N partitions (where N = parallelism factor).
That requires shuffling from the source data, which is not avoidable.

fbiville added a commit to fbiville/DataflowTemplates that referenced this issue May 22, 2024
copybara-service bot pushed a commit that referenced this issue May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage p2
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants