New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Google_cloud_to_neo4j v2024-03-27-00_rc00 causes unexpected increase of shuffle data processed and worker OOMs #1416
Comments
Thanks for the report! There are indeed two concerns. The first one is about the workers running out of memory. The second concern is about the amount of shuffled data. The parallelism factor now plays its intended role: source rows are By default:
Since the source data processing may be spread across several workers, You can adjust the factor in the top-level config object with:
|
@fbiville Thank you some much for your response! Unfortunately, I have kinda poor understanding of what is going on with parallelism in this template, so maybe you can help me understand it better or point to proper documentation? In may case BQ part of the job produce multiple (lets say 10x number of Dataflow workers) avro files into the bucket. When I increase parallelism, every worker will again pick up one avro file from BQ, but after it will shuffle the data from this file between multiple workers to achieve parallelism number. But does every worker do the same in this case? So I should end up with At least in my use case it feels a bit counterintuitive, from every worker perspective to send the data from the file you worked on to others and at the same time receive the data from other's files. It is, however, a very naive perspective from me, probably applicable to my very simple case. |
Hello, apologies for the late reply, I've been thinking about it for a while :) Without seeing your pipeline, it will be hard to assess your statements. Basically, the number of parallel write queries executing against Neo4j depends on a number of factors:
The parallelism factor you can configure for the Neo4j template applies to each target, and spreads the incoming source data into N partitions, with 1 partition per worker. The template needs to make sure that each write pipeline step reads from N partitions (where N = parallelism factor). |
PiperOrigin-RevId: 636182889
Related Template(s)
google_cloud_to_neo4j
Template Version
2024-03-27-00_rc00
What happened?
So I use
latest
version ofgoogle_cloud_to_neo4j
pipeline and faced unexpected Dataflow job failures for the past 12 hours or so.Initially seen error logs were quite cryptic like:
Error message from worker: Data channel closed, unable to receive additional data from SDK sdk-0-0
After some debugging I assumed the actual reason for failure was OOM-ing of workers.
However I haven't changing anything in the job itself, and it is pretty simple: export of single BQ table and writing it into neo4j via
custom_query
, thus I never experienced OOM issues before with it.So I tried to understand what have changes and noticed template version change and, more importantly, shift of shuffle data processed from 22.66 KB in yesterday run to 10.67 GB in new, failed one.
After seeing version change of the templated, I tried switching back to
gs://dataflow-templates-us-west1/2024-03-06-00_RC00/flex/Google_Cloud_to_Neo4j
and things went back to normal in job itself, so I assume the issue was caused by new template version.Relevant log output
No response
The text was updated successfully, but these errors were encountered: