perf(neo4j): improve neo4j query performance by using node labels #10415

pashashaik-mms · 2024-05-02T10:57:09Z

PR created and contributed by: MediamarktSaturn Technology GmbH, Analytics-Services Team. Special thanks to @raudzis for the finding and idea proposed.

PR Introduction:
This PR introduces an optimization to the Neo4j querying process within our Datahub project. Previously, our Neo4j queries did not specify node labels during the match phase, which resulted in scanning all nodes in the database. This approach was inefficient, especially for large datasets. By integrating dynamic node labels into our match queries, we significantly improve query performance by leveraging Neo4j's ability to use indexes more effectively.

Node Label Integration: Modified the Neo4j queries wherever applicable and now, the query explicitly targets nodes with the specified label, reducing the search space and improving performance.
Performance: By applying node labels directly in our match clauses, the database engine can optimize node lookups using existing indexes, thus speeding up the query execution by reducing the number of nodes scanned.
Scalability: These improvements make our database queries more scalable, handling larger datasets more efficiently.
Maintainability: This change also enhances the clarity of our queries, making them more understandable at a glance, which benefits new contributors and maintainers alike.

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
~~[ ] Links to related issues (if applicable)~~
Tests for the changes have been added/updated (if applicable)
~~[ ] Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.~~
~~[ ] For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub~~

pashashaik-mms · 2024-05-02T11:38:28Z

You can find the screenshot attached comparing the profile result:

Profile WITHOUT NODELABEL Query

Query: PROFILE MATCH (src {urn: "urn:li:schemaField:(%s),warp_file_name)"})-[r:DownstreamOf]->(dest) RETURN type(r), dest, 1

Profile WITH NODELABEL Query

Query: PROFILE MATCH (src:schemaField {urn: "urn:li:schemaField:(%s),warp_file_name)"})-[r:DownstreamOf]->(dest) RETURN type(r), dest, 1

deepgarg-visa · 2024-05-22T15:53:01Z

@david-leifker @RyanHolstien
Could you please look into this PR as we have also seen significant improvement with these changes in read calls to Neo4j

david-leifker

There are broken tests with ./gradlew :metadata-io:test

…for Neo4j Query performance

pashashaik-mms · 2024-05-23T15:20:11Z

@RyanHolstien @david-leifker Could you please approve it as I had fixed the changes. It was a formatting issue and hence the build was failing and so are the unit-tests. I fixed it now and would need your approval.

pashashaik-mms · 2024-05-23T15:20:48Z

Could you please approve it as I had fixed the changes. It was a formatting issue and hence the build was failing and so are the unit-tests. I fixed it now and would need your approval.

deepgarg-visa · 2024-05-24T10:57:24Z

@pashashaik-mms , as mentions here

the below error is occurring because in testcases here, "sourceEntityFilter" is passed as null and because of that the variable "srcNodeLabel" in method "findRelatedEntities" is not setting having a default value of blank, which results in the below query which is not correct.

MATCH (src: )-[r:DownstreamOf ]-(dest ) WHERE left(type(r), 2)<>'r_' RETURN dest, type(r) SKIP $offset LIMIT $count"

Please also handle the case where variable "sourceEntityFilter" in method "findRelatedEntities" can be null or empty

metadata-io > nonsearch > com.linkedin.metadata.graph.dgraph.DgraphGraphServiceTest > testFindRelatedEntitiesDestinationType[11](dataset, [HasOwner], {or=[], direction=UNDIRECTED}, [RelatedEntity(relationshipType=HasOwner, urn=urn:li:dataset:(urn:li:dataPlatform:type,SampleDataset1,PROD), via=null), RelatedEntity(relationshipType=HasOwner, urn=urn:li:dataset:(urn:li:dataPlatform:type,SampleDataset2,PROD), via=null), RelatedEntity(relationshipType=HasOwner, urn=urn:li:dataset:(urn:li:dataPlatform:type,SampleDataset3,PROD), via=null), RelatedEntity(relationshipType=HasOwner, urn=urn:li:dataset:(urn:li:dataPlatform:type,SampleDataset4,PROD), via=null)]) STANDARD_ERROR org.neo4j.bolt.protocol.common.fsm.error.TransactionStateTransitionException: Invalid input ')': expected "%", "(" or an identifier (line 1, column 13 (offset: 12)) "MATCH (src: )-[r:DownstreamOf ]-(dest ) WHERE left(type(r), 2)<>'r_' RETURN dest, type(r) SKIP $offset LIMIT $count"

…ance fix

pashashaik-mms · 2024-05-24T14:22:03Z

@pashashaik-mms , as mentions here

the below error is occurring because in testcases here, "sourceEntityFilter" is passed as null and because of that the variable "srcNodeLabel" in method "findRelatedEntities" is not setting having a default value of blank, which results in the below query which is not correct.

MATCH (src: )-[r:DownstreamOf ]-(dest ) WHERE left(type(r), 2)<>'r_' RETURN dest, type(r) SKIP $offset LIMIT $count"

Please also handle the case where variable "sourceEntityFilter" in method "findRelatedEntities" can be null or empty

metadata-io > nonsearch > com.linkedin.metadata.graph.dgraph.DgraphGraphServiceTest > testFindRelatedEntitiesDestinationType[11](dataset, [HasOwner], {or=[], direction=UNDIRECTED}, [RelatedEntity(relationshipType=HasOwner, urn=urn:li:dataset:(urn:li:dataPlatform:type,SampleDataset1,PROD), via=null), RelatedEntity(relationshipType=HasOwner, urn=urn:li:dataset:(urn:li:dataPlatform:type,SampleDataset2,PROD), via=null), RelatedEntity(relationshipType=HasOwner, urn=urn:li:dataset:(urn:li:dataPlatform:type,SampleDataset3,PROD), via=null), RelatedEntity(relationshipType=HasOwner, urn=urn:li:dataset:(urn:li:dataPlatform:type,SampleDataset4,PROD), via=null)]) STANDARD_ERROR org.neo4j.bolt.protocol.common.fsm.error.TransactionStateTransitionException: Invalid input ')': expected "%", "(" or an identifier (line 1, column 13 (offset: 12)) "MATCH (src: )-[r:DownstreamOf ]-(dest ) WHERE left(type(r), 2)<>'r_' RETURN dest, type(r) SKIP $offset LIMIT $count"

FIXED NOW

pashashaik-mms · 2024-05-24T14:22:57Z

@deepgarg-visa fixed now. Could you please check and approve it. Its been waiting a while now.

deepgarg-visa · 2024-05-24T14:32:35Z

@pashashaik-mms are all metadata-io testcase passed ?
I guess this fix will not the solve the problem, as defualt value of variable "srcNodeLabel" is blank. Because of that
the variable matchTemplate = "MATCH (src:%s %s)-[r%s %s]-(dest %s)%s" generates below query:

MATCH (src: )-[r:DownstreamOf ]->(dest )

…ance issues

pashashaik-mms · 2024-05-24T16:41:03Z

@deepgarg-visa I handled your scenario as well. Now the tests are running fine. Could you please check the same. removeEdgesFromNode() might be in balance.

deepgarg-visa · 2024-05-24T17:01:16Z

metadata-io/src/main/java/com/linkedin/metadata/graph/neo4j/Neo4jGraphService.java

@@ -648,18 +666,34 @@ public void removeEdgesFromNode(

    // build node label from entity type
    final String srcNodeLabel = urn.getEntityType();
+    String matchTemplate = "";


This code can be refactored as below:

`final RelationshipDirection relationshipDirection = relationshipFilter.getDirection();
final String srcNodeLabel = urn.getEntityType();

String matchTemplate = "";
matchTemplate =
String.format(
"MATCH (src {urn: $urn})-[r%s]-(dest) RETURN type(r), dest, 2", srcNodeLabel);
if (relationshipDirection == RelationshipDirection.INCOMING) {
matchTemplate =
String.format(
"MATCH (src {urn: $urn})<-[r%s]-(dest) RETURN type(r), dest, 0", srcNodeLabel);
} else if (relationshipDirection == RelationshipDirection.OUTGOING) {
matchTemplate =
String.format(
"MATCH (src {urn: $urn})-[r%s]->(dest) RETURN type(r), dest, 1", srcNodeLabel);
}
if (srcNodeLabel != null && !srcNodeLabel.isEmpty()) {
matchTemplate =
String.format(
"MATCH (src:%s {urn: $urn})-[r%s]-(dest) RETURN type(r), dest, 2", srcNodeLabel);
if (relationshipDirection == RelationshipDirection.INCOMING) {
matchTemplate =
String.format(
"MATCH (src:%s {urn: $urn})<-[r%s]-(dest) RETURN type(r), dest, 0", srcNodeLabel);
} else if (relationshipDirection == RelationshipDirection.OUTGOING) {
matchTemplate =
String.format(
"MATCH (src:%s {urn: $urn})-[r%s]->(dest) RETURN type(r), dest, 1", srcNodeLabel);
}
}`

pashashaik-mms added 2 commits May 2, 2024 11:09

perf(neo4j): improve query performance by using node labels

eeca7cf

perf(neo4j): improve query performance by using node labels

eb8b05a

github-actions bot added product PR or Issue related to the DataHub UI/UX community-contribution PR or Issue raised by member(s) of DataHub Community labels May 2, 2024

Merge branch 'master' into perf/neo4j-performance

971ebe3

pashashaik-mms changed the title ~~Perf/neo4j performance~~ perf(neo4j): improve neo4j query performance by using node labels May 2, 2024

vercel bot deployed to Preview May 2, 2024 11:27 View deployment

david-leifker requested review from RyanHolstien and david-leifker May 22, 2024 16:39

david-leifker approved these changes May 22, 2024

View reviewed changes

david-leifker added this to the v0.13.3 milestone May 22, 2024

david-leifker added the merge-pending-ci A PR that has passed review and should be merged once CI is green. label May 22, 2024

RyanHolstien approved these changes May 22, 2024

View reviewed changes

Merge branch 'master' into perf/neo4j-performance

fb3ecbd

vercel bot deployed to Preview May 22, 2024 19:46 View deployment

Merge branch 'master' into perf/neo4j-performance

59f8aba

david-leifker requested changes May 22, 2024

View reviewed changes

vercel bot deployed to Preview May 22, 2024 21:26 View deployment

pankajmahato-visa mentioned this pull request May 23, 2024

perf(neo4j): improve neo4j query performance by using node labels #10577

Open

“Pasha and others added 2 commits May 23, 2024 16:49

perf(neo4j): fix the gradle build issues occurred while implementing …

7346e42

…for Neo4j Query performance

Merge branch 'master' into perf/neo4j-performance

72b7ee0

pashashaik-mms requested a review from david-leifker May 23, 2024 15:17

vercel bot deployed to Preview May 23, 2024 15:18 View deployment

Merge branch 'master' into perf/neo4j-performance

83aae29

vercel bot deployed to Preview May 24, 2024 08:53 View deployment

perf(neo4j): have a null check to fix the unit test for neo4j perform…

389082b

…ance fix

pashashaik-mms requested a review from RyanHolstien May 24, 2024 14:23

vercel bot deployed to Preview May 24, 2024 14:35 View deployment

perf(neo4j): have a null check to fix the unit test for neo4j perform…

69d9fa8

…ance issues

vercel bot deployed to Preview May 24, 2024 16:54 View deployment

deepgarg-visa reviewed May 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(neo4j): improve neo4j query performance by using node labels #10415

perf(neo4j): improve neo4j query performance by using node labels #10415

pashashaik-mms commented May 2, 2024 •

edited

pashashaik-mms commented May 2, 2024 •

edited

deepgarg-visa commented May 22, 2024

david-leifker left a comment

pashashaik-mms commented May 23, 2024

pashashaik-mms commented May 23, 2024

deepgarg-visa commented May 24, 2024

pashashaik-mms commented May 24, 2024

pashashaik-mms commented May 24, 2024

deepgarg-visa commented May 24, 2024 •

edited

pashashaik-mms commented May 24, 2024

deepgarg-visa May 24, 2024

perf(neo4j): improve neo4j query performance by using node labels #10415

Are you sure you want to change the base?

perf(neo4j): improve neo4j query performance by using node labels #10415

Conversation

pashashaik-mms commented May 2, 2024 • edited

Checklist

pashashaik-mms commented May 2, 2024 • edited

Profile WITHOUT NODELABEL Query

Profile WITH NODELABEL Query

deepgarg-visa commented May 22, 2024

david-leifker left a comment

Choose a reason for hiding this comment

pashashaik-mms commented May 23, 2024

pashashaik-mms commented May 23, 2024

deepgarg-visa commented May 24, 2024

pashashaik-mms commented May 24, 2024

pashashaik-mms commented May 24, 2024

deepgarg-visa commented May 24, 2024 • edited

pashashaik-mms commented May 24, 2024

deepgarg-visa May 24, 2024

Choose a reason for hiding this comment

pashashaik-mms commented May 2, 2024 •

edited

pashashaik-mms commented May 2, 2024 •

edited

deepgarg-visa commented May 24, 2024 •

edited