Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(neo4j): improve neo4j query performance by using node labels #10415

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

pashashaik-mms
Copy link

@pashashaik-mms pashashaik-mms commented May 2, 2024

PR created and contributed by: MediamarktSaturn Technology GmbH, Analytics-Services Team. Special thanks to @raudzis for the finding and idea proposed.

PR Introduction:
This PR introduces an optimization to the Neo4j querying process within our Datahub project. Previously, our Neo4j queries did not specify node labels during the match phase, which resulted in scanning all nodes in the database. This approach was inefficient, especially for large datasets. By integrating dynamic node labels into our match queries, we significantly improve query performance by leveraging Neo4j's ability to use indexes more effectively.

  • Node Label Integration: Modified the Neo4j queries wherever applicable and now, the query explicitly targets nodes with the specified label, reducing the search space and improving performance.

  • Performance: By applying node labels directly in our match clauses, the database engine can optimize node lookups using existing indexes, thus speeding up the query execution by reducing the number of nodes scanned.

  • Scalability: These improvements make our database queries more scalable, handling larger datasets more efficiently.

  • Maintainability: This change also enhances the clarity of our queries, making them more understandable at a glance, which benefits new contributors and maintainers alike.


Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • [ ] Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • [ ] Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • [ ] For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added product PR or Issue related to the DataHub UI/UX community-contribution PR or Issue raised by member(s) of DataHub Community labels May 2, 2024
@pashashaik-mms pashashaik-mms changed the title Perf/neo4j performance perf(neo4j): improve neo4j query performance by using node labels May 2, 2024
@pashashaik-mms
Copy link
Author

pashashaik-mms commented May 2, 2024

You can find the screenshot attached comparing the profile result:

Profile WITHOUT NODELABEL Query

Query: PROFILE MATCH (src {urn: "urn:li:schemaField:(%s),warp_file_name)"})-[r:DownstreamOf]->(dest) RETURN type(r), dest, 1
image

Profile WITH NODELABEL Query

Query: PROFILE MATCH (src:schemaField {urn: "urn:li:schemaField:(%s),warp_file_name)"})-[r:DownstreamOf]->(dest) RETURN type(r), dest, 1
image

@deepgarg-visa
Copy link
Contributor

@david-leifker @RyanHolstien
Could you please look into this PR as we have also seen significant improvement with these changes in read calls to Neo4j

@david-leifker david-leifker added this to the v0.13.3 milestone May 22, 2024
@david-leifker david-leifker added the merge-pending-ci A PR that has passed review and should be merged once CI is green. label May 22, 2024
Copy link
Collaborator

@david-leifker david-leifker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are broken tests with ./gradlew :metadata-io:test

@pashashaik-mms
Copy link
Author

@RyanHolstien @david-leifker Could you please approve it as I had fixed the changes. It was a formatting issue and hence the build was failing and so are the unit-tests. I fixed it now and would need your approval.

@pashashaik-mms
Copy link
Author

Could you please approve it as I had fixed the changes. It was a formatting issue and hence the build was failing and so are the unit-tests. I fixed it now and would need your approval.

@deepgarg-visa
Copy link
Contributor

@pashashaik-mms , as mentions here

the below error is occurring because in testcases here, "sourceEntityFilter" is passed as null and because of that the variable "srcNodeLabel" in method "findRelatedEntities" is not setting having a default value of blank, which results in the below query which is not correct.

MATCH (src: )-[r:DownstreamOf ]-(dest ) WHERE left(type(r), 2)<>'r_' RETURN dest, type(r) SKIP $offset LIMIT $count"

Please also handle the case where variable "sourceEntityFilter" in method "findRelatedEntities" can be null or empty

metadata-io > nonsearch > com.linkedin.metadata.graph.dgraph.DgraphGraphServiceTest > testFindRelatedEntitiesDestinationType[11](dataset, [HasOwner], {or=[], direction=UNDIRECTED}, [RelatedEntity(relationshipType=HasOwner, urn=urn:li:dataset:(urn:li:dataPlatform:type,SampleDataset1,PROD), via=null), RelatedEntity(relationshipType=HasOwner, urn=urn:li:dataset:(urn:li:dataPlatform:type,SampleDataset2,PROD), via=null), RelatedEntity(relationshipType=HasOwner, urn=urn:li:dataset:(urn:li:dataPlatform:type,SampleDataset3,PROD), via=null), RelatedEntity(relationshipType=HasOwner, urn=urn:li:dataset:(urn:li:dataPlatform:type,SampleDataset4,PROD), via=null)]) STANDARD_ERROR org.neo4j.bolt.protocol.common.fsm.error.TransactionStateTransitionException: Invalid input ')': expected "%", "(" or an identifier (line 1, column 13 (offset: 12)) "MATCH (src: )-[r:DownstreamOf ]-(dest ) WHERE left(type(r), 2)<>'r_' RETURN dest, type(r) SKIP $offset LIMIT $count"

@pashashaik-mms
Copy link
Author

@pashashaik-mms , as mentions here

the below error is occurring because in testcases here, "sourceEntityFilter" is passed as null and because of that the variable "srcNodeLabel" in method "findRelatedEntities" is not setting having a default value of blank, which results in the below query which is not correct.

MATCH (src: )-[r:DownstreamOf ]-(dest ) WHERE left(type(r), 2)<>'r_' RETURN dest, type(r) SKIP $offset LIMIT $count"

Please also handle the case where variable "sourceEntityFilter" in method "findRelatedEntities" can be null or empty

metadata-io > nonsearch > com.linkedin.metadata.graph.dgraph.DgraphGraphServiceTest > testFindRelatedEntitiesDestinationType[11](dataset, [HasOwner], {or=[], direction=UNDIRECTED}, [RelatedEntity(relationshipType=HasOwner, urn=urn:li:dataset:(urn:li:dataPlatform:type,SampleDataset1,PROD), via=null), RelatedEntity(relationshipType=HasOwner, urn=urn:li:dataset:(urn:li:dataPlatform:type,SampleDataset2,PROD), via=null), RelatedEntity(relationshipType=HasOwner, urn=urn:li:dataset:(urn:li:dataPlatform:type,SampleDataset3,PROD), via=null), RelatedEntity(relationshipType=HasOwner, urn=urn:li:dataset:(urn:li:dataPlatform:type,SampleDataset4,PROD), via=null)]) STANDARD_ERROR org.neo4j.bolt.protocol.common.fsm.error.TransactionStateTransitionException: Invalid input ')': expected "%", "(" or an identifier (line 1, column 13 (offset: 12)) "MATCH (src: )-[r:DownstreamOf ]-(dest ) WHERE left(type(r), 2)<>'r_' RETURN dest, type(r) SKIP $offset LIMIT $count"

FIXED NOW

@pashashaik-mms
Copy link
Author

@deepgarg-visa fixed now. Could you please check and approve it. Its been waiting a while now.

@deepgarg-visa
Copy link
Contributor

deepgarg-visa commented May 24, 2024

@pashashaik-mms are all metadata-io testcase passed ?
I guess this fix will not the solve the problem, as defualt value of variable "srcNodeLabel" is blank. Because of that
the variable matchTemplate = "MATCH (src:%s %s)-[r%s %s]-(dest %s)%s" generates below query:

MATCH (src: )-[r:DownstreamOf ]->(dest )

@pashashaik-mms
Copy link
Author

@deepgarg-visa I handled your scenario as well. Now the tests are running fine. Could you please check the same. removeEdgesFromNode() might be in balance.

@@ -648,18 +666,34 @@ public void removeEdgesFromNode(

// build node label from entity type
final String srcNodeLabel = urn.getEntityType();
String matchTemplate = "";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code can be refactored as below:

`final RelationshipDirection relationshipDirection = relationshipFilter.getDirection();
final String srcNodeLabel = urn.getEntityType();

String matchTemplate = "";
matchTemplate =
String.format(
"MATCH (src {urn: $urn})-[r%s]-(dest) RETURN type(r), dest, 2", srcNodeLabel);
if (relationshipDirection == RelationshipDirection.INCOMING) {
matchTemplate =
String.format(
"MATCH (src {urn: $urn})<-[r%s]-(dest) RETURN type(r), dest, 0", srcNodeLabel);
} else if (relationshipDirection == RelationshipDirection.OUTGOING) {
matchTemplate =
String.format(
"MATCH (src {urn: $urn})-[r%s]->(dest) RETURN type(r), dest, 1", srcNodeLabel);
}
if (srcNodeLabel != null && !srcNodeLabel.isEmpty()) {
matchTemplate =
String.format(
"MATCH (src:%s {urn: $urn})-[r%s]-(dest) RETURN type(r), dest, 2", srcNodeLabel);
if (relationshipDirection == RelationshipDirection.INCOMING) {
matchTemplate =
String.format(
"MATCH (src:%s {urn: $urn})<-[r%s]-(dest) RETURN type(r), dest, 0", srcNodeLabel);
} else if (relationshipDirection == RelationshipDirection.OUTGOING) {
matchTemplate =
String.format(
"MATCH (src:%s {urn: $urn})-[r%s]->(dest) RETURN type(r), dest, 1", srcNodeLabel);
}
}`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community merge-pending-ci A PR that has passed review and should be merged once CI is green. product PR or Issue related to the DataHub UI/UX
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants