Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

de-duplicate entities in apoc.export.json.data/query #3930

Open
jexp opened this issue Jan 25, 2024 · 2 comments
Open

de-duplicate entities in apoc.export.json.data/query #3930

jexp opened this issue Jan 25, 2024 · 2 comments
Assignees

Comments

@jexp
Copy link
Member

jexp commented Jan 25, 2024

I'm not sure if we're de-duplicating entities in apoc.export.json.data/query

e.g. if you have a query like

MATCH (p:Person)-[r:ACTED_IN]->(m:Movie)
RETURN p,r,m

where people and movies can appear multiple times.

or

MATCH (p:Person)-[r:KNOWS]-(p2:Person)
RETURN p1,r,p2

where even relationships can be duplicated.

Are we keeping track in a set of ids or so. Please check.

@vga91
Copy link
Collaborator

vga91 commented Apr 30, 2024

@jexp

Yes, entities are duplicated during export.
In fact, executing:

CREATE (p:Person {id: 1})-[r:ACTED_IN]->(m:Movie {foo: 1}) with p 
CREATE (p)-[:ACTED_IN]->(:Movie {foo: 2})

and then:

MATCH (p:Person)-[r:ACTED_IN]->(m:Movie)
with collect(p) + collect(m) as nodes, collect(r) as rels
call apoc.export.json.data(nodes, rels, "testData.json", {})
yield file return file

the resulting file has a duplicate Person node:

{"type":"node","id":"3","labels":["Person"],"properties":{"id":1}}
{"type":"node","id":"3","labels":["Person"],"properties":{"id":1}}
{"type":"node","id":"4","labels":["Movie"],"properties":{"foo":1}}
{"type":"node","id":"5","labels":["Movie"],"properties":{"foo":2}}
{"type":"relationship","id":"2","label":"ACTED_IN","start":{"id":"3","labels":["Person"],"properties":{"id":1}},"end":{"id":"4","labels":["Movie"],"properties":{"foo":1}}}
{"type":"relationship","id":"3","label":"ACTED_IN","start":{"id":"3","labels":["Person"],"properties":{"id":1}},"end":{"id":"5","labels":["Movie"],"properties":{"foo":2}}}

The issue also occurs with other procedures, such as csv, Cypher.

Moreover, it happens also with the apoc.export.<type>.graph procedures:

MATCH (p:Person)-[r:ACTED_IN]->(m:Movie)
with collect(p) + collect(m) as nodes, collect(r) as rels
call apoc.export.json.graph({nodes: nodes, relationships: rels}, "testGraph.json", {})
yield file return file

With the query, such as the following, the result is duplicated, but I think in this case it is right,
since each Cypher row result corresponds to an entry in the json/csv/... file:

call apoc.export.json.query("MATCH path=(p:Person)-[r:ACTED_IN]->(m:Movie) RETURN path", "testQuery.json", {})
yield file return file

So we indeed should keep track of the IDs during the export.

Since the procedures are all in APOC Core, I think you need to create a Trello card, or am I wrong?

@vga91
Copy link
Collaborator

vga91 commented May 14, 2024

Created Trello card, with id VchWnQfd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Core issues (with trello core card)
Development

No branches or pull requests

3 participants