Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speeding up data import to Neo4j v5 and CSV format data #57

Open
nickzren opened this issue Oct 27, 2023 · 2 comments
Open

Speeding up data import to Neo4j v5 and CSV format data #57

nickzren opened this issue Oct 27, 2023 · 2 comments

Comments

@nickzren
Copy link

I encountered challenges while trying to load Hetionet data into my updated MacBook's Neo4j version 5.13. The existing Neo4j dumps were no longer compatible, and directly importing the data in JSON format was too time-consuming, taking an estimated 10+ hours.

To address this, I've written a script that efficiently converts JSON data to CSV format without any loss in node, edge, or property value information. The JSON-to-CSV conversion takes approximately 30 seconds, while uploading the CSV to Neo4j takes around 40 seconds.

I've organized each node and edge type into its own respective CSV file and accompanying Cypher script. I believe this will make it easier for people to understand and work with the data.

If this sounds useful, I'd be open to integrating these changes into the main branch. Let me know your thoughts.

You can find the revised code at:
https://github.com/nickzren/hetionet/tree/csv

@dhimmel
Copy link
Member

dhimmel commented Nov 1, 2023

Awesome work @nickzren. Nice job finding an efficient import method that works with the latest Neo4j stack.

I took a quick look at the changes and I'll need a little more time to think about where the code belongs... since it could possibly live in dhimmel/integrate or hetio/hetnetpy rather than in this repo whose focus is more the data and not the code to generate the data.

Taking a step back, there's a couple contributions that will be of major utility (in order of importance/interest):

  1. a neo4j dump file that is compatible with neo4j 5 (and possibly future neo4j versions)
  2. code to generate the neo4j database and dump file (i.e. your csv branch)
  3. the csv files, but I'm a little cautious in that they have some similarities with the TSV files and we'd want to understand and document the differences.

Any thoughts?

@nickzren
Copy link
Author

nickzren commented Nov 1, 2023

Thanks @dhimmel

I recognize the primary aim of this repo is data-centric rather than code-centric. However, I respectfully suggest that including data-specific scripts, such as Python and Cypher, could add value. No strong opinion here, and I'll defer to your ultimate decision on the matter.

The CSV data is derived from JSON files and serves as a comprehensive dataset, including all properties for both nodes and edges.

CSV files and Cypher scripts are generated simultaneously to ensure data type consistency, making the data import into Neo4j unaffected by version changes.

Having separate CSV files for nodes and edges not only enhances the framework's comprehensibility but also allows users to easily choose or modify or extend the data.

I encountered difficulties with restoring from Neo4j dump files and couldn't resolve the issues, so I began exploring alternative solutions.

I'm still learning about this and graph databases, so any corrections are welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants