Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch import from CSV via client #161

Open
ysimonson opened this issue Jan 30, 2021 · 5 comments
Open

Batch import from CSV via client #161

ysimonson opened this issue Jan 30, 2021 · 5 comments
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@ysimonson
Copy link
Member

Once #157, it'd be great to add support for batch importing from a CSV file. CSV would act as a decent lowest-common-denominator format. In the future, other formats could be added as well (e.g. RDF XML) based off demand.

@ysimonson ysimonson added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers labels Jan 30, 2021
@davidkuhta
Copy link
Contributor

Happy to take this on, unless @binoychitale feels its an organic progression of their work.

@binoychitale
Copy link
Contributor

No I think you can pick this :)

@ysimonson
Copy link
Member Author

Thanks @davidkuhta! Here's what I'm thinking for CSV formats, but let me know if you disagree or otherwise think there's a better way to structure things.

We'd support two CSV formats. One for batch inserting vertices and vertex properties, and one for batch inserting edges and edge properties.

Vertices

Format: ID,Type,Property 1,...,Property N. The first two header columns would be ignored, but all subsequent values would be used to determine the names of properties to be inserted.

e.g.:

ID,Type,movie-name,year
e08c8968-5b38-4c55-8dd1-56d7880708f2,movie,Plan 9 from Outer Space,1957
dc8c93c4-dc92-4d00-923d-cc9191f9a946,movie,The Room,2003

Would create vertices with properties movie-name and year.

Edges

Format: Outbound ID,Type,Inbound ID,Property 1,...,Property N. Similarly, the first three header columns would be ignored, but all subsequent values would be used to determine the names of properties to be inserted.

e.g.:

Outbound ID,Type,Inbound ID,year
6bdc1304-e62f-4c38-8a3d-78e923dcb176,acted-in,dc8c93c4-dc92-4d00-923d-cc9191f9a946,2003

Would create an edge with a year property.

@davidkuhta
Copy link
Contributor

@ysimonson agree with the overall concept, but a couple of thoughts came to mind:

  1. Validation - What are thoughts on failure for edges (say if inbound or outbound ids don't exist). Any considerations for property types?
  2. Would we treat each row as a transaction or the CSV as a whole?
  3. Is a uuid as key the anticipated use case or is realistically an integer or string key, which would need to be referenced between a given import of vertices and edges.

@ysimonson
Copy link
Member Author

For this we'd use the bulk insert API. Datastores provide different guarantees on validation and transactionalization.

Good point on (3), since most datasets don't use UUIDs. I think right now the keys in the CSV can realistically only be UUIDs. For users that don't have UUID keys, they'll need to generate a new CSV with them, and maintain a mapping of key -> UUID. I've opened an issue to better support this though. It's not clear to me what the best solution is yet, but does seem like something that should be addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants