Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streamline upload process on the backend #75

Open
flaneuse opened this issue Nov 17, 2020 · 3 comments
Open

Streamline upload process on the backend #75

flaneuse opened this issue Nov 17, 2020 · 3 comments
Labels

Comments

@flaneuse
Copy link
Member

Uploading large chunks of data is a pain, since there's not a good way to queue the data to be uploaded, and due to the complexity of the .json validation before ES-insertion, 300 records takes ~ 5 min to upload.

There are at least a few limits to queuing large amounts of data:

  1. The front-end has a limit for how much data it can store in memory for uploading
  2. The backend can only accept I think about 1 MB before it complains; as a result, right now the front-end parses the file into ~ 1 MB chunks to send to the backend.
  3. On the prod server, if there are too many simultaneous requests, the multiprocessing queue can get mixed up and the same record can be inserted multiple times into the index.

Ideally, we could queue a buncha records and let it do its thing overnight. This may involve moving away from the front-end interface, but we'll still have problems with the multiprocessing inserting duplicates.

@flaneuse
Copy link
Member Author

Also, sometimes I get random unexplained PUT errors:
Screen Shot 2020-11-17 at 2 20 57 PM

@flaneuse
Copy link
Member Author

Also, sometimes I get random unexplained PUT errors:
Screen Shot 2020-11-17 at 2 20 57 PM

never mind... i think this is actually wifi / vpn instability on my part

@flaneuse
Copy link
Member Author

@juliamullen : Thinking about this further, it'd be good to decouple and expose the jsonvalidation process to check the validation before trying the POST command (either in the command-line and/or GUI on the website). ideally, the validation process should be checked first, and then make the POST request.

might also make it easier to take advantage of the elasticsearch python library capabilities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

1 participant