Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CEP] Use native Elasticsearch reindexing for index changes #26516

Open
dannyroberts opened this issue Jan 28, 2020 · 6 comments
Open

[CEP] Use native Elasticsearch reindexing for index changes #26516

dannyroberts opened this issue Jan 28, 2020 · 6 comments
Labels
CEP: done CEP CommCare Enhancement Proposal

Comments

@dannyroberts
Copy link
Member

Abstract
Incorporate https://www.elastic.co/guide/en/elasticsearch/reference/2.4/docs-reindex.html into our automatic elasticsearch reindexing setup.

Motivation
It's supposedly much faster than resyncing all the docs ourselves

Specification
There should likely be a fallback method for when we need to reindex data in place because of an issue with the pillows, as opposed to reindexing because we changed the mapping, which is the more common case.

Impact on users
This should not affect users at all.

Impact on hosting
This change should be transparent to local hosting setups. If done before the EOL of our ES 1 backend option, it should fall back to current behavior if the setting ELASTICSEARCH_MAJOR_VERSION = 1 is used.

Backwards compatibility
Besides backwards compatibility with ELASTICSEARCH_MAJOR_VERSION = 1 described above, this should be an in place replacement of our current system with no major affects on users or devops, other than reindexes being faster.

Release Timeline
There is no hard date by which we must do this, but we'd probably want to do it before the next time we reindex forms or cases, as in #25666.

Open questions and issues
I'm not sure we fully understand the behavior of the native elasticsearch reindex functionality. There's always the tricky issue of how to make sure we don't skip any items that have come in between when we start the reindex and when we flip all new reads and writes to it; it's possible that our current code already handles this correctly and in a way that cleanly applies to the proposed reindex implementation.

@dannyroberts dannyroberts added the CEP CommCare Enhancement Proposal label Jan 28, 2020
@dannyroberts
Copy link
Member Author

I'm not actually offering to do this or to have SaaS unilaterally prioritize it. @snopoke it sounds like you've been thinking about this and I wanted to create a public place for that discussion. If ICDS wants to take some initiative at the planning level, I can see SaaS being willing to pitch in effort as well, since we'd clearly also get some benefit from it.

@snopoke
Copy link
Contributor

snopoke commented Jan 29, 2020

I'm not sure we fully understand the behavior of the native elasticsearch reindex functionality. There's always the tricky issue of how to make sure we don't skip any items that have come in between when we start the reindex and when we flip all new reads and writes to it; it's possible that our current code already handles this correctly and in a way that cleanly applies to the proposed reindex implementation.

I haven't thought about this much but reading the docs I see there are options for updating or overwriting or ignoring documents that already exist in the target index. One option would be to start the pillow writing to both old and new indexes before the reindex starts and configure the ES reindex to ignore existing docs.

Just looking at our current reindex workflow I think the part that sets the pillow checkpoints is broken because either it does not set the checkpoint at all (e.g. sql form reindexer) or it uses the old pillows (e.g. user reindexer).

@sravfeyn
Copy link
Member

This sounds like a good path to go on, but agree we might need to think a bit more about the details. One thing that will be nice is if reindexing etc can be decoupled from env to env.

@sravfeyn
Copy link
Member

sravfeyn commented Jan 31, 2020

I looked into this as part of reindexing the large index on ICDS. There are few challenges to using native ES reindexing.

  • The official documentation doesn't clearly mention the behaviour of what happens to live updates made to the source index while new index is being indexed. One hint regarding this that the docs say that the reindex is performed from a snapshot taken from the source index, so the conflicts are unlikely. This means that we can keep track of the pillow checkpoint before starting reindex and once the reindex finishes, replay the pillow changes from that point on the new index. But this is something that needs to be tested as the documentation is weak around this.
  • Some of the third party guides suggest to handle document updates on source index by maintaining an alias that reads from both old and new index and writes to new index or write to two indices and handle duplicate results on the application side. Both of these methods require changes in pillow and HQ's elasticsearch interface tools.
  • The biggest challenge in using the native Reindex API is that it is not robust i.e. there isn't a good way to ignore errors and keep continuing or being able to resume after the errors are addressed. There is a standing elasticsearch issue related to this Reindex API : improve robustness in case of error  elastic/elasticsearch#22471.
  • Another challenge is related to performance, the reindex API docs indicate that it uses scroll and doesn't suggest that it is capable of any concurrency, one would expect that other third party tools or our HQ reindexing tool (if we implement a reindexer that reads directly from source ES index instead of primary data) having concurrency capabilities would help get better performance. (Update: actually the scroll API of elasticsearch doesn't support concurrent scrolls until version 5, so even HQ tooling can't have concurrency, if we are using scroll)

Given all these challenges native Reindex might not be better than our HQ reindex tooling.

Above is a concise summary from the doc where I took notes while researching this, which has more details to points..

@sravfeyn
Copy link
Member

sravfeyn commented Feb 3, 2020

Adding some notes from the staging test

  • After around 15k docs being reindexed, the reindex stops at an error
[2020-02-03 11:18:15,059][DEBUG][action.bulk              ] [es2-staging] [xforms_2020_02_03][3] failed to execute bulk item (index) index {[xforms_2020_02_03][xform][62c3459987254bf889de537e47af19f9], source[{"_id": "62c3459987254bf889de537e47af19f9", "doc_type": "XFormInstance", "form": {"@version": "1", "@uiVersion": "1", "@xmlns": "http://commcarehq.org/case", "meta": {"@xmlns": "http://openrosa.org/jr/xforms", "deviceID": "corehq.apps.callcenter.sync_user_case._UserCaseHelper.update_user_case", "timeStart": "2019-04-30T21:12:22.069290Z", "timeEnd": "2019-04-30T21:12:22.069290Z", "username": "system", "userID": "", "instanceID": "62c3459987254bf889de537e47af19f9", "appVersion": null, "commcare_version": null, "app_build_version": null, "geo_point": null}, "case": {"@case_id": "0c14374b42cd4b7390461981c9e10b16", "@date_modified": "2019-04-30T21:12:22.068047Z", "@xmlns": "http://commcarehq.org/case/transaction/v2", "update": {"phone_number": "15086313093"}}, "#type": "system"}, "auth_context": {"doc_type": "DefaultAuthContext"}, "openrosa_headers": {}, "domain": "ccqa", "app_id": null, "xmlns": "http://commcarehq.org/case", "user_id": "", "orig_id": null, "deprecated_form_id": null, "server_modified_on": "2019-04-30T21:12:22.327993Z", "received_on": "2019-04-30T21:12:22.102687Z", "edited_on": null, "partial_submission": false, "submit_ip": null, "last_sync_token": null, "problem": null, "date_header": null, "build_id": null, "state": 1, "initial_processing_complete": true, "history": [], "backend_id": "sql", "user_type": "unknown", "inserted_at": "2019-04-30T21:12:29.848612", "__retrieved_case_ids": ["0c14374b42cd4b7390461981c9e10b16"]}]}
MapperParsingException[Field [_id] is a metadata field and cannot be added inside a document. Use the index API request parameters.]
  • It's not clear why this happens, since docs with similar _id field end up getting reindexed okay prior to this failure.
  • When I manually index this, a new _id is assigned for this doc (with the source._id remaining as is)
  • The reindexing error is only available in the log file.

@snopoke
Copy link
Contributor

snopoke commented Apr 29, 2020

@sravfeyn can you update this with the current state of the reindex tools you used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CEP: done CEP CommCare Enhancement Proposal
Projects
None yet
Development

No branches or pull requests

3 participants