Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement periodic sync of Elasticsearch with scrapers #84

Open
tanaysoni opened this issue Mar 26, 2020 · 0 comments
Open

Implement periodic sync of Elasticsearch with scrapers #84

tanaysoni opened this issue Mar 26, 2020 · 0 comments
Assignees

Comments

@tanaysoni
Copy link
Contributor

tanaysoni commented Mar 26, 2020

Proposal

In the current implementation, the meta scraper runs all the scrapers sequentially, crawls the FAQs, and then writes to an Elasticsearch index. This is good for initializing an index from scratch.

We should implement a periodic job(cron or AWS Lambda) that runs the meta scraper and check for updates, additions, and deletions since the last run.

A possible quick-n-dirty alternative to a periodic sync job could have been to recreate the entire Elasticsearch index each time we crawl. This works, except, collecting user feedback gets tricky as we lose the document_id when the list of scrapers gets updated.

Workflow

  • execute the meta crawler
  • search in ES if crawled question/answer pairs for a given scraper are present. The ES query can be filtered by the link field.
  • existing questions in ES which are no longer present(or are changed) in the newly crawled link are marked as outdated in ES

Other details

  • Currently, the document_id field in ES is populated as incrementing numbers. It could be changed to UUID to make things simpler to implement.
  • The API queries should be changed to exclude outdated documents.
@andra-pumnea andra-pumnea self-assigned this Apr 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants