Implement periodic sync of Elasticsearch with scrapers #84

tanaysoni · 2020-03-26T13:25:08Z

Proposal

In the current implementation, the meta scraper runs all the scrapers sequentially, crawls the FAQs, and then writes to an Elasticsearch index. This is good for initializing an index from scratch.

We should implement a periodic job(cron or AWS Lambda) that runs the meta scraper and check for updates, additions, and deletions since the last run.

A possible quick-n-dirty alternative to a periodic sync job could have been to recreate the entire Elasticsearch index each time we crawl. This works, except, collecting user feedback gets tricky as we lose the document_id when the list of scrapers gets updated.

Workflow

execute the meta crawler
search in ES if crawled question/answer pairs for a given scraper are present. The ES query can be filtered by the link field.
existing questions in ES which are no longer present(or are changed) in the newly crawled link are marked as outdated in ES

Other details

Currently, the document_id field in ES is populated as incrementing numbers. It could be changed to UUID to make things simpler to implement.
The API queries should be changed to exclude outdated documents.

The text was updated successfully, but these errors were encountered:

andra-pumnea self-assigned this Apr 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement periodic sync of Elasticsearch with scrapers #84

Implement periodic sync of Elasticsearch with scrapers #84

tanaysoni commented Mar 26, 2020 •

edited

Implement periodic sync of Elasticsearch with scrapers #84

Implement periodic sync of Elasticsearch with scrapers #84

Comments

tanaysoni commented Mar 26, 2020 • edited

Proposal

Workflow

Other details

tanaysoni commented Mar 26, 2020 •

edited