Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do you have plans to support real time indexing? #10

Open
yingfeng opened this issue Aug 22, 2017 · 4 comments
Open

Do you have plans to support real time indexing? #10

yingfeng opened this issue Aug 22, 2017 · 4 comments

Comments

@yingfeng
Copy link

It's not that difficult to support such a feature, just by providing two in-memory segments is enough.
When one in-memory segment is full, just flush it to disk while the other in-memory segment will be used to support data ingestion at the same time. It requires a lock-less design to support higher concurrency, which is not that complicated using std::atomic semantics.

@markpapadakis
Copy link
Member

Please note that a major Trinity update is in the works - it should be pushed to GH sometime next week, along with benchmarks, comparing Lucene and Trinity.

You can implement a real-time indexing scheme pretty easily, by creating an IndexSourcesCollection. You then just add to that collection one index source for each read-only serialized segment/source (e.g SegmentIndexSource) and finally you add another IndexSource that’s built for real-time updates -- all you need to do is make sure your resolve_term_ctx() and new_postings_decoder() account for that. That’s pretty much all there is to it, though you may need to make use of an IndexDocumentsFilter because you will likely won’t want to rely on IndexSource::masked_documents() of your real-time index source, but those are rather easy to figure out specifics.

When whatever you use to back your real-time index source(which is a proxy of sorts to that in-memory backing store), you can just flush it as e.g a lucene or google segment, re-create the index source collection to include that new segment and reset the in-memory index source and atomically replace the index collection (just a pointers swap).

This is just one way to do it, and if it sounds complicated, it’s because I failed to describe it properly -- it is pretty trivial in practice really.

@yingfeng
Copy link
Author

The real time indexing requires concurrent access for SegmentIndexSource since updates and retrieval happen at the same time, additionally, the document should be able to be found immediately after it has been inserted which means the so called commit will happen at a per-document grained level. As a result, corresponding posting list should be thread safe. I've not seen such a data structure and other mechanism to be able to support the above flow.

@markpapadakis
Copy link
Member

You shouldn't really use a SegmentIndexSource. This is for read-only segments. Instead, you should subclass IndexSource and create your own. I should probably bundle a simple such implementation as an example of how this could work. If you can wait for a while until I get this new major release into shape and push it to GH, I 'll add a reference impl. for such an IndexSource.

@markpapadakis
Copy link
Member

@yingfeng I am sorry, it has taken longer than I expected to find some free time for those examples -- working on add more features still (a major release was pushed to GH some days ago). Will get to those examples soon thereafter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants