Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Planned removal of index sorting in 0.23.0 #2352

Open
PSeitz opened this issue Apr 9, 2024 · 4 comments
Open

Planned removal of index sorting in 0.23.0 #2352

PSeitz opened this issue Apr 9, 2024 · 4 comments

Comments

@PSeitz
Copy link
Contributor

PSeitz commented Apr 9, 2024

Index sorting provides currently little to no benefit, and there's some confusion about it from tantivy users on how to use it. (e.g. that it is a way to get sorted search results, which is not really the case)
It also adds considerable complexity to the indexing and merging parts of the code.

On the potential pros:

  • Range queries may be accelerated by using binary search instead of a full scan, but we don't do that currently and it is limited to one field you would need to pick in advance.
  • Compression may be improved, but we don't have much data about that though.
  • order_by queries on the same field may be accelerated. On a single segment that would be trivial and remove the need to fetch fast field values. On multiple segments this would be more difficult.

There will also be a deprecation warning added to 0.22.0 with a link to this issue.

If you have a use case which requires index sorting, please explain your use case here.

Related #2335

@sujayakar
Copy link

I was planning on using index sorting for our full text search at Convex to ensure that results are sorted in a deterministic order that's independent of how documents are split across segments.

We're an OLTP database that lets users define text search indexes on their existing tables, much like Postgres with its GIN-based indexes. Indexes are fully transactional, with older data stored across multiple Tantivy segments and recent data stored in a multiversioned in-memory index. Furthermore, we want read transactions in our system to be deterministic so we can cache them automatically and precisely detect when they're invalidated.

So, we'd like our text search queries (which are just TopK queries by BM25 of an OR of a set of search terms) to be deterministic, even if there's been an index rebuild since the query was executed. Elsewhere in our system, our rows have a natural sort order based on their system-assigned ID and insertion time. Then, during query execution, ideally each segment would return its top K results based on (BM25, natural order), independent of how DocIds are assigned within each segment, and then we'd merge the results to get a deterministic global result.

We've already done some of the work here (e.g. adjusting BM25 statistics to be deterministic as of the transaction's timestamp: #1855, Convex Bm25StatisticsProvider implementation), but I would love to make this 100% accurate, and embedding the natural row order in the DocId order seems like a good path to do this efficiently at query time without having to load a fast field on every hit.

Future features

A few users have asked us for sort orders other than BM25 for text search results (e.g. insertion time), and having them pick a single field in advance would work well for our use case. Furthermore, letting users specify range queries on this sort field and paginate over it would be very helpful. Since we're an OLTP database, we store our indexes on SSDs and want queries to execute quickly (<100ms in total) with bounded IO. So, turning these range queries into a binary search rather than having to do a full aggregation and sort is highly desirable.

Even further in the future, I'd love to use Tantivy for implementing more types of OLTP inverted index algorithms in our system:

In these cases, I think having a well-defined sort order and efficient pagination would be important since these are structured application queries, not user-facing ones.

I'm also pretty rusty on the Block WAND paper, but if scoring is simpler in this setting (e.g. an array index scores by the simple count of matches), many documents may have the same score, and being able to use DocId as a tiebreaker without consulting a fast field may let us skip over many blocks entirely.

Conclusion

I'm curious if any of this resonates with you all -- I'm definitely inexperienced with this domain, and let me know if I'm missing anything or not thinking about things in the right way.

We've also benefitted greatly from using Tantivy within Convex, so if there are ways I can help implement some of these optimizations (e.g. using index order for range queries), let me know!

@PSeitz
Copy link
Contributor Author

PSeitz commented Apr 18, 2024

Index sorting is kinda a misnomer, the index is not sorted, but the segments. This feature will not really provide deterministic search results.
If you index single threaded, with a big enough memory budget on the writer that disables commit flushes, I think the order is deterministic for the same data.

We already support JSON as a field type. I'm not sure how index sorting would affect it in any way specifically. Geo-spatial would probably have an BKD tree as index independent of the sorting.

Index sorting has specific prerequisites and some performance cost during indexing. To utilize it we would need to add special code paths. So it is pretty niche. The time may be more effective spent by optimizing the more general case.

@sujayakar
Copy link

Index sorting is kinda a misnomer, the index is not sorted, but the segments.

Right, I think that's my understanding -- for an index sorted by intval,

let settings = IndexSettings {
    sort_by_field: Some(IndexSortByField {
        field: "intval".to_string(),
        order: Order::Asc,
    }),
    ..Default::default()
};     

each segment within the index will assign DocIds such that DocId order is the same as intval's order. Is that right?

In our system, we represent the state D of the database with n Tantivy segments s_1, s_2, ..., s_n and a memory index m. So, the union of s_1, ..., s_n, and m comprises all of the rows in D.

We query the top K results with respect to BM25 across each segment with the TopCollector, which breaks ties using the DocAddress. Then, we merge the results across the n segments and memory index and take the top K results globally.

The concern I have with determinism is when we do a compaction and merge s_1, ..., s_n into, say, a single segment s'. Then, the DocIds in s' are no longer in the same order as the DocIds in the original s_i. If we reexecute the query, the results may change if documents have the same BM25 score since we used the segments' DocId as a tie-breaker.

My thought was to use index sorting to ensure that DocId order is always compatible with the natural order of the documents, so each segment would always return the top K results with respect to (BM25, natural order), so then the global merged top K results would be unaffected even if we rebuild segments.

Workarounds

If you index single threaded, with a big enough memory budget on the writer that disables commit flushes, I think the order is deterministic for the same data.

Yeah, I think this should work for our use case. We already want to control the threading and memory consumption in our system, so we'd probably use the lower level SegmentWriter structure rather than IndexWriter and feed in documents in precisely the order we want. Then, it looks like we're guaranteed that DocIds are assigned sequentially:

self.max_doc += 1;

Then, if we wanted to implement range queries on, say, insertion time ourselves, we'd store insertion time in a fast field, perform a binary search on the fast field reader to get a DocId range, and then pass that DocId range to a custom TermScorer that restricts the posting list traversal to the given range.

Let me know if that makes sense.

@PSeitz
Copy link
Contributor Author

PSeitz commented Apr 20, 2024

each segment within the index will assign DocIds such that DocId order is the same as intval's order. Is that right?

A DocId is the ordinal of the Document in the segment. Sorting changes the order of the documents, it does not assign values to DocIds explicitly.

The concern I have with determinism is when we do a compaction and merge s_1, ..., s_n into, say, a single segment s'. Then, the DocIds in s' are no longer in the same order as the DocIds in the original s_i. If we reexecute the query, the results may change if documents have the same BM25 score since we used the segments' DocId as a tie-breaker.

You assume that search results are stable between merges with index sorting, only by the implicit ordering. I'm not sure this is the case today, or will stay like this if that's the case.

Yeah, I think this should work for our use case. We already want to control the threading and memory consumption in our system, so we'd probably use the lower level SegmentWriter structure rather than IndexWriter and feed in documents in precisely the order we want

You can already control threading and memory consumption with IndexWriter.

Then, if we wanted to implement range queries on, say, insertion time ourselves, we'd store insertion time in a fast field, perform a binary search on the fast field reader to get a DocId range, and then pass that DocId range to a custom TermScorer that restricts the posting list traversal to the given range.

I've been thinking if we should flag fast fields as almost sorted during creation (e.g. almost sorted in a range of 100 values) and then use that information to do a binary_search + 100 values scan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants