Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FST#Compiler allocates too much memory #12598

Closed
gf2121 opened this issue Sep 27, 2023 · 4 comments · Fixed by #12604
Closed

FST#Compiler allocates too much memory #12598

gf2121 opened this issue Sep 27, 2023 · 4 comments · Fixed by #12604

Comments

@gf2121
Copy link
Contributor

gf2121 commented Sep 27, 2023

Description

https://blunders.io/jfr-demo/indexing-4kb-2023.09.25.18.03.36/allocations-drill-down

The allocation profile for nightly indexing benchmark shows that FST#ByteStore occupies a rather high ratio because we always construct a FST#Compiler each time calling Lucene90BlockTreeTermsWriter$TermsWriter#writeBlocks, which will allocate a new 32KB byte[] .

As the profile shows, most of memory was allocated from FST#init, which means BytesStore rarely needs another page to store bytes. So I suspect we are using a too large page here. Maybe we should decrease the page size to 8k or even smaller ?

@gf2121
Copy link
Contributor Author

gf2121 commented Sep 28, 2023

I did an experiment: Index 5M random BytesRefs and count the byte usage when BytesStore#finish called. The following are the statistical results:

total sample: 23130

avg: 78.12
min: 7
mid: 36
pct75: 40
pct90: 40
pct99: 44
max: 10790

While the bytesrefs are random, it may share little prefix and suffix, I tried to mock some common prefix/suffix for them like:

if (R.nextBoolean()) {
  int prefixLen = R.nextInt(b.length / 2);
  System.arraycopy(commonPrefix, 0, b, 0, prefixLen);
}

if (R.nextBoolean()) {
  int suffixLen = R.nextInt(b.length / 2);
  System.arraycopy(commonSuffix, commonSuffix.length - suffixLen, b, b.length - suffixLen, suffixLen);
}

And here is the result:

total sample: 27235

avg: 820.540738020929
min: 8
mid: 24
pct75: 629
pct90: 3347
pct99: 5374
max: 29049

We will allocate a 32kb while 99% cases only need 5kb. These results somewhat matches the allocation profile that we rarely need a second block in BytesStore.

@gf2121
Copy link
Contributor Author

gf2121 commented Sep 28, 2023

I get similar statistics for wikimediumall and here are the results when BytesStore#finish called 1,000,000 times.

BytesStore#finish called: 1000000 times

min: 1
mid: 16
avg: 64.555987
pct75: 28
pct90: 57
pct99: 525
pct999: 4957
pct9999: 29124
max: 631700

It seems 1k bytes per block is enough here. 99% cases can be covered by single block and we at most need 600+ blocks for single FST.

@mikemccand
Copy link
Member

Thanks @gf2121 -- this is a great discovery (and thank you https://blunders.io for the awesome integrated profiling in Lucene's nightly benchmarks!).

While the bytesrefs are random, it may share little prefix and suffix, I tried to mock some common prefix/suffix for them like

It's always dangerous to test on random data (even random data that attempts to simulate realistic patterns): you'll draw random conclusions yet act on them as if they were trustworthy! Better to test by indexing true terms that show up in a real-world corpus.

Still, I think this issue is indeed compelling -- most FSTs in practice are likely smaller than 32 KB. But one issue to be aware of is shrinking the block size will worsen this issue, where performance of building & immediately using (in HEAP, not via save/load) an FST has poor performance when the FST needs more than one block.

Really we need to get away from FST attempting to re-implement an in-memory filesystem, badly! We've seen many problems from this over the years... like poor performance reading bytes in reverse, this issue (over-allocating blocks), the above linked issue (poor performance using a just-built FST). The FST compiler really ought to stream its bytes directly to DataOutput (inspired by Tantivy's awesome FST implementation) which will in general be backed by a "true" filesystem since FST compilation is fundamentally append-only writes.

But until we fix this "correctly" (stop trying to be a ramfs, badly), I agree we should tweak the block size for terms dict FST writing.

@risdenk
Copy link
Contributor

risdenk commented Oct 4, 2023

FWIW I was looking into this a bit when I saw this issue come in. Specifically on Solr 8.11, but as far as I can tell the changes in #12604 apply to 8.x as well.

In a 30s async profiler memory allocation profile, can see that ~10% (4% on left and then ~6% near the middle) of total is this same FST byte array.
Screenshot 2023-10-04 at 09 15 52

I put up a backport PR for lucene-solr branch_8_11 - apache/lucene-solr#2677

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants