-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize scanning of string offsets to skip pages we don't need to read #2575
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #2575 +/- ##
==========================================
+ Coverage 93.37% 93.38% +0.01%
==========================================
Files 1063 1063
Lines 39360 39371 +11
==========================================
+ Hits 36752 36767 +15
+ Misses 2608 2604 -4 ☔ View full report in Codecov by Sentry. |
df711a2
to
8dedd93
Compare
https://github.com/kuzudb/kuzu/actions/runs/7186804200/job/19573028840?pr=2575 Clang-tidy on macos doesn't seem to recognize that |
Can you share more details on your benchmark? What dataset is used, how large the dictionary is and what are the numbers like? |
I've updated the description to include benchmark details, as well as more details about the change |
src/storage/store/string_column.cpp
Outdated
@@ -154,54 +157,83 @@ void StringColumn::scanUnfiltered(transaction::Transaction* transaction, | |||
scanValuesToVector(transaction, nodeGroupIdx, offsetsToScan, resultVector, indexState); | |||
} | |||
|
|||
std::ranges::subrange<std::vector<std::pair<StringColumn::string_index_t, uint64_t>>::iterator> | |||
getNextRange(std::vector<std::pair<StringColumn::string_index_t, uint64_t>>& data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit concerned with the use of subrange
. It is quite new in C++20, not sure if it is already widely supported in most compilers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, me too. Maybe it could just be replaced by a pair, but it's a little more convenient and helps simplify the code.
https://en.cppreference.com/w/cpp/compiler_support/20 seems to indicate that it's supported by the four major stdlibs (under "The One Ranges Proposal"). std::ranges::minmax_result turned out to not be well supported by apple clang (which is why `std::minmax_result was used instead in the old code here). I'm not sure how much of ranges is.
src/storage/store/string_column.cpp
Outdated
void StringColumn::scanValuesToVector(Transaction* transaction, node_group_idx_t nodeGroupIdx, | ||
std::vector<std::pair<string_index_t, uint64_t>>& offsetsToScan, ValueVector* resultVector, | ||
const ReadState& indexState) { | ||
auto offsetState = offsetColumn->getReadState(transaction->getType(), nodeGroupIdx); | ||
auto dataState = dataColumn->getReadState(transaction->getType(), nodeGroupIdx); | ||
|
||
string_index_t firstOffsetToScan, lastOffsetToScan; | ||
auto comp = [](auto pair1, auto pair2) { return pair1.first < pair2.first; }; | ||
auto comp = [](const auto& pair1, const auto& pair2) { return pair1.first < pair2.first; }; | ||
auto duplicationFactor = (double)offsetState.metadata.numValues / indexState.metadata.numValues; | ||
if (duplicationFactor <= 0.5) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a boolean flag bool hasLotDuplication = duplicationFactor <= 0.5
, so you can also reuse the flag at L198. The variable name hasLotDuplication
can be polished though.
} | ||
return std::ranges::subrange(data.begin() + startPos, data.begin() + endPos); | ||
} | ||
|
||
void StringColumn::scanValuesToVector(Transaction* transaction, node_group_idx_t nodeGroupIdx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code block is becoming complex now. Can you write a more detailed comment to summarize the flow of scan values when there are (or aren't) lots of duplication?
src/storage/store/string_column.cpp
Outdated
@@ -28,6 +28,9 @@ StringColumn::StringColumn(std::string name, std::unique_ptr<LogicalType> dataTy | |||
|
|||
void StringColumn::scanOffsets(Transaction* transaction, const ReadState& state, | |||
string_offset_t* offsets, uint64_t index, uint64_t numValues, uint64_t dataSize) { | |||
// TODO(bmwinger): this causes an issue with structs where a read is attempted on an empty chunk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow here, what's the connection with structs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was that for some reason structs are reading the first value, even if the numValues in the metadata is 0, so the assertion fails for strings stored in a struct. As long as everything is zero, that value just ends up being an empty string, but it shouldn't be being read. I'm not sure why it was just for structs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a test case or example that can reproduce that? It looks like we should fix it now or later, better document it down in an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Un-commenting that line, it causes only EmptyCreateNodeTest.CreateStructWithWriteTransaction
to fail. That said, after creating the node, the string chunk should contain a single string, so maybe it's the metadata that's wrong, something to do with the scan occurring before the commit probably.
8a5a120
to
3a8a041
Compare
3a8a041
to
d0bb44a
Compare
Before, we were scanning all offsets at once to reduce the number of scans done (since with dictionary compression, offsets may be out of order even when doing unfiltered scans, so there's no guarantee that they will be adjacent). That had a bad worst case where for chunks with a large dictionary, you might end up scanning the entire chunk to read just a few pages.
So instead, this scans in batches, calculated by finding the next range of values where no two values are further apart than one page (using the number of values per page calculated from the chunk metadata). It will still scan unneeded values, but no longer scan unneeded pages. This had better performance than scanning only adjacent values.
I'm concerned that this is a little complicated, but it gives a moderate improvement to string scanning performance (roughly 10% on columns with a moderate amount of duplication).
Here are two sets of benchmarks on the LDBC-10 comment table (I did two runs to make it clearer when differences were due to externalities like system noise. On one of them, the two runs are superimposed). Note that these are without clearing OS filesystem caches.
I'd also done some cold runs in the same circumstances on the
locationIP
, andcontent
tables, and got a speedup of about 7% (only numbers I have to hand are 859ms average compared to 924ms average, for thecontent
column with the string IDs; non-string IDs were slightly faster, but still a comparable difference).Oddly, the q05 benchmark (scanning
browserUsed
) is faster with this change, which I think is suspicious as thebrowserUsed
column only has 5 distinct values and should get no benefit from skipping pages.