estimated row size in bytes performance issues #240

Manan007224 · 2021-01-21T09:08:20Z

We want to add a function to RowBatch which can estimate the batch size in bytes. Threre a couple of options which could be take here :-

convert the row batch to bytes by using json marshalling and calculate the size of the resulting byte slice. This is implemented in the following PR - Add estimated row batch size in bytes to state #230. @shuhaowu did some performance analysis and generated flame graph which clearly depicted the json marshalling method is slow and needs optimization.
reflect - just estimating the size of the different types in the rowdata which is of type []interface{} might suffice. The downside here is this might give us too much inaccurate estimate. For eg consider this rowdata :- ["mkm0", "mkm0a", 1234] . If we use the reflect package to calculate the size of this rowdata in bytes it gives us 40 where it calculates the size of the two strings as 16 bytes and the int as 8 bytes. But we can see that the first two strings mkm0 and mkm0a are different in lengths and the reflect package doesn't account for this.
On the same rowdata if we try to calculate the byte size using json marshalling we get the size as 21 which seems accurate.
So if we have a batch of 200 rowData I mentioned above the reflect package would give the size of batch as 8000 bytes and the json marshalling method would give the size in bytes ~4200. This seems to be too much of a deviation from the actual value. We need to see if we want to prioritize performance over correctness values or not. The code I ran to depict the issues with the reflect package and why it is not performant is here - https://play.golang.org/p/dWGh-M-qu_j
The mysqldriver when it creates the mysql packet calcualates the size in bytes of the data before performing the syscall - https://github.com/go-sql-driver/mysql/blob/master/packets.go#L1011-L1132. The
obvious option is to modify the internal function writeExecutePacket in the mysqldriver to return the data size in bytes but that would mean we have to maintain our own feature branch which might be cumbersome.
We can also take the logic used in the mysqldriver code to calculate the data size and put in into our rowbatch.EstimatedByteSize function. This would be clearly more performant then the json marshalling
and more accurate then the reflect package.
Do some sampling - If there are X rows in a batch, calculate the size in bytes for X/N rows

@fjordan @tiwilliam @shuhaowu

The text was updated successfully, but these errors were encountered:

shuhaowu · 2021-02-05T13:45:52Z

Perhaps we can just roll with the json based estimation for now, and have a config flag to turn this behaviour off...

fjordan · 2021-02-22T21:49:00Z

Perhaps we can just roll with the json based estimation for now, and have a config flag to turn this behaviour off...

I think that get's us 80% of the way there with 20% of the effort. Probably not worth going after the last 20%.

Manan007224 mentioned this issue Feb 5, 2021

Add estimated row batch size in bytes to state #230

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

estimated row size in bytes performance issues #240

estimated row size in bytes performance issues #240

Manan007224 commented Jan 21, 2021

shuhaowu commented Feb 5, 2021

fjordan commented Feb 22, 2021

estimated row size in bytes performance issues #240

estimated row size in bytes performance issues #240

Comments

Manan007224 commented Jan 21, 2021

shuhaowu commented Feb 5, 2021

fjordan commented Feb 22, 2021