Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

estimated row size in bytes performance issues #240

Open
Manan007224 opened this issue Jan 21, 2021 · 2 comments
Open

estimated row size in bytes performance issues #240

Manan007224 opened this issue Jan 21, 2021 · 2 comments

Comments

@Manan007224
Copy link
Contributor

We want to add a function to RowBatch which can estimate the batch size in bytes. Threre a couple of options which could be take here :-

  1. convert the row batch to bytes by using json marshalling and calculate the size of the resulting byte slice. This is implemented in the following PR - Add estimated row batch size in bytes to state #230. @shuhaowu did some performance analysis and generated flame graph which clearly depicted the json marshalling method is slow and needs optimization.

  2. reflect - just estimating the size of the different types in the rowdata which is of type []interface{} might suffice. The downside here is this might give us too much inaccurate estimate. For eg consider this rowdata :- ["mkm0", "mkm0a", 1234] . If we use the reflect package to calculate the size of this rowdata in bytes it gives us 40 where it calculates the size of the two strings as 16 bytes and the int as 8 bytes. But we can see that the first two strings mkm0 and mkm0a are different in lengths and the reflect package doesn't account for this.
    On the same rowdata if we try to calculate the byte size using json marshalling we get the size as 21 which seems accurate.
    So if we have a batch of 200 rowData I mentioned above the reflect package would give the size of batch as 8000 bytes and the json marshalling method would give the size in bytes ~4200. This seems to be too much of a deviation from the actual value. We need to see if we want to prioritize performance over correctness values or not. The code I ran to depict the issues with the reflect package and why it is not performant is here - https://play.golang.org/p/dWGh-M-qu_j

  3. The mysqldriver when it creates the mysql packet calcualates the size in bytes of the data before performing the syscall - https://github.com/go-sql-driver/mysql/blob/master/packets.go#L1011-L1132. The
    obvious option is to modify the internal function writeExecutePacket in the mysqldriver to return the data size in bytes but that would mean we have to maintain our own feature branch which might be cumbersome.
    We can also take the logic used in the mysqldriver code to calculate the data size and put in into our rowbatch.EstimatedByteSize function. This would be clearly more performant then the json marshalling
    and more accurate then the reflect package.

  4. Do some sampling - If there are X rows in a batch, calculate the size in bytes for X/N rows

@fjordan @tiwilliam @shuhaowu

@shuhaowu
Copy link
Contributor

shuhaowu commented Feb 5, 2021

Perhaps we can just roll with the json based estimation for now, and have a config flag to turn this behaviour off...

@fjordan
Copy link
Contributor

fjordan commented Feb 22, 2021

Perhaps we can just roll with the json based estimation for now, and have a config flag to turn this behaviour off...

I think that get's us 80% of the way there with 20% of the effort. Probably not worth going after the last 20%.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants