Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add estimated row batch size in bytes to state #230

Merged
merged 3 commits into from
Mar 26, 2021

Conversation

floriecai
Copy link
Contributor

@floriecai floriecai commented Jan 13, 2021

Related: #226

Similar PR: #228

Add an estimate of number of bytes for each row batch and sending it with the progress callback

@@ -142,7 +141,6 @@ func (c *Cursor) Each(f func(*RowBatch) error) error {
tx.Rollback()

c.lastSuccessfulPaginationKey = paginationKeypos
c.rowsExamined += uint64(batch.Size())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah this wasn't used... :whistling:..

row_batch.go Show resolved Hide resolved

s := batch.EstimateByteSize()

fmt.Printf("%d", s)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should actually assert something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed since I'm arleady testing in the callback.

func (e *RowBatch) EstimateByteSize() uint64 {
var total int
for _, v := range e.values {
size, err := json.Marshal(v)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried about the overhead of the json marshaling here. Have you run any benchmarks to see how much additional CPU this will take?

Also, instead of the json marshaling, have we considered unsafe.Sizeof() or reflect.Type.Size()? I'm not familiar with the risks of using the unsafe package, however. Something to consider.

Copy link
Contributor Author

@floriecai floriecai Jan 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unsafe.Sizeof (and reflect - they're aliases) give you the size of the pointer. So regardless of string length, the size is 8. Same with uint64, its always gonna be 8. So it's not quite the same since we want to know the byte length of the data itself.

And with some benchmarking, the json.Marshal seems harmless 👇

BenchmarkSize-12         	1000000000	         0.000454 ns/op	       	       0 allocs/op     #2.8MB file
BenchmarkJSON-12         	1000000000	         0.00634 ns/op	       	       0 allocs/op     #2.8MB file
BenchmarkSizeSmall-12    	1000000000	         0.000077 ns/op	       	       0 allocs/op     #200KB file
BenchmarkJSONSmall-12    	1000000000	         0.000429 ns/op	       	       0 allocs/op     #200KB file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you'd have to traverse into the pointer and things gets ugly quickly (although json.Marshal is technically doing it).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking out loud here, but we could technically with minimal performance hit wrap mysql.writeExecutePacket and len the return data after filtering for INSERTS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or no we can't, it doesn't return the data, just an error.

@Manan007224
Copy link
Contributor

Manan007224 commented Feb 5, 2021

There are some performance issues with the estimation of the row batch size in bytes noted here - #240. I have currently added a flag to turn on/off the row batch size estimation which can be passed in the ghostferry config.

cc @shuhaowu @tiwilliam

bytesWrittenForThisBatch = batch.EstimateByteSize()
}
w.StateTracker.UpdateLastSuccessfulPaginationKey(batch.TableSchema().String(), endPaginationKeypos,
RowStats{NumBytes: bytesWrittenForThisBatch, NumRows: uint64(batch.Size())})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behaviour seems a bit sketchy. If someone has turned off the EnableRowBatchSize to off they'll see the NumBytes to 0 which would create confusion. I don't think there's much we can do in here other then documenting this behaviour.

floriecai and others added 2 commits March 25, 2021 09:20
changes in config

debug tests

fix tests

added go tests

modifications
@Manan007224 Manan007224 merged commit cfe5638 into master Mar 26, 2021
@shuhaowu shuhaowu deleted the estimate-row-batch-bytes branch April 28, 2021 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants