msgp: too few bytes left to read object #875

Dieterbe · 2018-03-16T11:08:55Z

2018/03/16 10:29:00 [dataprocessor.go:221 func1()] [E] DP getTargetsRemote: error unmarshaling body from mt-read00-12574-medium-ops-b-2445396050-vdcg2/getdata: "msgp: too few bytes left to r>
2018/03/16 10:29:00 [graphite.go:766 executePlan()] [E] HTTP Render msgp: too few bytes left to read object
[Macaron] 2018-03-16 10:29:00: Completed /render 500 Internal Server Error in 129.391075ms

The text was updated successfully, but these errors were encountered:

shanson7 · 2018-03-22T15:54:41Z

I'm seeing these fairly frequently. Any idea what the issue is?

shanson7 · 2018-05-09T22:00:16Z

I deployed a “silent node” (carbon in, partition 9999) and added some debug statements

It turns out the buffers are coming back as nil
2018/05/09 21:05:26 [dataprocessor.go:223 func1()] [E] DEBUG len(buf)=0, is nil:true

It seems like we are getting nil buffers back from the peers when the request gets canceled. Adding more logging I see
2018/05/09 21:30:24 [dataprocessor.go:216 func1()] [E] DP getTargetsRemote: error with POST to metrictank-read-046-1/getdata: "500 Internal Server Error"

Looking at that time for metrictank-read-046-1 I see
2018/05/09 21:30:24 [cluster.go:191 getData()] [E] HTTP getData() start must be before end.

That comes from cassandra store. Likely something to do with this logic: https://github.com/grafana/metrictank/blob/master/api/dataprocessor.go#L537

shanson7 · 2018-05-09T22:04:12Z

I think this is ccache corruption. For this particular repro request it was always the same instance that was breaking things. I sent a ccache/delete request and now the error is gone for this repro

tehlers320 · 2018-06-19T15:40:35Z

This occurred for me during a schema update and was not related to the ccache at all on version 0.9.0. Once schemas were the same on all servers this went away.

shanson7 · 2018-06-27T16:17:39Z

Sorry, to clarify:

The main issue of msgp: too few bytes left to read object is coming from here. This happens when the request to the peer is canceled because another peer has returned an error (so the buffer is nil and not eligible for unmarshaling). The fix for this is probably to just check if the request was canceled before unmarshaling.
This means that there is another problem that is causing the error to be returned. In my specific case it is some ccache corruption.

Dieterbe added the customer-impacting label May 2, 2018

Dieterbe added this to the 0.9.1 milestone May 2, 2018

shanson7 mentioned this issue Jul 26, 2018

Chunk Cache Corruption causes request failures #967

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

msgp: too few bytes left to read object #875

msgp: too few bytes left to read object #875

Dieterbe commented Mar 16, 2018

shanson7 commented Mar 22, 2018

shanson7 commented May 9, 2018

shanson7 commented May 9, 2018

tehlers320 commented Jun 19, 2018

shanson7 commented Jun 27, 2018

msgp: too few bytes left to read object #875

msgp: too few bytes left to read object #875

Comments

Dieterbe commented Mar 16, 2018

shanson7 commented Mar 22, 2018

shanson7 commented May 9, 2018

shanson7 commented May 9, 2018

tehlers320 commented Jun 19, 2018

shanson7 commented Jun 27, 2018