Skip to content
This repository has been archived by the owner on Mar 27, 2021. It is now read-only.

Heroic appears to mis-handle EsRejectedExecutionException errors from ElasticSearch in batched queries #683

Open
rf opened this issue Aug 3, 2020 · 2 comments

Comments

@rf
Copy link

rf commented Aug 3, 2020

I'm seeing an issue in production where a EsRejectedExecutionException result from ElasticSearch will cause Heroic to return a correct-looking response but with empty results and empty errors. The responses look like this:

    "redactedkey:1": {
      "queryId": "21ee7a67-1023-495b-b2dd-55609d914f7a",
      "range": {
        "start": 1593316800000,
        "end": 1595736600000
      },
      "trace": {
        "what": {
          "name": "com.spotify.heroic.CoreQueryManager#query"
        },
        "elapsed": 537117,
        "children": [
          {
            "what": {
              "name": "com.spotify.heroic.CoreQueryManager#query_shard[{site=us-east1}]"
            },
            "elapsed": 537157,
            "children": [
              {
                "what": {
                  "name": "grpc://heroic-2.heroic.default.svc.cluster.local.:9698#query"
                },
                "elapsed": 537078,
                "children": [
                  {
                    "what": {
                      "name": "com.spotify.heroic.metric.LocalMetricManager#query"
                    },
                    "elapsed": 327049,
                    "children": []
                  }
                ]
              }
            ]
          }
        ]
      },
      "limits": [],
      "cached": false,
      "cache": null,
      "commonTags": {},
      "commonResource": {},
      "result": [],
      "preAggregationSampleSize": 0,
      "errors": []
    },

And in our heroic logs:

Failed to execute phase [fetch],
21:47:18.350 [elasticsearch[_client_][listener][T#3]] ERROR com.spotify.heroic.rpc.grpc.GrpcRpcProtocolServer - 29c3cfe3-ca6c-4c16-bc16-b755c8867848: Request failed
        ... 48 more
        at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.doExecute(EsThreadPoolExecutor.java:94)
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
        at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
        at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:50)
Caused by: EsRejectedExecutionException[rejected execution of org.elasticsearch.action.search.FetchSearchPhase$1@4606797a on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@433876a3[Running, pool size = 4, active threads = 4, queued tasks = 1000, completed tasks = 3259109]]]
        at java.lang.Thread.run(Thread.java:748)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)

Heroic should pass this response along to the client in the errors array.

@project-bot project-bot bot added this to Inbox in Observability Kanban Aug 3, 2020
@lmuhlha
Copy link
Member

lmuhlha commented Aug 3, 2020

Thanks for reporting this! Just to confirm, are you running version 2.2.0 or an earlier one?

@memory
Copy link
Contributor

memory commented Aug 4, 2020

We are running 0.10.5 at the moment.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants