Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

如果连接时抛出 SslHandshakeTimeoutException 异常,那么 ApnsChannelPool pendingCreateChannelFutures 会一直被占有 #1060

Closed
learnmore8 opened this issue Apr 13, 2024 · 7 comments

Comments

@learnmore8
Copy link

learnmore8 commented Apr 13, 2024

我的 Java 程序通过 Nginx 做代理访问 Apns,代理层偶现会出现 DNS 解析超时,然而这个超时会导致 ApnsChannelPool pendingCreateChannelFutures 没有 remove channel,最终导致 pendingCreateChannelFutures 的数量等于 capacity,从而没有任何一个可用的 connection 发送数据。

如果 Pushy 收到 SslHandshakeTimeoutException 是否应该把这个已经关闭的 channel 从 pendingCreateChannelFutures 移除呢?


(edited by @jchambers to add, via Google Translate):

My Java program accesses Apns through Nginx as a proxy. DNS resolution timeout occasionally occurs in the proxy layer. However, this timeout will cause ApnsChannelPool pendingCreateChannelFutures to have no remove channel. Eventually, the number of pendingCreateChannelFutures will be equal to capacity, so there will be no available connection to send data.

If Pushy receives a SslHandshakeTimeoutException, should the closed channel be removed from pendingCreateChannelFutures?

@jchambers
Copy link
Owner

This does sound like a bug, and I will investigate the issue.


(via Google Translate):

这听起来确实像是一个错误,我会调查这个问题。

@jchambers jchambers added the bug label Apr 13, 2024
@jchambers
Copy link
Owner

@learnmore8 do you have a full stack trace from a SslHandshakeTimeoutException that you can share?

您是否有可以分享的“SslHandshakeTimeoutException”的完整堆栈跟踪?

@learnmore8
Copy link
Author

非常感谢您的回复和关注,以下是我的初始化代码和报错信息。

  1. 初始化设置了连接超时
ApnsClientBuilder apnsClientBuilder = new ApnsClientBuilder();
... 
apnsClientBuilder.setConcurrentConnections(8)
	.setEventLoopGroup(eventLoopGroup)
	.setConnectionTimeout(Duration.ofSeconds(10)) // 设置 10 秒超时
	.setIdlePingInterval(Duration.ofSeconds(5))
	.setFrameLogger(new Http2FrameLogger(LogLevel.DEBUG, "PUSHY"))
	.setMetricsListener(new MyApnsClientMetricsListener());
ApnsClient apnsClient = apnsClientBuilder.build();
  1. 请求消息推送
      PushNotificationFuture<SimpleApnsPushNotification, PushNotificationResponse<SimpleApnsPushNotification>>
	    sendNotificationFuture = apnsClient.sendNotification(pushNotification);
      try {
    		final PushNotificationResponse<SimpleApnsPushNotification> pushNotificationResponse = sendNotificationFuture.get(3, TimeUnit.SECONDS);

    		if (pushNotificationResponse.isAccepted()) {
    			log.info("push success. Push notification accepted by APNs gateway. token="+pushNotification.getToken());
    		} else {
    			log.error("push failed. Notification rejected by the APNs gateway: " + pushNotificationResponse.getRejectionReason() + ",token="+pushNotification.getToken());
    			pushNotificationResponse.getTokenInvalidationTimestamp().ifPresent(timestamp -> {
    				log.error("\t…and the token is invalid as of " + timestamp + ",token="+pushNotification.getToken());
    			});
    		}
    	} catch (final ExecutionException e) {
    		log.error("Failed to send push notification.",e);
    	} catch (InterruptedException | TimeoutException e) {
    		log.error("sendNotificationFuture.get() 中断或超时异常",e);
    	}
  1. 域名解析服务使用 59.51.78.211 , 解析超时的概率非常大

  2. 程序 3 秒超时报错堆栈

sendNotificationFuture.get() 中断或超时异常
java.util.concurrent.TimeoutException: null
	at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771)
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
  1. nginx 超时日志信息
api.push.apple.com could not be resolved (110: Operation timed out)
  1. 程序 10 秒超时的堆栈日志
io.netty.handler.ssl.SslHandshakeTimeoutException: handshake timed out after 10000ms
	at io.netty.handler.ssl.SslHandler$7.run(SslHandler.java:2125)
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
	at io.netty.util.concurrent.AbstractEventExecutor.runTask$$$capture(AbstractEventExecutor.java:174)
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute$$$capture(AbstractEventExecutor.java:167)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:745)
  1. 然后我是这样规避这个问题的,可能不完全正确,请指正
    com.eatthepath.pushy.apns.ApnsClientHandler#userEventTriggered
    public void userEventTriggered(final ChannelHandlerContext context, final Object event) throws Exception {
        // 这里是调整的代码
        if(event instanceof SslHandshakeCompletionEvent){
            Throwable cause = ((SslHandshakeCompletionEvent) event).cause();
            if(cause != null && cause instanceof SslHandshakeTimeoutException){
                log.error("SslHandshakeTimeoutException,channel:{}会被关闭:{}",context.channel(),cause.getMessage());
                context.voidPromise().tryFailure(cause);
            }
        }
       ....
    }

@jchambers
Copy link
Owner

@learnmore8 Thank you for the additional details. This is helpful.

I'm confused, though. We assemble the pipeline here:

final ChannelPipeline pipeline = channel.pipeline();
clientConfiguration.getProxyHandlerFactory().ifPresent(proxyHandlerFactory ->
pipeline.addFirst(proxyHandlerFactory.createProxyHandler()));
pipeline.addLast(sslHandler);
pipeline.addLast(new FlushConsolidationHandler(FlushConsolidationHandler.DEFAULT_EXPLICIT_FLUSH_AFTER_FLUSHES, true));
pipeline.addLast(new IdleStateHandler(clientConfiguration.getIdlePingInterval().toMillis(), 0, 0, TimeUnit.MILLISECONDS));
pipeline.addLast(apnsClientHandler);

ApnsChannelHandler is the last handler in the pipeline, and it's already listening for exceptions. It will try to fail the "channel ready" promise if it catches any exception—including an SslHandshakeTimeoutException:

public void exceptionCaught(final ChannelHandlerContext context, final Throwable cause) {
// Always try to fail the "channel ready" promise if we catch an exception; in some cases, these may happen
// after a connection has already become ready, in which case the failure attempt will have no effect.
getChannelReadyPromise(context.channel()).tryFailure(cause);
}

…and regardless of how that promise resolves, ApnsChannelPool will remove the Promise from pendingCreateChannelFutures as soon as it's done:

createChannelFuture.addListener((GenericFutureListener<Future<Channel>>) future -> {
ApnsChannelPool.this.pendingCreateChannelFutures.remove(createChannelFuture);

So, considering all of that, it's not clear to me how we could be using all of the capacity in pendingCreateChannelFutures. Returning to your original question:

如果 Pushy 收到 SslHandshakeTimeoutException 是否应该把这个已经关闭的 channel 从 pendingCreateChannelFutures 移除呢
(translation: If Pushy receives a SslHandshakeTimeoutException, should the closed channel be removed from pendingCreateChannelFutures?)

…I think that should already be happening. Can you please say more about how you know that pendingCreateChannelFutures is filling up?

Also, what version of Pushy are you using?


感谢您提供更多详细信息。 这很有帮助。

不过我很困惑。 我们在这里组装管道:

final ChannelPipeline pipeline = channel.pipeline();
clientConfiguration.getProxyHandlerFactory().ifPresent(proxyHandlerFactory ->
pipeline.addFirst(proxyHandlerFactory.createProxyHandler()));
pipeline.addLast(sslHandler);
pipeline.addLast(new FlushConsolidationHandler(FlushConsolidationHandler.DEFAULT_EXPLICIT_FLUSH_AFTER_FLUSHES, true));
pipeline.addLast(new IdleStateHandler(clientConfiguration.getIdlePingInterval().toMillis(), 0, 0, TimeUnit.MILLISECONDS));
pipeline.addLast(apnsClientHandler);

ApnsChannelHandler 是管道中的最后一个处理程序,它已经在侦听异常。 如果捕获到任何异常(包括“SslHandshakeTimeoutException”),它将尝试使“通道就绪”承诺失败:

public void exceptionCaught(final ChannelHandlerContext context, final Throwable cause) {
// Always try to fail the "channel ready" promise if we catch an exception; in some cases, these may happen
// after a connection has already become ready, in which case the failure attempt will have no effect.
getChannelReadyPromise(context.channel()).tryFailure(cause);
}

...并且无论该承诺如何解决,“ApnsChannelPool”一旦完成都会从“pendingCreateChannelFutures”中删除“Promise”:

createChannelFuture.addListener((GenericFutureListener<Future<Channel>>) future -> {
ApnsChannelPool.this.pendingCreateChannelFutures.remove(createChannelFuture);

因此,考虑到所有这些,我不清楚我们如何使用“pendingCreateChannelFutures”中的所有容量。 回到你原来的问题:

如果 Pushy 收到 SslHandshakeTimeoutException 是否应该把这个已经关闭的 channel 从 pendingCreateChannelFutures 移除呢?

…我认为这应该已经发生了。 您能否详细说明一下您如何知道“pendingCreateChannelFutures”已填满?

另外,您使用的是哪个版本的 Pushy?

@jchambers
Copy link
Owner

Re-reading your example:

public void userEventTriggered(final ChannelHandlerContext context, final Object event) throws Exception {
    // 这里是调整的代码
    if(event instanceof SslHandshakeCompletionEvent){
        Throwable cause = ((SslHandshakeCompletionEvent) event).cause();
        if(cause != null && cause instanceof SslHandshakeTimeoutException){
            log.error("SslHandshakeTimeoutException,channel:{}会被关闭:{}",context.channel(),cause.getMessage());
            context.voidPromise().tryFailure(cause);
        }
    }
   ....
}

…I think your point may be that this specific handshake failure is being reported as a user event rather than thrown as an exception. That strikes me as strange, but I will review Netty's SslHandler code to see if I can better understand what's happening in this case.


重新阅读你的例子:

public void userEventTriggered(final ChannelHandlerContext context, final Object event) throws Exception {
    // 这里是调整的代码
    if(event instanceof SslHandshakeCompletionEvent){
        Throwable cause = ((SslHandshakeCompletionEvent) event).cause();
        if(cause != null && cause instanceof SslHandshakeTimeoutException){
            log.error("SslHandshakeTimeoutException,channel:{}会被关闭:{}",context.channel(),cause.getMessage());
            context.voidPromise().tryFailure(cause);
        }
    }
   ....
}

…我认为您的观点可能是,这种特定的握手失败被报告为用户事件,而不是作为异常抛出。 这让我觉得很奇怪,但我将回顾 Netty 的“SslHandler”代码,看看我是否可以更好地理解这种情况下发生的情况。

@learnmore8
Copy link
Author

触发 SslHandshakeTimeoutException 的代码如下 io.netty.handler.ssl.SslHandler#applyHandshakeTimeout

private void applyHandshakeTimeout() {
        final Promise<Channel> localHandshakePromise = this.handshakePromise;

        // Set timeout if necessary.
        final long handshakeTimeoutMillis = this.handshakeTimeoutMillis;
        if (handshakeTimeoutMillis <= 0 || localHandshakePromise.isDone()) {
            return;
        }

        final Future<?> timeoutFuture = ctx.executor().schedule(new Runnable() {
            @Override
            public void run() {
                if (localHandshakePromise.isDone()) {
                    return;
                }
                SSLException exception =
                        new SslHandshakeTimeoutException("handshake timed out after " + handshakeTimeoutMillis + "ms");
                try {
                    if (localHandshakePromise.tryFailure(exception)) {
                        SslUtils.handleHandshakeFailure(ctx, exception, true);
                    }
                } finally {
                    releaseAndFailAll(ctx, exception);
                }
            }
        }, handshakeTimeoutMillis, TimeUnit.MILLISECONDS);

        // Cancel the handshake timeout when handshake is finished.
        localHandshakePromise.addListener(new FutureListener<Channel>() {
            @Override
            public void operationComplete(Future<Channel> f) throws Exception {
                timeoutFuture.cancel(false);
            }
        });
    }

但上面这个异常并不会触发 com.eatthepath.pushy.apns.ApnsClientHandler#exceptionCaught

 public void exceptionCaught(final ChannelHandlerContext context, final Throwable cause) {
        // Always try to fail the "channel ready" promise if we catch an exception; in some cases, these may happen
        // after a connection has already become ready, in which case the failure attempt will have no effect.
        getChannelReadyPromise(context.channel()).tryFailure(cause);
    }

我使用的版本是:pushy 0.14.1 和 netty 4.1.92.Final

谢谢作者的回复,经过您的指导,我调整了修复方式,可以直接使用如下代码

getChannelReadyPromise(context.channel()).tryFailure(cause);

@jchambers
Copy link
Owner

@learnmore8 I apologize for the delay in getting back to this.

I've written a test to try to demonstrate the behavior you've described, and I'm still not able to reproduce the specific result where connection attempts get "stuck" in pendingCreateChannelFutures. Please find a test case in https://github.com/jchambers/pushy/compare/tls_handshake_timeout.

What I'm observing is:

  • You are correct that handshake timeouts do not reach com.eatthepath.pushy.apns.ApnsClientHandler#exceptionCaught
  • You are also correct that an appropriate event does reach userEventTriggered
  • …but even with both of those things being true, the channel still closes and is removed from pendingCreateChannelFutures

That left me confused at first, but then you pointed out that you're using 0.14.1. I believe the issue you're describing is a specific symptom of #853, which was fixed in #854. I believe this issue is a duplicate, and you can solve the problem by updating to the latest version of Pushy.

With that in mind, I'm going to close this issue for now. If updating to the latest version does not solve your problem, please leave a comment here and I'll happily reopen the issue.


@learnmore8 对于延迟回到这个话题我深表歉意。

我已经编写了一个测试来尝试演示您所描述的行为,但我仍然无法重现连接尝试在“pendingCreateChannelFutures”中“卡住”的具体结果。 请在 https://github.com/jchambers/pushy/compare/tls_handshake_timeout 中找到测试用例。

我观察到的是:

  • 你是正确的,握手超时确实_not_达到 com.eatthepath.pushy.apns.ApnsClientHandler#exceptionCaught
  • 您也正确地认为适当的事件确实到达了“userEventTriggered”
  • ...但即使这两件事都是真的,通道仍然关闭并从“pendingCreateChannelFutures”中删除

一开始这让我很困惑,但后来你指出你正在使用 0.14.1。 我相信您描述的问题是 #853 的特定症状,已在 #854 中修复。 我相信这个问题是重复的,您可以通过更新到最新版本的 Pushy 来解决该问题。

考虑到这一点,我现在将关闭这个问题。 如果更新到最新版本_不能_解决您的问题,请在此处发表评论,我很乐意重新打开该问题。

@jchambers jchambers closed this as not planned Won't fix, can't repro, duplicate, stale May 18, 2024
@jchambers jchambers added duplicate and removed bug labels May 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants