Skip to content
This repository has been archived by the owner on Jun 30, 2022. It is now read-only.

BeginDialogAsync takes longer to execute on first request #3266

Closed
GustavoMoreiraPT opened this issue Mar 10, 2020 · 15 comments
Closed

BeginDialogAsync takes longer to execute on first request #3266

GustavoMoreiraPT opened this issue Mar 10, 2020 · 15 comments
Assignees
Labels
Bot Services Required for internal Azure reporting. Do not delete. Do not change color. customer-replied-to Indicates that the team has replied to the issue reported by the customer. Do not delete. customer-reported Issue is created by anyone that is not a collaborator in the repository.

Comments

@GustavoMoreiraPT
Copy link

Version

4.7.0 (Microsoft.Bot.Builder.Dialogs)

Describe the bug

Hi everyone,
I am developing a webapp based on bot framework using the .Net Core SDK.
I have a direct line channel available and I noticed that be first request is always slower than the following ones.
The majority of the delay happens when calling BeginDialogAsync method from DialogContext class, from Microsoft.Bot.Builder.Dialogs package.
I already tried to do some warm up action to mitigate this problem, by calling method from singletons that are registered through Dependency Injection and are called within the BeginDialogAsync method.

Some of warm up tasks done include:

  • Dummy call to LUIS Recognizer
  • Dummy Bot state call to initialize CosmosDbStorage
  • IBotTelemetryClient logging initialization
  • IBotFrameworkAdapter initialization by doing a dummy call to ProcessActivityAsync method.

Even with these dummy calls, the first requests takes on average 4 to 6 seconds, while the rest of them takes only 1 second and under.

To Reproduce

Steps to reproduce the behavior:

  1. From any bot using direct line, call DialogContext.BeginDialogAsync
  2. Wait until the bot replies.
  3. Measure the 1st request duration time
  4. Repeat the above process for the following request and compare the time.

Expected behavior

The outcome from this issue is to reduce the request duration time when starting the bot.

[bug]

@dmvtech
Copy link

dmvtech commented Mar 10, 2020

Hi @GustavoMoreiraPT

This is just testing/debugging locally with the emulator? Or from deployed bot in azure with ngrok?

Do additional users have the same behavior? (if testing with emulator, starting a secondary emulator and starting a new conversation as a new user)

@GustavoMoreiraPT
Copy link
Author

Hi @dmvtech ,

Thanks a lot for your reply.

This is happening when deployed from azure without ngrok, and with ngrok when running locally.

It just happens to the first user.
A second user does not wait 5 seconds for the bot reply.

Not sure if this is some behavior that we can control from our side.
Happy to help with whatever is needed.

@dmvtech
Copy link

dmvtech commented Mar 11, 2020

Hi @GustavoMoreiraPT

From the behavior you explained and moderate testing on my end, I think you are seeing the .net core application starting/bootstrapping and the kestrel server starting up. Once the bot is running, then it doesn't have to do any of that again.

To help mitigate this in the Azure environment, you can ensure that the App Service is set to always on. So that the service doesn't shut down when not actively being used (and therefore does not have to re-start).

Additionally, if you are doing something with Continuous Deployment, you could configure something that would message the bot after deployment. To make sure the initial startup happens before any actual users get to it.

@GustavoMoreiraPT
Copy link
Author

Hi @dmvtech,

Thanks again for your reply.
Just to be clear, we have a custom solution that is based on bot framework, but already extends it.
We already tried to mitigate the issue by warming up the pipeline before doing the first request.

This warm up tasks that I am talking about helped us improve the first request response time around at least 50%. Before this actions, the first request would take 10 to 15 seconds to execute. We were able to reduce this time to an average of 5 seconds, give or take.

We also have "always on" enabled on the AppService. Because we have application insights enabled for the AppService, we can control the amount of requests that arrive, and we noticed that after a short period without any requests(5-10 seconds), the first following request takes longer to execute than the following ones.

Although 5 seconds is not the worst problem in the world, we would like to avoid this bottleneck in our solution, and so far no new discoveries where done.

Considering the warm up actions that I mentioned on the first post on this issue, can you see some additional tasks that might be relevant to action upon?

Thanks

@GustavoMoreiraPT
Copy link
Author

Hi again @dmvtech ,

I have been doing a debugging session to figure it out the reported issue.
I found something that is definitely taking longer in the first request.

When calling BeginDialogAsync, one of the internal method calls happens on SkillWebSocketTransport, ForwardToSkillAsync

SkillWebSocketTransport

Although I still haven't measure the execution times for the specific red part of the code, since this is a pdb file ad I have my limitations on how to measure it, it is clear that the first call takes between 1,5 to 2,5 seconds on average, while the second call is instantly terminated.

Of course this might be due to some kind of cache that is happening inside the method, but it's really affecting our user experience.

Can we have a look at this? And please let me know how I can help

In the meantime, I'll be updating this post once I have the times to share.

Best regards

@GustavoMoreiraPT
Copy link
Author

Hi again,

So based on the previous messages, we kept looking into the issue and on the application startup, we are now going through each of the dialogs that we need, and create a dummy call that initializes the entire pipeline with an entry point on the SkillDialog.BeginDialogAsync method.
This way we already have the token cached when we do our first real dialog activation. By applying this startup initialization, we were able to recover 1 to 2 additional seconds on the first request.

So this is not a specific issue on the bot framework, it is rather located on a package that is being used by the bot framework.

Based on the version we are using, Microsoft.Bot.Builder.Solutions 4.7.0, one of this package dependencies is Microsoft.IdentityModel.Clients.ActiveDirectory, Version=5.2.4.0, which is the assembly where the token request is being executed.
There is an http call in there that takes longer, and that is the reason we were seeing an additional delay.

Did somebody else faced this?
I am really curious to know because I haven't seen this reported anywhere else.

Please let me know,

Gustavo

@EricDahlvang
Copy link
Member

Hi @GustavoMoreiraPT

The initial token retrieval has been a known startup performance bottleneck for some time. In November of 2017 we provided some guidance similar to what you've described: https://blog.botframework.com/2017/11/02/optimizing-latency-bot-framework/ This has not been updated for V4 sdk, but you can see the similar pattern of retrieving a token upon startup, and periodically refreshing it in the background.

@GustavoMoreiraPT
Copy link
Author

Hi @EricDahlvang ,

Thanks for your reply.
We did a similar approach, by initializing a dummy call to the DialogContext.BeginDialogAsync and as such, the token will be cached before the first request.
It helped, but is still not ideal.

In the meantime, did you find any additional improvements that are not mentioned in the article?

Thanks.

@EricDahlvang EricDahlvang transferred this issue from microsoft/botbuilder-dotnet Apr 4, 2020
@EricDahlvang
Copy link
Member

@GustavoMoreiraPT I've transferred this to the botframework-solutions repo, where this package lives: Microsoft.Bot.Builder.Solutions

@solutions team, do you have any other recommendations for GustavoMoreiraPT that might help improve startup time?

@dmvtech dmvtech added Bot Services Required for internal Azure reporting. Do not delete. Do not change color. customer-replied-to Indicates that the team has replied to the issue reported by the customer. Do not delete. customer-reported Issue is created by anyone that is not a collaborator in the repository. labels Apr 4, 2020
@darrenj
Copy link
Contributor

darrenj commented Apr 6, 2020

A good first step would be to migrate to the newly GA version of Skills which removes teh dependency on Websockets.

We have some straight forward steps for VA changes here and for any custom skills here.

GA Skills have a different approach for authentication and I'm not aware of similar performance issues. If you have any issues with updates to your VA please let us know and interested to see how this helps you.

Depending on the changes you've made to VA another approach would be to create a new VA and migrate changes but if you've made a lot of changes the steps above will guide you.

@GustavoMoreiraPT
Copy link
Author

Hi @darrenj ,

Thanks for the reply.

We've noticed the new version V0.8 and we are already tackling the changes to do the migration, but it's a work in progress.

I'll keep you updated regarding this.

Thanks a lot once again.

@peterinnesmsft
Copy link
Contributor

Hi @GustavoMoreiraPT,

Just following up to see if there is anything further that you currently need from us for this particular issue? It sounds based upon your last comment that you have a path forward through migration to v0.8.

If there's no active need for assistance, then I will close the issue for now. If at any point you encounter further issues regarding this, feel free to reactivate and we can jump straight back on to assist.

Let me know if you have any objections or need any help with anything actively!

@GustavoMoreiraPT
Copy link
Author

Hi @peterinnesmsft ,

Thanks for your reply.
I was waiting to have metrics available to share with you after V0.8 migration, but unfortunately due to time constraints the migration was put on hold.

So currently we basically created a Fake Turn Context provider on initialization time, which forces the get of the token and that really bumped the first request response time.
Although it works, it's not perfect (it is still slower than the following requests), but after all this, we managed to reduce the initial response time from 10/15 seconds to 2/3 seconds.

I will reply to this comment once I have the project migrated to V0.8.
For now I think it can be closed, although the issue seems real.

Thanks for your help and let's speak soon! :)

@peterinnesmsft
Copy link
Contributor

Hi @GustavoMoreiraPT,

No worries about the delay in the migration. I'm glad you had some success in cutting down the initial response time, but I agree that we can probably help find ways to lower it further.

I'll close the issue for now, but let's pick this up again when you are able to revisit and we can happily help dig in to see what kind of improvements we can make.

Look forward to hearing back from you soon! :)

@peterinnesmsft
Copy link
Contributor

Hi @GustavoMoreiraPT,

No worries about the delay in the migration. I'm glad you had some success in cutting down the initial response time, but I agree that we can probably help find ways to lower it further.

I'll close the issue for now, but let's pick this up again when you are able to revisit and we can happily help dig in to see what kind of improvements we can make.

Look forward to hearing back from you soon! :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Bot Services Required for internal Azure reporting. Do not delete. Do not change color. customer-replied-to Indicates that the team has replied to the issue reported by the customer. Do not delete. customer-reported Issue is created by anyone that is not a collaborator in the repository.
Projects
None yet
Development

No branches or pull requests

5 participants