[Obs AI Assistant] Connector documentation #181282

klacabane · 2024-04-22T10:29:42Z

Summary

While the connector is in tech preview and has limited capabilities we should create public documentation

elasticmachine · 2024-04-22T10:29:44Z

Pinging @elastic/obs-knowledge-team (Team:obs-knowledge)

dedemorton · 2024-04-23T16:21:54Z

Adding this to the observability docs project because it sounds like someone on our team should work on these docs.

@klacabane We will need more information, including links to related issues/PRs and a list of contacts, to help us get started. Thanks!

klacabane · 2024-04-24T21:27:35Z

Hi @dedemorton,

As an overview, the connector can be attached to an alert and configured with a message that will be passed to the AI assistant. When an alert fires, the assistant will be called with an initial prompt providing contextual information about the alert (eg when it fired, the service impacted, threshold breached..), and the message provided by the user when configuring the connector.
The user message can be thought and designed as task(s) for the assistant to execute at that point, for example I'm an SRE, create a report of the alert including other active alerts relevant to the impacted service.. The assistant will execute the provided tasks and create a conversation out of it. Users can access that conversation to continue chatting with the assistant to eg help them troubleshoot the issue.
Regarding tasks that can be asked, the assistant is able to call available connectors (limited to slack/email/webhook/jira/pagerduty), so one can also ask Create a report of the alert and send it to the slack connector if a slack connector is already configured.

Some technical details:

the connector will be called when alert fires and when alert recovers
users need api:observabilityAIAssistant and app:observabilityAIAssistant privileges to use the connector
the conversation created by the assistant will be public and accessible to every users with permissions to the assistant
connector is in tech preview

Links:

initial issue (internal) describing the project
implementation PR with some additional details
documentation page for connectors

You can reach out to me or @dgieselaar for more informations!

dedemorton · 2024-04-25T19:57:15Z

cc'ing @lcawl for awareness. She is working on other system action feature docs and may want to contribute to these docs.

emma-raffenne · 2024-05-13T16:02:11Z

@dedemorton Do you have an update on this documentation? Is there anything needed from us?

dedemorton · 2024-05-13T17:06:11Z

Is there anything needed from us?

@emma-raffenne Not right now, but I'll let you know. This issue came in too late for our docs sprint 20, but it's towards the top of my list for sprint 21, which starts today.

dedemorton · 2024-05-16T01:26:32Z

Here's my preliminary plan for the documentation after playing around with the Obs AI Connector today:

In the Kibana Guide, create a new topic about the Observability AI Assistant connector and add it to the list of connectors.
In the Obs Guide, add the Observability AI Assistant connector to the list of valid connectors for all the rules documented under the container topic.
In the Obs Guide under Interact with the AI Assistant, add a section about using the Observability AI Assistant connector and explain why/when you might want to do that.
We should also list any limitations or requirements.

I ran into some flaky behavior when I was playing around with the connector. I received the slack messages and links to the conversation, but the visualizations didn't work. Eventually the messages stopped arriving, but I was also editing/deleting rules and might have broken something. Perhaps I generated too many alerts and ended up exceeding the token limit. I kept track of some questions that came up when I was testing:

Are there limitations on how many alerts can be analyzed by the AI Assistant?
Is it normal for there to be a significant delay in the time it takes for the action to execute and the message appear in slack?
What happens if I edit the rule after I've started running it?
Is it good enough for the message to say “send it to slack connector” or should you give the name of the slack connector in case there is more than one?
Is there any way to diagnose whether (and why) actions are failing (for example, sending messages to slack failed)? Could I have exceeded the token limit and caused the Observability AI Assistant connector to fail sending a message to slack? Is there any way to see where things failed?
How do I avoid exceeding the token limit? After playing around with the Observability AI Assistant connector, I tried using the “Help me understand this alert” option and got the message: “The conversation has exceeded the token limit. The maximum token limit is 32768, but the current conversation has 118700 tokens. Please start a new conversation to continue.”

emma-raffenne · 2024-05-16T10:00:33Z

Thank you @dedemorton
cc @jasonrhodes for awareness about the Alerting documentation.

jasonrhodes · 2024-05-16T18:47:36Z

cc @jasonrhodes for awareness about the Alerting documentation.

Thanks, @emma-raffenne - I've had a brief scan of this comment thread and I'm not seeing the reference to Alerting documentation. Can you point me to it?

klacabane · 2024-05-17T07:59:27Z

Hi @dedemorton!

Are there limitations on how many alerts can be analyzed by the AI Assistant?

No but we generate a prompt that grows with the number of alerts being passed to the connector, and having several alerts being processed in the same connector execution may lead to many function calls analyzing the alerts and reach the function call limit. If that's the case we would not be able to call the connector. This behavior should be surfaced in the generated conversations, any chance you still have them stored ?

Is it normal for there to be a significant delay in the time it takes for the action to execute and the message appear in slack?

What is significant, 5minutes ? It should take around ~60 seconds if everything goes as expected but several function callings and errors may lead to additional processing time or a failure. In any case a conversation will be created and looking at this conversation would be the best way to troubleshoot any underlying issues.

What happens if I edit the rule after I've started running it?

I don't have the specifics of the rule inner workings but I expect any new alert to pick up the new settings/prompt. Did you experience bizarre behaviors when doing so ?

Is it good enough for the message to say “send it to slack connector” or should you give the name of the slack connector in case there is more than one?

The more accurate the better. The assistant is given the list of connector with their configurations (configured name, ID and any other configured properties), given a large list of ie slack connectors one should ideally provide an identifier unique enough for the assistant to make a good decision, in our case the connector name would be appropriate.

Is there any way to diagnose whether (and why) actions are failing (for example, sending messages to slack failed)? Could I have exceeded the token limit and caused the Observability AI Assistant connector to fail sending a message to slack? Is there any way to see where things failed?

Looking at the generated conversation would be the best way to track any errors that happened during the connector execution. Each function call (ie calling the connector) will appear in the conversation timeline and will have debugging informations attached to it

How do I avoid exceeding the token limit? After playing around with the Observability AI Assistant connector, I tried using the “Help me understand this alert” option and got the message: “The conversation has exceeded the token limit. The maximum token limit is 32768, but the current conversation has 118700 tokens. Please start a new conversation to continue.”

Could you provide details on your setup, how did you trigger the alert and what was the configured prompt in the connector ?

dedemorton · 2024-05-17T20:01:27Z

Could you provide details on your setup, how did you trigger the alert and what was the configured prompt in the connector ?

@klacabane Unfortunately my data got blown away when the cluster was updated. I will go through the process again after I've finished the docs and want to test them.

I triggered the alert by creating a custom threshold rule that I knew would fire. The rule looked for max(system.filesystem.used.pct) over 22. (I know, pretty low...but there were only a couple of hosts at the time that were over that threshold.) It generated quite a few alerts (I think about 40) when I was playing around with things. I played around with a few different prompts, but one of them was something like:

High disk usage alert has triggered. Execute the following steps:
  - create a graph of the disk usage for the service impacted by the alert for the last 24h
  - to help troubleshoot, recall past occurrences of this alarm and any other active alerts. Generate a report with all the found information and send it to the slack connector as a single message. Also include the link to this conversation in the report

I don't think I expanded all the function calls so I might have missed something.
I'll pay more attention when I go through my final testing and take better notes.

I think we should definitely consider adding more guidance to help users construct rules and prompts that avoid causing them to run into limits...and also tell them what to do when they run into limits.

dedemorton · 2024-05-23T02:07:05Z

@klacabane I played around a bit with this today, and I am definitely exceeding limits. Maybe the rules I'm creating are too contrived (meant to generate alerts quickly, but perhaps generating too many alerts)? Today I tried using the Custom Threshold rule to test for max(system.filesystem.used.pct) > 80, and I am seeing messages like this in the Azure OpenAI GPT-4 connector logs:

action execution failure: .gen-ai:azure-open-ai: Azure OpenAI GPT-4 - an error occurred while running the action: Status code: 400. Message: API Error: model_error - This model's maximum context length is 32768 tokens. However, your messages resulted in 133449 tokens (132115 in the messages, 1334 in the functions). Please reduce the length of the messages or functions.; retry: true

The weird thing is that it worked beautifully the very first time I tried it out. :-/ Now that I want to take screen captures, nothing is working.

So I have a couple asks:

Can you suggest a different rule (type, threshold, and AI connector configuration/message) that would trigger a reasonable number of alerts so I don't keep exceeding limits? I've been using the edge-lite-oblt test cluster, but let me know if I should use a different environment.
I think newbie users will probably play around with this and may end up in the same situation. What can we tell users to help them avoid this situation?

Thanks in advance for you help.

klacabane · 2024-05-23T09:14:09Z

Hi @dedemorton,

I'm not able to reproduce this issue atm and still working on it.

meant to generate alerts quickly, but perhaps generating too many alerts

The latter could be the culprit. We generate a summary and get context for every alerts that is passed to the connector. I suspect in your case a high number of alerts gets passed and as a result a large prompt is generated which would lead to reaching the token limit early in the conversation. If that's the culprit we should limit the number of alerts we summarize in the prompt but I'll need confirmation this is the root cause.

Since you're able to generate this error consistently, could you either ping me the steps you're taking and/or provide a copy of the generated conversation that leads to the token limit being reached ?

I'm also working against edge-lite-oblt and have no issues triggering the connector successfully. Could you try with an Error count threshold rule instead of Custom threshold ?

emma-raffenne · 2024-05-23T15:58:04Z

@jasonrhodes

I've had a brief scan of this comment thread and I'm not seeing the reference to Alerting documentation. Can you point me to it?

Here is the quote from Dede's comment:

In the Obs Guide, add the Observability AI Assistant connector to the list of valid connectors for all the rules documented under the container topic.

dedemorton · 2024-05-23T18:39:11Z

I've created a rule that does not generate a lot of alerts, and I am seeing the same problem. This rule has created a single alert in the past 30 min. There are currently only 3 active alerts total, but there are a bunch of untracked alerts.

Here's the API call for the rule:

PUT kbn:/api/alerting/rule/03820ed4-fd57-487d-894a-39e7301524fa
{
  "name": "Log threshold rule",
  "tags": [],
  "schedule": {
    "interval": "1m"
  },
  "params": {
    "criteria": [
      {
        "comparator": ">",
        "metrics": [
          {
            "name": "A",
            "filter": "log.level: (\"error\")",
            "aggType": "count"
          }
        ],
        "threshold": [
          30
        ],
        "timeSize": 1,
        "timeUnit": "m"
      }
    ],
    "alertOnNoData": true,
    "alertOnGroupDisappear": true,
    "searchConfiguration": {
      "query": {
        "query": "",
        "language": "kuery"
      },
      "index": "logs-*"
    },
    "groupBy": ""
  },
  "actions": [
    {
      "id": "system-connector-.observability-ai-assistant",
      "params": {
        "connector": "e88c7248-da89-481e-af3b-566ed06728a1",
        "message": "High error count alert has triggered. Execute the following steps:\n  - create a graph of the error count for the service impacted by the alert for the last 24h\n  - to help troubleshoot recall past occurrences of this alarm, also any other active alerts. Generate a report with all the found informations and send it to slack connector as a single message. Also include the link to this conversation in the report\n"
      },
      "uuid": "91c485d5-52ea-41c4-92ef-09f1c4d37b4d"
    }
  ]
}

Here’s the message I am seeing under Stack Management > Connectors > Logs:

action execution failure: .gen-ai:e88c7248-da89-481e-af3b-566ed06728a1: My AI Connector - an error occurred while running the action: Status code: 400. Message: API Error: model_error - This model's maximum context length is 32768 tokens. However, your messages resulted in 126567 tokens (125233 in the messages, 1334 in the functions). Please reduce the length of the messages or functions.; retry: true

Also note that there is no conversation created.

dedemorton · 2024-05-23T22:59:01Z

OK, so I've tried a second round of testing using the latest 7.14.0 snapshot at staging.found.no (I wanted to create a very simple environment with limited data ingested using the System integration and Elastic Agent).

It works fine! I think the takeaway here is that we need to provide users with some guidance on how to avoid exceeding the token limit when they create their rules + messages for the AI Assistant connector...and also some steps to diagnose problems.

## Summary Adds reference documentation about the Obs AI Assistant connector (requested in #181282) Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

## Summary Adds reference documentation about the Obs AI Assistant connector (requested in elastic#181282) Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> (cherry picked from commit 310f4ff)

# Backport This will backport the following commits from `main` to `8.14`: - [[DOCS] Obs AI Assistant connector (#183792)](#183792)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport)  Co-authored-by: DeDe Morton <dede.morton@elastic.co>

dedemorton · 2024-05-31T19:11:37Z

Closed by #183792 and elastic/observability-docs#3906

klacabane added documentation Team:Obs AI Assistant Team:obs-knowledge Observability Experience Knowledge team labels Apr 22, 2024

emma-raffenne added this to the 8.14 milestone Apr 23, 2024

emma-raffenne assigned dedemorton Apr 23, 2024

This was referenced May 18, 2024

[DOCS] Obs AI Assistant connector #183792

Merged

Add section about AI connector elastic/observability-docs#3906

Merged

mergify bot mentioned this issue May 30, 2024

[8.14](backport #3906) Add section about AI connector elastic/observability-docs#3952

Merged

dedemorton closed this as completed May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Obs AI Assistant] Connector documentation #181282

[Obs AI Assistant] Connector documentation #181282

klacabane commented Apr 22, 2024

elasticmachine commented Apr 22, 2024

dedemorton commented Apr 23, 2024

klacabane commented Apr 24, 2024

dedemorton commented Apr 25, 2024 •

edited

emma-raffenne commented May 13, 2024

dedemorton commented May 13, 2024

dedemorton commented May 16, 2024

emma-raffenne commented May 16, 2024

jasonrhodes commented May 16, 2024

klacabane commented May 17, 2024

dedemorton commented May 17, 2024

dedemorton commented May 23, 2024 •

edited

klacabane commented May 23, 2024 •

edited

emma-raffenne commented May 23, 2024

dedemorton commented May 23, 2024 •

edited

dedemorton commented May 23, 2024

dedemorton commented May 31, 2024

[Obs AI Assistant] Connector documentation #181282

[Obs AI Assistant] Connector documentation #181282

Comments

klacabane commented Apr 22, 2024

Summary

elasticmachine commented Apr 22, 2024

dedemorton commented Apr 23, 2024

klacabane commented Apr 24, 2024

dedemorton commented Apr 25, 2024 • edited

emma-raffenne commented May 13, 2024

dedemorton commented May 13, 2024

dedemorton commented May 16, 2024

emma-raffenne commented May 16, 2024

jasonrhodes commented May 16, 2024

klacabane commented May 17, 2024

dedemorton commented May 17, 2024

dedemorton commented May 23, 2024 • edited

klacabane commented May 23, 2024 • edited

emma-raffenne commented May 23, 2024

dedemorton commented May 23, 2024 • edited

dedemorton commented May 23, 2024

dedemorton commented May 31, 2024

dedemorton commented Apr 25, 2024 •

edited

dedemorton commented May 23, 2024 •

edited

klacabane commented May 23, 2024 •

edited

dedemorton commented May 23, 2024 •

edited