Add support for threaded tests #598

SiXoS · 2024-03-29T13:29:27Z

What

This is a draft for adding support for threaded tests that can be used with e.g. OpenAI chats or assistants.

I created it as a very early draft to ask if this is a feature you want to have, and as a base for discussions on implementation details. I hope that you are interested in the suggestion.

Why

In my case, my assistant will be used almost exclusively with a sequence of questions and answers. This can not be tested sufficiently with a single request. I can not initiate the provider with a list of messages since the OpenAI assistants only allows the client to add messages with role: user.

How

I have a suggestion for syntax but I expect feedback and it's welcome! I have an example in the PR. Basically, a test case can have a property called thread which is a list of test cases that will be run sequentially on the same thread. This is not recursive. A test case that has thread can not have assert and vice versa.

The ApiProvider uses the CallApiContextParams.thread to keep track of which threadId to use. See the OpenAiAssistantProvider for an example.

One concern is that this use-case doesn't fit the structure of having one main prompt and then variables per case. You probably want a prompt per case.

What makes this a draft?

I have to add a lot more tests.
Concurrency has to be handled correctly, threaded tests can't be concurrent.
I have not even looked at what the output looks like, that has to be inspected.
Add implementation for all relevant providers.

Let me know what you think about the feature suggestion and if you have any concerns with my suggested implementation.

typpo · 2024-03-31T09:32:40Z

Thank you for putting this together! It's an awesome start.

I wonder if there is a way to generalize the solution so it's not just for OpenAI assistants, but for anyone who wants to test a chat-style conversation with one message after another. Definitely see a lot of people jumping through hoops to build chat conversations.

SiXoS · 2024-04-02T09:00:21Z

I wonder if there is a way to generalize the solution so it's not just for OpenAI assistants, but for anyone who wants to test a chat-style conversation with one message after another.

I think the current solution could already support other providers. The two solutions for conversations that I've seen are either a conversation id or you send the whole conversation every time. The threadId propery in ThreadContext is of type any so for the providers where you send entire conversation, the convo could simply be stored in threadId. Of course, the property should probably be renamed to thread

SiXoS · 2024-04-02T09:59:09Z

Another big issue that I realized is variable combinations. Let's say you have the following test case:

prompts: "{{ prompt }}"

tests:
  - thread:
      - vars:
          prompt: [ I want to write a poem, I want to write a novel ]
        assert:
          - type: icontains
            value: "What about?"
      - vars:
          prompt: Edgar Allan Poe
        assert:
          - type: icontains
            value: |
              Edgar Allen Poe
              Had a pretty big toe

In this test, both questions would be posted to the same thread which doesn't make much sense. The conversation would look something like:

I want to write a poem
What about?
I want to write a novel
Oh, I thought you wanted a poem
Edgar Allan Poe
Edgar Allen Poe Had a pretty big toe

So. These are the two suggestions I have, do you have anything else in mind?

Do not allow variable expansions in sub-tests

This is easy implementation-wise but could be unintuitive to users. It wouldn't be clear that variable expansion is ignored. Might be ok, though? Should be pretty clear from the output.

Just leave it as is

Allow variables to be expanded as above. It's a bit weird but easy to avoid.

SiXoS · 2024-05-14T12:50:56Z

src/evaluator.ts

-                  continue;
+      const mainTestCase = tests[index];
+      const allTestCases = mainTestCase.thread || [mainTestCase];
+      const startRow = rowIndex


If we're in a threaded context, we will not add tests column-by-column. The tests have to be added row-by-row. This means that if we have the following config:

providers: [gpt-3, gpt-4] tests: - thread: - assertions: ... - assertions: ... - assertions: ...

We will fill the table in the following order:

+-----------+--------+ | Outputs | | +-----------+--------+ | gpt-3 | gpt-4 | +-----------+--------+ | 1 | 3 | +-----------+--------+ | 2 | 4 | +-----------+--------+ | 5 | 6 | +-----------+--------+

SiXoS · 2024-05-14T13:00:42Z

src/evaluator.ts

@@ -828,14 +877,14 @@ class Evaluator {
    // Run the evals
    if (this.options.interactiveProviders) {
      runEvalOptions = runEvalOptions.sort((a, b) =>
-        a.provider.id().localeCompare(b.provider.id()),
+        a[0].provider.id().localeCompare(b[0].provider.id()),


All evals in a thread must have the same provider. It should therefore be safe to look at [0] here.

SiXoS · 2024-05-14T13:17:42Z

@typpo I have implemented a first working solution. I have addressed the earlier concerns of mine:

I have added a few more tests
Threaded tests are run synchronously
The output tables looks good as far as I'm concerned. It's not very clear which tests that are threaded but maybe that's fine?
I still have to add implementation for all relevant providers but I figured that I can ask for feedback before I add the implementation everywhere.

This is how I solved variable combinations (for an example, see examples/openai-assistant-thread/promptfooconfig.yaml):

Variables that are specified at the top level or default tests are expanded before the thread test cases are constructed. Variables that are specified on the sub-level are expended for that specific sub-level only.

Let me know if you have any further feedback.

SiXoS · 2024-05-15T13:46:02Z

I added implementation for all the relevant providers

typpo · 2024-05-16T04:19:47Z

Thank you, @SiXoS! I missed this PR because it's further down the list. I'll give it a proper review and test. Very excited to see this

SiXoS · 2024-05-17T07:15:57Z

Great! I made an additional small change that I think makes the code easier to understand, I hope that it doesn't mess with your review process too much. I wasn't very happy with how the test cases loop was constructed. The order in which cells were handled were a bit weird. So i changed it to fill column-by-column. Using the previous example, this is how tables are now filled:

+-----------+--------+
| Outputs   |        |
+-----------+--------+
| gpt-3     | gpt-4  |
+-----------+--------+
| 1         | 4      |
+-----------+--------+
| 2         | 5      |
+-----------+--------+
| 3         | 6      |
+-----------+--------+

SiXoS mentioned this pull request Mar 29, 2024

Add support for threaded tests #599

Open

SiXoS force-pushed the main branch from 1ff5f62 to 2e07358 Compare May 14, 2024 12:39

SiXoS commented May 14, 2024

View reviewed changes

Support thread tests for assistants

21ac5b8

SiXoS force-pushed the main branch from 4292c7e to 21ac5b8 Compare May 14, 2024 13:07

SiXoS added 2 commits May 15, 2024 15:45

Implement threads for all relevant providers

135e0ed

Small fixes

73c1a7e

SiXoS marked this pull request as ready for review May 15, 2024 13:45

Simplify test cases loop

a108cc5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for threaded tests #598

Add support for threaded tests #598

SiXoS commented Mar 29, 2024 •

edited

typpo commented Mar 31, 2024

SiXoS commented Apr 2, 2024

SiXoS commented Apr 2, 2024

SiXoS May 14, 2024 •

edited

SiXoS May 14, 2024

SiXoS commented May 14, 2024 •

edited

SiXoS commented May 15, 2024

typpo commented May 16, 2024

SiXoS commented May 17, 2024

Add support for threaded tests #598

Are you sure you want to change the base?

Add support for threaded tests #598

Conversation

SiXoS commented Mar 29, 2024 • edited

What

Why

How

What makes this a draft?

typpo commented Mar 31, 2024

SiXoS commented Apr 2, 2024

SiXoS commented Apr 2, 2024

Do not allow variable expansions in sub-tests

Just leave it as is

SiXoS May 14, 2024 • edited

Choose a reason for hiding this comment

SiXoS May 14, 2024

Choose a reason for hiding this comment

SiXoS commented May 14, 2024 • edited

SiXoS commented May 15, 2024

typpo commented May 16, 2024

SiXoS commented May 17, 2024

SiXoS commented Mar 29, 2024 •

edited

SiXoS May 14, 2024 •

edited

SiXoS commented May 14, 2024 •

edited