Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for threaded tests #598

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Add support for threaded tests #598

wants to merge 4 commits into from

Conversation

SiXoS
Copy link

@SiXoS SiXoS commented Mar 29, 2024

What

This is a draft for adding support for threaded tests that can be used with e.g. OpenAI chats or assistants.

I created it as a very early draft to ask if this is a feature you want to have, and as a base for discussions on implementation details. I hope that you are interested in the suggestion.

Why

In my case, my assistant will be used almost exclusively with a sequence of questions and answers. This can not be tested sufficiently with a single request. I can not initiate the provider with a list of messages since the OpenAI assistants only allows the client to add messages with role: user.

How

I have a suggestion for syntax but I expect feedback and it's welcome! I have an example in the PR. Basically, a test case can have a property called thread which is a list of test cases that will be run sequentially on the same thread. This is not recursive. A test case that has thread can not have assert and vice versa.

The ApiProvider uses the CallApiContextParams.thread to keep track of which threadId to use. See the OpenAiAssistantProvider for an example.

One concern is that this use-case doesn't fit the structure of having one main prompt and then variables per case. You probably want a prompt per case.

What makes this a draft?

  1. I have to add a lot more tests.
  2. Concurrency has to be handled correctly, threaded tests can't be concurrent.
  3. I have not even looked at what the output looks like, that has to be inspected.
  4. Add implementation for all relevant providers.

Let me know what you think about the feature suggestion and if you have any concerns with my suggested implementation.

@typpo
Copy link
Collaborator

typpo commented Mar 31, 2024

Thank you for putting this together! It's an awesome start.

I wonder if there is a way to generalize the solution so it's not just for OpenAI assistants, but for anyone who wants to test a chat-style conversation with one message after another. Definitely see a lot of people jumping through hoops to build chat conversations.

@SiXoS
Copy link
Author

SiXoS commented Apr 2, 2024

I wonder if there is a way to generalize the solution so it's not just for OpenAI assistants, but for anyone who wants to test a chat-style conversation with one message after another.

I think the current solution could already support other providers. The two solutions for conversations that I've seen are either a conversation id or you send the whole conversation every time. The threadId propery in ThreadContext is of type any so for the providers where you send entire conversation, the convo could simply be stored in threadId. Of course, the property should probably be renamed to thread

@SiXoS
Copy link
Author

SiXoS commented Apr 2, 2024

Another big issue that I realized is variable combinations. Let's say you have the following test case:

prompts: "{{ prompt }}"

tests:
  - thread:
      - vars:
          prompt: [ I want to write a poem, I want to write a novel ]
        assert:
          - type: icontains
            value: "What about?"
      - vars:
          prompt: Edgar Allan Poe
        assert:
          - type: icontains
            value: |
              Edgar Allen Poe
              Had a pretty big toe

In this test, both questions would be posted to the same thread which doesn't make much sense. The conversation would look something like:

  1. I want to write a poem
  2. What about?
  3. I want to write a novel
  4. Oh, I thought you wanted a poem
  5. Edgar Allan Poe
  6. Edgar Allen Poe Had a pretty big toe

So. These are the two suggestions I have, do you have anything else in mind?

Do not allow variable expansions in sub-tests

This is easy implementation-wise but could be unintuitive to users. It wouldn't be clear that variable expansion is ignored. Might be ok, though? Should be pretty clear from the output.

Just leave it as is

Allow variables to be expanded as above. It's a bit weird but easy to avoid.

src/evaluator.ts Outdated
continue;
const mainTestCase = tests[index];
const allTestCases = mainTestCase.thread || [mainTestCase];
const startRow = rowIndex
Copy link
Author

@SiXoS SiXoS May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're in a threaded context, we will not add tests column-by-column. The tests have to be added row-by-row. This means that if we have the following config:

providers: [gpt-3, gpt-4]

tests:
  - thread:
      - assertions: ...
      - assertions: ...
 - assertions: ...

We will fill the table in the following order:

+-----------+--------+
| Outputs   |        |
+-----------+--------+
| gpt-3     | gpt-4  |
+-----------+--------+
| 1         | 3      |
+-----------+--------+
| 2         | 4      |
+-----------+--------+
| 5         | 6      |
+-----------+--------+

@@ -828,14 +877,14 @@ class Evaluator {
// Run the evals
if (this.options.interactiveProviders) {
runEvalOptions = runEvalOptions.sort((a, b) =>
a.provider.id().localeCompare(b.provider.id()),
a[0].provider.id().localeCompare(b[0].provider.id()),
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All evals in a thread must have the same provider. It should therefore be safe to look at [0] here.

@SiXoS
Copy link
Author

SiXoS commented May 14, 2024

@typpo I have implemented a first working solution. I have addressed the earlier concerns of mine:

  1. I have added a few more tests
  2. Threaded tests are run synchronously
  3. The output tables looks good as far as I'm concerned. It's not very clear which tests that are threaded but maybe that's fine?
  4. I still have to add implementation for all relevant providers but I figured that I can ask for feedback before I add the implementation everywhere.

This is how I solved variable combinations (for an example, see examples/openai-assistant-thread/promptfooconfig.yaml):

Variables that are specified at the top level or default tests are expanded before the thread test cases are constructed. Variables that are specified on the sub-level are expended for that specific sub-level only.

Let me know if you have any further feedback.

@SiXoS SiXoS marked this pull request as ready for review May 15, 2024 13:45
@SiXoS
Copy link
Author

SiXoS commented May 15, 2024

I added implementation for all the relevant providers

@typpo
Copy link
Collaborator

typpo commented May 16, 2024

Thank you, @SiXoS! I missed this PR because it's further down the list. I'll give it a proper review and test. Very excited to see this

@SiXoS
Copy link
Author

SiXoS commented May 17, 2024

Great! I made an additional small change that I think makes the code easier to understand, I hope that it doesn't mess with your review process too much. I wasn't very happy with how the test cases loop was constructed. The order in which cells were handled were a bit weird. So i changed it to fill column-by-column. Using the previous example, this is how tables are now filled:

+-----------+--------+
| Outputs   |        |
+-----------+--------+
| gpt-3     | gpt-4  |
+-----------+--------+
| 1         | 4      |
+-----------+--------+
| 2         | 5      |
+-----------+--------+
| 3         | 6      |
+-----------+--------+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants