New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for threaded tests #598
base: main
Are you sure you want to change the base?
Conversation
Thank you for putting this together! It's an awesome start. I wonder if there is a way to generalize the solution so it's not just for OpenAI assistants, but for anyone who wants to test a chat-style conversation with one message after another. Definitely see a lot of people jumping through hoops to build chat conversations. |
I think the current solution could already support other providers. The two solutions for conversations that I've seen are either a conversation id or you send the whole conversation every time. The |
Another big issue that I realized is variable combinations. Let's say you have the following test case:
In this test, both questions would be posted to the same thread which doesn't make much sense. The conversation would look something like:
So. These are the two suggestions I have, do you have anything else in mind? Do not allow variable expansions in sub-testsThis is easy implementation-wise but could be unintuitive to users. It wouldn't be clear that variable expansion is ignored. Might be ok, though? Should be pretty clear from the output. Just leave it as isAllow variables to be expanded as above. It's a bit weird but easy to avoid. |
src/evaluator.ts
Outdated
continue; | ||
const mainTestCase = tests[index]; | ||
const allTestCases = mainTestCase.thread || [mainTestCase]; | ||
const startRow = rowIndex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're in a threaded context, we will not add tests column-by-column. The tests have to be added row-by-row. This means that if we have the following config:
providers: [gpt-3, gpt-4]
tests:
- thread:
- assertions: ...
- assertions: ...
- assertions: ...
We will fill the table in the following order:
+-----------+--------+
| Outputs | |
+-----------+--------+
| gpt-3 | gpt-4 |
+-----------+--------+
| 1 | 3 |
+-----------+--------+
| 2 | 4 |
+-----------+--------+
| 5 | 6 |
+-----------+--------+
@@ -828,14 +877,14 @@ class Evaluator { | |||
// Run the evals | |||
if (this.options.interactiveProviders) { | |||
runEvalOptions = runEvalOptions.sort((a, b) => | |||
a.provider.id().localeCompare(b.provider.id()), | |||
a[0].provider.id().localeCompare(b[0].provider.id()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All evals in a thread must have the same provider. It should therefore be safe to look at [0]
here.
@typpo I have implemented a first working solution. I have addressed the earlier concerns of mine:
This is how I solved variable combinations (for an example, see Variables that are specified at the top level or default tests are expanded before the thread test cases are constructed. Variables that are specified on the sub-level are expended for that specific sub-level only. Let me know if you have any further feedback. |
I added implementation for all the relevant providers |
Thank you, @SiXoS! I missed this PR because it's further down the list. I'll give it a proper review and test. Very excited to see this |
Great! I made an additional small change that I think makes the code easier to understand, I hope that it doesn't mess with your review process too much. I wasn't very happy with how the test cases loop was constructed. The order in which cells were handled were a bit weird. So i changed it to fill column-by-column. Using the previous example, this is how tables are now filled:
|
What
This is a draft for adding support for threaded tests that can be used with e.g. OpenAI chats or assistants.
I created it as a very early draft to ask if this is a feature you want to have, and as a base for discussions on implementation details. I hope that you are interested in the suggestion.
Why
In my case, my assistant will be used almost exclusively with a sequence of questions and answers. This can not be tested sufficiently with a single request. I can not initiate the provider with a list of messages since the OpenAI assistants only allows the client to add messages with
role: user
.How
I have a suggestion for syntax but I expect feedback and it's welcome! I have an example in the PR. Basically, a test case can have a property called
thread
which is a list of test cases that will be run sequentially on the same thread. This is not recursive. A test case that hasthread
can not haveassert
and vice versa.The
ApiProvider
uses theCallApiContextParams.thread
to keep track of which threadId to use. See theOpenAiAssistantProvider
for an example.One concern is that this use-case doesn't fit the structure of having one main prompt and then variables per case. You probably want a prompt per case.
What makes this a draft?
Let me know what you think about the feature suggestion and if you have any concerns with my suggested implementation.