is it possible to get actual outputs of multiple models that were compared #31

prathameshk · 2023-10-19T13:05:06Z

We are trying to compare our model against legalbench test data. But as the test data is huge, we are doing random sampling of the test data and compare the results. Is it possible to get the prompts and outputs of the models at each data test data point that were tested?
The model we are trying to fine-tune gives a verbose output, which makes it difficult to compare against ground truth. How was this handled especially for smaller models like LLama2 7B? We are using chatgpt to evaluate our verbose output with ground truth.

neelguha · 2023-10-19T20:38:56Z

Thanks for your question!

We are trying to compare our model against legalbench test data. But as the test data is huge, we are doing random sampling of the test data and compare the results. Is it possible to get the prompts and outputs of the models at each data test data point that were tested?

Prompts for each task are available in each task's folder. For instance, here is a text file containing the prompt for the abercrombie task. The text files are written with placeholders for specific column values. This notebook provides a more in-depth illustration of how this works. When we ran our evaluations, we used the same prompt for each sample in a task.

The model we are trying to fine-tune gives a verbose output, which makes it difficult to compare against ground truth. How was this handled especially for smaller models like LLama2 7B? We are using chatgpt to evaluate our verbose output with ground truth.

Most LegalBench tasks correspond to classification/extraction tasks. For these, we either (1) terminated generations after a single token, or (2) terminated at a new-line/stop token. We found that non-chat models were usually pretty good at providing concise outputs.

The one exception are the small handful of tasks which correspond to more open-ended generation (e.g., rule_qa). We manually evaluated those generations.

Is that helpful? Apologies if I'm misunderstanding.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

is it possible to get actual outputs of multiple models that were compared #31

is it possible to get actual outputs of multiple models that were compared #31

prathameshk commented Oct 19, 2023

neelguha commented Oct 19, 2023

is it possible to get actual outputs of multiple models that were compared #31

is it possible to get actual outputs of multiple models that were compared #31

Comments

prathameshk commented Oct 19, 2023

neelguha commented Oct 19, 2023