Benchmark on SWE-Bench #74

distbit0 · 2024-04-10T10:27:41Z

It would be interesting to see measure the performance on SWE-Bench benchmarks, so that this project can be more clearly differentiated from the increasing number of other coding agents.

https://www.swebench.com/
https://github.com/princeton-nlp/SWE-bench
[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
https://arxiv.org/abs/2310.06770
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, Karthik Narasimhan
Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We consider real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. We therefore introduce SWE-bench, an evaluation framework including 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue. Resolving issues in SWE-bench frequently requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously, calling for models to interact with execution environments, process extremely long contexts and perform complex reasoning that goes far beyond traditional code generation. Our evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. Claude 2 and GPT-4 solve a mere 4.8% and 1.7% of instances respectively, even when provided with an oracle retriever. Advances on SWE-bench represent steps towards LMs that are more practical, intelligent, and autonomous.

danenania · 2024-04-10T20:05:07Z

Agreed, it would be interesting to see the results if anyone wants to try.

That said, I'm guessing it might not do particularly well at this point since my focus so far has been much more on enabling a tight feedback loop, productive collaboration, and quick iteration between the developer and LLM vs. doing tasks end-to-end with the LLM autonomously in a single shot. But now that the former is working well, it makes sense to start shifting more toward the latter, so stay tuned on that :)

distbit0 changed the title ~~Run Benchmark on SWE-Bench~~ Benchmark on SWE-Bench Apr 10, 2024

danenania added the help wanted Extra attention is needed label Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark on SWE-Bench #74

Benchmark on SWE-Bench #74

distbit0 commented Apr 10, 2024

danenania commented Apr 10, 2024 •

edited

Benchmark on SWE-Bench #74

Benchmark on SWE-Bench #74

Comments

distbit0 commented Apr 10, 2024

danenania commented Apr 10, 2024 • edited

danenania commented Apr 10, 2024 •

edited