SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models

Desktop Only

This website is currently optimized for desktop viewing only.

Please access this site from:

💻 Desktop Computer

🖥️ Laptop

Mobile and tablet support coming soon!

Summary: Evaluating large language models (LLMs) for software engineering has been limited by narrow task coverage, language bias, and insufficient alignment with real-world developer workflows. Existing bench

📄 Research Paper Abstract

Below is the abstract from this arXiv research paper. Mathematical notation has been simplified for readability.

Evaluating large language models (LLMs) for software engineering has been limited by narrow task coverage, language bias, and insufficient alignment with real-world developer workflows. Existing benchmarks often focus on algorithmic problems or Python-centric bug fixing, leaving critical dimensions of software engineering underexplored. To address these gaps, we introduce SWE-Compass1, a comprehensive benchmark that unifies heterogeneous code-related evaluations into a structured and production-aligned framework. SWE-Compass spans 8 task types, 8 programming scenarios, and 10 programming languages, with 2000 high-quality instances curated from authentic GitHub pull requests and refined through systematic filtering and validation. We benchmark ten state-of-the-art LLMs under two agentic frameworks, SWE-Agent and Claude Code, revealing a clear hierarchy of difficulty across task types, languages, and scenarios. Moreover, by aligning evaluation with real-world developer practices, SWE-Compass provides a rigorous and reproducible foundation for diagnosing and advancing agentic coding capabilities in large language models.

📚

Read Full Paper on arXiv

View the complete research paper with proper mathematical formatting, figures, and references on arXiv.

View on arXiv →

Desktop Only

SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models

📄 Research Paper Abstract

Read Full Paper on arXiv

How do you feel about this article?

📝 Article Information

Author

Original Publisher

Published Date

Disclaimer

Desktop Only

SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models

📄 Research Paper Abstract

Read Full Paper on arXiv

How do you feel about this article?

📝 Article Information

Author

Original Publisher

Published Date

Disclaimer

More from arXiv Scientific Papers