SWE-bench-Live Leaderboard

About

SWE-bench-Live is a live benchmark for issue resolving, designed to evaluate an AI system's ability to complete real-world software engineering tasks. Thanks to our automated dataset curation pipeline, we plan to update SWE-bench-Live on a monthly basis to provide the community with up-to-date task instances and support rigorous and contamination-free evaluation.

Note: If you think your repository is not suitable to be included in our benchmark, please contact us to remove it.

News

Feb 2026

Windows Specific Task Released with Benchmarking Result

We released SWE-bench-Live/Windows to test agents in taking actions in Windows powershell and making Windows-specific code implementation. Through experiments we find none of SWE-agent, OpenHands and ClaudeCode can run on Windows containers, so we implement a minimal Windows-compatible agent with the same tool calls as SWE-agent and OpenHands, named as Win-agent, for benchmarking of LLMs on Windows tasks.

Dec 2025

Multi Language and OS update

We upgraded RepoLaunch Agent to support building repos on all mainstram languages (C C++ C# Python Java Go JS/TS Rust) and on both Linux&Windows platforms. The MultiLang benchmark has been released on HuggingFace. On the leaderboard below, Lite, Full and Verified splits are still for Python tasks only.

Aug 2025

Dataset update (through Aug 2025)

We've finalized the update process for SWE-bench-Live: Each month, we will add 50 newly verified, high-quality issues to the dataset. The lite and verified splits will remain frozen, ensuring fair leaderboard comparisons and keeping evaluation costs manageable. To access the latest issues, please refer to the full split!

Jun 2025

Dataset update

We've updated the dataset! Now it includes 1,565 task instances, covering 164 repositories.

Leaderboard

- results • - instances

Rank	Method	Resolved ↓	Date ↕

Loading leaderboard data...

Submit your results

We coordinate results submission via Pull Requests, see SWE-bench-Live/submissions for instructions.

Correspondence

Corresponding to SWE-bench-Live@microsoft.com

GitHub Copilot Team, Microsoft US is actively hiring FTE/Interns
DKI Group, Microsoft Shanghai is actively hiring Interns

We welcome external part-time open-source collaborators to join us to update our dataset tasks each month.

Acknowledgement

SWE-bench-Live is built upon the foundation of SWE-bench. We extend our gratitude to the original SWE-bench team for their pioneering work in software engineering evaluation benchmarks.

Citation

If you use SWE-bench-Live in your research, please cite:

@article{zhang2025swebenchgoeslive,
  title={SWE-bench Goes Live!},
  author={Linghao Zhang and Shilin He and Chaoyun Zhang and Yu Kang and Bowen Li and Chengxing Xie and Junhao Wang and Maoquan Wang and Yufan Huang and Shengyu Fu and Elsie Nallipogu and Qingwei Lin and Yingnong Dang and Saravan Rajmohan and Dongmei Zhang},
  journal={arXiv preprint arXiv:2505.23419},
  year={2025}
}