Skip to content

Details

Building a Reproducible CI Pipeline to Benchmark AI-Generated Code Fixes – Ernst Haagsman

​AI coding agents are becoming part of the development workflow, but evaluating their performance reliably is challenging. In this webinar, we’ll show you how to use TeamCity and the SWE-bench benchmark to build a reproducible pipeline that runs AI agents on real-world tasks from open-source repositories and evaluates their outcomes.

​You’ll learn how to:

  • ​Set up an automated evaluation pipeline that runs AI agents on real GitHub issues and validates their fixes with tests.
  • ​Ensure reproducibility with isolated environments and faster builds using Docker and TeamCity jobs.
  • ​Track meaningful metrics such as task success rate, costs, and agent performance across versions.

​By the end of this workshop, you’ll be able to set up systematic benchmarking and regression testing of AI coding agents, enabling reproducible, scalable evaluation across hundreds of real-world tasks.

About the speaker:

Ernst Haagsman is a Product Leader at JetBrains, where he currently leads the strategy for TeamCity and the integration of AI into CI/CD workflows. Throughout his tenure at JetBrains, he has held key leadership roles, including Head of Product for IDE Services, where he focused on scaling developer tools for large organizations. With a professional background spanning software development, product marketing, and community management, Ernst brings a holistic perspective to building tools that improve the developer experience.

**Join our Slack: https://datatalks.club/slack.html**

​This event is sponsored by JetBrains.

You may also like