Benchmarking AI agents against 800+ open source tests

Learn how to evaluate AI agents using open‑source benchmarks such as WebArena, AGBenchmark, and BOLAA, and set up systematic testing and regression tracking.

Overview

AI agents suck. But why? You can’t know what you can’t measure, and this is where benchmarking comes in. We show off how to test AI agents against research-based and open-source benchmarks including WebArena, AGBenchmark, and BOLAA. We show the most common benchmarks to look at, how to pick evaluation criteria, how to set up agent environments, and how to track regression tests.

Video

Links

Tech stack