Members-Only
Recent Talks & Demos are for members only
You must be an AI Tinkerers active member to view these talks and demos.
AI Agents: Testing and Benchmarking
Learn how to evaluate AI agents using openโsource benchmarks such as WebArena, AGBenchmark, and BOLAA, and set up systematic testing and regression tracking.
AI agents suck. But why? You canโt know what you canโt measure, and this is where benchmarking comes in. We show off how to test AI agents against research-based and open-source benchmarks including WebArena, AGBenchmark, and BOLAA. We show the most common benchmarks to look at, how to pick evaluation criteria, how to set up agent environments, and how to track regression tests.