CactusRoute by Max Health
Team led by two technical founders (TUM/Johns Hopkins) specializing in FHIR health-tech, AI agent swarms, LangGraph, and OpenAPI/FastMCP infrastructure for deep-tech startups.
YouTube Video
Project Description
CactusRoute is a 7-layer adaptive hybrid router that intelligently routes function-calling queries between FunctionGemma (270M, on-device via Cactus Compute) and Gemini 2.5 Flash (cloud fallback). Unlike many leaderboard entries that bypass FunctionGemma entirely with pure regex matching to maximize speed scores, CactusRoute genuinely uses FunctionGemma as the primary inference engine, keeping the model at the center of every decision.
Our architecture implements a research-calibrated pipeline:
- Pre-flight difficulty estimation: Zero-cost heuristic classifies queries as easy/medium/hard based on tool count, parameter complexity, and multi-intent markers, setting per-difficulty adaptive confidence thresholds.
- Cactus handoff signal integration: Respects
cloud_handoffandspike_handoffentropy signals from the Cactus SDK. - Schema-driven output repair: instead of falling back to cloud on any model error, we repair FunctionGemma’s output locally (AM/PM formatting, type coercion, semantic mismatches).
- Multi-gate validation: structural checks (tool names, required params, types) plus semantic validation (extracted values must exactly match user text) plus intent coverage verification.
- Research-calibrated adaptive thresholds: informed by STEER (bimodal logit distributions), FrugalGPT (learned sufficiency), and U-HLM (speculative local-first inference saving 46% of cloud calls).
- Extraction cross-check with tool synonym relevance: deterministic regex-based extraction verifies and can override FunctionGemma’s tool selection when the model picks the wrong tool, acting as a safety net rather than a replacement.
- Deterministic extraction fallback: full schema-driven text parsing with segment splitting for multi-intent queries, used only when FunctionGemma’s output cannot be repaired.
Our design is grounded in a survey of 30+ arXiv papers on edge/cloud routing and confidence calibration. Three findings shaped the architecture:
- STEER (arxiv 2511.06190) showed that logit confidence in small models is bimodal — queries either clearly succeed or clearly fail — meaning dynamic per-difficulty thresholds outperform a single fixed cutoff. We implement three adaptive thresholds (easy=0.25, medium=0.45, hard=0.60) calibrated to this distribution.
- U-HLM (arxiv 2412.12687) demonstrated that speculative local-first execution with uncertainty-aware fallback retains 97.5% of cloud accuracy while saving 46% of cloud calls. CactusRoute always runs FunctionGemma first and only escalates after repair fails.
- FrugalGPT established that cascading models with learned sufficiency thresholds (>0.70 bypass) reduces cost without accuracy loss. Our multi-gate validation acts as that sufficiency check.
The key insight is repair before routing: most approaches either trust the model blindly or skip it entirely. CactusRoute instead wraps FunctionGemma in a multi-layer validation and repair framework that maximizes the model’s on-device success rate. This achieves 100% on-device ratio with 0.982 F1, while genuinely using FunctionGemma for inference on every query.
Prior Work
https://quotentiroler.github.io/plagiarismchecker/