Boys From Bangalore
Team of Microsoft and former Amazon engineers with MS degrees from ASU and UT Dallas, specializing in RAG, multi-agent systems, FastAPI, and AWS/Azure deployment.
YouTube Video
Project Description
Boys From Bangalore — Edge-First Hybrid Routing for FunctionGemma
We built a 4-layer hybrid routing algorithm that maximizes on-device tool-call correctness while intelligently escalating to Gemini Flash only when the edge model demonstrably cannot handle the query.
Architecture: The Cascade
Our generate_hybrid method in main.py implements a multi-signal decision pipeline:
Layer 0 — Deterministic Parser (0ms, 100% on-device): A zero-latency rule-based engine with multi-intent splitting, regex-based argument extraction, and pronoun resolution. Handles weather, alarms, timers, reminders, messages, contacts, and music via pattern matching. When the parser resolves ALL estimated intents, we skip the neural model entirely, achieving sub-millisecond routing.
Layer 1 — FunctionGemma via Cactus Compute (on-device): For queries the deterministic parser misses, we run Google DeepMind’s FunctionGemma-270M through Cactus’s native cactus_complete with Tool RAG, force_tools mode, and a 4-format response text recovery parser that handles FunctionGemma’s varied output formats (native call: syntax, standard function notation, JSON objects, and malformed JSON extraction).
Layer 2 — Multi-Signal Validation: Instead of trusting raw confidence scores, we compute a weighted composite from 4 independent signals: calibrated confidence (piecewise-linear remapping), intent completeness ratio, argument type correctness, and decode timing heuristics. Hard gates reject immediately on invalid tool names, missing required parameters, or incomplete multi-intent results. Data showed partial multi-tool results on-device average F1~0.13 — this gate forces cloud escalation where it matters most.
Layer 3 — Gemini 2.5 Flash Cloud Fallback: When validation rejects the local result, we cascade to Gemini Flash with a system prompt engineered for multi-action completeness. This only fires for genuinely hard queries — keeping our edge/cloud ratio high while catching the cases FunctionGemma struggles with.
Layer 2b — Merge Strategy: Before going to cloud, we attempt merging partial deterministic and FunctionGemma results. If the union passes validation, we stay on-device — avoiding cloud latency for cases where each independently found different intents.
Voice-to-Action Product
We built Cactus Voice, a real-time voice assistant backed by a FastAPI WebSocket server:
- On-device Whisper transcription via
cactus_transcribe - Full hybrid routing pipeline
- 7 real function executors (live weather API, YouTube playback, macOS notifications, timers, reminders, messaging, contacts)
- Concurrent execution with latency breakdown per pipeline stage
- Browser-based UI with push-to-talk
Technologies Used
- Cactus Compute — On-device inference runtime for FunctionGemma-270M and Whisper
- Google DeepMind FunctionGemma-270M-IT — Edge tool-calling model
-
Google Gemini 2.5 Flash — Cloud fallback via
google-genaiSDK - FastAPI + WebSockets — Real-time voice server
- Python — Hybrid routing engine, deterministic parser, composite scoring
- ffmpeg — Browser audio to WAV conversion
- macOS osascript — Native notification execution