Hackathon Showcase

Boys From Bangalore

Team of Microsoft and former Amazon engineers with MS degrees from ASU and UT Dallas, specializing in RAG, multi-agent systems, FastAPI, and AWS/Azure deployment.

4 members Watch Demo

YouTube Video

Project Description

Boys From Bangalore — Edge-First Hybrid Routing for FunctionGemma

We built a 4-layer hybrid routing algorithm that maximizes on-device tool-call correctness while intelligently escalating to Gemini Flash only when the edge model demonstrably cannot handle the query.

Architecture: The Cascade

Our generate_hybrid method in main.py implements a multi-signal decision pipeline:

Layer 0 — Deterministic Parser (0ms, 100% on-device): A zero-latency rule-based engine with multi-intent splitting, regex-based argument extraction, and pronoun resolution. Handles weather, alarms, timers, reminders, messages, contacts, and music via pattern matching. When the parser resolves ALL estimated intents, we skip the neural model entirely, achieving sub-millisecond routing.

Layer 1 — FunctionGemma via Cactus Compute (on-device): For queries the deterministic parser misses, we run Google DeepMind’s FunctionGemma-270M through Cactus’s native cactus_complete with Tool RAG, force_tools mode, and a 4-format response text recovery parser that handles FunctionGemma’s varied output formats (native call: syntax, standard function notation, JSON objects, and malformed JSON extraction).

Layer 2 — Multi-Signal Validation: Instead of trusting raw confidence scores, we compute a weighted composite from 4 independent signals: calibrated confidence (piecewise-linear remapping), intent completeness ratio, argument type correctness, and decode timing heuristics. Hard gates reject immediately on invalid tool names, missing required parameters, or incomplete multi-intent results. Data showed partial multi-tool results on-device average F1~0.13 — this gate forces cloud escalation where it matters most.

Layer 3 — Gemini 2.5 Flash Cloud Fallback: When validation rejects the local result, we cascade to Gemini Flash with a system prompt engineered for multi-action completeness. This only fires for genuinely hard queries — keeping our edge/cloud ratio high while catching the cases FunctionGemma struggles with.

Layer 2b — Merge Strategy: Before going to cloud, we attempt merging partial deterministic and FunctionGemma results. If the union passes validation, we stay on-device — avoiding cloud latency for cases where each independently found different intents.

Voice-to-Action Product

We built Cactus Voice, a real-time voice assistant backed by a FastAPI WebSocket server:

On-device Whisper transcription via cactus_transcribe
Full hybrid routing pipeline
7 real function executors (live weather API, YouTube playback, macOS notifications, timers, reminders, messaging, contacts)
Concurrent execution with latency breakdown per pipeline stage
Browser-based UI with push-to-talk

Technologies Used

Cactus Compute — On-device inference runtime for FunctionGemma-270M and Whisper
Google DeepMind FunctionGemma-270M-IT — Edge tool-calling model
Google Gemini 2.5 Flash — Cloud fallback via google-genai SDK
FastAPI + WebSockets — Real-time voice server
Python — Hybrid routing engine, deterministic parser, composite scoring
ffmpeg — Browser audio to WAV conversion
macOS osascript — Native notification execution

Team

Suprad Parashar

Mallikarjun Nagaraja

Dhanush Gowda S

SHREEVERSHITH KOLLABETTU

Products & Tools

Cactus Compute Google Google DeepMind

Additional Links

https://github.com/suprad-parashar/CactusWebApp

Github

Summarizing URL...