Doozy - Google DeepMind x Cactus Compute Global Hackathon
AI Tinkerers - San Francisco
Hackathon Showcase

Doozy

Team consisting of an ex-Amazon SDE and Harvard CS founder skilled in AWS serverless/AI agents, paired with a Pratt-educated systems thinker and UX researcher.

2 members

Hybrid On-Device Function Calling with Semantic Cache Learning

Our system implements a local-first function calling pipeline that runs a 270M parameter FunctionGemma model via the Cactus Compute runtime for on-device inference, with intelligent escalation to Gemini Flash as a cloud fallback. The architecture is designed around a core principle: handle as much as possible on-device, escalate only when uncertain, and learn from every escalation to reduce future cloud dependency.

On-Device Execution Pipeline

FunctionGemma runs entirely on-device through Cactus’s C-level inference engine (cactus_init, cactus_complete, cactus_destroy). The local pipeline uses a forced multi-call approach: user messages are split into action clauses, each clause is matched to the best tool via schema-driven keyword/stem overlap scoring against tool names, descriptions, and parameter metadata, then arguments are extracted through a forced single-tool model call with full schema context. A grounding validation layer rejects hallucinated values by verifying each extracted argument traces back to the user’s original text. When the model fails, a deterministic repair pipeline uses schema-driven type inference (_infer_value_type) to select the right candidate from regex-extracted text segments — all without any tool-specific hardcoding.

Intelligent Routing Logic

The hybrid router (generate_hybrid) uses the model’s own token-probability confidence score to decide routing. When FunctionGemma returns high confidence (≥ 0.95), the result is served directly on-device with sub-second latency and zero network dependency — preserving user privacy since no data leaves the device. When confidence is low, the system escalates to Gemini Flash via the Google GenAI API, which provides higher-accuracy results at the cost of latency and network exposure.

Semantic Cache Learning Loop

The most novel aspect is the persistent semantic cache (ToolCallStore), powered by Qwen3-Embedding-0.6B (also running on-device via Cactus) and Cactus’s built-in vector index (cactus_index_init, cactus_index_add, cactus_index_query). Every cloud escalation stores the query embedding and correct tool calls. On subsequent requests, the cache performs cosine similarity lookup — if a near-identical query was previously resolved by the cloud (score ≥ 0.99), the cached result is returned instantly. For moderate-similarity hits (0.75–0.99), the cached example is injected as a few-shot prompt into FunctionGemma’s context, guiding the local model to produce better extractions. This creates an agentic learning loop: the system starts cold, escalates to the cloud for difficult cases, caches the answers, and progressively handles more queries locally over time — shifting the edge-cloud frontier with each interaction.

Technologies and Frameworks

Technology Role
FunctionGemma 270M-IT On-device function calling model (Google DeepMind), optimized for tool selection and argument extraction
Cactus Compute Runtime C-level inference engine with Python bindings for model loading, completion, embedding generation, and persistent vector indexing — all running locally without GPU requirements
Qwen3-Embedding-0.6B On-device embedding model (1024-dim, normalized) for semantic similarity in the cache layer
Cactus Vector Index Memory-mapped persistent vector store (fp16, cosine similarity via dot product) — no external database needed
Gemini Flash (Google GenAI API) Cloud fallback for low-confidence cases, providing high-accuracy function calling that feeds the learning loop
Python standard library Schema-driven type inference, regex candidate extraction, grounding validation — all deterministic, no ML dependencies beyond the models

Local-First Privacy and Speed

The architecture ensures that high-confidence requests never leave the device. The semantic cache further reduces cloud dependency over time — after an initial learning period, the system can handle the vast majority of queries entirely on-device with single-digit millisecond cache lookups, compared to ~700ms for full model inference or multi-second cloud round-trips. User query data stored in the cache remains local, and the learning loop operates without any centralized training or data collection.

AI Tinkerers Cactus Compute Cactus Compute Runtime. Gemini 2.0 Flash was blocked to new users. Google DeepMind