Doozy
Team consisting of an ex-Amazon SDE and Harvard CS founder skilled in AWS serverless/AI agents, paired with a Pratt-educated systems thinker and UX researcher.
Project Description
Hybrid On-Device Function Calling with Semantic Cache Learning
Our system implements a local-first function calling pipeline that runs a 270M parameter FunctionGemma model via the Cactus Compute runtime for on-device inference, with intelligent escalation to Gemini Flash as a cloud fallback. The architecture is designed around a core principle: handle as much as possible on-device, escalate only when uncertain, and learn from every escalation to reduce future cloud dependency.
On-Device Execution Pipeline
FunctionGemma runs entirely on-device through Cactus’s C-level inference engine (cactus_init, cactus_complete, cactus_destroy). The local pipeline uses a forced multi-call approach: user messages are split into action clauses, each clause is matched to the best tool via schema-driven keyword/stem overlap scoring against tool names, descriptions, and parameter metadata, then arguments are extracted through a forced single-tool model call with full schema context. A grounding validation layer rejects hallucinated values by verifying each extracted argument traces back to the user’s original text. When the model fails, a deterministic repair pipeline uses schema-driven type inference (_infer_value_type) to select the right candidate from regex-extracted text segments — all without any tool-specific hardcoding.
Intelligent Routing Logic
The hybrid router (generate_hybrid) uses the model’s own token-probability confidence score to decide routing. When FunctionGemma returns high confidence (≥ 0.95), the result is served directly on-device with sub-second latency and zero network dependency — preserving user privacy since no data leaves the device. When confidence is low, the system escalates to Gemini Flash via the Google GenAI API, which provides higher-accuracy results at the cost of latency and network exposure.
Semantic Cache Learning Loop
The most novel aspect is the persistent semantic cache (ToolCallStore), powered by Qwen3-Embedding-0.6B (also running on-device via Cactus) and Cactus’s built-in vector index (cactus_index_init, cactus_index_add, cactus_index_query). Every cloud escalation stores the query embedding and correct tool calls. On subsequent requests, the cache performs cosine similarity lookup — if a near-identical query was previously resolved by the cloud (score ≥ 0.99), the cached result is returned instantly. For moderate-similarity hits (0.75–0.99), the cached example is injected as a few-shot prompt into FunctionGemma’s context, guiding the local model to produce better extractions. This creates an agentic learning loop: the system starts cold, escalates to the cloud for difficult cases, caches the answers, and progressively handles more queries locally over time — shifting the edge-cloud frontier with each interaction.
Technologies and Frameworks
| Technology | Role |
|---|---|
| FunctionGemma 270M-IT | On-device function calling model (Google DeepMind), optimized for tool selection and argument extraction |
| Cactus Compute Runtime | C-level inference engine with Python bindings for model loading, completion, embedding generation, and persistent vector indexing — all running locally without GPU requirements |
| Qwen3-Embedding-0.6B | On-device embedding model (1024-dim, normalized) for semantic similarity in the cache layer |
| Cactus Vector Index | Memory-mapped persistent vector store (fp16, cosine similarity via dot product) — no external database needed |
| Gemini Flash (Google GenAI API) | Cloud fallback for low-confidence cases, providing high-accuracy function calling that feeds the learning loop |
| Python standard library | Schema-driven type inference, regex candidate extraction, grounding validation — all deterministic, no ML dependencies beyond the models |
Local-First Privacy and Speed
The architecture ensures that high-confidence requests never leave the device. The semantic cache further reduces cloud dependency over time — after an initial learning period, the system can handle the vast majority of queries entirely on-device with single-digit millisecond cache lookups, compared to ~700ms for full model inference or multi-second cloud round-trips. User query data stored in the cache remains local, and the learning loop operates without any centralized training or data collection.