Cookmate — AI Voice Cooking App
Creator · March 2026
Overview
Cookmate turns cooking into a conversation. Instead of squinting at a recipe card, you cook with Nonna — an AI Italian grandmother who guides you through every step by voice, in real time. She tells you what to do, fires your timers, and keeps you company between steps. No tapping, no reading mid-chop. Just talking. I built the entire app from scratch: a full Swift/SwiftUI codebase with a multi-layer voice pipeline at its core. The pipeline chains Apple's SFSpeechRecognizer for always-on voice activity detection, the Claude API for Nonna's intelligence, and OpenAI's TTS API for her voice — all coordinated through a VoiceState machine that ensures the microphone is never active while Nonna is speaking. Claude drives the session through tool calls: advancing recipe steps, starting and cancelling timers, and saving user memories to Firebase so Nonna remembers your dietary restrictions across sessions. A ChunkingBuffer streams Claude's response tokens into TTS in sentence-sized chunks, keeping end-to-end latency under two seconds from the moment you finish speaking. The MVP launched with three recipes — Chicken Alfredo, Chicken Parm, and Pizza from scratch — and covers the full user journey: a three-page onboarding flow, a recipe selection screen, a live cooking session view with step display and active timer chips, and session resumption for interrupted cooks. Prompt caching keeps the per-session API cost under $0.25.
Key Features
- •Always-on voice activity detection — no push-to-talk, just natural conversation
- •VoiceState machine ensuring mic is never active while Nonna is speaking
- •Claude tool calls to advance steps, manage timers, and save dietary memories across sessions
- •ChunkingBuffer streaming LLM tokens into TTS chunks for sub-2s response latency
- •Session persistence via Firebase — resume interrupted cooks where you left off
- •Three full recipes with step-by-step timer integration (Chicken Alfredo, Chicken Parm, Pizza)
Challenges
Coordinating three real-time async systems — speech recognition, LLM streaming, and TTS playback — without letting them step on each other required a strict VoiceState machine and careful actor isolation throughout. Prompt engineering Nonna's persona to feel warm and natural (not robotic) across the full arc of a cooking session was the other core challenge.
Outcomes
MVP complete and TestFlight-ready across all 29 implementation tasks. End-to-end latency under 2 seconds. Per-session cost under $0.25 with prompt caching active.