AI Portfolio Voice Agent

Live

Talk to the portfolio in any language

Started: May 7, 2026 · ~16 hours over 2 days

AI:

Claude Code

Gemini API

Stack:

Next.js 16

React 19

TypeScript

Agora Conversational AI

Agora RTC

Gemini 3.1 Flash LiveUpstash Redis

Vercel

A real-time voice docent built into this portfolio. Click the mic, speak in any language, and an AI agent describes Alex's projects, opens them on demand, and offers to walk you through his career arc — all over a single multimodal channel with no separate STT or TTS. Ships with three layers of cost guardrails, private session analytics for post-launch debugging, and a drift-resistant QA harness that runs before every push.

// screens

1 / 2

// highlights

Single-vendor multimodal voice (audio in / reasoning / audio out) via Gemini 3.1 Flash Live — no separate STT or TTS pipeline
Belt-and-suspenders tool calling: real function_calls plus a six-layer transcript-pattern fallback (commitment + order-aware suppression + commitment-text scoping + writeup-vs-domain + anaphora + booking-before-LinkedIn priority)
Three-layer Upstash guardrails — single-session lock, per-IP rate limit, daily budget kill switch with refund on early end
Private session analytics — every transcript turn and tool call logs to Upstash with a 30-day TTL, so production failures get diagnosed in seconds rather than guessed at
Hover-toast pulse trigger UI that stays out of the way until a visitor leans in; tool-failure toast surfaces real error messages above the in-session UI when a tool returns ok:false
Drift-resistant QA harness extracts regex patterns from source — 46 cases run on every preship, each one a verbatim production transcript that broke an earlier version
Reverse-engineered Agora's chunked datastream wire format from runtime logs to recover early-arriving messages

// takeaways

Preview-tier multimodal models will narrate function calls without emitting them — always plan a transcript-pattern fallback before relying on tool-call reliability
Distinguishing 'tell me about X' from 'show me X' is the core voice-UX problem, not a side detail. The agent should be a docent that offers, not a teleporter that yanks
Build private analytics on day one. Reading one real session transcript beats hours of speculation. The diagnostic CLI we built turned every 'it didn't work' report into a 30-second answer.
Anaphora and past-tense narration are the dominant production failure modes for project navigation, and they only show up after real users use the agent. The QA harness needs verbatim production transcripts as cases, not invented ones.
Multimodal Live > pipeline (STT+LLM+TTS) for latency and vendor count, but the trade-off is tool-call reliability. Pick the architecture based on whether your UX needs sub-second turns or rock-solid function-calling