Babel-Fish-Assistant

08/2023 — 09/2023

Real-time translationVoice AI

The Problem

Real-time voice translation systems lacked seamless integration of speech recognition, translation, and natural voice synthesis for cross-language communication.

The Solution

Developed a production-ready real-time voice translator integrating Deepgram STT and ElevenLabs TTS with GPT-4o-mini for high-accuracy cross-language processing. Built full-stack application with Next.js 14 and Flask backend for low-latency audio streaming.

Impact

Production-ready voice AI system enabling real-time multilingual communication with natural speech synthesis.

Architecture

Audio is captured in the browser and streamed to a Flask backend via WebSocket. Deepgram STT transcribes the stream in real-time with word-level timestamps. The transcript is sent to GPT-4o-mini for translation with a system prompt that preserves tone and domain vocabulary. The translated text is passed to ElevenLabs TTS, which streams synthesized audio back to the Next.js frontend for immediate playback — keeping end-to-end latency under 2 seconds for most language pairs.

Key Challenges

WebSocket stream management between Next.js and Flask was the hardest integration point. Audio chunks arrived out of order under high load. Solved by implementing a sequence-numbered buffer on the Flask side that reassembles chunks before passing to Deepgram.
ElevenLabs TTS has per-character billing and noticeable latency on long strings. Chunked translated text at sentence boundaries and streamed audio synthesis sentence-by-sentence, so the first sentence plays while the rest are still being synthesized.
GPT-4o-mini occasionally over-translated idiomatic expressions, losing the speaker's original tone. Added a translation memory system that stores previously translated phrases per session and injects them as few-shot examples to improve consistency across a conversation.

Key Learnings

Real-time audio pipelines exposed how fragile streaming integrations are under variable network conditions — buffering and retry logic aren't optional. I also learned that translation quality is highly context-dependent: without conversation history, even GPT-4o-mini translates ambiguous phrases incorrectly, so maintaining a session context window is essential.

Technologies

Deepgram STTGPT-4o-miniElevenLabs TTSNext.js 14FlaskPythonTypeScript

Links

View Source Code