Babel-Fish-Assistant

08/2023 — 09/2023
Real-time translationVoice AI

The Problem

Real-time voice translation systems lacked seamless integration of speech recognition, translation, and natural voice synthesis for cross-language communication.

The Solution

Developed a production-ready real-time voice translator integrating Deepgram STT and ElevenLabs TTS with GPT-4o-mini for high-accuracy cross-language processing. Built full-stack application with Next.js 14 and Flask backend for low-latency audio streaming.

Impact

Production-ready voice AI system enabling real-time multilingual communication with natural speech synthesis.

Architecture

Audio is captured in the browser and streamed to a Flask backend via WebSocket. Deepgram STT transcribes the stream in real-time with word-level timestamps. The transcript is sent to GPT-4o-mini for translation with a system prompt that preserves tone and domain vocabulary. The translated text is passed to ElevenLabs TTS, which streams synthesized audio back to the Next.js frontend for immediate playback — keeping end-to-end latency under 2 seconds for most language pairs.

Key Challenges

  • WebSocket stream management between Next.js and Flask was the hardest integration point. Audio chunks arrived out of order under high load. Solved by implementing a sequence-numbered buffer on the Flask side that reassembles chunks before passing to Deepgram.
  • ElevenLabs TTS has per-character billing and noticeable latency on long strings. Chunked translated text at sentence boundaries and streamed audio synthesis sentence-by-sentence, so the first sentence plays while the rest are still being synthesized.
  • GPT-4o-mini occasionally over-translated idiomatic expressions, losing the speaker's original tone. Added a translation memory system that stores previously translated phrases per session and injects them as few-shot examples to improve consistency across a conversation.

Key Learnings

Real-time audio pipelines exposed how fragile streaming integrations are under variable network conditions — buffering and retry logic aren't optional. I also learned that translation quality is highly context-dependent: without conversation history, even GPT-4o-mini translates ambiguous phrases incorrectly, so maintaining a session context window is essential.

Technologies

Deepgram STTGPT-4o-miniElevenLabs TTSNext.js 14FlaskPythonTypeScript