Real-Time Multilingual Chat & Voice Translation

A cloud-native platform that lets people who speak different languages converse naturally, in real time, by capturing speech, translating on the fly, and rebroadcasting both audio and text back into the same virtual “room.”

Features

Multi-language voice & text translation with sub-second latency
WebSocket-driven orchestration (FastAPI) and front-end (Next.js)
GPT-4o Realtime on Azure AI Foundry for transcription + interpreted audio streams
Azure Communication Services for reliable media transport & room management
Stateless, auto-scalable containers with in-memory session maps

Getting Started

Prerequisites

Node.js v14+
Python 3.9+
Azure Subscription with Communication Services & AI Foundry

Installation

git clone https://github.com/your-org/realtime-translation.git
cd realtime-translation
npm install # install front-end dependencies
pip install -r requirements.txt # install back-end dependencies

Quickstart

Create .env.local in the root and set:
• AZURE_COMMUNICATION_CONNECTION_STRING
• AZURE_AI_ENDPOINT & AZURE_AI_KEY

Run orchestrator:

cd orchestrator
uvicorn main:app --host 0.0.0.0 --port 8000

Start the front-end:
```
cd ../web-ui
npm run dev
```
Open http://localhost:3000, join or create a room, press “P” to speak.

Demo

A minimal demo is included to illustrate full-stack flow:

Ensure both orchestrator and front-end are running.
In two browser tabs, join the same room ID.
Press “P” in one tab, speak; hear interpreted audio & see live transcript in the other tab.

Sequence Diagram

Sequential Chat Flow
The first diagram traces a single interpretation cycle. When a user joins a room, the Next.js UI opens audio & text WebSocket channels, tags the session, and renders the chat canvas. The FastAPI orchestrator registers the user in its in-memory map, adds them via Azure Communication Services Rooms API, then buffers incoming PCM until a pause is detected. That chunk is sent to GPT-4o Realtime; translated audio and transcription deltas stream back and are immediately broadcast over the same WebSockets. The UI plays the audio and appends transcript lines to complete one real-time interpreter loop.

Azure Rooms Service Integration
The second diagram highlights the orchestrator’s use of Azure Rooms Service for durable membership and token management. On join, the orchestrator calls the Rooms API to create or retrieve a room, issues time-bound tokens, and enforces permissions. Azure handles WebRTC signaling, NAT traversal, and scalable media endpoints.

Communication and Orchestration Pipeline
The third illustration brings together the static UI, FastAPI orchestrator, GPT-4o Realtime, and Azure Communication Services. The browser captures speech via WebRTC, WebSockets deliver buffered audio to the orchestrator, which debounces and forwards segments to the AI Foundry. The model returns audio and text deltas; the orchestrator multicasts both back to each client. Azure Communication Services underpins signaling, ICE negotiation, and media relay.

Meeting Controller / Azure Communication Services Orchestration
The final diagram positions Azure Communication Services as the call coordinator—issuing tokens, enforcing NAT traversal, and supporting in-call operations. Our Meeting Controller (the orchestrator) brokers authentication, room lifecycle, and AI translation, then injects translated streams into the media plane so every participant hears the interpreted audio.

Architecture Overview

Architecture Overview — Real-Time Multilingual Chat & Voice Translation

Major components

Layer	Technology	Purpose
1. Chat Room UI	Next.js SPA served from Azure Static Web Apps	records microphone, renders translated audio/text, opens WebSocket to orchestrator, manages “/username” and “/join” commands
2. Back-End Orchestrator	Containerised FastAPI / Chainlit service on Azure Container Apps	holds per-room maps (`ROOM_EMITTERS`, `USERNAMES`), buffers incoming PCM, detects end-of-utterance, fans out translated streams, maintains WebSocket API consumed by the browser
3. Live Translator	GPT-4o-Realtime endpoint (Azure AI Foundry)	receives buffered audio, runs a strict “interpreter” prompt, streams back translated audio + incremental transcript
4. Meeting Controller	Azure Communication Services (Rooms API + Call Automation)	issues access tokens, hosts WebRTC media plane, guarantees that every participant’s client can publish/subscribe audio & video at scale
5. Persistent Storage / State	In-memory dictionaries inside each orchestrator replica	light-weight: no database in the critical path; rooms are effectively sharded by orchestrator replica

Strategy for solving the problem

Room-centric session key – Every browser tab presents a room_id. The orchestrator stores a hash-map of {room_id: {session_id: emitter}}, so a single broadcast loop fans translated audio/text to every connection in that room without iterating over foreign sessions.
Buffered speech, not raw streaming – Continuous raw PCM would overload GPT-4o tokens and spike cost. The orchestrator accumulates ~1 MB or 3 s of audio (or until server-side VAD reports silence), then forwards the whole segment to GPT-4o. This shrinks WebSocket chatter and keeps translation coherent.
Horizontal scalability –
- Frontend is static and can be hosted on any CDN.
- Orchestrator runs in Container Apps with auto-scale rules on CPU, memory and concurrent WebSocket count. Internal state is per-replica; sticky routing (room-id-based hash) can be added via Revision Traffic if thousands of rooms are expected.
- Live Translator is fully managed by Azure; each new WebSocket is another GPT-4o session.
- Communication Services globally distributes TURN/media relay—no custom scaling work needed.
Loose coupling through WebSockets – The UI, orchestrator and AI translator communicate exclusively through WebSocket protocols, allowing the translation engine to be swapped without touching the browser or Communication Services integration.
Minimal persistent state – All session data lives in memory and is derived from the client handshake. An orchestrator replica can be killed and replaced instantly; the browser reconnects and re-joins its room with zero manual recovery. Long-lived data (e.g., audit logs) can be streamed to Application Insights or Azure Monitor asynchronously.
Security & isolation – Rooms Service issues time-bound join tokens; a browser cannot eavesdrop on a room without the token. WebSocket endpoints are protected by the same token and can be fronted by Azure API Management for rate limiting.
Cost control – GPT-4o Realtime minutes are used only for active speech. Silence detection plus container auto-scale keeps idle cost near zero. Communication Services is billed per active participant-minute, matching actual usage.

With these components in concert, the platform delivers speaker-by-speaker live interpretation, scales horizontally under load, and remains simple to operate: swap container images for new logic, point to a new Azure model deployment for better translation quality, or scale out replicas—all without rewriting the UI or room-management layer.

Resources

Azure Communication Services: https://aka.ms/acs
Azure AI Foundry & GPT-4o Realtime: https://aka.ms/azure-ai-foundry
Sample repository & issue tracker: https://github.com/Azure-Samples/realtime-translation
Diagram assets & specs: .assets/

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.assets		.assets
.configure		.configure
.github		.github
infra		infra
src		src
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Real-Time Multilingual Chat & Voice Translation

Features

Getting Started

Prerequisites

Installation

Quickstart

Demo

Sequence Diagram

Architecture Overview

Architecture Overview — Real-Time Multilingual Chat & Voice Translation

Major components

Strategy for solving the problem

Resources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

Azure-Samples/realtime-translation

Folders and files

Latest commit

History

Repository files navigation

Real-Time Multilingual Chat & Voice Translation

Features

Getting Started

Prerequisites

Installation

Quickstart

Demo

Sequence Diagram

Architecture Overview

Architecture Overview — Real-Time Multilingual Chat & Voice Translation

Major components

Strategy for solving the problem

Resources

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages