Gemini Live Engine
The Gemini Live engine connects to Google’s Gemini Live API over WebSocket for real-time, cloud-powered voice conversations. Audio streams directly between the browser and Gemini’s servers — there is no intermediate speech-to-text or text-to-speech step.
How Gemini Live works
Unlike the local engine’s three-step pipeline, Gemini Live uses a single WebSocket connection for bidirectional audio streaming:
- The widget captures microphone audio as raw PCM at 16 kHz.
- Audio chunks are sent in real-time to the Gemini Live API via WebSocket.
- Gemini processes the audio server-side and streams back audio responses.
- The widget plays the response audio through the Web Audio API.
The API handles speech recognition, language understanding, and speech synthesis entirely on the server. It also provides real-time transcription of both user input and model output.
Capabilities
| Capability | Supported |
|---|---|
| Languages | 12 languages |
| Voice selection | Yes (8 voices) |
| Interactive highlights | Yes (via function calling) |
| Image understanding | Yes (page images sent as context) |
| Rate / Pitch / Volume | No |
| Barge-in | Native (built into API) |
| Offline | No |
| Max session duration | 15 minutes |
| Token server required | Yes (managed by the ArionTalk cloud service when using site-key; self-host otherwise) |
Supported languages
The Gemini Live engine supports 12 languages:
| Code | Language |
|---|---|
en | English |
es | Spanish |
ja | Japanese |
fr | French |
de | German |
pt | Portuguese |
it | Italian |
zh | Chinese |
ko | Korean |
hi | Hindi |
ar | Arabic |
ru | Russian |
Available voices
Gemini Live provides 8 built-in voices. All voices support the full set of 12 languages listed above.
| Voice | ID |
|---|---|
| Kore | Kore |
| Puck | Puck |
| Charon | Charon |
| Aoede | Aoede |
| Fenrir | Fenrir |
| Leda | Leda |
| Orus | Orus |
| Zephyr | Zephyr |
The default voice is Kore. Set a different voice using the gemini-voice attribute.
Interactive Highlights
When interactive-highlights is enabled, Gemini calls a highlight_and_scroll function in real time as it speaks. The widget scrolls to and visually highlights the referenced page section or image — creating a guided tour experience.
The Page Indexer assigns stable IDs (sec-1, sec-2, img-1, img-2) to content elements and embeds them in the context sent to Gemini. The model then calls the tool whenever it discusses a specific section.
See the full Interactive Highlights guide for details on how the element ID system, visual behavior, and function calling architecture work.
Widget configuration
There are two ways to authenticate against Gemini Live:
Option 1 — ArionTalk cloud service (recommended)
Register your site at ariontalk.com to get a site key — no server to run.
<ariontalk-widget site-key="YOUR_SITE_KEY" interactive-highlights></ariontalk-widget>With site-key set, engine resolves to "gemini" automatically and the widget points at the cloud service. Override service-url if you need to point at a different ArionTalk deployment.
Option 2 — Self-hosted token server
If you’d rather run your own token server, use the token-server attribute:
<ariontalk-widget engine="gemini" token-server="https://your-server.example.com" interactive-highlights gemini-voice="Kore" settings></ariontalk-widget>| Attribute | Description |
|---|---|
site-key | Site key for the ArionTalk cloud service. When present, the cloud path is used automatically. |
service-url | Base URL for the ArionTalk cloud service. Defaulted automatically when site-key is set. |
engine | Set to "gemini" to use the Gemini Live engine. Resolves to "gemini" automatically when site-key is present. |
token-server | URL of a self-hosted token server. Use this as an alternative to site-key. |
interactive-highlights | Enable real-time content highlighting during conversations |
gemini-voice | Voice name (e.g., "Kore", "Puck") |
gemini-model | Model identifier (defaults to gemini-3.1-flash-live-preview) |
settings | Show the settings gear icon for pre-session configuration |
Session behavior notes
- Scripted greeting. The engine sends the configured greeting via system instruction at session start.
- Synchronous tool responses. Function calls (e.g.
highlight_and_scroll) pause the model’s output until the engine sends aFunctionResponse; the model then resumes speaking. - Page context payload. Page text and image references are delivered as context, not as a user-turn instruction.
Setting up the token server (self-hosted)
If you’re going the self-hosted route, the Gemini Live engine requires a token server to securely generate ephemeral API tokens. This keeps your Gemini API key on the server and never exposes it to the browser.
The token server is included in the ArionTalk repository under packages/token-server/. It is not published on npm — clone the repo to use it.
Install
git clone https://github.com/luixaviles/ariontalk.gitcd ariontalkpnpm installConfigure
Create a .env file with your Gemini API key:
GEMINI_API_KEY=your-api-key-hereRun
pnpm token-serverOr, if running from a built distribution:
node dist/index.jsThe server starts on port 3001 by default. Set the PORT environment variable to change it.
API endpoint
POST /api/token
The widget sends a JSON request with session parameters:
{ "model": "gemini-3.1-flash-live-preview", "voice": "Kore", "lang": "en", "pageTitle": "My Page", "pageUrl": "https://example.com", "pageContent": "Extracted page text..."}The server builds a system prompt from the page content, creates an ephemeral token scoped to a single use with a 30-minute expiry, and returns:
{ "token": "ephemeral-token-string"}CORS is enabled on the /api/token endpoint, so the widget can call it from any origin.
Deployment options
The token server is a lightweight Node.js HTTP server built with Hono. You can deploy it as:
- A standalone Node.js server
- A Docker container
- A serverless function (adapt the Hono app to your platform’s handler format)
Session limits and behavior
| Behavior | Details |
|---|---|
| Max duration | 15 minutes per session |
| Warning | Displayed at 12 minutes (“Session ending in 3 minutes”) |
| Auto-end | Session ends automatically at 15 minutes |
| Reconnection | Automatic reconnection with exponential backoff (up to 3 retries) |
| Session resumption | Saves a session handle for seamless reconnection after network interruptions |
When the server sends a goAway message (requesting disconnect), the engine automatically reconnects using session resumption to continue the conversation without losing context.
Local vs. Gemini comparison
| Feature | Local Engine | Gemini Live Engine |
|---|---|---|
| Processing | On-device (Gemini Nano) | Cloud (Gemini Live API) |
| Languages | 2 (en, es) | 12 |
| Voice selection | System TTS voices | 8 Gemini voices |
| Rate / Pitch / Volume | Yes | No |
| Barge-in | Plugin-based (Energy, Silero VAD) | Native (built into API) |
| Offline | Yes | No |
| Session limit | Unlimited | 15 minutes |
| API key required | No | Yes (via token server) |
| Initial download | ~1.7 GB model | None |
| Response quality | Good (smaller model) | Higher (cloud model) |
| Privacy | Fully on-device | Audio sent to Google servers |
| Browser support | Chrome 139+ only | Any modern browser with WebSocket + mic |