Gemini Live Engine

The Gemini Live engine connects to Google’s Gemini Live API over WebSocket for real-time, cloud-powered voice conversations. Audio streams directly between the browser and Gemini’s servers — there is no intermediate speech-to-text or text-to-speech step.

How Gemini Live works

Unlike the local engine’s three-step pipeline, Gemini Live uses a single WebSocket connection for bidirectional audio streaming:

The widget captures microphone audio as raw PCM at 16 kHz.
Audio chunks are sent in real-time to the Gemini Live API via WebSocket.
Gemini processes the audio server-side and streams back audio responses.
The widget plays the response audio through the Web Audio API.

The API handles speech recognition, language understanding, and speech synthesis entirely on the server. It also provides real-time transcription of both user input and model output.

Capabilities

Capability	Supported
Languages	12 languages
Voice selection	Yes (8 voices)
Interactive highlights	Yes (via function calling)
Image understanding	Yes (page images sent as context)
Rate / Pitch / Volume	No
Barge-in	Native (built into API)
Offline	No
Max session duration	15 minutes
Token server required	Yes (managed by the ArionTalk cloud service when using `site-key`; self-host otherwise)

Supported languages

The Gemini Live engine supports 12 languages:

Code	Language
`en`	English
`es`	Spanish
`ja`	Japanese
`fr`	French
`de`	German
`pt`	Portuguese
`it`	Italian
`zh`	Chinese
`ko`	Korean
`hi`	Hindi
`ar`	Arabic
`ru`	Russian

Available voices

Gemini Live provides 8 built-in voices. All voices support the full set of 12 languages listed above.

Voice	ID
Kore	`Kore`
Puck	`Puck`
Charon	`Charon`
Aoede	`Aoede`
Fenrir	`Fenrir`
Leda	`Leda`
Orus	`Orus`
Zephyr	`Zephyr`

The default voice is Kore. Set a different voice using the gemini-voice attribute.

Interactive Highlights

When interactive-highlights is enabled, Gemini calls a highlight_and_scroll function in real time as it speaks. The widget scrolls to and visually highlights the referenced page section or image — creating a guided tour experience.

The Page Indexer assigns stable IDs (sec-1, sec-2, img-1, img-2) to content elements and embeds them in the context sent to Gemini. The model then calls the tool whenever it discusses a specific section.

See the full Interactive Highlights guide for details on how the element ID system, visual behavior, and function calling architecture work.

There are two ways to authenticate against Gemini Live:

Option 1 — ArionTalk cloud service (recommended)

<ariontalk-widget
  site-key="YOUR_SITE_KEY"
  interactive-highlights
></ariontalk-widget>

With site-key set, engine resolves to "gemini" automatically and the widget points at the cloud service. Override service-url if you need to point at a different ArionTalk deployment.

Option 2 — Self-hosted token server

If you’d rather run your own token server, use the token-server attribute:

<ariontalk-widget
  engine="gemini"
  token-server="https://your-server.example.com"
  interactive-highlights
  gemini-voice="Kore"
  settings
></ariontalk-widget>

Attribute	Description
`site-key`	Site key for the ArionTalk cloud service. When present, the cloud path is used automatically.
`service-url`	Base URL for the ArionTalk cloud service. Defaulted automatically when `site-key` is set.
`engine`	Set to `"gemini"` to use the Gemini Live engine. Resolves to `"gemini"` automatically when `site-key` is present.
`token-server`	URL of a self-hosted token server. Use this as an alternative to `site-key`.
`interactive-highlights`	Enable real-time content highlighting during conversations
`gemini-voice`	Voice name (e.g., `"Kore"`, `"Puck"`)
`gemini-model`	Model identifier (defaults to `gemini-3.1-flash-live-preview`)
`settings`	Show the settings gear icon for pre-session configuration

Session behavior notes

Scripted greeting. The engine sends the configured greeting via system instruction at session start.
Synchronous tool responses. Function calls (e.g. highlight_and_scroll) pause the model’s output until the engine sends a FunctionResponse; the model then resumes speaking.
Page context payload. Page text and image references are delivered as context, not as a user-turn instruction.

Setting up the token server (self-hosted)

If you’re going the self-hosted route, the Gemini Live engine requires a token server to securely generate ephemeral API tokens. This keeps your Gemini API key on the server and never exposes it to the browser.

The token server is included in the ArionTalk repository under packages/token-server/. It is not published on npm — clone the repo to use it.

Install

git clone https://github.com/luixaviles/ariontalk.git
cd ariontalk
pnpm install

Configure

Create a .env file with your Gemini API key:

GEMINI_API_KEY=your-api-key-here

Run

pnpm token-server

Or, if running from a built distribution:

node dist/index.js

The server starts on port 3001 by default. Set the PORT environment variable to change it.

API endpoint

POST /api/token

The widget sends a JSON request with session parameters:

{
  "model": "gemini-3.1-flash-live-preview",
  "voice": "Kore",
  "lang": "en",
  "pageTitle": "My Page",
  "pageUrl": "https://example.com",
  "pageContent": "Extracted page text..."
}

The server builds a system prompt from the page content, creates an ephemeral token scoped to a single use with a 30-minute expiry, and returns:

{
  "token": "ephemeral-token-string"
}

CORS is enabled on the /api/token endpoint, so the widget can call it from any origin.

Deployment options

The token server is a lightweight Node.js HTTP server built with Hono. You can deploy it as:

A standalone Node.js server
A Docker container
A serverless function (adapt the Hono app to your platform’s handler format)

Session limits and behavior

Behavior	Details
Max duration	15 minutes per session
Warning	Displayed at 12 minutes (“Session ending in 3 minutes”)
Auto-end	Session ends automatically at 15 minutes
Reconnection	Automatic reconnection with exponential backoff (up to 3 retries)
Session resumption	Saves a session handle for seamless reconnection after network interruptions

When the server sends a goAway message (requesting disconnect), the engine automatically reconnects using session resumption to continue the conversation without losing context.

Local vs. Gemini comparison

Feature	Local Engine	Gemini Live Engine
Processing	On-device (Gemini Nano)	Cloud (Gemini Live API)
Languages	2 (en, es)	12
Voice selection	System TTS voices	8 Gemini voices
Rate / Pitch / Volume	Yes	No
Barge-in	Plugin-based (Energy, Silero VAD)	Native (built into API)
Offline	Yes	No
Session limit	Unlimited	15 minutes
API key required	No	Yes (via token server)
Initial download	~1.7 GB model	None
Response quality	Good (smaller model)	Higher (cloud model)
Privacy	Fully on-device	Audio sent to Google servers
Browser support	Chrome 139+ only	Any modern browser with WebSocket + mic