Skip to content

Gemini Live Engine

The Gemini Live engine connects to Google’s Gemini Live API over WebSocket for real-time, cloud-powered voice conversations. Audio streams directly between the browser and Gemini’s servers — there is no intermediate speech-to-text or text-to-speech step.

How Gemini Live works

Unlike the local engine’s three-step pipeline, Gemini Live uses a single WebSocket connection for bidirectional audio streaming:

  1. The widget captures microphone audio as raw PCM at 16 kHz.
  2. Audio chunks are sent in real-time to the Gemini Live API via WebSocket.
  3. Gemini processes the audio server-side and streams back audio responses.
  4. The widget plays the response audio through the Web Audio API.

The API handles speech recognition, language understanding, and speech synthesis entirely on the server. It also provides real-time transcription of both user input and model output.

Capabilities

CapabilitySupported
Languages12 languages
Voice selectionYes (8 voices)
Interactive highlightsYes (via function calling)
Image understandingYes (page images sent as context)
Rate / Pitch / VolumeNo
Barge-inNative (built into API)
OfflineNo
Max session duration15 minutes
Token server requiredYes (managed by the ArionTalk cloud service when using site-key; self-host otherwise)

Supported languages

The Gemini Live engine supports 12 languages:

CodeLanguage
enEnglish
esSpanish
jaJapanese
frFrench
deGerman
ptPortuguese
itItalian
zhChinese
koKorean
hiHindi
arArabic
ruRussian

Available voices

Gemini Live provides 8 built-in voices. All voices support the full set of 12 languages listed above.

VoiceID
KoreKore
PuckPuck
CharonCharon
AoedeAoede
FenrirFenrir
LedaLeda
OrusOrus
ZephyrZephyr

The default voice is Kore. Set a different voice using the gemini-voice attribute.

Interactive Highlights

When interactive-highlights is enabled, Gemini calls a highlight_and_scroll function in real time as it speaks. The widget scrolls to and visually highlights the referenced page section or image — creating a guided tour experience.

The Page Indexer assigns stable IDs (sec-1, sec-2, img-1, img-2) to content elements and embeds them in the context sent to Gemini. The model then calls the tool whenever it discusses a specific section.

See the full Interactive Highlights guide for details on how the element ID system, visual behavior, and function calling architecture work.

Widget configuration

There are two ways to authenticate against Gemini Live:

Register your site at ariontalk.com to get a site key — no server to run.

<ariontalk-widget
site-key="YOUR_SITE_KEY"
interactive-highlights
></ariontalk-widget>

With site-key set, engine resolves to "gemini" automatically and the widget points at the cloud service. Override service-url if you need to point at a different ArionTalk deployment.

Option 2 — Self-hosted token server

If you’d rather run your own token server, use the token-server attribute:

<ariontalk-widget
engine="gemini"
token-server="https://your-server.example.com"
interactive-highlights
gemini-voice="Kore"
settings
></ariontalk-widget>
AttributeDescription
site-keySite key for the ArionTalk cloud service. When present, the cloud path is used automatically.
service-urlBase URL for the ArionTalk cloud service. Defaulted automatically when site-key is set.
engineSet to "gemini" to use the Gemini Live engine. Resolves to "gemini" automatically when site-key is present.
token-serverURL of a self-hosted token server. Use this as an alternative to site-key.
interactive-highlightsEnable real-time content highlighting during conversations
gemini-voiceVoice name (e.g., "Kore", "Puck")
gemini-modelModel identifier (defaults to gemini-3.1-flash-live-preview)
settingsShow the settings gear icon for pre-session configuration

Session behavior notes

  • Scripted greeting. The engine sends the configured greeting via system instruction at session start.
  • Synchronous tool responses. Function calls (e.g. highlight_and_scroll) pause the model’s output until the engine sends a FunctionResponse; the model then resumes speaking.
  • Page context payload. Page text and image references are delivered as context, not as a user-turn instruction.

Setting up the token server (self-hosted)

If you’re going the self-hosted route, the Gemini Live engine requires a token server to securely generate ephemeral API tokens. This keeps your Gemini API key on the server and never exposes it to the browser.

The token server is included in the ArionTalk repository under packages/token-server/. It is not published on npm — clone the repo to use it.

Install

Terminal window
git clone https://github.com/luixaviles/ariontalk.git
cd ariontalk
pnpm install

Configure

Create a .env file with your Gemini API key:

Terminal window
GEMINI_API_KEY=your-api-key-here

Run

Terminal window
pnpm token-server

Or, if running from a built distribution:

Terminal window
node dist/index.js

The server starts on port 3001 by default. Set the PORT environment variable to change it.

API endpoint

POST /api/token

The widget sends a JSON request with session parameters:

{
"model": "gemini-3.1-flash-live-preview",
"voice": "Kore",
"lang": "en",
"pageTitle": "My Page",
"pageUrl": "https://example.com",
"pageContent": "Extracted page text..."
}

The server builds a system prompt from the page content, creates an ephemeral token scoped to a single use with a 30-minute expiry, and returns:

{
"token": "ephemeral-token-string"
}

CORS is enabled on the /api/token endpoint, so the widget can call it from any origin.

Deployment options

The token server is a lightweight Node.js HTTP server built with Hono. You can deploy it as:

  • A standalone Node.js server
  • A Docker container
  • A serverless function (adapt the Hono app to your platform’s handler format)

Session limits and behavior

BehaviorDetails
Max duration15 minutes per session
WarningDisplayed at 12 minutes (“Session ending in 3 minutes”)
Auto-endSession ends automatically at 15 minutes
ReconnectionAutomatic reconnection with exponential backoff (up to 3 retries)
Session resumptionSaves a session handle for seamless reconnection after network interruptions

When the server sends a goAway message (requesting disconnect), the engine automatically reconnects using session resumption to continue the conversation without losing context.

Local vs. Gemini comparison

FeatureLocal EngineGemini Live Engine
ProcessingOn-device (Gemini Nano)Cloud (Gemini Live API)
Languages2 (en, es)12
Voice selectionSystem TTS voices8 Gemini voices
Rate / Pitch / VolumeYesNo
Barge-inPlugin-based (Energy, Silero VAD)Native (built into API)
OfflineYesNo
Session limitUnlimited15 minutes
API key requiredNoYes (via token server)
Initial download~1.7 GB modelNone
Response qualityGood (smaller model)Higher (cloud model)
PrivacyFully on-deviceAudio sent to Google servers
Browser supportChrome 139+ onlyAny modern browser with WebSocket + mic