@ariontalk/engine-gemini

@ariontalk/engine-gemini is a cloud-based alternative engine that connects to the Gemini Live API for real-time, bidirectional voice conversations. Unlike the default local engine, it streams audio directly between the browser and Gemini servers — no on-device model download required.

Installation

pnpm add @ariontalk/engine-gemini

Peer dependency: @ariontalk/core

GeminiEngine Class

import { GeminiEngine } from '@ariontalk/engine-gemini';

const engine = new GeminiEngine({
  tokenServerUrl: 'https://my-server.example.com',
  model: 'gemini-3.1-flash-live-preview',
  voice: 'Kore',
});

Implements VoiceEngineInterface from @ariontalk/core.

Constructor Options

interface GeminiEngineOptions {
  /** URL of the token server that mints ephemeral Gemini tokens. Required. */
  tokenServerUrl: string;
  /** Gemini model identifier. Defaults to the built-in default model. */
  model?: string;
  /** Gemini voice name for audio output. Defaults to 'Kore'. */
  voice?: string;
  /** Custom PageExtractorService instance. A new one is created if omitted. */
  pageExtractor?: PageExtractorService;
}

Capabilities

Capability	Value
`supportedLanguages`	`['en', 'es', 'ja', 'fr', 'de', 'pt', 'it', 'zh', 'ko', 'hi', 'ar', 'ru']`
`supportsVoiceSelection`	`true`
`supportsRatePitchVolume`	`false`
`supportsBargeInPlugins`	`false`
`supportsOffline`	`false`
`maxSessionDurationSec`	`900` (15 minutes)
`requiresTokenServer`	`true`

A session-expiry warning is shown at 12 minutes. The session is automatically ended at 15 minutes.

Reconnection

The engine automatically reconnects on WebSocket disconnection using exponential backoff (2s, 4s, 8s) up to 3 retries. When the server sends a goAway message, the engine reconnects immediately. Session resumption handles are preserved across reconnects to maintain conversation context.

Session lifecycle behavior

Scripted greeting via system instruction. The engine delivers the configured greeting as part of the system instruction rather than as an opening turn.
Synchronous tool responses. Function calls (e.g. highlight_and_scroll) pause the model’s output until the engine sends a FunctionResponse; the model then resumes speaking.
Page context as context, not turn. The page title, URL, and extracted content travel as context metadata rather than as a user-turn instruction.

Constants

Default Model

const DEFAULT_MODEL = 'gemini-3.1-flash-live-preview';

Available Voices

const GEMINI_VOICES: VoiceInfo[] = [
  { id: 'Kore',   name: 'Kore',   lang: 'en', local: false },
  { id: 'Puck',   name: 'Puck',   lang: 'en', local: false },
  { id: 'Charon', name: 'Charon', lang: 'en', local: false },
  { id: 'Aoede',  name: 'Aoede',  lang: 'en', local: false },
  { id: 'Fenrir', name: 'Fenrir', lang: 'en', local: false },
  { id: 'Leda',   name: 'Leda',   lang: 'en', local: false },
  { id: 'Orus',   name: 'Orus',   lang: 'en', local: false },
  { id: 'Zephyr', name: 'Zephyr', lang: 'en', local: false },
];

Supported Languages

['en', 'es', 'ja', 'fr', 'de', 'pt', 'it', 'zh', 'ko', 'hi', 'ar', 'ru']

Internal Components

These are not exported but are relevant for understanding the engine architecture.

Component	Description
`AudioCapture`	Captures microphone audio via `getUserMedia` and an `AudioWorklet`. Outputs 16 kHz mono PCM in 20 ms chunks, base64-encoded, sent to the Gemini session as realtime input.
`AudioPlayback`	Receives base64-encoded PCM audio from the Gemini server, decodes it, and plays it through an `AudioContext`. Tracks buffered duration for transcript synchronization.
`TokenManager`	Requests ephemeral authentication tokens from the token server. Sends page context (title, URL, content) and session configuration (model, voice, language) so the server can bake a system instruction into the token.