Voice Agents

Real-time bidirectional voice conversation. Audio flows in from the microphone and out to the speaker simultaneously, with tool calling in between.

Overview

Voice agents use the StrandsBidiStreaming module, which provides a BidiAgent that manages an audio session with persistent duplex streaming. Unlike text agents, voice agents maintain a single long-lived connection to the model where audio chunks flow continuously in both directions.

Microphone → audio chunks → BidiAgent → audio response → Speaker
Tools (called mid-conversation)

Cloud Backends

Cloud backends connect to a hosted speech model over WebSocket or a proprietary real-time protocol. The model handles speech-to-text, reasoning, and text-to-speech in a single round trip.

OpenAI Realtime

OpenAI Realtime APISwift
import StrandsBidiStreaming

/// Get the current time.
@Tool func currentTime() -> String { ISO8601DateFormatter().string(from: Date()) }

let agent = BidiAgent(
    model: OpenAIRealtimeModel(model: "gpt-4o-realtime-preview"),
    tools: [currentTime],
    config: BidiSessionConfig(
        voice: "alloy",
        systemPrompt: "You are a helpful voice assistant."
    )
)

try await agent.start()

// Send mic audio
Task {
    for await chunk in mic.audioStream {
        try await agent.send(.audio(chunk, format: .openAI))
    }
}

// Receive and play responses
for try await event in agent.receive() {
    switch event {
    case .audio(let data, _):
        speaker.play(data)
    case .transcript(let text):
        print("Agent said: \(text)")
    case .toolUse(let use):
        print("Calling tool: \(use.name)")
    default:
        break
    }
}

AWS Nova Sonic

AWS Nova SonicSwift
import StrandsBidiStreaming

let agent = BidiAgent(
    model: NovaSonicModel(config: NovaSonicConfig(
        region: "us-east-1",
        voice: "tiffany"
    )),
    tools: [myTools],
    config: BidiSessionConfig(systemPrompt: "You are a voice assistant.")
)

try await agent.start()
// same send/receive pattern as above

Google Gemini Live

Gemini Live APISwift
import StrandsBidiStreaming

let agent = BidiAgent(
    model: GeminiLiveModel(model: "gemini-2.0-flash-live-001"),
    config: BidiSessionConfig(voice: "Puck")
)
try await agent.start()

On-Device Voice (MLX)

Apple Silicon only

Run the full voice pipeline on-device: STT, LLM, and TTS all on Apple Silicon. No network required after the initial model download.

Fully local voice agentSwift
import StrandsMLXBidiProvider

// Load models (downloaded from HuggingFace and cached locally)
let sttModel = try await MLXSTTProcessor.load(model: glmASRModel)
let ttsModel = try await MLXTTSProcessor.load(model: sopranoModel)

let agent = MLXBidiFactory.createAgent(
    llmProcessor: MLXLLMProcessor(modelId: "mlx-community/Qwen3-8B-4bit"),
    sttProcessor: sttModel,
    ttsProcessor: ttsModel,
    tools: [currentTime, calculator],
    systemPrompt: "You are a helpful on-device assistant."
)

try await agent.start()
// same send/receive pattern

Local Pipeline Components

ComponentRoleExample model
MLXSTTProcessorSpeech to textGLM ASR, Parakeet
MLXLLMProcessorLanguage model + toolsQwen3-8B-4bit
MLXTTSProcessorText to speechSoprano, Marvis

Supported Backends

BackendModulePlatform
OpenAI RealtimeStrandsBidiStreamingmacOS, iOS
AWS Nova SonicStrandsBidiStreamingmacOS, iOS
Google Gemini LiveStrandsBidiStreamingmacOS, iOS
MLX (local)StrandsMLXBidiProvidermacOS Apple Silicon