Local Inference

Run language models entirely on Apple Silicon. No network, no API keys, no data leaves the device.

ℹ️

MLX inference requires Apple Silicon (M1 or later). It is not supported on Intel Macs, iOS, or tvOS. Use xcodebuild or Xcode to run; Metal library loading fails with swift run.

MLXProvider

Pass an MLXProvider to your agent. Models are downloaded from HuggingFace and cached locally on first run:

local_agent.swiftSwift

import StrandsAgents

let agent = Agent(
    model: MLXProvider(modelId: "mlx-community/Qwen3-8B-4bit"),
    tools: [calculator, wordCount]
)

let result = try await agent.run("What is 1234 * 5678?")
print(result.message.textContent)  // "7,006,652"

Recommended models

Model	Size	Use case
`mlx-community/Qwen3-8B-4bit`	~5 GB	General purpose, tool calling
`mlx-community/Qwen3-14B-4bit`	~9 GB	More capable, slower
`mlx-community/Mistral-7B-Instruct-v0.3-4bit`	~4 GB	Instruction following

⚠️

The qwen3_5 architecture is not yet supported in the current MLX Swift LM version. Use qwen3 models (e.g. Qwen3-8B, Qwen3-14B).

Download progress

Models are downloaded from HuggingFace on first use and cached locally. Pass an onDownloadProgress callback to track progress. The closure receives a Double from 0.0 to 1.0 and is not called if the model is already cached.

Basic usageSwift

let provider = MLXProvider(modelId: "mlx-community/Qwen3-8B-4bit") { progress in
    print("Downloading: \(Int(progress * 100))%")
}

In SwiftUI, bind the callback to a @State variable to drive a ProgressView:

SwiftUI progress barSwift

@State private var downloadProgress: Double = 0

var body: some View {
    if downloadProgress < 1.0 {
        ProgressView("Downloading model...", value: downloadProgress)
    }
}

// When creating the provider:
let provider = MLXProvider(modelId: "mlx-community/Qwen3-8B-4bit") { progress in
    Task { @MainActor in
        downloadProgress = progress
    }
}

Hybrid Routing

A HybridRouter sits between your agent and its models. For each request it evaluates a policy and picks either the local (MLX) or cloud provider. The agent code is identical either way.

This is useful when you want fast, private responses for simple tasks and cloud-quality reasoning for complex ones, without writing any branching logic yourself.

Request → HybridRouter → policy says local → MLXProvider

→ policy says cloud → BedrockProvider

Basic setupSwift

import StrandsAgents

let agent = Agent(
    router: HybridRouter(
        local: MLXProvider(modelId: "mlx-community/Qwen3-8B-4bit"),
        cloud: try BedrockProvider(config: BedrockConfig(
            modelId: "us.anthropic.claude-sonnet-4-20250514-v1:0"
        )),
        policy: LatencySensitivePolicy()
    ),
    tools: [myTools]
)

Built-in policies

Policy	Best for
`AlwaysCloudPolicy`	Cloud only. Default when no policy is specified.
`AlwaysLocalPolicy`	Local only. Useful for fully offline apps.
`LatencySensitivePolicy`	Most apps. Prefers cloud by default but routes locally when the user requests low latency or privacy, or when the device is plugged in and not throttled.
`FallbackPolicy`	Local-first apps. Prefers local by default and only falls back to cloud when the device is throttled, low on memory, or the previous local inference was too slow.

💡

Not sure which to pick? Start with LatencySensitivePolicy -- it's the recommended general-purpose option. Use FallbackPolicy if local inference is your default and cloud is the safety net.

Routing hints

agent.routingHints is read on every call. Set it before run() or stream() to influence the routing decision:

Per-call hintsSwift

// Keep this prompt on-device — LatencySensitivePolicy will always route local
agent.routingHints = RoutingHints(privacySensitive: true)
let privateResult = try await agent.run("Summarize my health records")

// This task needs a strong model — FallbackPolicy and LatencySensitivePolicy both route cloud
agent.routingHints = RoutingHints(requiresDeepReasoning: true)
let deepResult = try await agent.run("Explain the proof of Fermat's Last Theorem step by step")

// Bypass the policy entirely for one call
agent.routingHints = RoutingHints(forceProvider: .local)
let localResult = try await agent.run("Quick offline calculation: 42 * 17")

// Reset to defaults — let the policy decide again
agent.routingHints = RoutingHints()

Available hint flags

Hint	Type	Effect
`preferLowLatency`	`Bool`	Prefer fastest provider even at lower capability
`privacySensitive`	`Bool`	Prompt must not leave the device
`requiresDeepReasoning`	`Bool`	Task needs a large, capable model
`forceProvider`	`.local` / `.cloud`	Bypass policy entirely

Device capabilities

Policies receive live device state on every routing call. All values are read fresh each time:

Available in RoutingContext.deviceCapabilitiesSwift

context.deviceCapabilities.hasNeuralEngine    // true on Apple Silicon (arm64)
context.deviceCapabilities.availableMemoryGB  // free + speculative RAM right now (vm_statistics64)
context.deviceCapabilities.thermalState       // .nominal / .fair / .serious / .critical (ProcessInfo)
context.deviceCapabilities.isPluggedIn        // requires DeviceCapabilities.isPluggedInProvider (see below)

ℹ️

isPluggedIn requires a platform-specific hook because StrandsAgents only imports Foundation. Set DeviceCapabilities.isPluggedInProvider once at app startup:

// iOS (AppDelegate or @main):
UIDevice.current.isBatteryMonitoringEnabled = true
DeviceCapabilities.isPluggedInProvider = {
    UIDevice.current.batteryState != .unplugged
}

// macOS (using IOKit — add IOKit to your app target, not to StrandsAgents):
DeviceCapabilities.isPluggedInProvider = {
    IOPSCopyExternalPowerAdapterDetails() != nil
}

Until you set this, isPluggedIn defaults to true.

Writing a custom policy

Conform to RoutingPolicy and implement one method. Return true to use local inference, false to use cloud.

The full RoutingContext is available to your policy, including:

context.deviceCapabilities -- thermal state, free memory, power source, Neural Engine
context.estimatedPromptTokens -- rough token estimate (character count / 4), useful for routing long contexts to cloud
context.lastInferenceLatencyMs -- latency of the previous model call in ms (nil on first call), useful for falling back if local is running slowly
context.hints -- the RoutingHints set by the caller

Example: route by prompt length + thermal stateSwift

struct SmartPolicy: RoutingPolicy {
    let shortPromptThreshold = 200  // characters

    func shouldUseLocal(context: RoutingContext) -> Bool {
        // Never run locally if the device is throttling
        guard context.deviceCapabilities.thermalState != .critical else { return false }

        // Always respect an explicit privacy request
        if context.hints.privacySensitive { return true }

        // Route deep reasoning tasks to cloud
        if context.hints.requiresDeepReasoning { return false }

        // Short prompts with no tools: fast local response
        let promptLength = context.messages.last?.textContent?.count ?? 0
        let hasTools = !(context.toolSpecs?.isEmpty ?? true)
        return promptLength < shortPromptThreshold && !hasTools
    }
}

let agent = Agent(
    router: HybridRouter(
        local: MLXProvider(modelId: "mlx-community/Qwen3-8B-4bit"),
        cloud: try BedrockProvider(config: BedrockConfig(
            modelId: "us.anthropic.claude-sonnet-4-20250514-v1:0"
        )),
        policy: SmartPolicy()
    )
)

💡

You can also implement the full ModelRouter protocol directly for cases where you need async logic, external signals (A/B flags, remote config), or routing to more than two providers.

Full custom router

If the two-provider HybridRouter structure is too limiting, implement ModelRouter directly. You receive the full RoutingContext and can return any ModelProvider.

Router with three providersSwift

struct TieredRouter: ModelRouter {
    let nano: any ModelProvider    // fast local, small tasks
    let standard: any ModelProvider // cloud, general tasks
    let powerful: any ModelProvider // large cloud model, hard tasks

    func route(context: RoutingContext) async throws -> any ModelProvider {
        if context.hints.forceProvider == .local { return nano }
        if context.hints.requiresDeepReasoning   { return powerful }

        let wordCount = context.messages.reduce(0) { $0 + ($1.textContent?.split(separator: " ").count ?? 0) }
        return wordCount < 50 ? nano : standard
    }
}

let agent = Agent(
    router: TieredRouter(
        nano:      MLXProvider(modelId: "mlx-community/Qwen3-4B-4bit"),
        standard:  try BedrockProvider(config: BedrockConfig(modelId: "us.anthropic.claude-haiku-3-5-20241022-v1:0")),
        powerful:  try BedrockProvider(config: BedrockConfig(modelId: "us.anthropic.claude-sonnet-4-20250514-v1:0"))
    )
)

Tool Calling with Local Models

Local inference supports tool calling via the MLX Swift LM tool call generation. Define tools the same way as with cloud providers:

Swift

func getCurrentTime() -> String {
    ISO8601DateFormatter().string(from: Date())
}
let getCurrentTimeTool = Tool(getCurrentTime, "Get today's date and time.")

func evaluate(expression: String) -> String {
    let result = NSExpression(format: expression).expressionValue(with: nil, context: nil)
    return "\(result ?? "error")"
}
let evaluateTool = Tool(evaluate, "Evaluate a mathematical expression.")

let agent = Agent(
    model: MLXProvider(modelId: "mlx-community/Qwen3-8B-4bit"),
    tools: [getCurrentTimeTool, evaluateTool]
)

let result = try await agent.run("What day of the week is it, and what is 17 * 23?")
print(result.message.textContent)

💡

Tool calling quality varies significantly by model. Qwen3 models (8B+) produce the most reliable results for multi-step tool use in local inference.

← PreviousStreaming Next →Multi-Agent