Local Inference

Run language models entirely on Apple Silicon. No network, no API keys, no data leaves the device.

ℹ️

MLX inference requires Apple Silicon (M1 or later). It is not supported on Intel Macs, iOS, or tvOS. Use xcodebuild or Xcode to run; Metal library loading fails with swift run.

MLXProvider

Pass an MLXProvider to your agent. Models are downloaded from HuggingFace and cached locally on first run:

local_agent.swiftSwift
import StrandsAgents

let agent = Agent(
    model: MLXProvider(modelId: "mlx-community/Qwen3-8B-4bit"),
    tools: [calculator, wordCount]
)

let result = try await agent.run("What is 1234 * 5678?")
print(result.message.textContent)  // "7,006,652"
ModelSizeUse case
mlx-community/Qwen3-8B-4bit~5 GBGeneral purpose, tool calling
mlx-community/Qwen3-14B-4bit~9 GBMore capable, slower
mlx-community/Mistral-7B-Instruct-v0.3-4bit~4 GBInstruction following
⚠️

The qwen3_5 architecture is not yet supported in the current MLX Swift LM version. Use qwen3 models (e.g. Qwen3-8B, Qwen3-14B).

Download progress

Models are downloaded from HuggingFace on first use and cached locally. Pass an onDownloadProgress callback to track progress. The closure receives a Double from 0.0 to 1.0 and is not called if the model is already cached.

Basic usageSwift
let provider = MLXProvider(modelId: "mlx-community/Qwen3-8B-4bit") { progress in
    print("Downloading: \(Int(progress * 100))%")
}

In SwiftUI, bind the callback to a @State variable to drive a ProgressView:

SwiftUI progress barSwift
@State private var downloadProgress: Double = 0

var body: some View {
    if downloadProgress < 1.0 {
        ProgressView("Downloading model...", value: downloadProgress)
    }
}

// When creating the provider:
let provider = MLXProvider(modelId: "mlx-community/Qwen3-8B-4bit") { progress in
    Task { @MainActor in
        downloadProgress = progress
    }
}

Hybrid Routing

A HybridRouter sits between your agent and its models. For each request it evaluates a policy and picks either the local (MLX) or cloud provider. The agent code is identical either way.

This is useful when you want fast, private responses for simple tasks and cloud-quality reasoning for complex ones, without writing any branching logic yourself.

Request HybridRouter → policy says local → MLXProvider
→ policy says cloud → BedrockProvider
Basic setupSwift
import StrandsAgents

let agent = Agent(
    router: HybridRouter(
        local: MLXProvider(modelId: "mlx-community/Qwen3-8B-4bit"),
        cloud: try BedrockProvider(config: BedrockConfig(
            modelId: "us.anthropic.claude-sonnet-4-20250514-v1:0"
        )),
        policy: LatencySensitivePolicy()
    ),
    tools: [myTools]
)

Built-in policies

PolicyBest for
AlwaysCloudPolicyCloud only. Default when no policy is specified.
AlwaysLocalPolicyLocal only. Useful for fully offline apps.
LatencySensitivePolicyMost apps. Prefers cloud by default but routes locally when the user requests low latency or privacy, or when the device is plugged in and not throttled.
FallbackPolicyLocal-first apps. Prefers local by default and only falls back to cloud when the device is throttled, low on memory, or the previous local inference was too slow.
💡

Not sure which to pick? Start with LatencySensitivePolicy -- it's the recommended general-purpose option. Use FallbackPolicy if local inference is your default and cloud is the safety net.

Routing hints

agent.routingHints is read on every call. Set it before run() or stream() to influence the routing decision:

Per-call hintsSwift
// Keep this prompt on-device — LatencySensitivePolicy will always route local
agent.routingHints = RoutingHints(privacySensitive: true)
let privateResult = try await agent.run("Summarize my health records")

// This task needs a strong model — FallbackPolicy and LatencySensitivePolicy both route cloud
agent.routingHints = RoutingHints(requiresDeepReasoning: true)
let deepResult = try await agent.run("Explain the proof of Fermat's Last Theorem step by step")

// Bypass the policy entirely for one call
agent.routingHints = RoutingHints(forceProvider: .local)
let localResult = try await agent.run("Quick offline calculation: 42 * 17")

// Reset to defaults — let the policy decide again
agent.routingHints = RoutingHints()

Available hint flags

HintTypeEffect
preferLowLatencyBoolPrefer fastest provider even at lower capability
privacySensitiveBoolPrompt must not leave the device
requiresDeepReasoningBoolTask needs a large, capable model
forceProvider.local / .cloudBypass policy entirely

Device capabilities

Policies receive live device state on every routing call. All values are read fresh each time:

Available in RoutingContext.deviceCapabilitiesSwift
context.deviceCapabilities.hasNeuralEngine    // true on Apple Silicon (arm64)
context.deviceCapabilities.availableMemoryGB  // free + speculative RAM right now (vm_statistics64)
context.deviceCapabilities.thermalState       // .nominal / .fair / .serious / .critical (ProcessInfo)
context.deviceCapabilities.isPluggedIn        // requires DeviceCapabilities.isPluggedInProvider (see below)
ℹ️

isPluggedIn requires a platform-specific hook because StrandsAgents only imports Foundation. Set DeviceCapabilities.isPluggedInProvider once at app startup:

// iOS (AppDelegate or @main):
UIDevice.current.isBatteryMonitoringEnabled = true
DeviceCapabilities.isPluggedInProvider = {
    UIDevice.current.batteryState != .unplugged
}

// macOS (using IOKit — add IOKit to your app target, not to StrandsAgents):
DeviceCapabilities.isPluggedInProvider = {
    IOPSCopyExternalPowerAdapterDetails() != nil
}

Until you set this, isPluggedIn defaults to true.

Writing a custom policy

Conform to RoutingPolicy and implement one method. Return true to use local inference, false to use cloud.

The full RoutingContext is available to your policy, including:

  • context.deviceCapabilities -- thermal state, free memory, power source, Neural Engine
  • context.estimatedPromptTokens -- rough token estimate (character count / 4), useful for routing long contexts to cloud
  • context.lastInferenceLatencyMs -- latency of the previous model call in ms (nil on first call), useful for falling back if local is running slowly
  • context.hints -- the RoutingHints set by the caller
Example: route by prompt length + thermal stateSwift
struct SmartPolicy: RoutingPolicy {
    let shortPromptThreshold = 200  // characters

    func shouldUseLocal(context: RoutingContext) -> Bool {
        // Never run locally if the device is throttling
        guard context.deviceCapabilities.thermalState != .critical else { return false }

        // Always respect an explicit privacy request
        if context.hints.privacySensitive { return true }

        // Route deep reasoning tasks to cloud
        if context.hints.requiresDeepReasoning { return false }

        // Short prompts with no tools: fast local response
        let promptLength = context.messages.last?.textContent?.count ?? 0
        let hasTools = !(context.toolSpecs?.isEmpty ?? true)
        return promptLength < shortPromptThreshold && !hasTools
    }
}

let agent = Agent(
    router: HybridRouter(
        local: MLXProvider(modelId: "mlx-community/Qwen3-8B-4bit"),
        cloud: try BedrockProvider(config: BedrockConfig(
            modelId: "us.anthropic.claude-sonnet-4-20250514-v1:0"
        )),
        policy: SmartPolicy()
    )
)
💡

You can also implement the full ModelRouter protocol directly for cases where you need async logic, external signals (A/B flags, remote config), or routing to more than two providers.

Full custom router

If the two-provider HybridRouter structure is too limiting, implement ModelRouter directly. You receive the full RoutingContext and can return any ModelProvider.

Router with three providersSwift
struct TieredRouter: ModelRouter {
    let nano: any ModelProvider    // fast local, small tasks
    let standard: any ModelProvider // cloud, general tasks
    let powerful: any ModelProvider // large cloud model, hard tasks

    func route(context: RoutingContext) async throws -> any ModelProvider {
        if context.hints.forceProvider == .local { return nano }
        if context.hints.requiresDeepReasoning   { return powerful }

        let wordCount = context.messages.reduce(0) { $0 + ($1.textContent?.split(separator: " ").count ?? 0) }
        return wordCount < 50 ? nano : standard
    }
}

let agent = Agent(
    router: TieredRouter(
        nano:      MLXProvider(modelId: "mlx-community/Qwen3-4B-4bit"),
        standard:  try BedrockProvider(config: BedrockConfig(modelId: "us.anthropic.claude-haiku-3-5-20241022-v1:0")),
        powerful:  try BedrockProvider(config: BedrockConfig(modelId: "us.anthropic.claude-sonnet-4-20250514-v1:0"))
    )
)

Tool Calling with Local Models

Local inference supports tool calling via the MLX Swift LM tool call generation. Define tools the same way as with cloud providers:

Swift
func getCurrentTime() -> String {
    ISO8601DateFormatter().string(from: Date())
}
let getCurrentTimeTool = Tool(getCurrentTime, "Get today's date and time.")

func evaluate(expression: String) -> String {
    let result = NSExpression(format: expression).expressionValue(with: nil, context: nil)
    return "\(result ?? "error")"
}
let evaluateTool = Tool(evaluate, "Evaluate a mathematical expression.")

let agent = Agent(
    model: MLXProvider(modelId: "mlx-community/Qwen3-8B-4bit"),
    tools: [getCurrentTimeTool, evaluateTool]
)

let result = try await agent.run("What day of the week is it, and what is 17 * 23?")
print(result.message.textContent)
💡

Tool calling quality varies significantly by model. Qwen3 models (8B+) produce the most reliable results for multi-step tool use in local inference.