Local Inference

Run language models entirely on Apple Silicon. No network, no API keys, no data leaves the device.

ℹ️

MLX inference requires Apple Silicon (M1 or later). It is not supported on Intel Macs, iOS, or tvOS. Use xcodebuild or Xcode to run; Metal library loading fails with swift run.

MLXProvider

Add StrandsMLXProvider as a dependency and pass an MLXProvider to your agent. Models are downloaded from HuggingFace and cached locally on first run:

local_agent.swiftSwift
import StrandsAgents
import StrandsMLXProvider

let agent = Agent(
    model: MLXProvider(modelId: "mlx-community/Qwen3-8B-4bit"),
    tools: [calculator, wordCount]
)

let result = try await agent.run("What is 1234 * 5678?")
print(result.output)  // "7,006,652"
ModelSizeUse case
mlx-community/Qwen3-8B-4bit~5 GBGeneral purpose, tool calling
mlx-community/Qwen3-14B-4bit~9 GBMore capable, slower
mlx-community/Mistral-7B-Instruct-v0.3-4bit~4 GBInstruction following
⚠️

The qwen3_5 architecture is not yet supported in the current MLX Swift LM version. Use qwen3 models (e.g. Qwen3-8B, Qwen3-14B).

Hybrid Routing

A HybridRouter sits between your agent and its models. For each request it evaluates a policy and picks either the local (MLX) or cloud provider. The agent code is identical either way.

This is useful when you want fast, private responses for simple tasks and cloud-quality reasoning for complex ones, without writing any branching logic yourself.

Request HybridRouter → policy says local → MLXProvider
→ policy says cloud → BedrockProvider
Basic setupSwift
import StrandsAgents
import StrandsMLXProvider
import StrandsBedrockProvider

let agent = Agent(
    router: HybridRouter(
        local: MLXProvider(modelId: "mlx-community/Qwen3-8B-4bit"),
        cloud: try BedrockProvider(config: BedrockConfig(
            modelId: "us.anthropic.claude-sonnet-4-20250514-v1:0"
        )),
        policy: LatencySensitivePolicy()
    ),
    tools: [myTools]
)

Built-in policies

Choose the policy that matches your app's priorities:

PolicyRoutes to local when...
AlwaysCloudPolicyNever. Default when no policy specified.
AlwaysLocalPolicyAlways.
LatencySensitivePolicyDevice is not critically throttled, prompt is not too long, and either a low-latency hint is set, privacy is requested, or the device is plugged in with nominal thermal state.
FallbackPolicyDevice thermal is below serious, free memory exceeds the threshold, deep reasoning is not requested, and the last inference was not too slow.

Routing hints

agent.routingHints is read on every call. Set it before run() or stream() to influence the routing decision:

Per-call hintsSwift
// Keep this prompt on-device — LatencySensitivePolicy will always route local
agent.routingHints = RoutingHints(privacySensitive: true)
let privateResult = try await agent.run("Summarize my health records")

// This task needs a strong model — FallbackPolicy and LatencySensitivePolicy both route cloud
agent.routingHints = RoutingHints(requiresDeepReasoning: true)
let deepResult = try await agent.run("Explain the proof of Fermat's Last Theorem step by step")

// Bypass the policy entirely for one call
agent.routingHints = RoutingHints(forceProvider: .local)
let localResult = try await agent.run("Quick offline calculation: 42 * 17")

// Reset to defaults — let the policy decide again
agent.routingHints = RoutingHints()

Available hint flags

HintTypeEffect
preferLowLatencyBoolPrefer fastest provider even at lower capability
privacySensitiveBoolPrompt must not leave the device
requiresDeepReasoningBoolTask needs a large, capable model
forceProvider.local / .cloudBypass policy entirely

Device capabilities

Policies receive live device state on every routing call. All values are read fresh each time:

Available in RoutingContext.deviceCapabilitiesSwift
context.deviceCapabilities.hasNeuralEngine    // true on Apple Silicon (arm64)
context.deviceCapabilities.availableMemoryGB  // free + speculative RAM right now (vm_statistics64)
context.deviceCapabilities.thermalState       // .nominal / .fair / .serious / .critical (ProcessInfo)
context.deviceCapabilities.isPluggedIn        // requires DeviceCapabilities.isPluggedInProvider (see below)
ℹ️

isPluggedIn requires a platform-specific hook because StrandsAgents only imports Foundation. Set DeviceCapabilities.isPluggedInProvider once at app startup:

// iOS (AppDelegate or @main):
UIDevice.current.isBatteryMonitoringEnabled = true
DeviceCapabilities.isPluggedInProvider = {
    UIDevice.current.batteryState != .unplugged
}

// macOS (using IOKit — add IOKit to your app target, not to StrandsAgents):
DeviceCapabilities.isPluggedInProvider = {
    IOPSCopyExternalPowerAdapterDetails() != nil
}

Until you set this, isPluggedIn defaults to true.

Last inference latency

On the second and subsequent cycles of a conversation, policies also receive the latency of the previous model call. FallbackPolicy uses this to fall back to cloud if the local model is running slowly:

Swift
context.lastInferenceLatencyMs  // Int? — nil on first call, ms on subsequent calls

// FallbackPolicy defaults: fall back to cloud if last call exceeded 5 seconds
FallbackPolicy(slowInferenceThresholdMs: 5000, minimumMemoryGB: 1.5)

Estimated prompt tokens

A rough token estimate (character count / 4) is computed automatically and available to all policies. Use it to avoid sending very long contexts to local models:

Swift
context.estimatedPromptTokens  // Int — rough estimate, updated each cycle

// LatencySensitivePolicy defaults: route cloud if prompt exceeds 2000 estimated tokens
LatencySensitivePolicy(minimumMemoryGBOnBattery: 2.0, maxLocalPromptTokens: 2000)

Writing a custom policy

Conform to RoutingPolicy and implement one method. Return true to use local inference, false to use cloud.

Example: route by prompt length + thermal stateSwift
struct SmartPolicy: RoutingPolicy {
    let shortPromptThreshold = 200  // characters

    func shouldUseLocal(context: RoutingContext) -> Bool {
        // Never run locally if the device is throttling
        guard context.deviceCapabilities.thermalState != .critical else { return false }

        // Always respect an explicit privacy request
        if context.hints.privacySensitive { return true }

        // Route deep reasoning tasks to cloud
        if context.hints.requiresDeepReasoning { return false }

        // Short prompts with no tools: fast local response
        let promptLength = context.messages.last?.textContent?.count ?? 0
        let hasTools = !(context.toolSpecs?.isEmpty ?? true)
        return promptLength < shortPromptThreshold && !hasTools
    }
}

let agent = Agent(
    router: HybridRouter(
        local: MLXProvider(modelId: "mlx-community/Qwen3-8B-4bit"),
        cloud: try BedrockProvider(config: BedrockConfig(
            modelId: "us.anthropic.claude-sonnet-4-20250514-v1:0"
        )),
        policy: SmartPolicy()
    )
)
💡

You can also implement the full ModelRouter protocol directly for cases where you need async logic, external signals (A/B flags, remote config), or routing to more than two providers.

Full custom router

If the two-provider HybridRouter structure is too limiting, implement ModelRouter directly. You receive the full RoutingContext and can return any ModelProvider.

Router with three providersSwift
struct TieredRouter: ModelRouter {
    let nano: any ModelProvider    // fast local, small tasks
    let standard: any ModelProvider // cloud, general tasks
    let powerful: any ModelProvider // large cloud model, hard tasks

    func route(context: RoutingContext) async throws -> any ModelProvider {
        if context.hints.forceProvider == .local { return nano }
        if context.hints.requiresDeepReasoning   { return powerful }

        let wordCount = context.messages.reduce(0) { $0 + ($1.textContent?.split(separator: " ").count ?? 0) }
        return wordCount < 50 ? nano : standard
    }
}

let agent = Agent(
    router: TieredRouter(
        nano:      MLXProvider(modelId: "mlx-community/Qwen3-4B-4bit"),
        standard:  try BedrockProvider(config: BedrockConfig(modelId: "us.anthropic.claude-haiku-3-5-20241022-v1:0")),
        powerful:  try BedrockProvider(config: BedrockConfig(modelId: "us.anthropic.claude-sonnet-4-20250514-v1:0"))
    )
)

Tool Calling with Local Models

Local inference supports tool calling via the MLX Swift LM tool call generation. Define tools the same way as with cloud providers:

Swift
/// Get today's date and time.
@Tool
func getCurrentTime() -> String {
    ISO8601DateFormatter().string(from: Date())
}

/// Evaluate a mathematical expression.
@Tool
func evaluate(expression: String) -> String {
    let result = NSExpression(format: expression).expressionValue(with: nil, context: nil)
    return "\(result ?? "error")"
}

let agent = Agent(
    model: MLXProvider(modelId: "mlx-community/Qwen3-8B-4bit"),
    tools: [getCurrentTime, evaluate]
)

let result = try await agent.run("What day of the week is it, and what is 17 * 23?")
print(result.output)
💡

Tool calling quality varies significantly by model. Qwen3 models (8B+) produce the most reliable results for multi-step tool use in local inference.