Observability

Traces, metrics, and hooks give you visibility into every agent run: token counts, per-cycle latency, and tool execution.

OpenTelemetry Tracing

The SDK emits structured traces using the OpenTelemetry GenAI semantic conventions. Every agent run produces a connected trace tree with parent-child span relationships, token counts, latency, and tool results.

Span hierarchy

Each run produces a trace with this structure:

invoke_agent root span for the full run
execute_event_loop_cycle one per reasoning cycle
chat model call with token counts
execute_tool <name> one per tool call (concurrent tools run in parallel)

Span attributes

AttributeWhere set
gen_ai.operation.nameAll spans
gen_ai.request.modelchat span
gen_ai.usage.input_tokensgen_ai.choice event on chat
gen_ai.usage.output_tokensgen_ai.choice event on chat
gen_ai.usage.total_tokensgen_ai.choice event on chat
gen_ai.tool.nameexecute_tool span
gen_ai.tool.call.idexecute_tool span
gen_ai.tool.statusgen_ai.choice event on tool span
event_loop.cycle_idexecute_event_loop_cycle span
finish_reasongen_ai.choice event on chat

Sending to Datadog LLM Observability

The SDK emits GenAI semantic conventions that Datadog's LLM Observability product reads natively. Before choosing a setup, read the section below on API key safety.

⚠️

Do not embed a Datadog API key in a shipped app. Datadog's OTLP intake only accepts API keys -- not restricted client tokens. An API key extracted from your binary gives full write access to your Datadog account. See API Key Safety below before choosing a setup.

Package.swiftSwift
.target(
    name: "MyApp",
    dependencies: [
        .product(name: "StrandsAgents",           package: "strands-agents-swift"),
        .product(name: "StrandsBedrockProvider",   package: "strands-agents-swift"),
        .product(name: "StrandsOTelObservability", package: "strands-agents-swift"),
    ]
)
SetupSwift
import StrandsAgents
import StrandsOTelObservability

let agent = Agent(
    model: provider,
    tools: [myTool],
    observability: OTelObservabilityEngine.datadog(
        apiKey: "...",   // see API Key Safety section below
        service: "my-app"
    )
)

let result = try await agent.run("What is 42 * 17?")
// Traces appear in Datadog LLM Observability within seconds
All optionsSwift
OTelObservabilityEngine.datadog(
    apiKey:   "your-dd-api-key",
    service:  "my-app",          // service name shown in Datadog
    version:  "2.1.0",           // optional, defaults to "1.0"
    site:     "datadoghq.eu",    // optional, defaults to "datadoghq.com"
    endpoint: URL(string: "https://your-collector.example.com/v1/traces")
    // endpoint overrides site -- use this for the Collector proxy pattern
)

API Key Safety

Datadog's OTLP endpoint requires an API key. Unlike Datadog's native mobile SDK (dd-sdk-ios), which uses restricted client tokens, the OTLP intake has no client-safe credential type. This creates a real problem for shipped apps.

ContextApproach
Development / internal toolsEnvironment variable or .xcconfig (gitignored)
Server-side agent (not in a user's app)Embed directly via environment variable on the server
Shipped iOS / macOS appCollector proxy (see below) -- never embed the key

Development: environment variable

For local development and internal tools, read the key from an environment variable set in your shell or a gitignored .xcconfig file. Never commit it.

Swift
observability: OTelObservabilityEngine.datadog(
    apiKey: ProcessInfo.processInfo.environment["DD_API_KEY"] ?? "",
    service: "my-app"
)

Production: proxy backend

For any app you ship to users, put a lightweight backend between your app and Datadog. The app sends OTLP to your endpoint with no credentials. The backend adds the API key and forwards to Datadog. The key never reaches the device.

iOS / macOS app → OTLP (no key) → Your proxy → + dd-api-key → Datadog

The proxy's only job is to inject the credential your app cannot safely hold. The OTLP payload your app built -- spans, token counts, trace hierarchy -- passes through unchanged.

Point the app at your proxy endpoint instead of Datadog directly:

App (no credentials in binary)Swift
observability: OTelObservabilityEngine.datadog(
    apiKey: "",   // ignored -- proxy adds the key server-side
    service: "my-app",
    endpoint: URL(string: "https://your-proxy.example.com/v1/traces")
)

Option A: Lambda proxy (serverless)

The simplest deployment: an API Gateway + Lambda function that adds the API key header and forwards. No servers to manage.

Lambda handler (Node.js)javascript
export const handler = async (event) => {
  const apiKey = process.env.DD_API_KEY; // set in Lambda, never in app

  const response = await fetch("https://otlp.datadoghq.com/v1/traces", {
    method: "POST",
    headers: {
      "Content-Type": event.headers["content-type"] ?? "application/x-protobuf",
      "dd-api-key": apiKey,
      "dd-otlp-source": "llmobs",
    },
    body: Buffer.from(event.body, event.isBase64Encoded ? "base64" : "utf8"),
  });

  return { statusCode: response.status };
};
⚠️

Security risks with a basic Lambda proxy:

  • Unauthenticated writes. Anyone who finds your API Gateway URL can POST arbitrary spans to your Datadog account. They cannot read your data, but they can pollute your LLM Observability traces with junk.
  • Cost amplification. Flooding the endpoint runs up Lambda invocation costs and Datadog ingestion costs simultaneously.
  • No payload validation. The Lambda forwards whatever it receives without checking it is valid OTLP.

Hardening the Lambda proxy

Apply these mitigations in order of effort:

MitigationWhat it preventsEffort
API Gateway rate limiting + throttlingCost amplification from floodsLow -- configure in AWS console
API Gateway request size limitOversized payloads inflating Datadog ingestionLow -- one setting
AWS WAF on the API GatewayKnown bad actors, automated scannersMedium
Cognito JWT auth on the API GatewayUnauthenticated writes entirelyMedium -- requires your app to sign in
API key stored in AWS Secrets ManagerKey exposure if Lambda env is readLow -- change one env var to a Secrets Manager lookup

Rate limiting is the minimum you should add before shipping. A usage plan in API Gateway takes two minutes to configure and caps the blast radius of any abuse:

API Gateway usage plan (AWS CLI)bash
aws apigateway create-usage-plan \
  --name "otlp-proxy-plan" \
  --throttle burstLimit=50,rateLimit=10 \
  --quota limit=50000,period=DAY

Option B: Datadog DDOT Collector

Datadog recommends running the DDOT Collector on a long-lived server so it can handle batching and retries reliably.

Install on a Linux server
bash
DD_API_KEY=your_api_key DD_SITE=datadoghq.com \
  bash -c "$(curl -L https://install.datadoghq.com/scripts/install_script_agent7.sh)"
Or run with Docker
bash
docker run -d \
  --name ddot-collector \
  -e DD_API_KEY=your_api_key \
  -e DD_SITE=datadoghq.com \
  -p 4318:4318 \
  -v $(pwd)/otel-config.yaml:/etc/datadog-agent/otel-config.yaml \
  gcr.io/datadoghq/agent:latest
Collector config
otel-config.yamlyaml
receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318

exporters:
  datadog:
    api:
      key: ${env:DD_API_KEY}
      site: datadoghq.com   # datadoghq.eu for EU

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [datadog]
Point the Swift SDK at your Collector

Once it's running, pass the server's IP or hostname to the SDK. Use the public IP for quick testing, or a domain with TLS for production:

Swift
// Quick test -- server IP directly
observability: OTelObservabilityEngine.datadog(
    apiKey: "",
    service: "my-app",
    endpoint: URL(string: "http://1.2.3.4:4318/v1/traces")
)

// Production -- domain with TLS
observability: OTelObservabilityEngine.datadog(
    apiKey: "",
    service: "my-app",
    endpoint: URL(string: "https://collector.yourserver.com/v1/traces")
)

Manual OTel setup (any backend)

Swift
import OpenTelemetrySdk
import OpenTelemetryProtocolExporterHttp

let exporter = OtlpHttpTraceExporter(
    endpoint: URL(string: "https://collector.yourbackend.com/v1/traces")!
)
let provider = TracerProviderBuilder()
    .add(spanProcessor: BatchSpanProcessor(spanExporter: exporter))
    .build()
OpenTelemetry.registerTracerProvider(tracerProvider: provider)

let tracer = provider.get(instrumentationName: "my-app", instrumentationVersion: "1.0")
let observability = OTelObservabilityEngine(tracer: tracer)
ℹ️

For EU accounts, use https://otlp.datadoghq.eu/v1/traces as the endpoint. The dd-otlp-source: llmobs header routes spans into the LLM Observability product specifically.

Other supported backends

BackendExporter
Datadog LLM ObservabilityOtlpHttpTraceExporter to otlp.datadoghq.com
JaegerOtlpGrpcTraceExporter to your Jaeger instance
AWS X-RayOtlpGrpcTraceExporter to the X-Ray OTLP receiver
Any OTLP backendAny SpanExporter from opentelemetry-swift

Run Metrics

Every AgentResult carries metrics for the entire run and per-cycle breakdowns, with no extra configuration needed:

Swift
let result = try await agent.run("Summarize the quarterly report")

// Run-level metrics
print(result.metrics.cycleCount)           // number of reasoning cycles
print(result.metrics.totalLatencyMs)       // wall-clock time
print(result.metrics.outputTokensPerSecond)

// Token usage
print(result.usage.inputTokens)
print(result.usage.outputTokens)
print(result.usage.totalTokens)

// Per-cycle breakdown
for cycle in result.metrics.cycles {
    print("Cycle \(cycle.cycleNumber):")
    print("  Latency:     \(cycle.modelLatencyMs)ms")
    print("  Stop reason: \(cycle.stopReason)")
    print("  Tools run:   \(cycle.toolsExecuted)")
    print("  Output tokens: \(cycle.usage.outputTokens)")
}

Lifecycle Hooks

Hooks let you observe or react to agent events without modifying the agent. Register callbacks on agent.hookRegistry:

Registering hooksSwift
// Before each model call
agent.hookRegistry.addCallback(BeforeInvocationEvent.self) { event in
    print("Sending \(event.messages.count) messages to model")
}

// After each model call
agent.hookRegistry.addCallback(AfterInvocationEvent.self) { event in
    print("Model responded in \(event.latencyMs)ms")
}

// After each tool execution
agent.hookRegistry.addCallback(AfterToolEvent.self) { event in
    print("Tool \(event.toolName): \(event.result.status)")
}

// At the end of every run (good for logging to a metrics system)
agent.hookRegistry.addCallback(MetricsEvent.self) { event in
    myMetrics.record(
        cycles: event.metrics.cycleCount,
        tokens: event.metrics.totalUsage.totalTokens,
        latency: event.metrics.totalLatencyMs
    )
}

Available hook events

EventWhen it fires
BeforeInvocationEventBefore each model API call
AfterInvocationEventAfter each model API call
BeforeToolEventBefore a tool is executed
AfterToolEventAfter a tool returns
MetricsEventAt the end of a complete agent run