Observability
Traces, metrics, and hooks give you visibility into every agent run: token counts, per-cycle latency, and tool execution.
OpenTelemetry Tracing
The SDK emits structured traces using the OpenTelemetry GenAI semantic conventions. Every agent run produces a connected trace tree with parent-child span relationships, token counts, latency, and tool results.
Span hierarchy
Each run produces a trace with this structure:
Span attributes
| Attribute | Where set |
|---|---|
gen_ai.operation.name | All spans |
gen_ai.request.model | chat span |
gen_ai.usage.input_tokens | gen_ai.choice event on chat |
gen_ai.usage.output_tokens | gen_ai.choice event on chat |
gen_ai.usage.total_tokens | gen_ai.choice event on chat |
gen_ai.tool.name | execute_tool span |
gen_ai.tool.call.id | execute_tool span |
gen_ai.tool.status | gen_ai.choice event on tool span |
event_loop.cycle_id | execute_event_loop_cycle span |
finish_reason | gen_ai.choice event on chat |
Sending to Datadog LLM Observability
The SDK emits GenAI semantic conventions that Datadog's LLM Observability product reads natively. Before choosing a setup, read the section below on API key safety.
Do not embed a Datadog API key in a shipped app. Datadog's OTLP intake only accepts API keys -- not restricted client tokens. An API key extracted from your binary gives full write access to your Datadog account. See API Key Safety below before choosing a setup.
.target(
name: "MyApp",
dependencies: [
.product(name: "StrandsAgents", package: "strands-agents-swift"),
.product(name: "StrandsBedrockProvider", package: "strands-agents-swift"),
.product(name: "StrandsOTelObservability", package: "strands-agents-swift"),
]
)
import StrandsAgents
import StrandsOTelObservability
let agent = Agent(
model: provider,
tools: [myTool],
observability: OTelObservabilityEngine.datadog(
apiKey: "...", // see API Key Safety section below
service: "my-app"
)
)
let result = try await agent.run("What is 42 * 17?")
// Traces appear in Datadog LLM Observability within seconds
OTelObservabilityEngine.datadog(
apiKey: "your-dd-api-key",
service: "my-app", // service name shown in Datadog
version: "2.1.0", // optional, defaults to "1.0"
site: "datadoghq.eu", // optional, defaults to "datadoghq.com"
endpoint: URL(string: "https://your-collector.example.com/v1/traces")
// endpoint overrides site -- use this for the Collector proxy pattern
)
API Key Safety
Datadog's OTLP endpoint requires an API key. Unlike Datadog's native mobile SDK (dd-sdk-ios), which uses restricted client tokens, the OTLP intake has no client-safe credential type. This creates a real problem for shipped apps.
| Context | Approach |
|---|---|
| Development / internal tools | Environment variable or .xcconfig (gitignored) |
| Server-side agent (not in a user's app) | Embed directly via environment variable on the server |
| Shipped iOS / macOS app | Collector proxy (see below) -- never embed the key |
Development: environment variable
For local development and internal tools, read the key from an environment variable set in your shell or a gitignored .xcconfig file. Never commit it.
observability: OTelObservabilityEngine.datadog(
apiKey: ProcessInfo.processInfo.environment["DD_API_KEY"] ?? "",
service: "my-app"
)
Production: proxy backend
For any app you ship to users, put a lightweight backend between your app and Datadog. The app sends OTLP to your endpoint with no credentials. The backend adds the API key and forwards to Datadog. The key never reaches the device.
The proxy's only job is to inject the credential your app cannot safely hold. The OTLP payload your app built -- spans, token counts, trace hierarchy -- passes through unchanged.
Point the app at your proxy endpoint instead of Datadog directly:
observability: OTelObservabilityEngine.datadog(
apiKey: "", // ignored -- proxy adds the key server-side
service: "my-app",
endpoint: URL(string: "https://your-proxy.example.com/v1/traces")
)
Option A: Lambda proxy (serverless)
The simplest deployment: an API Gateway + Lambda function that adds the API key header and forwards. No servers to manage.
export const handler = async (event) => {
const apiKey = process.env.DD_API_KEY; // set in Lambda, never in app
const response = await fetch("https://otlp.datadoghq.com/v1/traces", {
method: "POST",
headers: {
"Content-Type": event.headers["content-type"] ?? "application/x-protobuf",
"dd-api-key": apiKey,
"dd-otlp-source": "llmobs",
},
body: Buffer.from(event.body, event.isBase64Encoded ? "base64" : "utf8"),
});
return { statusCode: response.status };
};
Security risks with a basic Lambda proxy:
- Unauthenticated writes. Anyone who finds your API Gateway URL can POST arbitrary spans to your Datadog account. They cannot read your data, but they can pollute your LLM Observability traces with junk.
- Cost amplification. Flooding the endpoint runs up Lambda invocation costs and Datadog ingestion costs simultaneously.
- No payload validation. The Lambda forwards whatever it receives without checking it is valid OTLP.
Hardening the Lambda proxy
Apply these mitigations in order of effort:
| Mitigation | What it prevents | Effort |
|---|---|---|
| API Gateway rate limiting + throttling | Cost amplification from floods | Low -- configure in AWS console |
| API Gateway request size limit | Oversized payloads inflating Datadog ingestion | Low -- one setting |
| AWS WAF on the API Gateway | Known bad actors, automated scanners | Medium |
| Cognito JWT auth on the API Gateway | Unauthenticated writes entirely | Medium -- requires your app to sign in |
| API key stored in AWS Secrets Manager | Key exposure if Lambda env is read | Low -- change one env var to a Secrets Manager lookup |
Rate limiting is the minimum you should add before shipping. A usage plan in API Gateway takes two minutes to configure and caps the blast radius of any abuse:
aws apigateway create-usage-plan \
--name "otlp-proxy-plan" \
--throttle burstLimit=50,rateLimit=10 \
--quota limit=50000,period=DAY
Option B: Datadog DDOT Collector
Datadog recommends running the DDOT Collector on a long-lived server so it can handle batching and retries reliably.
Install on a Linux server
DD_API_KEY=your_api_key DD_SITE=datadoghq.com \
bash -c "$(curl -L https://install.datadoghq.com/scripts/install_script_agent7.sh)"
Or run with Docker
docker run -d \
--name ddot-collector \
-e DD_API_KEY=your_api_key \
-e DD_SITE=datadoghq.com \
-p 4318:4318 \
-v $(pwd)/otel-config.yaml:/etc/datadog-agent/otel-config.yaml \
gcr.io/datadoghq/agent:latest
Collector config
receivers:
otlp:
protocols:
http:
endpoint: 0.0.0.0:4318
exporters:
datadog:
api:
key: ${env:DD_API_KEY}
site: datadoghq.com # datadoghq.eu for EU
service:
pipelines:
traces:
receivers: [otlp]
exporters: [datadog]
Point the Swift SDK at your Collector
Once it's running, pass the server's IP or hostname to the SDK. Use the public IP for quick testing, or a domain with TLS for production:
// Quick test -- server IP directly
observability: OTelObservabilityEngine.datadog(
apiKey: "",
service: "my-app",
endpoint: URL(string: "http://1.2.3.4:4318/v1/traces")
)
// Production -- domain with TLS
observability: OTelObservabilityEngine.datadog(
apiKey: "",
service: "my-app",
endpoint: URL(string: "https://collector.yourserver.com/v1/traces")
)
Manual OTel setup (any backend)
import OpenTelemetrySdk
import OpenTelemetryProtocolExporterHttp
let exporter = OtlpHttpTraceExporter(
endpoint: URL(string: "https://collector.yourbackend.com/v1/traces")!
)
let provider = TracerProviderBuilder()
.add(spanProcessor: BatchSpanProcessor(spanExporter: exporter))
.build()
OpenTelemetry.registerTracerProvider(tracerProvider: provider)
let tracer = provider.get(instrumentationName: "my-app", instrumentationVersion: "1.0")
let observability = OTelObservabilityEngine(tracer: tracer)
For EU accounts, use https://otlp.datadoghq.eu/v1/traces as the endpoint. The dd-otlp-source: llmobs header routes spans into the LLM Observability product specifically.
Other supported backends
| Backend | Exporter |
|---|---|
| Datadog LLM Observability | OtlpHttpTraceExporter to otlp.datadoghq.com |
| Jaeger | OtlpGrpcTraceExporter to your Jaeger instance |
| AWS X-Ray | OtlpGrpcTraceExporter to the X-Ray OTLP receiver |
| Any OTLP backend | Any SpanExporter from opentelemetry-swift |
Run Metrics
Every AgentResult carries metrics for the entire run and per-cycle breakdowns, with no extra configuration needed:
let result = try await agent.run("Summarize the quarterly report")
// Run-level metrics
print(result.metrics.cycleCount) // number of reasoning cycles
print(result.metrics.totalLatencyMs) // wall-clock time
print(result.metrics.outputTokensPerSecond)
// Token usage
print(result.usage.inputTokens)
print(result.usage.outputTokens)
print(result.usage.totalTokens)
// Per-cycle breakdown
for cycle in result.metrics.cycles {
print("Cycle \(cycle.cycleNumber):")
print(" Latency: \(cycle.modelLatencyMs)ms")
print(" Stop reason: \(cycle.stopReason)")
print(" Tools run: \(cycle.toolsExecuted)")
print(" Output tokens: \(cycle.usage.outputTokens)")
}
Lifecycle Hooks
Hooks let you observe or react to agent events without modifying the agent. Register callbacks on agent.hookRegistry:
// Before each model call
agent.hookRegistry.addCallback(BeforeInvocationEvent.self) { event in
print("Sending \(event.messages.count) messages to model")
}
// After each model call
agent.hookRegistry.addCallback(AfterInvocationEvent.self) { event in
print("Model responded in \(event.latencyMs)ms")
}
// After each tool execution
agent.hookRegistry.addCallback(AfterToolEvent.self) { event in
print("Tool \(event.toolName): \(event.result.status)")
}
// At the end of every run (good for logging to a metrics system)
agent.hookRegistry.addCallback(MetricsEvent.self) { event in
myMetrics.record(
cycles: event.metrics.cycleCount,
tokens: event.metrics.totalUsage.totalTokens,
latency: event.metrics.totalLatencyMs
)
}
Available hook events
| Event | When it fires |
|---|---|
BeforeInvocationEvent | Before each model API call |
AfterInvocationEvent | After each model API call |
BeforeToolEvent | Before a tool is executed |
AfterToolEvent | After a tool returns |
MetricsEvent | At the end of a complete agent run |