Observability
Containment Chamber provides three observability pillars: Prometheus metrics, OpenTelemetry OTLP tracing, and structured JSON logging. The metrics endpoint runs on a separate port from the signing API, so you can expose metrics to your monitoring stack without exposing the signing surface.
What to Alert On
Section titled “What to Alert On”Start with these signals before tuning dashboard detail:
| Signal | Why it matters |
|---|---|
containment_healthy == 0 | The signer is not healthy. |
containment_signer_state not showing unsealed during duty windows | Signing will fail or is waiting on operators. |
containment_slashing_rejections_total increasing | Slashing protection is blocking requests. Investigate before retrying duties. |
containment_auth_rejections_total increasing unexpectedly | Tokens, policies, or client configuration may be wrong. |
containment_canary_signing_total increasing | A canary key signed. Treat as a security incident. |
| KMS or DynamoDB error counters increasing | The signer may lose ability to unseal or refresh keys. |
Prometheus Metrics
Section titled “Prometheus Metrics”Metrics are served on a dedicated HTTP endpoint, separate from the signing API (port 9000).
metrics: listen_address: "0.0.0.0" listen_port: 3000 refresh_interval_seconds: 30| Option | Default | Description |
|---|---|---|
listen_address | 0.0.0.0 | Bind address for the metrics server |
listen_port | 3000 | Port for the metrics endpoint |
refresh_interval_seconds | 30 | How often metrics are refreshed |
Verify metrics are working:
curl http://localhost:3000/metricsMetrics Reference
Section titled “Metrics Reference”All metrics exposed at /metrics:
Signing
Section titled “Signing”| Name | Type | Description |
|---|---|---|
containment_canary_signing_total | counter | Number of times a canary key has signed |
containment_signing_concurrency_limit | gauge | Configured signing concurrency limit |
containment_signing_duration_seconds | histogram | Duration of signing operations in seconds |
containment_signing_requests_total | counter | Total signing requests by status and operation |
containment_signing_semaphore_available | gauge | Available signing semaphore permits |
containment_slashing_rejections_total | counter | Total signing requests rejected by slashing protection |
| Name | Type | Description |
|---|---|---|
containment_chamber_ceremony_lock_held_seconds | gauge | Current ceremony transition_guard held-duration in seconds (0 when free) |
containment_chamber_ceremony_lock_stuck_total | counter | Number of times the ceremony watchdog detected transition_guard held longer than the stuck threshold (suggests deadlock; operator must restart process) |
containment_chamber_init_total | counter | Number of chamber init ceremonies performed |
containment_chamber_rotation_total | counter | Number of rotation operations by type (kms, unseal, mode) |
containment_chamber_seal_total | counter | Number of emergency seal operations |
containment_chamber_tee_unseal_total | counter | Number of TEE auto-unseal attempts by status (success, measurement_mismatch, malformed_blob, rogue_arn, unsupported_version, kms_attestation_rejected, kms_unavailable) |
containment_chamber_unseal_shares_total | counter | Number of unseal share submissions by operator |
containment_chamber_unseal_total | counter | Number of completed unseal ceremonies |
| Name | Type | Description |
|---|---|---|
containment_dynamodb_key_refresh_duration_seconds | histogram | Duration of DynamoDB key refresh operations in seconds |
containment_key_load_failures_total | counter | Total validator keys that failed to load |
containment_key_loading_duration_seconds | gauge | Duration of key loading operations in seconds |
containment_key_refresh_total | counter | Total keys added via background refresh |
containment_keys_active | gauge | Number of active validator keys by source |
Key Management API
Section titled “Key Management API”| Name | Type | Description |
|---|---|---|
containment_key_deletions_total | counter | Total validator keys deleted via Key Manager API |
containment_key_import_duration_seconds | histogram | Duration of Key Manager API import operations in seconds |
containment_key_imports_total | counter | Total validator keys imported via Key Manager API |
containment_key_requests_total | counter | Total Key Manager API requests by method |
Keygen
Section titled “Keygen”| Name | Type | Description |
|---|---|---|
containment_keygen_duration_seconds | histogram | Duration of keygen operations in seconds |
containment_keygen_errors_total | counter | Total keygen errors (labels: error_type ∈ {validation, crypto, backup, storage}) |
containment_keygen_total | counter | Total validator keys generated via keygen endpoint |
Anti-Slashing
Section titled “Anti-Slashing”| Name | Type | Description |
|---|---|---|
containment_anti_slashing_check_duration_seconds | histogram | Duration of anti-slashing checks in seconds |
containment_anti_slashing_errors_total | counter | Total anti-slashing backend errors |
containment_anti_slashing_hmac_mismatch_total | counter | Anti-slashing per-row HMAC verification failures by row kind |
containment_anti_slashing_malformed_row_total | counter | Anti-slashing rows failing structural validation (malformed pk, unsupported scheme) |
containment_anti_slashing_master_key_sealed_total | counter | Anti-slashing operations aborted because chamber was sealed mid-op |
containment_anti_slashing_pg_pool | gauge | PostgreSQL connection pool state by status |
| Name | Type | Description |
|---|---|---|
containment_auth_rejections_total | counter | Total authentication rejections by reason |
containment_ceremony_cidr_filter_enabled | gauge | Ceremony CIDR filter state (1 = enabled, 0 = disabled / empty list) |
containment_cidr_rejections_total | counter | Total requests rejected by a CIDR guard layer (labels: layer, reason) |
HTTP Errors
Section titled “HTTP Errors”| Name | Type | Description |
|---|---|---|
containment_http_errors_total | counter | Total HTTP error responses by status code |
AWS/KMS
Section titled “AWS/KMS”| Name | Type | Description |
|---|---|---|
containment_dynamodb_keystore_errors_total | counter | Total AWS keystore errors by operation |
containment_kms_operation_duration_seconds | histogram | Duration of KMS operations in seconds |
containment_kms_operations_total | counter | Total KMS operations by action and status |
| Name | Type | Description |
|---|---|---|
containment_tls_cert_expiry_seconds | gauge | Seconds until current TLS certificate expires |
containment_tls_cert_generation_duration_seconds | histogram | Time to generate TLS certificate and attestation document in seconds |
containment_tls_cert_rotations_total | counter | Total TLS certificate rotations |
containment_tls_handshakes_total | counter | Total TLS handshake attempts by status |
System
Section titled “System”| Name | Type | Description |
|---|---|---|
containment_background_task_panics_total | counter | Total panics in long-running background tasks (labeled by task name) |
containment_build_info | gauge | Build information (version, commit, timestamp) |
containment_control_plane_component_up | gauge | Whether the last observed control-plane refresh outcome succeeded (1 = ok, 0 = error) |
containment_control_plane_last_success_unix_seconds | gauge | Unix timestamp of the last successful control-plane refresh by component |
containment_control_plane_refresh_total | counter | Control-plane refresh outcomes by component and status |
containment_handler_panics_total | counter | Total handler panics caught and converted to 500 by CatchPanicLayer |
containment_healthy | gauge | Health status of the signer (1 = healthy, 0 = unhealthy) |
containment_network_info | gauge | Ethereum network configuration info gauge |
containment_signer_state | gauge | Current signer state (1 = active, 0 = inactive) by state label |
containment_startup_duration_seconds | gauge | Time from process start to signer ready in seconds |
containment_uptime_seconds | gauge | Uptime in seconds since process start |
Backpressure
Section titled “Backpressure”| Name | Type | Description |
|---|---|---|
containment_queue_rejected_total | counter | Total requests rejected due to backpressure |
Passphrase Validation
Section titled “Passphrase Validation”| Name | Type | Description |
|---|---|---|
containment_passphrase_validation_rejections_total | counter | Total passphrase-validation rejections (labels: reason, endpoint) |
containment_zxcvbn_estimate_duration_seconds | histogram | Duration of zxcvbn passphrase-strength estimator invocations in seconds (recorded only after length-floor check passes) |
Enclave
Section titled “Enclave”| Name | Type | Description |
|---|---|---|
containment_enclave_config_bootstrap_duration_seconds | histogram | Wall-clock time spent fetching the bootstrap YAML over vsock, from first connect attempt to successful read_to_end |
containment_enclave_config_bootstrap_failures_total | counter | Terminal enclave bootstrap failures (labels: reason ∈ {timeout, permanent_connect, oversize, partial_read, invalid_utf8}); matches the event=bootstrap_failure reason=... tracing log emitted on the same failure |
containment_enclave_config_bootstrap_retries_total | counter | Transient vsock connect failures that triggered a backoff retry during enclave bootstrap (ConnectionRefused / TimedOut / Interrupted / WouldBlock) |
containment_enclave_log_events_dropped_total | counter | Enclave log events dropped by the in-enclave vsock log forwarder (labels: reason ∈ {backoff, connect_failed, write_failed}) |
The operation label uses the signing operation names: AGGREGATION_SLOT, AGGREGATE_AND_PROOF, ATTESTATION, BLOCK_V2, RANDAO_REVEAL, SYNC_COMMITTEE_CONTRIBUTION_AND_PROOF, SYNC_COMMITTEE_MESSAGE, SYNC_COMMITTEE_SELECTION_PROOF, VALIDATOR_REGISTRATION, VOLUNTARY_EXIT.
Process metrics (containment_process_resident_memory_bytes and containment_process_open_fds) are only available on Linux.
OpenTelemetry OTLP Tracing
Section titled “OpenTelemetry OTLP Tracing”Containment Chamber can export distributed traces via gRPC OTLP to any OpenTelemetry-compatible collector — Jaeger, Grafana Tempo, Honeycomb, Datadog, and others.
opentelemetry: enabled: true endpoint: "http://otel-collector:4317" service_name: "containment-chamber"| Option | Default | Description |
|---|---|---|
enabled | false | Enable OTLP trace export |
endpoint | http://localhost:4317 | gRPC OTLP collector endpoint |
service_name | containment-chamber | Service name in traces |
Traces include the full request lifecycle — from HTTP ingestion through authorization, slashing protection checks, and BLS signing.
Grafana Dashboards
Section titled “Grafana Dashboards”Two pre-built Grafana dashboards are included in the repository under k8s/dashboards/:
containment-chamber-classic.json — A standalone dashboard suitable for any deployment model (bare metal, Docker, Kubernetes).
Import via: Grafana → Dashboards → Import → Upload JSON file
containment-chamber-kubernetes.json — A Kubernetes-native dashboard with namespace and pod selector variables. Designed for multi-replica deployments where you need to filter by specific pods.
Import via: Grafana → Dashboards → Import → Upload JSON file
Kubernetes ServiceMonitor
Section titled “Kubernetes ServiceMonitor”If you use the Prometheus Operator, the Helm chart includes a ServiceMonitor resource for automatic scrape target discovery.
Enable it in your Helm values:
serviceMonitor: enabled: true scrapeInterval: "15s" additionalLabels: release: prometheusAll available ServiceMonitor options:
| Option | Default | Description |
|---|---|---|
enabled | false | Create a ServiceMonitor resource |
scrapeInterval | 60s | Prometheus scrape interval |
additionalLabels | {} | Labels added to the ServiceMonitor |
namespace | "" | Namespace for the ServiceMonitor (defaults to release namespace) |
namespaceSelector | {} | Namespace selector (use any: true to scrape all namespaces) |
targetLabels | [] | Labels to transfer from the Kubernetes Service to scraped metrics |
metricRelabelings | [] | Metric relabeling rules |
Logging
Section titled “Logging”By default, Containment Chamber outputs human-readable text logs with ANSI colors (when connected to a terminal). Switch to JSON for production log aggregation.
Configuration
Section titled “Configuration”logging: # Log level filter — supports tracing EnvFilter syntax # Examples: "info", "debug", "containment_chamber=debug,hyper=info" level: "info" # default: "info"
# Output format: "text" (human-readable) or "json" (structured) format: text # default: "text"
# ANSI colors in text output — auto-detects TTY by default log_color: null # default: auto-detect (true if TTY, false otherwise)| Option | Type | Default | Description |
|---|---|---|---|
logging.level | string | "info" | Log level filter (EnvFilter syntax) |
logging.format | enum | text | text for human-readable, json for structured JSON |
logging.log_color | boolean | auto | ANSI colors — auto-detects TTY when unset |
Log levels
Section titled “Log levels”# Via configlogging: level: "containment_chamber=debug,hyper=info"
# Or via environment variable (overrides config)RUST_LOG=containment_chamber=debugJSON output
Section titled “JSON output”Enable JSON format for structured log aggregation (Datadog, Loki, CloudWatch, etc.):
logging: format: json log_color: false # disable ANSI escape codes in JSONEach JSON log line includes timestamp, level, target, span context, and message fields.
Audit Logging
Section titled “Audit Logging”Security-relevant events are logged with target: "audit". This target is separate from the normal containment_chamber target, so you can route audit events to a dedicated sink without changing your general log level.
Events logged to the audit target:
| Event | When |
|---|---|
signing request | Every signing attempt, including key and operation type |
state transition | Seal machine state changes (e.g., Sealed → KmsUnsealed) |
unseal share submitted | When an operator submits an unseal share, including share index |
signer sealed | When the signer is sealed, and by whom |
Filtering audit events
Section titled “Filtering audit events”# Include audit events alongside normal application logsRUST_LOG=containment_chamber=info,audit=info
# Audit events only — suppress everything elseRUST_LOG=off,audit=infoIn JSON mode, filter on "target":"audit" in your log aggregator (Datadog, Loki, CloudWatch, etc.) to build a dedicated audit trail.