Skip to content

Observability

Containment Chamber provides three observability pillars: Prometheus metrics, OpenTelemetry OTLP tracing, and structured JSON logging. The metrics endpoint runs on a separate port from the signing API, so you can expose metrics to your monitoring stack without exposing the signing surface.

Start with these signals before tuning dashboard detail:

SignalWhy it matters
containment_healthy == 0The signer is not healthy.
containment_signer_state not showing unsealed during duty windowsSigning will fail or is waiting on operators.
containment_slashing_rejections_total increasingSlashing protection is blocking requests. Investigate before retrying duties.
containment_auth_rejections_total increasing unexpectedlyTokens, policies, or client configuration may be wrong.
containment_canary_signing_total increasingA canary key signed. Treat as a security incident.
KMS or DynamoDB error counters increasingThe signer may lose ability to unseal or refresh keys.

Metrics are served on a dedicated HTTP endpoint, separate from the signing API (port 9000).

metrics:
listen_address: "0.0.0.0"
listen_port: 3000
refresh_interval_seconds: 30
OptionDefaultDescription
listen_address0.0.0.0Bind address for the metrics server
listen_port3000Port for the metrics endpoint
refresh_interval_seconds30How often metrics are refreshed

Verify metrics are working:

Terminal window
curl http://localhost:3000/metrics

All metrics exposed at /metrics:

NameTypeDescription
containment_canary_signing_totalcounterNumber of times a canary key has signed
containment_signing_concurrency_limitgaugeConfigured signing concurrency limit
containment_signing_duration_secondshistogramDuration of signing operations in seconds
containment_signing_requests_totalcounterTotal signing requests by status and operation
containment_signing_semaphore_availablegaugeAvailable signing semaphore permits
containment_slashing_rejections_totalcounterTotal signing requests rejected by slashing protection
NameTypeDescription
containment_chamber_ceremony_lock_held_secondsgaugeCurrent ceremony transition_guard held-duration in seconds (0 when free)
containment_chamber_ceremony_lock_stuck_totalcounterNumber of times the ceremony watchdog detected transition_guard held longer than the stuck threshold (suggests deadlock; operator must restart process)
containment_chamber_init_totalcounterNumber of chamber init ceremonies performed
containment_chamber_rotation_totalcounterNumber of rotation operations by type (kms, unseal, mode)
containment_chamber_seal_totalcounterNumber of emergency seal operations
containment_chamber_tee_unseal_totalcounterNumber of TEE auto-unseal attempts by status (success, measurement_mismatch, malformed_blob, rogue_arn, unsupported_version, kms_attestation_rejected, kms_unavailable)
containment_chamber_unseal_shares_totalcounterNumber of unseal share submissions by operator
containment_chamber_unseal_totalcounterNumber of completed unseal ceremonies
NameTypeDescription
containment_dynamodb_key_refresh_duration_secondshistogramDuration of DynamoDB key refresh operations in seconds
containment_key_load_failures_totalcounterTotal validator keys that failed to load
containment_key_loading_duration_secondsgaugeDuration of key loading operations in seconds
containment_key_refresh_totalcounterTotal keys added via background refresh
containment_keys_activegaugeNumber of active validator keys by source
NameTypeDescription
containment_key_deletions_totalcounterTotal validator keys deleted via Key Manager API
containment_key_import_duration_secondshistogramDuration of Key Manager API import operations in seconds
containment_key_imports_totalcounterTotal validator keys imported via Key Manager API
containment_key_requests_totalcounterTotal Key Manager API requests by method
NameTypeDescription
containment_keygen_duration_secondshistogramDuration of keygen operations in seconds
containment_keygen_errors_totalcounterTotal keygen errors (labels: error_type ∈ {validation, crypto, backup, storage})
containment_keygen_totalcounterTotal validator keys generated via keygen endpoint
NameTypeDescription
containment_anti_slashing_check_duration_secondshistogramDuration of anti-slashing checks in seconds
containment_anti_slashing_errors_totalcounterTotal anti-slashing backend errors
containment_anti_slashing_hmac_mismatch_totalcounterAnti-slashing per-row HMAC verification failures by row kind
containment_anti_slashing_malformed_row_totalcounterAnti-slashing rows failing structural validation (malformed pk, unsupported scheme)
containment_anti_slashing_master_key_sealed_totalcounterAnti-slashing operations aborted because chamber was sealed mid-op
containment_anti_slashing_pg_poolgaugePostgreSQL connection pool state by status
NameTypeDescription
containment_auth_rejections_totalcounterTotal authentication rejections by reason
containment_ceremony_cidr_filter_enabledgaugeCeremony CIDR filter state (1 = enabled, 0 = disabled / empty list)
containment_cidr_rejections_totalcounterTotal requests rejected by a CIDR guard layer (labels: layer, reason)
NameTypeDescription
containment_http_errors_totalcounterTotal HTTP error responses by status code
NameTypeDescription
containment_dynamodb_keystore_errors_totalcounterTotal AWS keystore errors by operation
containment_kms_operation_duration_secondshistogramDuration of KMS operations in seconds
containment_kms_operations_totalcounterTotal KMS operations by action and status
NameTypeDescription
containment_tls_cert_expiry_secondsgaugeSeconds until current TLS certificate expires
containment_tls_cert_generation_duration_secondshistogramTime to generate TLS certificate and attestation document in seconds
containment_tls_cert_rotations_totalcounterTotal TLS certificate rotations
containment_tls_handshakes_totalcounterTotal TLS handshake attempts by status
NameTypeDescription
containment_background_task_panics_totalcounterTotal panics in long-running background tasks (labeled by task name)
containment_build_infogaugeBuild information (version, commit, timestamp)
containment_control_plane_component_upgaugeWhether the last observed control-plane refresh outcome succeeded (1 = ok, 0 = error)
containment_control_plane_last_success_unix_secondsgaugeUnix timestamp of the last successful control-plane refresh by component
containment_control_plane_refresh_totalcounterControl-plane refresh outcomes by component and status
containment_handler_panics_totalcounterTotal handler panics caught and converted to 500 by CatchPanicLayer
containment_healthygaugeHealth status of the signer (1 = healthy, 0 = unhealthy)
containment_network_infogaugeEthereum network configuration info gauge
containment_signer_stategaugeCurrent signer state (1 = active, 0 = inactive) by state label
containment_startup_duration_secondsgaugeTime from process start to signer ready in seconds
containment_uptime_secondsgaugeUptime in seconds since process start
NameTypeDescription
containment_queue_rejected_totalcounterTotal requests rejected due to backpressure
NameTypeDescription
containment_passphrase_validation_rejections_totalcounterTotal passphrase-validation rejections (labels: reason, endpoint)
containment_zxcvbn_estimate_duration_secondshistogramDuration of zxcvbn passphrase-strength estimator invocations in seconds (recorded only after length-floor check passes)
NameTypeDescription
containment_enclave_config_bootstrap_duration_secondshistogramWall-clock time spent fetching the bootstrap YAML over vsock, from first connect attempt to successful read_to_end
containment_enclave_config_bootstrap_failures_totalcounterTerminal enclave bootstrap failures (labels: reason ∈ {timeout, permanent_connect, oversize, partial_read, invalid_utf8}); matches the event=bootstrap_failure reason=... tracing log emitted on the same failure
containment_enclave_config_bootstrap_retries_totalcounterTransient vsock connect failures that triggered a backoff retry during enclave bootstrap (ConnectionRefused / TimedOut / Interrupted / WouldBlock)
containment_enclave_log_events_dropped_totalcounterEnclave log events dropped by the in-enclave vsock log forwarder (labels: reason ∈ {backoff, connect_failed, write_failed})

The operation label uses the signing operation names: AGGREGATION_SLOT, AGGREGATE_AND_PROOF, ATTESTATION, BLOCK_V2, RANDAO_REVEAL, SYNC_COMMITTEE_CONTRIBUTION_AND_PROOF, SYNC_COMMITTEE_MESSAGE, SYNC_COMMITTEE_SELECTION_PROOF, VALIDATOR_REGISTRATION, VOLUNTARY_EXIT.

Process metrics (containment_process_resident_memory_bytes and containment_process_open_fds) are only available on Linux.

Containment Chamber can export distributed traces via gRPC OTLP to any OpenTelemetry-compatible collector — Jaeger, Grafana Tempo, Honeycomb, Datadog, and others.

opentelemetry:
enabled: true
endpoint: "http://otel-collector:4317"
service_name: "containment-chamber"
OptionDefaultDescription
enabledfalseEnable OTLP trace export
endpointhttp://localhost:4317gRPC OTLP collector endpoint
service_namecontainment-chamberService name in traces

Traces include the full request lifecycle — from HTTP ingestion through authorization, slashing protection checks, and BLS signing.

Two pre-built Grafana dashboards are included in the repository under k8s/dashboards/:

containment-chamber-classic.json — A standalone dashboard suitable for any deployment model (bare metal, Docker, Kubernetes).

Import via: Grafana → Dashboards → Import → Upload JSON file

If you use the Prometheus Operator, the Helm chart includes a ServiceMonitor resource for automatic scrape target discovery.

Enable it in your Helm values:

serviceMonitor:
enabled: true
scrapeInterval: "15s"
additionalLabels:
release: prometheus

All available ServiceMonitor options:

OptionDefaultDescription
enabledfalseCreate a ServiceMonitor resource
scrapeInterval60sPrometheus scrape interval
additionalLabels{}Labels added to the ServiceMonitor
namespace""Namespace for the ServiceMonitor (defaults to release namespace)
namespaceSelector{}Namespace selector (use any: true to scrape all namespaces)
targetLabels[]Labels to transfer from the Kubernetes Service to scraped metrics
metricRelabelings[]Metric relabeling rules

By default, Containment Chamber outputs human-readable text logs with ANSI colors (when connected to a terminal). Switch to JSON for production log aggregation.

logging:
# Log level filter — supports tracing EnvFilter syntax
# Examples: "info", "debug", "containment_chamber=debug,hyper=info"
level: "info" # default: "info"
# Output format: "text" (human-readable) or "json" (structured)
format: text # default: "text"
# ANSI colors in text output — auto-detects TTY by default
log_color: null # default: auto-detect (true if TTY, false otherwise)
OptionTypeDefaultDescription
logging.levelstring"info"Log level filter (EnvFilter syntax)
logging.formatenumtexttext for human-readable, json for structured JSON
logging.log_colorbooleanautoANSI colors — auto-detects TTY when unset
Terminal window
# Via config
logging:
level: "containment_chamber=debug,hyper=info"
# Or via environment variable (overrides config)
RUST_LOG=containment_chamber=debug

Enable JSON format for structured log aggregation (Datadog, Loki, CloudWatch, etc.):

logging:
format: json
log_color: false # disable ANSI escape codes in JSON

Each JSON log line includes timestamp, level, target, span context, and message fields.

Security-relevant events are logged with target: "audit". This target is separate from the normal containment_chamber target, so you can route audit events to a dedicated sink without changing your general log level.

Events logged to the audit target:

EventWhen
signing requestEvery signing attempt, including key and operation type
state transitionSeal machine state changes (e.g., Sealed → KmsUnsealed)
unseal share submittedWhen an operator submits an unseal share, including share index
signer sealedWhen the signer is sealed, and by whom
Terminal window
# Include audit events alongside normal application logs
RUST_LOG=containment_chamber=info,audit=info
# Audit events only — suppress everything else
RUST_LOG=off,audit=info

In JSON mode, filter on "target":"audit" in your log aggregator (Datadog, Loki, CloudWatch, etc.) to build a dedicated audit trail.