Skip to content

Dynamic debug logging

Family
observability
Status
proposed
Complexity
low
Advances
PL4-cost-governance
Prerequisites
PL3-emission-quality ≥ 2, PL4-pii-masking ≥ 2

Dynamic debug logging

Purpose

Remotely escalate log verbosity per user / per device / per session for targeted investigation — without redeploying, without PII leakage, and without unbounded cost. Advances PL4-cost-governance by containing one of the canonical cost-runaway domains (verbose logging) via scoped, time-boxed, cost-aware escalation. Previously encoded as rubric criterion PL4-dynamic-debug-logging in v0.17 through v0.23; extracted into a recipe in v0.24 because the criterion’s text (“per-device, time-boxed, cost-aware”) was prescribing mechanism, which per canon’s rubric-vs-recipe discipline belongs in recipes rather than in the diagnostic instrument.

Architecture

  • Togglable flag — a feature-flag or config store holding per-scope debug state. Not a build-time DEBUG compile flag; runtime-mutable.
  • Scoping — every debug-on event is bounded by one of: per-user, per-device, per-session, per-request. Global debug-on is not permitted at this recipe’s level; it requires explicit elevation as a separate, audited operation.
  • Auto-off — every debug-on event carries a time-bound (30 min default is a reasonable starting point) and expires automatically. Expired state is reaped by a scheduler — composes with GitHub Actions scheduler or equivalent scheduling substrate.
  • Cost budget — every debug-on event is attributed to cost-governance tracking. When spend exceeds a per-event or per-scope budget, the flag auto-clears and alerts. Budget is both soft-alert and hard-cap.
  • Audit — every toggle event (on, off, expiry, cost-triggered-off) is logged with who flipped, scope, duration, cost accrued. Audit log is write-once and observable via the project’s standard queryability surface.
  • Agent tool surface — agent can request debug-on via its tool surface (MCP or equivalent), bounded by its credentials. Agent flipping its own session scope is low-friction; flipping a different user’s scope requires elevation.
  • PII invariant — verbose-level log output continues to pass PL4-pii-masking at the substrate layer. Dynamic-debug does not bypass masking; it escalates the volume of already-masked output. This is preconditional, not layered: the masking is done by PL4-pii-masking’s substrate; dynamic-debug respects it.

Criteria advanced

  • PL4-cost-governance — primary target. Dynamic debug logging is one domain-specific containment mechanism for the general “cost is observable, capped, attributed” discipline. It does not on its own move a project from level-0 to level-2 on the criterion — level-2 requires observable-capped-attributed across all cost domains (agent inference, CI minutes, log retention, canary spin-up). This recipe instantiates the discipline specifically for verbose logging. Pairs with analogous (future) recipes for other cost-runaway domains as they are written up.

Prerequisites

  • PL3-emission-quality ≥ 2 Emission quality. Structured logs with correlation IDs are preconditional. Escalating verbosity against unstructured logs produces noise, not signal — worse than static DEBUG because cost rises without any diagnostic benefit. Dynamic debug assumes a structured baseline to escalate against.
  • PL4-pii-masking ≥ 2 PII masking. Substrate-level PII masking is preconditional. If masking is application-layer-only or DB-only, verbose output at escalated levels will leak PII via telemetry paths that bypass the partial-layer masking. This recipe does not replace or layer on PII-masking; it relies on it being a universal invariant.

Failure modes

  • Static DEBUG flag set and forgotten. Legacy pattern: DEBUG=true in env or config, never unset. Cost accumulates silently for months; often discovered only via an invoice anomaly. Mitigation: auto-off is non-optional; every debug-on event carries a time-bound and expires automatically. Static/build-time DEBUG is not recognised by this mechanism — migration from static-DEBUG to dynamic-debug is part of adopting the recipe.
  • Global debug-on. Someone flips debug globally (all users, all devices) rather than per-scope. Blast radius unbounded: cost explodes, storage saturates, potentially rate-limits downstream consumers. Mitigation: scoping is non-optional at this recipe’s level; global flip is a separate, elevated operation with its own audit — compose with GitOps JIT privilege elevation.
  • Cost runaway despite budget. Volume spike exceeds budget faster than monitoring can react (e.g. a verbose log statement inside a hot loop or exception handler that fires repeatedly). Mitigation: hard cap in addition to soft alert; auto-off on hard-cap exceedance; pre-deployment budget testing for known high-traffic paths.
  • PII escape via debug-only code paths. Masking rules were designed and tested at INFO level; DEBUG level may include additional log statements that bypass masking because the mask rules didn’t anticipate them. Mitigation: PL4-pii-masking ≥ 2 specifically requires substrate-level enforcement (not per-path sanitiser); verify masking applies at all log levels including DEBUG before adopting this recipe.
  • Tamper on the toggle. Debug-on state mutable without audit. Abuse case: an adversary (or compromised credential) enables debug on a target user to capture session tokens, debug-only internal state, or verbose error messages containing secrets. Mitigation: every toggle event audited with who/when/scope/cost/duration; audit log write-once; cross-scope flips require elevation.
  • Side effects from log level. Escalated verbosity changes runtime behaviour — more memory allocation, slower execution, possibly different timing that masks or reveals race conditions. Mitigation: loggers must be pure-observers; production tests with debug-on confirm no behavioural drift; in unavoidable cases, document the drift so investigators know what to expect.

Cost estimate

Low-medium. First deployment: 2–5 engineer-days for the flag substrate (feature-flag service or config store), per-scope propagation through request-handling layers, auto-off timer (compose with a scheduling substrate), cost-attribution hook into the cost-governance telemetry, and audit-log wiring. Subsequent projects using a portfolio-level substrate: hours to configure new scopes. Ongoing cost: negligible once the mechanism is stable; the budget-alert channel needs monitoring like any other alert stream.

Open design questions

  • Toggle substrate. Feature-flag service (LaunchDarkly / Flagsmith / OpenFeature)? Custom DB-backed flag table? Redis? Depends on portfolio existing infra. Each has different tamper / audit / latency characteristics; standardising portfolio-wide vs. letting projects pick is itself a design decision.
  • Default auto-off duration. 15 min is tight for deep investigations; 60 min risks the forgot-to-disable failure and exposes more cost before the timer reaps. 30 min with per-event user override within policy caps is probably right but unproven in practice.
  • Cost-attribution granularity. Per-event? Per-scope-per-hour? Per-request? Affects how budget caps are expressed and enforced, and how much telemetry the mechanism produces about itself.
  • Scope boundary for agent invocation. Agent flipping debug on its own session is low-friction. Flipping on another user’s session requires elevation. Where exactly does the line sit — session, user, device? Related to the PL5-multi-agent-delegation trifecta-separation principle (v0.23).
  • Overlap with observability platform features. Most vendor observability platforms (Datadog, Honeycomb, Grafana Cloud, etc.) offer per-query sampling or per-trace tagging with similar intent. Is this recipe a separate mechanism or a thin wrapper over vendor capabilities? Depends on which platform the project uses and how much of this mechanism the platform already provides.
  • Composes with: GitHub Actions scheduler (or equivalent scheduling substrate) — auto-off timer is a scheduled task; can’t implement this recipe without a scheduling primitive.
  • Composes with: GitOps JIT privilege elevation — global debug-on and cross-scope debug-on require elevation, routed through the JIT gate.
  • Composes with: bot-token credential tenancy — the service that writes audit-log events for toggle changes is typically under a bot identity, not a user identity.
  • Sibling cost-containment recipes (future): inference token budget caps, canary duration caps, CI minute budget caps, log-retention caps. Each instantiates PL4-cost-governance for a different cost-runaway domain. Not yet written; candidates for portfolio-reusable mechanism extraction as each domain’s pattern matures.