Dynamic debug logging

Purpose

Remotely escalate log verbosity per user / per device / per session for targeted investigation — without redeploying, without PII leakage, and without unbounded cost. Advances PL4-cost-governance by containing one of the canonical cost-runaway domains (verbose logging) via scoped, time-boxed, cost-aware escalation. Previously encoded as rubric criterion PL4-dynamic-debug-logging in v0.17 through v0.23; extracted into a recipe in v0.24 because the criterion’s text (“per-device, time-boxed, cost-aware”) was prescribing mechanism, which per canon’s rubric-vs-recipe discipline belongs in recipes rather than in the diagnostic instrument.

Architecture

Togglable flag — a feature-flag or config store holding per-scope debug state. Not a build-time DEBUG compile flag; runtime-mutable.
Scoping — every debug-on event is bounded by one of: per-user, per-device, per-session, per-request. Global debug-on is not permitted at this recipe’s level; it requires explicit elevation as a separate, audited operation.
Auto-off — every debug-on event carries a time-bound (30 min default is a reasonable starting point) and expires automatically. Expired state is reaped by a scheduler — composes with GitHub Actions scheduler or equivalent scheduling substrate.
Cost budget — every debug-on event is attributed to cost-governance tracking. When spend exceeds a per-event or per-scope budget, the flag auto-clears and alerts. Budget is both soft-alert and hard-cap.
Audit — every toggle event (on, off, expiry, cost-triggered-off) is logged with who flipped, scope, duration, cost accrued. Audit log is write-once and observable via the project’s standard queryability surface.
Agent tool surface — agent can request debug-on via its tool surface (MCP or equivalent), bounded by its credentials. Agent flipping its own session scope is low-friction; flipping a different user’s scope requires elevation.
PII invariant — verbose-level log output continues to pass PL4-pii-masking at the substrate layer. Dynamic-debug does not bypass masking; it escalates the volume of already-masked output. This is preconditional, not layered: the masking is done by PL4-pii-masking’s substrate; dynamic-debug respects it.

Criteria advanced

PL4-cost-governance — primary target. Dynamic debug logging is one domain-specific containment mechanism for the general “cost is observable, capped, attributed” discipline. It does not on its own move a project from level-0 to level-2 on the criterion — level-2 requires observable-capped-attributed across all cost domains (agent inference, CI minutes, log retention, canary spin-up). This recipe instantiates the discipline specifically for verbose logging. Pairs with analogous (future) recipes for other cost-runaway domains as they are written up.

Prerequisites

PL3-emission-quality ≥ 2 Emission quality. Structured logs with correlation IDs are preconditional. Escalating verbosity against unstructured logs produces noise, not signal — worse than static DEBUG because cost rises without any diagnostic benefit. Dynamic debug assumes a structured baseline to escalate against.
PL4-pii-masking ≥ 2 PII masking. Substrate-level PII masking is preconditional. If masking is application-layer-only or DB-only, verbose output at escalated levels will leak PII via telemetry paths that bypass the partial-layer masking. This recipe does not replace or layer on PII-masking; it relies on it being a universal invariant.

Failure modes

Static DEBUG flag set and forgotten. Legacy pattern: DEBUG=true in env or config, never unset. Cost accumulates silently for months; often discovered only via an invoice anomaly. Mitigation: auto-off is non-optional; every debug-on event carries a time-bound and expires automatically. Static/build-time DEBUG is not recognised by this mechanism — migration from static-DEBUG to dynamic-debug is part of adopting the recipe.
Global debug-on. Someone flips debug globally (all users, all devices) rather than per-scope. Blast radius unbounded: cost explodes, storage saturates, potentially rate-limits downstream consumers. Mitigation: scoping is non-optional at this recipe’s level; global flip is a separate, elevated operation with its own audit — compose with GitOps JIT privilege elevation.
Cost runaway despite budget. Volume spike exceeds budget faster than monitoring can react (e.g. a verbose log statement inside a hot loop or exception handler that fires repeatedly). Mitigation: hard cap in addition to soft alert; auto-off on hard-cap exceedance; pre-deployment budget testing for known high-traffic paths.
PII escape via debug-only code paths. Masking rules were designed and tested at INFO level; DEBUG level may include additional log statements that bypass masking because the mask rules didn’t anticipate them. Mitigation: PL4-pii-masking ≥ 2 specifically requires substrate-level enforcement (not per-path sanitiser); verify masking applies at all log levels including DEBUG before adopting this recipe.
Tamper on the toggle. Debug-on state mutable without audit. Abuse case: an adversary (or compromised credential) enables debug on a target user to capture session tokens, debug-only internal state, or verbose error messages containing secrets. Mitigation: every toggle event audited with who/when/scope/cost/duration; audit log write-once; cross-scope flips require elevation.
Side effects from log level. Escalated verbosity changes runtime behaviour — more memory allocation, slower execution, possibly different timing that masks or reveals race conditions. Mitigation: loggers must be pure-observers; production tests with debug-on confirm no behavioural drift; in unavoidable cases, document the drift so investigators know what to expect.

Cost estimate

Low-medium. First deployment: 2–5 engineer-days for the flag substrate (feature-flag service or config store), per-scope propagation through request-handling layers, auto-off timer (compose with a scheduling substrate), cost-attribution hook into the cost-governance telemetry, and audit-log wiring. Subsequent projects using a portfolio-level substrate: hours to configure new scopes. Ongoing cost: negligible once the mechanism is stable; the budget-alert channel needs monitoring like any other alert stream.

Open design questions

Toggle substrate. Feature-flag service (LaunchDarkly / Flagsmith / OpenFeature)? Custom DB-backed flag table? Redis? Depends on portfolio existing infra. Each has different tamper / audit / latency characteristics; standardising portfolio-wide vs. letting projects pick is itself a design decision.
Default auto-off duration. 15 min is tight for deep investigations; 60 min risks the forgot-to-disable failure and exposes more cost before the timer reaps. 30 min with per-event user override within policy caps is probably right but unproven in practice.
Cost-attribution granularity. Per-event? Per-scope-per-hour? Per-request? Affects how budget caps are expressed and enforced, and how much telemetry the mechanism produces about itself.
Scope boundary for agent invocation. Agent flipping debug on its own session is low-friction. Flipping on another user’s session requires elevation. Where exactly does the line sit — session, user, device? Related to the PL5-multi-agent-delegation trifecta-separation principle (v0.23).
Overlap with observability platform features. Most vendor observability platforms (Datadog, Honeycomb, Grafana Cloud, etc.) offer per-query sampling or per-trace tagging with similar intent. Is this recipe a separate mechanism or a thin wrapper over vendor capabilities? Depends on which platform the project uses and how much of this mechanism the platform already provides.

Composes with: GitHub Actions scheduler (or equivalent scheduling substrate) — auto-off timer is a scheduled task; can’t implement this recipe without a scheduling primitive.
Composes with: GitOps JIT privilege elevation — global debug-on and cross-scope debug-on require elevation, routed through the JIT gate.
Composes with: bot-token credential tenancy — the service that writes audit-log events for toggle changes is typically under a bot identity, not a user identity.
Sibling cost-containment recipes (future): inference token budget caps, canary duration caps, CI minute budget caps, log-retention caps. Each instantiates PL4-cost-governance for a different cost-runaway domain. Not yet written; candidates for portfolio-reusable mechanism extraction as each domain’s pattern matures.

Dynamic debug logging

Dynamic debug logging

Purpose

Architecture

Criteria advanced

Prerequisites

Failure modes

Cost estimate

Open design questions

Related recipes