Observe-before-enforce gate promotion
Observe-before-enforce gate promotion
Purpose
When you stand up a structural gate — a GitHub ruleset, a Content Security Policy, a feature flag, an AWS SCP, a Dependabot required check, a Zero Trust policy — flipping it directly from off to enforce creates a particularly pernicious failure mode: the configuration needed to satisfy the gate is invisible while the gate is off, becomes blocking once the gate is on, and the fix cannot land because the gate blocks the fix itself. The recipe: every gate supporting an intermediate observe mode (shadow, evaluate, report-only) runs through it first, its outcome is inspected, pre-conditions are verified to resolve, and the flip to enforce only happens once the observe-mode signal is empirically clean.
Solves the “we configured the gate correctly, why is everything broken?” failure mode by making pre-conditions testable before they become blocking, and by closing the chicken-and-egg window where a broken precondition can no longer be fixed.
Architecture
Every structural gate has three states:
- Off — the gate exists in config but does nothing.
- Observe — the gate evaluates every event, records would-block outcomes, but does not actually block. Platform-specific names: evaluate (GitHub rulesets), report-only (CSP), shadow (feature flags, Zero Trust), audit (Kubernetes OPA), monitor (AWS SCP with
Deny+ wide exceptions). - Enforce — the gate blocks events that violate rules. Names: active, enforce, block, deny.
The recipe: always promote in two steps — off → observe → enforce — never off → enforce directly.
Between observe and enforce, run a fixed pre-flight checklist:
- Zero would-block signals on expected-to-pass events. The observe-mode log should be empty (or explainable) for normal workflows. Non-empty log = misconfigured pre-condition that enforce would turn into a block.
- All referenced config resolves. CODEOWNERS entries, ACL group memberships, exception lists, policy allowlists — every external reference must resolve at evaluation time. A rule that depends on
@unknown-useror a missing exception file silently passes in observe (nobody triggers it) and hard-fails in enforce. - An escape hatch exists. Before flipping, know how you’ll recover if something breaks. For GitHub rulesets: declared
bypass_actorsor admin-UI override availability. For CSP: deployment rollback path. For feature flags: kill switch. For SCPs:org-adminwith an attached manual-override policy. The escape hatch is pre-verified, not discovered during the incident. - All impact paths are covered. Observe mode only produces signal for events that actually occur during the observation window. A rule that protects against tag pushes won’t surface any signal if nobody pushed a tag. Extend the observation window or synthesise events to cover each enforcement path.
The load-bearing property is testability before blocking: observe mode is an empirical probe whose result is visible, read, and interpreted before the gate’s blocking semantics come online.
Criteria advanced
PL4-least-privilegelevel 1 → level 2 (structurally-enforced) — the recipe is how you actually get from “gate declared” to “gate blocking” without creating a chicken-and-egg deadlock that forces either bypass-with-no-controls or gate-disabled. It’s the verified-promotion pattern that makes level-2 structural enforcement durable; without it, teams either stay at level-1 (gate declared, not enforced) or hit a lockout and revert.PL4-branch-protection— when applied to branch ruleset promotion specifically, the recipe is the reliable path from “protection rule declared” to “protection rule platform-enforced” without the commonly-observed “everything is stuck pending, admin override used, we gave up” failure.PL5-pipeline-reliabilitypartial contributor — the recipe prevents a class of pipeline outages where gate activation itself disrupts the pipeline (CI stuck, PRs unmergeable, deploys frozen). The pre-flight checklist catches gate-related pipeline breakage before it’s live.PL1-decision-recordspartial contributor — the observe-phase output is the decision record for “we flipped this gate on these grounds”; inspect-observe-output is the evidence of due diligence on the gate activation decision.
Prerequisites
PL3-source-control≥ 2. Agent + human can query gate configuration and activation state natively via source control. Without this, gate state is opaque and the pre-flight checks are hand-rolled scripts that rot.- The gate substrate must support an observe mode. If the gate is binary (on/off only), this recipe doesn’t apply — a different pattern is needed (e.g. sandbox-first rollout, canary-on-fraction). Most modern structural gates do support observe modes; legacy ones sometimes don’t.
- Visibility into observe-mode output. API, UI, log stream, metric — whatever the substrate offers must be accessible to the operator and (ideally) to the agent. Observe mode without an inspection path produces no signal.
- Sufficient event volume during observation. The observation window must produce representative events for each rule in the gate. One PR through a branch ruleset covers the PR-merge path but not the tag-push path; both paths need signal before flip.
Failure modes
- Observe logs not inspected. Teams flip to observe, wait a bit, flip to enforce — without ever reading the observe output. The recipe’s value is the inspection step; skipping it reduces to off → enforce with extra ceremony. Mitigation: pre-flight checklist has a mandatory “share the observe-mode output” item that must produce an artefact (PR comment, triage doc, issue link).
- Observe-visible errors treated as “we’ll fix later”. Pre-conditions flagged as problems in observe mode get deferred, gate is flipped to enforce anyway, deferred fix is now blocked by the gate. Mitigation: pre-flight checklist has a hard gate — any unexplained would-block signal blocks the flip, regardless of deadline pressure.
- Chicken-and-egg: the fix IS blocked by the gate. CODEOWNERS references unresolvable users →
require_code_owner_reviewcan’t be satisfied → the CODEOWNERS-fix PR can’t merge → gate is stuck in locked state. Mitigation: the pre-flight “escape hatch” check must verify the recovery path is available, not just declared. A bypass actor list of[]is not an escape hatch; a documented UI-admin-override path is only an escape hatch if the admin has actually-tested bypass permission on the gate in question. - Observe semantics differ from enforce semantics. Some gates log more than they enforce (useful); some log less (dangerous — you flip to enforce and hit blocks the observe signal didn’t show). CSP
report-onlyvsenforce: enforce blocks resource loads, report-only only reports violations — but a CSP directive that affects cache behaviour may only apply under enforce, not report-only. Mitigation: read the gate substrate’s docs on observe-vs-enforce differences before assuming observe is a faithful simulation. - Not all rules exercised during observation.
required_signaturesonly surfaces signal when someone pushes an unsigned commit. If the observation window only had signed commits, that rule’s behaviour is untested. Mitigation: either (a) deliberately synthesise a violation in observe mode (push an unsigned commit on a test branch to verify the rule logs would-block), or (b) accept the risk and plan a quick escape path for rules that went untested. - Observation window too short. One PR isn’t enough. Some observation should span multiple commit patterns, multiple actors (human + bot), multiple event types. Mitigation: declare an explicit minimum observation window before starting (e.g. ”≥ 3 PRs covering code-change, CODEOWNERS-change, workflow-change paths”); don’t flip until the minimum is met.
- Declared-state-diverges-from-live-state post-flip. If the flip is executed out-of-band (via UI or API) rather than via the IaC path that manages the gate, declared state drifts from live state. Mitigation: always make the flip via the IaC path; if an emergency UI flip happens, immediately open a PR to bring declared state in sync.
Cost estimate
Low to medium. The gate substrate almost always supports observe mode natively (it’s a config knob, not a new deployment). The cost is:
- Human time running the observation window: typically 1–3 days depending on event volume.
- Human time reviewing observe-mode output: typically 15–60 minutes.
- Time to fix any surfaced pre-condition gaps: varies widely; the recipe’s value is that this time is spent before the gate is blocking, so it’s uncapped by incident pressure.
- The flip itself: minutes.
Ongoing maintenance burden is zero — the recipe is a one-time discipline per gate activation, not an ongoing process. New gates repeat the same checklist; old gates stay on after verified activation.
Case studies
- Canon main-branch ruleset (
internal/integrations/canon-repo-substrate.md). First apply landed the ruleset inevaluatemode. PR #14 merged cleanly during observation, confirmingrequired_signatures+pull_requestrules work on the happy path. But the flip toactivewas executed before runninggh api /repos/.../codeowners/errors— which would have shown@jason-khongas Unknown owner onmain’s CODEOWNERS. Active mode turned the invisible CODEOWNERS error into a hard block; the CODEOWNERS-fix PR (#16) couldn’t merge becauserequire_code_owner_reviewhad no resolvable owner. Unblocked via repo-admin UI-bypass addition, PR merged, bypass removed, declared state re-synced. Lesson embedded as the first item in this recipe’s pre-flight checklist (verify external references resolve). Second lesson: the escape hatch (bypass_actors) declared in IaC did not persist through the apply due to anintegrations/github v6.11.1provider quirk — the UI path was the working escape. Validates failure-mode 3 (chicken-and-egg with unverified escape hatch).
Related recipes
- Composes with: gitops-jit-privilege-elevation — the elevation gate itself should be stood up in observe-before-enforce. Use this recipe’s pre-flight checklist when promoting the elevation pipeline from logging to actually-blocking.
- Composes with: ingestion-as-pr — the ingestion gate’s evaluate-mode output surfaces which incoming content would be rejected by the policy, letting you tune policy before it starts rejecting real traffic.
- Alternatives to: big-bang gate activation (declare + enforce in one step) — works when pre-conditions are thoroughly understood and fit in one reviewer’s head. For any gate with external references (CODEOWNERS, allowlists, group memberships), big-bang is statistically likely to surprise you.
- Alternatives to: canary gate activation (enforce on a fraction of traffic / repos / tenants, then expand) — works when the gate’s substrate supports fractional rollout. Some gate substrates (e.g. repo-level GitHub rulesets) don’t. Canary and observe-before-enforce compose — observe first, then canary-roll the enforce.
- Depends on: none directly. Presumes IaC-managed gate configuration; a manually-configured gate can still use the recipe, it just loses the declared-state discipline.