Skip to content

Agentic Engineering Rubric

The rubric's pillar bodies — the criteria themselves — each live on their own page so they can be browsed without one scroll to rule them all:

  • 1. FocusNarrow the agent’s world to what matters; what remains is the right context.
  • 2. ValidationHard, deterministic rules that catch non-deterministic output.
  • 3. ActionsThe agent’s ability to act externally in the real world.
  • 4. Safe SpaceBlast-radius containment, so "going wrong" has bounded cost.
  • 5. WorkflowThe meta-layer that ties 1–4 together, including periodic and proactive loops.

Scope and boundary

The rubric is focused on engineering environments — it scores the readiness of a codebase to host agentic work. It is not a compliance framework; it does not replace organisational governance, formal attestation standards (SOC 2, ISO 27001, NIST), or third-party-risk programs. Where the rubric’s concerns coincide with those frameworks — most often in access control, change management, monitoring, availability — applying the rubric naturally builds toward compliance readiness on the overlapping dimensions. Where the rubric has not yet addressed a compliance-adjacent concern, open questions about which concerns to absorb, refine, or leave to complementary instruments are captured in the Open Questions section. Coexistence with compliance frameworks is the current stance; convergence is neither the goal nor foreclosed.

The rubric holds one additional boundary explicitly: engineering does not need PII. PII lives on the production-data side of the boundary between production and engineering systems. Logs, memory, caches, git history, CI artefacts, and agent tool surfaces are PII-free by design, not by layered masking. The criteria that implement this — PL4-pii-masking, PL4-memory-safety, PL4-prompt-injection-defence, and the ingestion discipline in PL1-real-world-feedback and PL3-emission-quality — realise a single bright line, not parallel defences.


Scoring

Each criterion is scored 0 / 1 / 2 / 3:

ScoreAnchorMeaning
0AbsentNot in place, or in name only
1PresentExists but with meaningful gaps, inconsistent coverage, or high friction
2EffectiveConsistently in place, low friction, agent-usable
3CompoundingImproves with use — outcomes are captured, fed back, and demonstrably make the criterion cheaper or better over time

Scoring

Each criterion is scored 0 / 1 / 2 / 3:

ScoreAnchorMeaning
0AbsentNot in place, or in name only
1PresentExists but with meaningful gaps, inconsistent coverage, or high friction
2EffectiveConsistently in place, low friction, agent-usable
3CompoundingImproves with use — outcomes are captured, fed back, and demonstrably make the criterion cheaper or better over time

How to read the scale

  • 2 is the realistic operational target for most criteria. A project that hits 2 on every line is a well-engineered codebase.
  • 3 is the bar for criteria where compounding is structurally possible and high-leverage. Reaching 3 requires building learning infrastructure: instrumentation, retrieval, hygiene, decay protocols.
  • Some criteria are tagged (max 2) — compounding isn’t structurally meaningful (e.g. lint either passes or it doesn’t). For these, 2 is the ceiling; the rubric doesn’t penalise the absence of a “3.”

Why this matters

This single-scale design embeds memory and learning into every criterion rather than treating them as a separate concern. A codebase that scores 2s everywhere has capability; a codebase that scores 3s has a compounding system — one that gets cheaper to operate the longer it runs. The gap between the two is the gap between “AI-assisted engineering” and “agentic engineering.”

What scoring requires

Scoring a project needs more than codebase access. The rubric assumes the project’s Actions pillar already provides agent-readable operational access — structured state (PL3-structured-state-read), observability (PL3-emission-quality / PL3-agent-queryability), source control metadata (PL3-source-control), CI/deploy results (PL3-deployment-cicd). That same access is what makes scoring feasible: a scorer queries the agent’s own read surfaces rather than chasing dashboards by hand. A project with weak Actions is simultaneously harder to use and harder to audit.

For criteria that can’t be fully scored from agent-readable sources alone (e.g. PL2-taste-validation human taste validation, PL2-secret-hygiene secret rotation confirmation), expect to supplement with brief process interviews.

Maximum total: 146 points. (Calculated below in the Scoring Summary.)

A project should aim to reach the maximum on at least one flagship codebase before attempting to scale the methodology across the portfolio.


Meta-Metrics

Beyond the rubric score, track four operational signals. The rubric measures capability and compounding; these measure whether the loop actually runs and improves.

Glance Threshold — median time to approve a PR

  • > 15 min — something upstream failed (planning, actions, or validation)
  • 5–15 min — acceptable, but PR is doing too much
  • < 5 min — target state: PR is glanceable because trust has compounded

If you have to read a PR for an hour, you might as well have written it yourself.

Cost per merged PR

  • All-in cost (agent inference + CI minutes + canary infrastructure + log retention) divided by merged PRs in the period
  • Tracks whether agentic engineering is actually cheaper than the alternative
  • A high rubric score with runaway cost-per-PR means the rubric is being gamed

Signal-to-deploy time — median hours from user signal received to fix deployed

  • User signal = review, support ticket, production alert, meeting note, canary metric breach
  • Captures whether the full loop (PL1-real-world-feedbackPL5-signal-driven-tasksPL5-outcome-input-loop → release) actually closes
  • This is the metric that proves the “month-long holiday and the app has grown 30 features” vision is real, not aspirational

Compounding Index — fraction of compounding-eligible criteria scored at 3

  • Numerator: criteria scored at 3
  • Denominator: criteria where 3 is structurally achievable (i.e. excluding (max 2) criteria) — currently 46 of 50
  • Tracks whether the project is building learning infrastructure or just static capability
  • A high Compounding Index is the rubric’s strongest signal that agentic engineering is actually compounding, not just present
  • Target: > 0.3 within 12 months of starting; > 0.6 indicates a mature compounding system