Scope and boundary
The rubric is focused on engineering environments — it scores the readiness of a codebase to host agentic work. It is not a compliance framework; it does not replace organisational governance, formal attestation standards (SOC 2, ISO 27001, NIST), or third-party-risk programs. Where the rubric’s concerns coincide with those frameworks — most often in access control, change management, monitoring, availability — applying the rubric naturally builds toward compliance readiness on the overlapping dimensions. Where the rubric has not yet addressed a compliance-adjacent concern, open questions about which concerns to absorb, refine, or leave to complementary instruments are captured in the Open Questions section. Coexistence with compliance frameworks is the current stance; convergence is neither the goal nor foreclosed.
The rubric holds one additional boundary explicitly: engineering does not need PII. PII lives on the production-data side of the boundary between production and engineering systems. Logs, memory, caches, git history, CI artefacts, and agent tool surfaces are PII-free by design, not by layered masking. The criteria that implement this — PL4-pii-masking, PL4-memory-safety, PL4-prompt-injection-defence, and the ingestion discipline in PL1-real-world-feedback and PL3-emission-quality — realise a single bright line, not parallel defences.
Scoring
Each criterion is scored 0 / 1 / 2 / 3:
| Score | Anchor | Meaning |
|---|---|---|
| 0 | Absent | Not in place, or in name only |
| 1 | Present | Exists but with meaningful gaps, inconsistent coverage, or high friction |
| 2 | Effective | Consistently in place, low friction, agent-usable |
| 3 | Compounding | Improves with use — outcomes are captured, fed back, and demonstrably make the criterion cheaper or better over time |
Scoring
Each criterion is scored 0 / 1 / 2 / 3:
| Score | Anchor | Meaning |
|---|---|---|
| 0 | Absent | Not in place, or in name only |
| 1 | Present | Exists but with meaningful gaps, inconsistent coverage, or high friction |
| 2 | Effective | Consistently in place, low friction, agent-usable |
| 3 | Compounding | Improves with use — outcomes are captured, fed back, and demonstrably make the criterion cheaper or better over time |
How to read the scale
- 2 is the realistic operational target for most criteria. A project that hits 2 on every line is a well-engineered codebase.
- 3 is the bar for criteria where compounding is structurally possible and high-leverage. Reaching 3 requires building learning infrastructure: instrumentation, retrieval, hygiene, decay protocols.
- Some criteria are tagged
(max 2)— compounding isn’t structurally meaningful (e.g. lint either passes or it doesn’t). For these, 2 is the ceiling; the rubric doesn’t penalise the absence of a “3.”
Why this matters
This single-scale design embeds memory and learning into every criterion rather than treating them as a separate concern. A codebase that scores 2s everywhere has capability; a codebase that scores 3s has a compounding system — one that gets cheaper to operate the longer it runs. The gap between the two is the gap between “AI-assisted engineering” and “agentic engineering.”
What scoring requires
Scoring a project needs more than codebase access. The rubric assumes the project’s Actions pillar already provides agent-readable operational access — structured state (PL3-structured-state-read), observability (PL3-emission-quality / PL3-agent-queryability), source control metadata (PL3-source-control), CI/deploy results (PL3-deployment-cicd). That same access is what makes scoring feasible: a scorer queries the agent’s own read surfaces rather than chasing dashboards by hand. A project with weak Actions is simultaneously harder to use and harder to audit.
For criteria that can’t be fully scored from agent-readable sources alone (e.g. PL2-taste-validation human taste validation, PL2-secret-hygiene secret rotation confirmation), expect to supplement with brief process interviews.
Maximum total: 146 points. (Calculated below in the Scoring Summary.)
A project should aim to reach the maximum on at least one flagship codebase before attempting to scale the methodology across the portfolio.
Meta-Metrics
Beyond the rubric score, track four operational signals. The rubric measures capability and compounding; these measure whether the loop actually runs and improves.
Glance Threshold — median time to approve a PR
- > 15 min — something upstream failed (planning, actions, or validation)
- 5–15 min — acceptable, but PR is doing too much
- < 5 min — target state: PR is glanceable because trust has compounded
If you have to read a PR for an hour, you might as well have written it yourself.
Cost per merged PR
- All-in cost (agent inference + CI minutes + canary infrastructure + log retention) divided by merged PRs in the period
- Tracks whether agentic engineering is actually cheaper than the alternative
- A high rubric score with runaway cost-per-PR means the rubric is being gamed
Signal-to-deploy time — median hours from user signal received to fix deployed
- User signal = review, support ticket, production alert, meeting note, canary metric breach
- Captures whether the full loop (
PL1-real-world-feedback→PL5-signal-driven-tasks→PL5-outcome-input-loop→ release) actually closes - This is the metric that proves the “month-long holiday and the app has grown 30 features” vision is real, not aspirational
Compounding Index — fraction of compounding-eligible criteria scored at 3
- Numerator: criteria scored at 3
- Denominator: criteria where 3 is structurally achievable (i.e. excluding
(max 2)criteria) — currently 46 of 50 - Tracks whether the project is building learning infrastructure or just static capability
- A high Compounding Index is the rubric’s strongest signal that agentic engineering is actually compounding, not just present
- Target: > 0.3 within 12 months of starting; > 0.6 indicates a mature compounding system