Skip to content

Indexed per-entry registry with YAML frontmatter

Family
filing
Status
proven
Complexity
low
Advances
PL1-corpus-taxonomy
Prerequisites

Indexed per-entry registry with YAML frontmatter

Purpose

Corpus filing pattern for collections of like entities (stakeholders, integrations, ADRs, recipes, vendors, projects) where each entity has its own file and a thin index file aggregates. Solves the “scattered markdown in arbitrary folders” problem that makes corpuses invisible to agents and humans alike.

Architecture

Two layers:

  1. Index file at the corpus root (e.g. stakeholders.md, integrations.md, recipes.md) containing a short explanation of what the corpus is, a one-line-per-entry table, and a “how to use this registry” section covering add / query / sweep conventions.
  2. Per-entry files in a sibling directory (stakeholders/, integrations/, recipes/), each with YAML frontmatter carrying structured attributes (status, type, dates, tags, IDs) and markdown body organized into conventional sections.

A _template.md file lives in the per-entry directory showing expected frontmatter keys and body sections. New entries are created by copying the template, not by free-form authoring — this keeps frontmatter queryable across the set.

Query surface: rg 'status: active' <corpus>/ for filtering; yq '.field' <corpus>/*.md for structured extraction; grep for cross-corpus references.

Criteria advanced

  • PL1-corpus-taxonomy Corpus taxonomy, filing, indexing — this is the mechanism for level-2. Explicit type system (enforced via template frontmatter), consistent filing structure (per-corpus directory), index (the .md file), agent-queryable by type/status/recency. Level-3 requires the additional discipline of tracked staleness sweeps and filing-gap detection, both of which this pattern accommodates but doesn’t automate on its own.

Indirectly supports any criterion that depends on corpus retrieval (PL1-primary-source-access, PL1-decision-records, PL1-documentation-loop, PL1-stakeholder-context, PL2-agent-audit-trail) — if those criteria need a queryable knowledge store, this is the substrate.

Prerequisites

None beyond markdown, git, and a willingness to apply the pattern consistently. The pattern’s value compounds with scale (5 entries → marginal; 50 entries → load-bearing).

Failure modes

  • Frontmatter key drift. Different entries use different key names for the same attribute (email vs. contact_email; status vs. state). Breaks yq queries silently. Mitigation: _template.md as the canonical schema, with discipline around consulting it when adding entries.
  • Index-body drift. The index table row claims an entry exists that the directory doesn’t contain, or vice versa. Mitigation: adding an entry is a two-step commit (file + index row), done in one PR; a periodic staleness sweep catches the rest.
  • Over-structuring. Template demands too many required fields; contributors skip the pattern for new entries because filling it in is painful; you end up with a mix of structured and unstructured. Mitigation: keep required frontmatter to 3–5 fields; let the rest be optional.
  • Under-structuring. Frontmatter is too thin to actually answer real queries; the registry devolves back into free-text markdown. Mitigation: add a frontmatter field the first time you notice you want to query for it, not before and not after.
  • Narrative content leaking into frontmatter. Frontmatter should be queryable attributes, not prose. status: active is queryable; status: "active but with reservations because..." is not. Narrative belongs in the body.

Cost estimate

Low. Establishing the pattern for a new corpus: 1–2 hours (template + index + 2 seed entries). Ongoing cost is proportional to contribution volume — a well-designed template makes new entries 10–20 minutes, most of which is actual content thinking, not filing work.

Case studies

  • memory/stakeholders/ — people involved in the Agentic Engineering canon. Frontmatter carries contact details (email, Slack, preferred channel, timezone) making outbound reach a one-step lookup; tags capture project scope; status tracks active / archived / prospective. Enables queries like “who do I ping about the Gentari deployment?” via yq 'select(.tags | contains(["gentari"])) | .contact.slack' memory/stakeholders/*.md.
  • internal/integrations/ — external systems the agent can act on. Added 2026-04-18 in this shape (previously a single monolithic file — the refactor itself was the test that the pattern scales). Frontmatter carries system, status (active / proposed / archived), auth mode, pillars advanced, making scope drift auditable across the corpus.
  • recipes/ — this collection. Frontmatter carries criteria advanced, prerequisites, complexity, seen_in, making portfolio-reuse queries natural: “which recipes advance PL4-least-privilege?”; “which recipes are blocked on PL4-branch-protection being at level 2?”; “which recipes have been proven in at least one project?”.
  • Composes with: every domain-specific corpus pattern. Stakeholder registries, integration registries, ADR corpora, recipe collections, project inventories, vendor catalogues — same shape, different frontmatter schema per corpus.
  • Alternatives to: free-text markdown in arbitrary folders (the default failure mode); spreadsheet-of-truth (loses prose context, breaks at entry complexity); external CMS (introduces a system boundary agents can’t query natively).