GitHub Actions scheduler
GitHub Actions scheduler
Purpose
Use GitHub Actions as the agent-invokable scheduling substrate: schedules declared as on: schedule in workflow files on the default branch, executed on ephemeral self-hosted runners with just-in-time (JIT) configuration tokens, credentialed via OIDC federation to cloud targets where applicable. Closes the structural gap where seven rubric criteria — PL5-signal-driven-tasks, PL2-test-quality, PL2-ui-test-coverage, PL2-load-stress-testing, PL4-release-strategy, PL5-pipeline-reliability, PL5-outcome-input-loop — silently presupposed scheduling infrastructure that wasn’t named (resolved in rubric v0.18 by tightening PL5-signal-driven-tasks’s level-2 anchor to make the scheduling prerequisite explicit).
The recipe is substrate-specific by design. Earlier deliberation considered a substrate-agnostic framing, but the structure that matters — delay tail, public-repo self-hosted runner prohibition, OIDC federation shape, workflow-file lifecycle mechanics — is specific enough to GitHub Actions that a generic framing would hide the parts that matter. Additional substrates (Temporal, Cloudflare Cron, pg_cron, etc.) warrant their own sibling recipes if and when a workload forces the question; see research/scheduler-substrate-github-actions.md for the boundary analysis that justifies this specialisation.
Architecture
The scheduler is a GitHub repository running GitHub Actions, configured as follows:
-
Schedules declared in workflow files on the default branch using
on: schedulewith POSIX five-field cron syntax. Minimum cadence 5 minutes. Non-round-minute offsets (17 * * * *) preferred over round-hour expressions (0 * * * *) to avoid the platform’s peak-queue delay tail. -
Ephemeral self-hosted runners with the
--ephemeralflag, provisioned via just-in-time (JIT) configuration tokens fromPOST /orgs/{org}/actions/runners/generate-jitconfig(or the repo-scoped variant). Every runner processes exactly one job and de-registers; no state persists between runs. On Kubernetes, Actions Runner Controller (ARC) is the reference implementation and passes--ephemeralautomatically. -
Agent tool surface via
ghCLI (shell access) or a thin MCP wrapper. Canonical operations map onto native GitHub Actions mechanisms:Operation Mechanism Create PR adding a workflow file with on: scheduleon the default branchEdit PR modifying the cron expression in the workflow file Cancel gh workflow disable <id>(pause) or file deletion (permanent)List gh workflow listfiltered to workflows usingon: scheduleStatus gh run list --workflow <id> --event scheduleAd-hoc fire gh workflow run <id>(requiresworkflow_dispatchin the workflow)Cancel run gh run cancel <run-id>(orrerun-failed-jobsfor recovery) -
Credentials via OIDC federation for cloud targets (AWS STS, GCP Workload Identity Federation, Azure federated identity, HCP, Databricks). Workflows declare
permissions: id-token: writeand exchange GitHub’s short-lived JWT for cloud credentials at fire time. No long-lived secrets stored in repo secrets, the scheduler, or the runner host. -
For write paths, composition with GitOps JIT privilege elevation is load-bearing: the PR that creates a scheduled writer routes through the same elevation gate as code changes, and the scheduled job itself elevates at fire time rather than carrying standing write credentials.
-
Log forwarding from ephemeral runners to external storage (Loki, CloudWatch, S3, vendor) — non-optional because runner-local logs evaporate on de-registration and silent-failure investigation is otherwise impossible.
-
Observability via
GET /repos/{owner}/{repo}/actions/runsendpoints and the workflow-run UI; structured logs queryable through the project’s existing observability surface (rubricPL3-agent-queryability) once forwarding is in place. -
Job metadata conventions. Each scheduled workflow file carries metadata as YAML comments parsed by agent tooling and a dedicated sweeper workflow:
# owner:— agent or human slug (reinforces the nativeactorcontext).# expires:— TTL date; past this date the sweeper disables or deletes the workflow.# risk-tier: low | medium | high— feeds branch-protection tiers; high-risk workflow-file changes require additional reviewers on create and edit.# tags:— list form, for aggregation and querying.# cost-budget:— per-run or per-month ceiling; sweeper alerts when exceeded.
These are repo-level conventions (not platform features); the sweeper enforces them on the scheduler’s own cadence. Without them, workflow-file proliferation is latent — see Failure modes.
Substrate limits to engineer around: 5-minute minimum cadence, ±30-minute delay tail under platform load, 60-day inactivity disable on public repos (N/A for private), plan-level concurrent-job ceilings (20 Free / 40 Pro / 60 Team / 500 Enterprise), and the GITHUB_TOKEN 1,000 req/hr/repo rate limit for API-chatty jobs. Workloads outside these boundaries need a different substrate — see the evaluation research for the full suitability envelope.
Criteria advanced
PL5-signal-driven-tasks— direct unlock. Rubric v0.18’s level-2 explicitly caps this criterion at level 1 without an agent-invokable scheduler. Deploying this recipe removes the cap; level 2 becomes reachable on the criterion’s own merits (reactive-source coverage still needs its own work).PL2-test-quality— enables “mutation testing run periodically, not per-PR” (level-2 anchor’s explicit language). Without scheduling, mutation testing is either per-PR (too expensive) or ad-hoc (signal quality collapses).PL2-ui-test-coverage— enables “coverage across critical flows, run daily.” Without scheduling, daily-run is aspirational.PL2-load-stress-testing— enables “run on production-mirrored env, scheduled.” Without scheduling, load tests run ad-hoc and drift out of representativeness.PL4-release-strategy— enables metric-gated stage promotion over time windows rather than immediate-or-never promotion. Caveat: the ±30-minute delay tail makes tight release-window promotion marginal on this substrate; sub-window-sensitive release strategies need a tighter scheduler.PL5-pipeline-reliability— enables self-healing pipelines that retry, backfill, or alert on schedule drift. Without scheduling, self-healing is reactive-only.PL5-outcome-input-loop— enables “metric thresholds trigger automated next-cycle tasks” on time windows, not just immediate-event triggers.
On all criteria except PL5-signal-driven-tasks, this recipe is an unlock / prerequisite rather than a full mechanism — each criterion still needs its own domain work (mutation tooling, UI test authoring, load-test rigs, canary promotion policy). But without the scheduler, each is structurally blocked from reaching level 2.
Prerequisites
PL3-structured-state-read≥ 2 Structured state read access. Jobs that observe state need read access to the state they’re observing. Scheduled jobs without queryable state can only do blind work.PL3-agent-queryability≥ 2 Agent queryability. Jobs that react to telemetry (health checks, finding-rate trends) need the same queryability the agent has. Without this, jobs can act but can’t inspect.PL5-pipeline-reliability≥ 2 Pipeline reliability. Flaky pipelines compound pain catastrophically when scheduled work is layered on top — schedule keeps firing, failures accumulate, alerting drowns. Don’t deploy this recipe onto an unstable pipeline.- Implicit: GitHub as the project’s VCS, with Actions enabled. Projects not on GitHub need a different scheduling substrate; this recipe does not cover that case. Projects on GitHub where Actions is disabled by the organisation have no path here without that policy change.
Failure modes
- Delay-tail-sensitive downstream consumers. A consumer that assumes the schedule fires within a minute of the stated time breaks when it fires 20–40 minutes later. Mitigation: treat scheduler timing as best-effort; downstream consumers tolerate the drift or use a different substrate; document the drift budget in the workflow file itself.
- 60-day inactivity disable on public repos. Scheduled workflows auto-disable after 60 days with no repository activity. A reported bug extends the disablement to
on: push/on: pull_requestpaths on the same workflow file. Mitigation: prefer private repos for scheduled agent work; if public, adopt keepalive-workflow or synthetic commit cadence — but recognise these are brittle and themselves affected by the disablement rule. - Silent failure with lost logs. Ephemeral runners lose their local logs on de-registration. Without external log forwarding, scheduled jobs fail, no one notices, trust in the signal erodes. Mitigation: log forwarding to external storage is non-optional; alerting on job failure is non-optional; observability on the schedule itself (is it firing? is the runner registering?) is distinct from observability on its outputs.
GITHUB_TOKENrate-limit exhaustion. Scheduled jobs chatty against the GitHub API (listing runs, posting statuses, updating issues) exhaust the 1,000 req/hr/repo budget quickly. Mitigation: purpose-scoped PATs or GitHub App tokens for chatty jobs; batch API calls; cache listings within the job.- Public-repo self-hosted runner compromise. Public repos with self-hosted runners are vulnerable to fork-PR attacks that execute arbitrary code on the runner host. GitHub’s own guidance is effectively prohibitive: “Self-hosted runners should almost never be used for public repositories.” Mitigation: do not deploy this recipe on public repos with self-hosted runners; private repos only, or GitHub-hosted runners for public-repo scheduling.
- Credentials at rest on the runner host. A persistent self-hosted runner with secrets on disk becomes a compounding attack surface. Mitigation: ephemeral + JIT is non-optional; OIDC federation for cloud credentials; environment secrets with required reviewers for non-federated targets.
- Privilege escalation via scheduled writes. If the agent can schedule arbitrary writes, the GitOps JIT elevation gate is bypassed by timing — schedule the write, come back later. Mitigation: scheduled writes elevate at fire time, not schedule time; the scheduled job opens a PR that the elevation gate reviews, rather than executing the write directly.
- Timezone / DST bugs. Cron expressions behave surprisingly around DST boundaries; GitHub advances schedules forward across skipped hours but fall-back creates potential duplicate fires. Mitigation: schedules expressed and reasoned about in UTC; DST-affected hours (roughly 01:00–03:00 in transitioning zones) avoided where possible; tests around DST boundaries explicit.
- Workflow-file proliferation. Agent creates scheduled workflows and never removes them; the repo accumulates abandoned workflow files over months. GitHub Actions has no native TTL on workflow files. Mitigation: required
# expires:metadata comment on every scheduled workflow (see Architecture → Job metadata conventions); dedicated sweeper workflow processes the comment and disables or deletes expired entries; quarterly review of scheduled workflows for drift and relevance. - Runaway jobs spawning more jobs. A scheduled job that adds further scheduled jobs (by committing new workflow files) can cascade. GitHub Actions does not cap chain depth or spawn rate natively. Mitigation: PR review on any workflow-file additions (standard code review handles this); per-agent compute budget caps on the runner host; rate limiting on the ingestion surface that writes workflow files.
- Scope regression on substrate swap. Schedules expressed as GitHub workflow files need a migration path if this recipe is ever replaced by a different substrate (Temporal, Cloudflare Cron, etc.). Mitigation: keep the agent tool surface thin (the
gh-CLI wrapper or MCP) and job logic in scripts invoked by the workflow — the workflow YAML is the trigger wrapper, not the mechanism, so only the wrapper changes on substrate swap.
Cost estimate
Medium. First deployment: 1–2 engineer-weeks including runner-host substrate (ARC on K8s, or scripted VMs with systemd), OIDC-federation wiring for at least one cloud target, log-forwarding pipeline, gh CLI wrapper or MCP surface, and policy layer for write-path schedules. Per-project incremental cost drops sharply after the first deployment — the runner infrastructure and OIDC-federation template are reusable. Ongoing maintenance is moderate: runner-host upkeep, log-storage costs, quarterly review of scheduled workflows for drift and relevance.
Compute costs are substrate-dependent (K8s cluster, VM fleet, physical hardware). GitHub Actions minute billing is zero on self-hosted runners regardless of repo visibility.
Open design questions
These gate promotion from proposed to proven. Each needs an answer from the first integration.
-
Runner-host substrate. ARC on Kubernetes (if K8s is already in play) vs. scripted VMs with systemd (simpler, fewer moving parts, weaker at scale) vs. a managed runner service. First integration picks one and documents the trade-off.
-
Log-forwarding destination. Loki / CloudWatch / S3 / vendor product. Non-optional but not yet picked; blocks the silent-failure mitigation.
-
Action-type policy. What’s allowed in a scheduled workflow step? Options span a risk gradient:
- Pre-registered playbooks only — narrow, safe, every job type manually added.
- MCP tool invocations — broader, bounded by tool registration discipline.
- Claude Code remote-agent runs — broadest leverage, highest risk.
- Arbitrary shell steps — GitHub’s default, weakest containment.
Pick an answer or establish tiered allow-lists by risk class.
-
Human-in-the-loop for high-risk schedules. Workflow-file changes go through standard PR review. Is that sufficient, or do production-touching / high-cadence schedules need additional approval to schedule (not just to execute)? Risk-tier labels on workflow files with matching branch-protection rules is a candidate answer.
-
Observability loop into signal-driven tasks. Scheduled jobs produce signal (scan results, test failures, metric alerts); that signal must feed
PL5-signal-driven-tasksfor the compounding loop to close. Direct MCP integration from the workflow? Filesystem drop + the Ingestion as PR recipe? Still unresolved. -
Sweeper implementation. The Job-metadata-conventions architecture names the
# expires:mechanism; the sweeper that processes it on each scheduler cadence is not yet written. A first-integration concern rather than a design-question, but noted here until there’s a reference implementation.
Related recipes
- Composes with: GitOps JIT privilege elevation — scheduled jobs requiring writes route through the elevation mechanism at fire time. The PR-shaped
create_joboperation composes naturally with branch-protection-based elevation gates on workflow-file additions. - Composes with: Bot-token credential tenancy — non-OIDC targets (third-party APIs, legacy systems) that need long-lived-ish credentials should use bot tokens scoped to the scheduler’s service identity, not user PATs.
- Composes with: Indexed per-entry registry — workflow-run history is a corpus worth indexing; querying “what scheduled jobs have ever run against production” is the analytic value of treating runs as structured data.
- Composes with: Ingestion as PR — signals produced by scheduled jobs (scan reports, test failures) ingested through the PR-shaped ingestion path close the loop back into
PL5-signal-driven-tasks. - Prerequisite for: future recipes that depend on time-based execution — automated rubric re-scoring, periodic stakeholder-context sweeps, drift detection against design specs, periodic backup / migration validation. None of these recipes have been written yet; several are latent in the repo backlog.
- Alternatives to: other scheduling substrates for workloads outside this recipe’s envelope — Temporal, Cloudflare Workers Cron, Durable Object alarms, pg_cron, APScheduler / BullMQ, AWS EventBridge Scheduler. None evaluated yet. Workloads that fall outside the boundary (sub-5-min cadence, tight-SLA timing, public repos with privileged secrets, non-GitHub VCS) need a separate substrate evaluation; see
research/scheduler-substrate-github-actions.mdfor the boundary analysis that would anchor such a comparison. Also specifically rejected: claude.ai-hosted remote triggers — cannot reach the project’s local MCP tool surface, forcing a compromise between agent-invocability and boundary discipline that this recipe is designed to preserve.
References
- GitHub Actions as scheduling substrate — full substrate evaluation: scheduling primitive, reliability, ephemeral runners, REST API surface, credential model, suitability envelope, negative findings.
- GitHub Actions platform docs — primary-source extracts for schedule event, ephemeral runners, security hardening, REST API, usage limits.