Skip to content

GitHub Actions scheduler

Family
scheduling
Status
proposed
Complexity
medium
Advances
PL5-signal-driven-tasks PL2-test-quality PL2-ui-test-coverage PL2-load-stress-testing PL4-release-strategy PL5-pipeline-reliability PL5-outcome-input-loop
Prerequisites
PL3-structured-state-read ≥ 2, PL3-agent-queryability ≥ 2, PL5-pipeline-reliability ≥ 2

GitHub Actions scheduler

Purpose

Use GitHub Actions as the agent-invokable scheduling substrate: schedules declared as on: schedule in workflow files on the default branch, executed on ephemeral self-hosted runners with just-in-time (JIT) configuration tokens, credentialed via OIDC federation to cloud targets where applicable. Closes the structural gap where seven rubric criteria — PL5-signal-driven-tasks, PL2-test-quality, PL2-ui-test-coverage, PL2-load-stress-testing, PL4-release-strategy, PL5-pipeline-reliability, PL5-outcome-input-loop — silently presupposed scheduling infrastructure that wasn’t named (resolved in rubric v0.18 by tightening PL5-signal-driven-tasks’s level-2 anchor to make the scheduling prerequisite explicit).

The recipe is substrate-specific by design. Earlier deliberation considered a substrate-agnostic framing, but the structure that matters — delay tail, public-repo self-hosted runner prohibition, OIDC federation shape, workflow-file lifecycle mechanics — is specific enough to GitHub Actions that a generic framing would hide the parts that matter. Additional substrates (Temporal, Cloudflare Cron, pg_cron, etc.) warrant their own sibling recipes if and when a workload forces the question; see research/scheduler-substrate-github-actions.md for the boundary analysis that justifies this specialisation.

Architecture

The scheduler is a GitHub repository running GitHub Actions, configured as follows:

  • Schedules declared in workflow files on the default branch using on: schedule with POSIX five-field cron syntax. Minimum cadence 5 minutes. Non-round-minute offsets (17 * * * *) preferred over round-hour expressions (0 * * * *) to avoid the platform’s peak-queue delay tail.

  • Ephemeral self-hosted runners with the --ephemeral flag, provisioned via just-in-time (JIT) configuration tokens from POST /orgs/{org}/actions/runners/generate-jitconfig (or the repo-scoped variant). Every runner processes exactly one job and de-registers; no state persists between runs. On Kubernetes, Actions Runner Controller (ARC) is the reference implementation and passes --ephemeral automatically.

  • Agent tool surface via gh CLI (shell access) or a thin MCP wrapper. Canonical operations map onto native GitHub Actions mechanisms:

    OperationMechanism
    CreatePR adding a workflow file with on: schedule on the default branch
    EditPR modifying the cron expression in the workflow file
    Cancelgh workflow disable <id> (pause) or file deletion (permanent)
    Listgh workflow list filtered to workflows using on: schedule
    Statusgh run list --workflow <id> --event schedule
    Ad-hoc firegh workflow run <id> (requires workflow_dispatch in the workflow)
    Cancel rungh run cancel <run-id> (or rerun-failed-jobs for recovery)
  • Credentials via OIDC federation for cloud targets (AWS STS, GCP Workload Identity Federation, Azure federated identity, HCP, Databricks). Workflows declare permissions: id-token: write and exchange GitHub’s short-lived JWT for cloud credentials at fire time. No long-lived secrets stored in repo secrets, the scheduler, or the runner host.

  • For write paths, composition with GitOps JIT privilege elevation is load-bearing: the PR that creates a scheduled writer routes through the same elevation gate as code changes, and the scheduled job itself elevates at fire time rather than carrying standing write credentials.

  • Log forwarding from ephemeral runners to external storage (Loki, CloudWatch, S3, vendor) — non-optional because runner-local logs evaporate on de-registration and silent-failure investigation is otherwise impossible.

  • Observability via GET /repos/{owner}/{repo}/actions/runs endpoints and the workflow-run UI; structured logs queryable through the project’s existing observability surface (rubric PL3-agent-queryability) once forwarding is in place.

  • Job metadata conventions. Each scheduled workflow file carries metadata as YAML comments parsed by agent tooling and a dedicated sweeper workflow:

    • # owner: — agent or human slug (reinforces the native actor context).
    • # expires: — TTL date; past this date the sweeper disables or deletes the workflow.
    • # risk-tier: low | medium | high — feeds branch-protection tiers; high-risk workflow-file changes require additional reviewers on create and edit.
    • # tags: — list form, for aggregation and querying.
    • # cost-budget: — per-run or per-month ceiling; sweeper alerts when exceeded.

    These are repo-level conventions (not platform features); the sweeper enforces them on the scheduler’s own cadence. Without them, workflow-file proliferation is latent — see Failure modes.

Substrate limits to engineer around: 5-minute minimum cadence, ±30-minute delay tail under platform load, 60-day inactivity disable on public repos (N/A for private), plan-level concurrent-job ceilings (20 Free / 40 Pro / 60 Team / 500 Enterprise), and the GITHUB_TOKEN 1,000 req/hr/repo rate limit for API-chatty jobs. Workloads outside these boundaries need a different substrate — see the evaluation research for the full suitability envelope.

Criteria advanced

  • PL5-signal-driven-tasks — direct unlock. Rubric v0.18’s level-2 explicitly caps this criterion at level 1 without an agent-invokable scheduler. Deploying this recipe removes the cap; level 2 becomes reachable on the criterion’s own merits (reactive-source coverage still needs its own work).
  • PL2-test-quality — enables “mutation testing run periodically, not per-PR” (level-2 anchor’s explicit language). Without scheduling, mutation testing is either per-PR (too expensive) or ad-hoc (signal quality collapses).
  • PL2-ui-test-coverage — enables “coverage across critical flows, run daily.” Without scheduling, daily-run is aspirational.
  • PL2-load-stress-testing — enables “run on production-mirrored env, scheduled.” Without scheduling, load tests run ad-hoc and drift out of representativeness.
  • PL4-release-strategy — enables metric-gated stage promotion over time windows rather than immediate-or-never promotion. Caveat: the ±30-minute delay tail makes tight release-window promotion marginal on this substrate; sub-window-sensitive release strategies need a tighter scheduler.
  • PL5-pipeline-reliability — enables self-healing pipelines that retry, backfill, or alert on schedule drift. Without scheduling, self-healing is reactive-only.
  • PL5-outcome-input-loop — enables “metric thresholds trigger automated next-cycle tasks” on time windows, not just immediate-event triggers.

On all criteria except PL5-signal-driven-tasks, this recipe is an unlock / prerequisite rather than a full mechanism — each criterion still needs its own domain work (mutation tooling, UI test authoring, load-test rigs, canary promotion policy). But without the scheduler, each is structurally blocked from reaching level 2.

Prerequisites

  • PL3-structured-state-read ≥ 2 Structured state read access. Jobs that observe state need read access to the state they’re observing. Scheduled jobs without queryable state can only do blind work.
  • PL3-agent-queryability ≥ 2 Agent queryability. Jobs that react to telemetry (health checks, finding-rate trends) need the same queryability the agent has. Without this, jobs can act but can’t inspect.
  • PL5-pipeline-reliability ≥ 2 Pipeline reliability. Flaky pipelines compound pain catastrophically when scheduled work is layered on top — schedule keeps firing, failures accumulate, alerting drowns. Don’t deploy this recipe onto an unstable pipeline.
  • Implicit: GitHub as the project’s VCS, with Actions enabled. Projects not on GitHub need a different scheduling substrate; this recipe does not cover that case. Projects on GitHub where Actions is disabled by the organisation have no path here without that policy change.

Failure modes

  • Delay-tail-sensitive downstream consumers. A consumer that assumes the schedule fires within a minute of the stated time breaks when it fires 20–40 minutes later. Mitigation: treat scheduler timing as best-effort; downstream consumers tolerate the drift or use a different substrate; document the drift budget in the workflow file itself.
  • 60-day inactivity disable on public repos. Scheduled workflows auto-disable after 60 days with no repository activity. A reported bug extends the disablement to on: push / on: pull_request paths on the same workflow file. Mitigation: prefer private repos for scheduled agent work; if public, adopt keepalive-workflow or synthetic commit cadence — but recognise these are brittle and themselves affected by the disablement rule.
  • Silent failure with lost logs. Ephemeral runners lose their local logs on de-registration. Without external log forwarding, scheduled jobs fail, no one notices, trust in the signal erodes. Mitigation: log forwarding to external storage is non-optional; alerting on job failure is non-optional; observability on the schedule itself (is it firing? is the runner registering?) is distinct from observability on its outputs.
  • GITHUB_TOKEN rate-limit exhaustion. Scheduled jobs chatty against the GitHub API (listing runs, posting statuses, updating issues) exhaust the 1,000 req/hr/repo budget quickly. Mitigation: purpose-scoped PATs or GitHub App tokens for chatty jobs; batch API calls; cache listings within the job.
  • Public-repo self-hosted runner compromise. Public repos with self-hosted runners are vulnerable to fork-PR attacks that execute arbitrary code on the runner host. GitHub’s own guidance is effectively prohibitive: “Self-hosted runners should almost never be used for public repositories.” Mitigation: do not deploy this recipe on public repos with self-hosted runners; private repos only, or GitHub-hosted runners for public-repo scheduling.
  • Credentials at rest on the runner host. A persistent self-hosted runner with secrets on disk becomes a compounding attack surface. Mitigation: ephemeral + JIT is non-optional; OIDC federation for cloud credentials; environment secrets with required reviewers for non-federated targets.
  • Privilege escalation via scheduled writes. If the agent can schedule arbitrary writes, the GitOps JIT elevation gate is bypassed by timing — schedule the write, come back later. Mitigation: scheduled writes elevate at fire time, not schedule time; the scheduled job opens a PR that the elevation gate reviews, rather than executing the write directly.
  • Timezone / DST bugs. Cron expressions behave surprisingly around DST boundaries; GitHub advances schedules forward across skipped hours but fall-back creates potential duplicate fires. Mitigation: schedules expressed and reasoned about in UTC; DST-affected hours (roughly 01:00–03:00 in transitioning zones) avoided where possible; tests around DST boundaries explicit.
  • Workflow-file proliferation. Agent creates scheduled workflows and never removes them; the repo accumulates abandoned workflow files over months. GitHub Actions has no native TTL on workflow files. Mitigation: required # expires: metadata comment on every scheduled workflow (see Architecture → Job metadata conventions); dedicated sweeper workflow processes the comment and disables or deletes expired entries; quarterly review of scheduled workflows for drift and relevance.
  • Runaway jobs spawning more jobs. A scheduled job that adds further scheduled jobs (by committing new workflow files) can cascade. GitHub Actions does not cap chain depth or spawn rate natively. Mitigation: PR review on any workflow-file additions (standard code review handles this); per-agent compute budget caps on the runner host; rate limiting on the ingestion surface that writes workflow files.
  • Scope regression on substrate swap. Schedules expressed as GitHub workflow files need a migration path if this recipe is ever replaced by a different substrate (Temporal, Cloudflare Cron, etc.). Mitigation: keep the agent tool surface thin (the gh-CLI wrapper or MCP) and job logic in scripts invoked by the workflow — the workflow YAML is the trigger wrapper, not the mechanism, so only the wrapper changes on substrate swap.

Cost estimate

Medium. First deployment: 1–2 engineer-weeks including runner-host substrate (ARC on K8s, or scripted VMs with systemd), OIDC-federation wiring for at least one cloud target, log-forwarding pipeline, gh CLI wrapper or MCP surface, and policy layer for write-path schedules. Per-project incremental cost drops sharply after the first deployment — the runner infrastructure and OIDC-federation template are reusable. Ongoing maintenance is moderate: runner-host upkeep, log-storage costs, quarterly review of scheduled workflows for drift and relevance.

Compute costs are substrate-dependent (K8s cluster, VM fleet, physical hardware). GitHub Actions minute billing is zero on self-hosted runners regardless of repo visibility.

Open design questions

These gate promotion from proposed to proven. Each needs an answer from the first integration.

  • Runner-host substrate. ARC on Kubernetes (if K8s is already in play) vs. scripted VMs with systemd (simpler, fewer moving parts, weaker at scale) vs. a managed runner service. First integration picks one and documents the trade-off.

  • Log-forwarding destination. Loki / CloudWatch / S3 / vendor product. Non-optional but not yet picked; blocks the silent-failure mitigation.

  • Action-type policy. What’s allowed in a scheduled workflow step? Options span a risk gradient:

    • Pre-registered playbooks only — narrow, safe, every job type manually added.
    • MCP tool invocations — broader, bounded by tool registration discipline.
    • Claude Code remote-agent runs — broadest leverage, highest risk.
    • Arbitrary shell steps — GitHub’s default, weakest containment.

    Pick an answer or establish tiered allow-lists by risk class.

  • Human-in-the-loop for high-risk schedules. Workflow-file changes go through standard PR review. Is that sufficient, or do production-touching / high-cadence schedules need additional approval to schedule (not just to execute)? Risk-tier labels on workflow files with matching branch-protection rules is a candidate answer.

  • Observability loop into signal-driven tasks. Scheduled jobs produce signal (scan results, test failures, metric alerts); that signal must feed PL5-signal-driven-tasks for the compounding loop to close. Direct MCP integration from the workflow? Filesystem drop + the Ingestion as PR recipe? Still unresolved.

  • Sweeper implementation. The Job-metadata-conventions architecture names the # expires: mechanism; the sweeper that processes it on each scheduler cadence is not yet written. A first-integration concern rather than a design-question, but noted here until there’s a reference implementation.

  • Composes with: GitOps JIT privilege elevation — scheduled jobs requiring writes route through the elevation mechanism at fire time. The PR-shaped create_job operation composes naturally with branch-protection-based elevation gates on workflow-file additions.
  • Composes with: Bot-token credential tenancy — non-OIDC targets (third-party APIs, legacy systems) that need long-lived-ish credentials should use bot tokens scoped to the scheduler’s service identity, not user PATs.
  • Composes with: Indexed per-entry registry — workflow-run history is a corpus worth indexing; querying “what scheduled jobs have ever run against production” is the analytic value of treating runs as structured data.
  • Composes with: Ingestion as PR — signals produced by scheduled jobs (scan reports, test failures) ingested through the PR-shaped ingestion path close the loop back into PL5-signal-driven-tasks.
  • Prerequisite for: future recipes that depend on time-based execution — automated rubric re-scoring, periodic stakeholder-context sweeps, drift detection against design specs, periodic backup / migration validation. None of these recipes have been written yet; several are latent in the repo backlog.
  • Alternatives to: other scheduling substrates for workloads outside this recipe’s envelope — Temporal, Cloudflare Workers Cron, Durable Object alarms, pg_cron, APScheduler / BullMQ, AWS EventBridge Scheduler. None evaluated yet. Workloads that fall outside the boundary (sub-5-min cadence, tight-SLA timing, public repos with privileged secrets, non-GitHub VCS) need a separate substrate evaluation; see research/scheduler-substrate-github-actions.md for the boundary analysis that would anchor such a comparison. Also specifically rejected: claude.ai-hosted remote triggers — cannot reach the project’s local MCP tool surface, forcing a compromise between agent-invocability and boundary discipline that this recipe is designed to preserve.

References

  • GitHub Actions as scheduling substrate — full substrate evaluation: scheduling primitive, reliability, ephemeral runners, REST API surface, credential model, suitability envelope, negative findings.
  • GitHub Actions platform docs — primary-source extracts for schedule event, ephemeral runners, security hardening, REST API, usage limits.