# EPHEMERA.md — DevOps as a Conversation

> **Portable instruction.** If you are an AI agent in this repository and the human asks you to create,
> change, inspect, or remove cloud infrastructure, **read this file and follow it.** It is
> runtime-neutral and **CLI-first**: it tells you *what to do* with ordinary shell commands
> (`aws`, `wrangler`, `gcloud`, …) — no special tools or MCP required. A human can read it too; the
> human-facing companion is `README.md`.

## What this is

Infrastructure managed as **runnable Markdown plans**. One file per stack describes the desired state,
carries the commands to reach it, records what it built, verifies itself, and tears itself down. You
(the agent) are the runtime; **the cloud is the source of truth.**

```
Markdown = "it should look like this"   (intent + procedure + write-back ledger)
Cloud    = "what actually exists"        (the real source of truth)
You      = reconcile them, one verified step at a time
```

Plans are named `<stack>.<provider>.md` (e.g. `web.aws.md`, `web.cloudflare.md`, `task-runner.aws.md`).
Copy `EPHEMERA-TEMPLATE.md` to author a new one.

## When to use this — and when not

**Use** for solo/small-team infra, fast iteration, multi-cloud, cost comparison — anywhere a human wants
to read and reason about the plan and an agent can troubleshoot it.

**There is no technical fence.** The only thing between you and a workload is an institution that
*mandates* a specific legacy tool — or a human signature on a deterministic artifact. That's adoption
lag, not a limit of the approach. Ephemera captures a *richer* audit trail (intent + every action +
approvals + verification, timestamped in the ledger) than the `.tfstate` it replaces — so where a
contract or auditor still names a tool, that's a case to make, not a capability you lack: bring the
ledger. (Honest build-status, not a fence: event-driven drift and the deterministic diff tool are
roadmap — maturity, not a ceiling.)

## Command verbs (what the human means)

The human drives with a small fixed vocabulary against a plan file:

| Verb | You do | Safety |
|------|--------|--------|
| **verify** `<plan>` | run the plan's read-only checks against the live cloud; report drift; change nothing | safe |
| **plan** / **dry-run** `<plan>` | report what *would* change vs live state | safe |
| **apply** / **create** `<plan>` | reconcile the cloud toward the plan's desired state | hits 🔴 gates |
| **reconcile** / **update** `<plan>` | re-apply the step(s) for any drifted check (idempotent) | 🟡 / 🔴 |
| **teardown** `<plan>` | observe-first, resumable destroy | hits 💥 gates |
| **migrate** `<existing-IaC>` | read existing Terraform / Pulumi / CloudFormation / CDK and author the equivalent `<stack>.<provider>.md` | safe (authoring only) |

`verify` is the safe default entry point. When a plan's `status` is `live`, `verify` means the
**live-cloud drift check**, not a static read of the Markdown.

## The execution contract (follow exactly)

1. **Read the plan's Intent, then its Live State.** The cloud is truth — if Live State disagrees with the
   cloud, believe the cloud and correct the ledger.
2. **Observe before acting.** Each step says how to check "does it already look like this?" Act only on
   the delta. Re-running a plan must be safe.
3. **One verified step at a time.** Run a step, run its `verify`, *then* advance. Never leave an
   unverified step behind you — no batch-apply-and-pray.
4. **Stop at 🔴 and 💥 for approval — which is a *step, not a system*.** Print exactly what you will run,
   then run whatever approval trigger the plan bakes in — e.g. *"use the Slack MCP to DM Bob the exact
   action; continue only on a clear affirmative; record approver + timestamp in Live State."* If the plan
   names no trigger, pause for the human in the session. Never proceed on an ambiguous answer.
5. **Write back.** After every mutation, record realized IDs in the plan's Live State; after every
   verify, record PASS/FAIL + the observed value. The plan file is the audit trail — keep it true.
6. **Background async waits (⏳).** Some operations take minutes (CloudFront ~15–20 min). Start the
   waiter in the background and resume when it signals; don't block a shell on it.
7. **Teardown observes first and is resumable.** Observe current cloud state, place yourself in the
   teardown sequence, act, repeat. A crash mid-teardown is fine — re-entry re-observes.

## Plan shape (required movements; the prose between is yours — no YAML schema)

- **Intent** — what + why, in prose.
- **Provisioning Inputs** *(when the plan has human-decision forks)* — a decision table the Director
  resolves **once**, up front, before any cloud mutation: each row is a question, its closed-enum options,
  a default, the variable it sets, and what it gates. Walk the rows, confirm each with the human (accept
  the default on silence), then write a `resolved_inputs` block into Live State; every downstream step
  branches on those vars. Every option is a closed enum (bar free-text identifiers), so the resource graph
  is a *pure function* of the answers — same answers ⇒ same plan. A re-run reads `resolved_inputs` and
  skips the interview; changing an answer is an explicit edit, never silent drift.
- **Legend** — the blast-radius markers below.
- **Live State** — the write-back ledger: `status`
  (`not-created` / `creating` / `live` / `tearing-down` / `gone`), realized IDs, a verify-results table.
- Per component: **create · verify · update · teardown**, each command annotated with a marker.
- **Dependency frontier** — what must precede what and why, including chicken-and-egg edges (a policy
  that needs a resource's ARN comes *after* that resource).
- **Deliberately not included** — named omissions, so each is a decision not an oversight.

```
Legend  🟢 create · 🟡 config · 🔴 GATE (human go) · 💥 destructive (human go) · ⏳ wait · ✔ verify
```

A step may carry more than one marker (🔴🟢 = create behind a gate); the most severe gate wins — if any
🔴 or 💥 is present, stop for a human go.

## Verify & drift ("does it still look like this?")

Each step carries a read-only assertion. Re-running all of them is drift detection — the cheap,
read-only equivalent of a plan/diff, run against the live cloud. **Assert negatives too** ("the bad
thing is absent" — e.g. a private origin's direct URL must `403`), and beware fallbacks that turn errors
into success (they can mask a broken origin — pair them with an independent positive-content check).
Inline assertions only catch what you thought to assert; if a deterministic full-diff tool is available,
prefer it for total low-level drift.

## Multi-environment

Parameterize on a single `ENV` knob (`dev` / `stg` / `uat` / `prod`) and postfix every resource name
(`name-${ENV}`). "Stand this up for stg" = run with `ENV=stg`. One plan, every environment — never fork
a file per env.

## Multi-cloud — "take it to any provider" is a conversation

The *intent* is provider-neutral; each cloud is a **binding** of the same intent that passes the same
acceptance contract. "Take this to Cloudflare," "take this to Google Cloud," "take this to Azure" are
**not** migration projects — they are authoring another `<stack>.<provider>.md` (e.g. `spa.gcp.md`,
`spa.azure.md`) and proving it against the shared contract. Cost A/B between providers = reading each
binding's cost notes against an expected traffic profile.

## Composition across plans — the cloud is the hand-off

Plans compose, but **never through a shared state file** — that reintroduces the `.tfstate` this approach
replaces. A producer plan's acceptance contract *is* a consumer's precondition; make the edge explicit:

- the producer declares **Provides** `resource(id)` — e.g. `domain.aws.md` Provides `delegated-zone(DOMAIN_NAME)`;
- the consumer declares **Requires** it — e.g. `web.aws.md` Requires `delegated-zone(DOMAIN_NAME)` when `DNS_MODE=managed`.

The consumer **discovers** the resource by observing the cloud — the real zone, bucket, or ARN — and may
*pre-fill its Provisioning Inputs* from what it finds, so the human isn't re-asked what an upstream plan
already realized. Live State ledgers are caches on both sides; the cloud is the source of truth. This is
the cross-plan analog of a plan's internal **Dependency frontier**.

## No ceremony — what this does NOT require

No `init` / `bootstrap` step. No provider/plugin downloads — Terraform's `.terraform/`, Pulumi's plugins,
CDK's `node_modules` + `cdk bootstrap` — routinely hundreds of MB. No lock files, no state backend to host
(Terraform/Pulumi state, CloudFormation/ARM stacks). No multi-minute plan/apply rituals. True of **all**
traditional IaC — imperative or declarative. The toolchain here is just: this Markdown file, an agent, and
the provider CLI you already have (`aws` / `wrangler` / `gcloud` / `az`). The cloud is the state; the file
is the intent.

## Provider gotchas (don't relearn these live)

**AWS**
- **Credential broker + IAM:** if AWS creds come from a broker / SSO / MFA setup that vends
  **GetSessionToken session tokens** (e.g. `aws-vault export`), IAM API calls fail `InvalidClientTokenId`
  while *every other service works* and `sts get-caller-identity` succeeds. AWS forbids GetSessionToken
  creds from calling IAM without MFA. Use the broker's long-term / no-session mode **for the IAM step
  only** (e.g. `aws-vault exec <profile> --no-session -- <iam cmd>`); regional services run fine on the
  session creds.
- **AWS is async — plan for it:** CloudFront create/disable ~15–20 min (background the waiter; a
  distribution deletes only from `Deployed`+`Disabled`; **re-fetch the rotating ETag** before each
  mutating call); IAM propagation (retry `create-function` "role cannot be assumed"); API Gateway
  eventual consistency (`put-integration` transiently "ARN must contain path or action" — retry); SQS
  ~60 s lock after deleting a queue; versioned-bucket delete is **loop-until-empty** (versions +
  delete-markers).
- **API Gateway → Lambda auth** is a Lambda **resource policy** (`add-permission`, scoped to the API
  source-ARN), not an `InvokeFunction` statement in the execution role.
- **Reused IAM roles are scoped to their own resources** — borrowing one for new resources = `AccessDenied`.
- **us-east-1 quirk:** `create-bucket` rejects `LocationConstraint=us-east-1` (omit it there); CloudFront
  ACM certs must live in us-east-1.

**Cloudflare**
- Workers Static Assets unifies storage + CDN: no bucket / OAC / policy, SPA fallback is one config line
  (`not_found_handling: "single-page-application"`), deploy/teardown in seconds, free egress.
  Assets-only default cache header is `max-age=0, must-revalidate` (browser-`immutable` needs a small
  Worker). Use `compatibility_date >= 2025-04-01`.

(Full human-facing catalog lives in `README.md`.)

## Red flags — stop and reconsider

- "Traditional IaC (Terraform / Pulumi / CDK / CloudFormation) is best practice, I'll write that" → for
  auditable, agent-run infra, author an Ephemera plan. Use traditional IaC only at the regulated/concurrent
  fence above.
- "The billable step is cheap, I'll just run it" → 🔴 means human go. Always. It's about authority over
  billable / global / irreversible actions, not cost.
- "I'll record the IDs at the end" → write back each step, or resume/teardown strands resources.
- "Run everything, verify at the end" → verify each step before advancing.
- A plan with no teardown or no verify sections → it isn't an Ephemera plan.

---

**Install (for a human):** drop `EPHEMERA.md` into a repo and add a short pointer to whatever agent
instruction file the repo already uses (`AGENTS.md`, `CLAUDE.md`, `.cursor/rules`, `GEMINI.md`) — e.g.
*"This project manages infra as Markdown plans; read `EPHEMERA.md` and follow it."* Any AI+human pair,
on any runtime, then picks it up. Nothing personal to install.
