# Ephemera — ECS Fargate platform: cluster + ALB (AWS CLI)

> Self-executing Markdown. The shared, long-lived layer of the ECS Fargate demo — the ECS cluster, the
> ALB, and the ALB security group — as one auditable plan. The cloud is the source of
> truth for state; this file is intent + write-back ledger.

> **Provides** `ecs-cluster(${CLUSTER})` and `alb(${ALB_NAME})` — realized cluster ARN/name, ALB ARN,
> ALB DNS name, the :443 (prod) and :9001 (test) listener ARNs, and the ALB security-group id. The
> per-service plan `ecs-service.aws.md` **Requires** these and discovers them by observing the cloud.
>
> **Requires** `vpc()`, `subnets()`, `cert()` from **`network.aws.md`** — a **VPC + public subnets + a
> valid ACM cert**, published to **SSM Parameter Store** at `/network/vpc`, `/network/subnet/public/1a`,
> `/network/subnet/public/1b`, `/network/cert/${AWS_REGION}` — read from SSM. If they are absent (or the
> cert is expired), run `network.aws.md` first — it discovers
> what exists, teaches what's missing, and stands up or reissues only what's needed.
>
> ⚠️ **Known blocker (dry-run 2026-06-24):** both `*.example.com` certs **in us-west-2 are EXPIRED**;
> the valid `*.example.com` certs (ISSUED→2027) live in **us-east-1** and are **CloudFront-only — an
> ALB cannot attach a cross-region cert.** So §5's HTTPS listener has no valid cert until `network.aws.md`
> §3 produces one in us-west-2 (ACM DNS-validated **via Cloudflare**, or a Cloudflare Origin cert). VPC +
> 2 public subnets in the default VPC *are* present and wired. example.com DNS is on **Cloudflare**.

---

## 🤖 Director prompt

Observe before acting; verify each step; **stop at 🔴/💥 for human go**; write realized ARNs/IDs back
into Live State each step; ALB create/delete is async — background the waiter; teardown observes-first
and is resumable. Verbs: `verify` (read-only), `apply` (create, gated), `teardown` (destroy, gated).

```
Legend  🟢 create · 🟡 config · 🔴 GATE (human go) · 💥 destructive (human go) · ⏳ wait · ✔ verify
```

## Intent

Stand up the **shared platform** every service in this demo attaches to: one **ECS Fargate cluster**,
one **internet-facing ALB** terminating TLS on **:443** (HTTP :80 disabled — HTTPS-only), plus a **test
listener on :9001** that CodeDeploy uses to validate green before cutover,
and the **ALB security group** (ingress :80/:443 from the internet, egress all). Per-service target
groups, task defs, services, and DNS live in `ecs-service.aws.md`; this layer is created once and
rarely torn down — which is exactly why it is its own plan with its own teardown gate.

## Gotcha that will cost time if forgotten: ALB is async, SGs are referenced everywhere

ALB create/delete takes minutes and the ALB cannot be deleted until its listeners are gone; the ALB
security group cannot be deleted until **every** service SG that references it is gone. Teardown order
therefore matters and is the reverse of create. (General AWS broker/IAM note lives in `README.md`; this
plan creates **no IAM**, so the credential-broker `--no-session` dance does not apply here.)

## Live State

```yaml
status:        not-created      # published template - run it to realize state
last_action:   authored — cluster, alb, security_group
last_verified: —
resolved_inputs:                # filled once the Provisioning Inputs interview runs
  env:        dev
  region:     us-west-2
```

| key            | value (filled on apply) |
|----------------|-------------------------|
| AWS_REGION     | `us-west-2` |
| ACCOUNT_ID     | `<discover: sts get-caller-identity>` |
| ENV            | `dev` |
| VPC_ID         | `<discover: ssm get-parameter /network/vpc>` |
| SUBNET_IDS     | `<discover: ssm /network/subnet/public/1a,1b>` |
| CERT_ARN       | `<discover: ssm /network/cert/us-west-2>` |
| ALB_SG_ID      | `—` |
| CLUSTER_ARN    | `—` |
| ALB_ARN        | `—` |
| ALB_DNS        | `—` |
| LISTENER_443   | `—` |
| LISTENER_9001  | `—` |

| ✔ check                          | expected                                  | observed | result |
|----------------------------------|-------------------------------------------|----------|--------|
| upstream VPC/subnets/cert in SSM | all 4 params resolve                       | —        | — |
| ALB security group               | ingress :80/:443 from `0.0.0.0/0`, egress all | —    | — |
| ECS cluster                      | `ACTIVE`, Fargate providers attached       | —        | — |
| ALB                              | `active`, internet-facing                  | —        | — |
| :443 listener                    | protocol HTTPS, cert attached, TLS13 policy | —       | — |
| cert ISSUED & not expired        | ACM status `ISSUED`, `NotAfter` future     | EXPIRED (dry-run) | ❌ |
| :9001 test listener              | present (blue/green validation)            | —        | — |
| :80 listener absent              | `enable_http=false` → no port-80 listener  | —        | — |

## Provisioning Inputs

Resolve once, up front, before any cloud mutation. Accept the default on silence; write `resolved_inputs`
into Live State. Same answers ⇒ same plan.

| # | Question | Options (closed enum) | Default | Sets | Gates |
|---|----------|-----------------------|---------|------|-------|
| 1 | Which environment? | `dev` / `stg` / `uat` / `prod` | `dev` | `ENV` | every resource name (`*-${ENV}`) |
| 2 | Which region? | `us-west-2` / `us-east-1` / … | `us-west-2` | `AWS_REGION` | SSM cert path, all ARNs |
| 3 | Enable an HTTP :80 listener? | `no` / `redirect-to-443` | `no` | `ENABLE_HTTP` | §3 listener set |

```yaml
# → written into Live State once resolved
resolved_inputs:
  env:         dev
  region:      us-west-2
  enable_http: no
  resolved_by: <human who confirmed>
  resolved_at: <timestamp>
```

## 0. Variables

```bash
export ENV="dev" AWS_REGION="us-west-2"
export ACCOUNT_ID="$(aws sts get-caller-identity --query Account --output text)"
export CLUSTER="${ENV}-cluster"
export ALB_NAME="${ENV}-fargate-alb"
export ALB_SG_NAME="${ENV}-fargate-sg"
export SSL_POLICY="ELBSecurityPolicy-TLS13-1-2-2021-06"
export TEST_LISTENER_PORT="9001"
```

## Dependency frontier

```
upstream SSM (VPC, subnets, cert)  ── verified, not created ──┐
                                                              ├─> alb-sg (ingress 80/443) ──┐
cluster (independent) ────────────────────────────────────────┘                            │
                                                                                            ├─> ALB (needs SG + subnets)
                                                                                            │     └─> :443 listener (needs cert ARN)
                                                                                            │     └─> :9001 test listener
                                                                                            └─> Provides{cluster, alb, listeners, sg}
```
Non-negotiable edges: **the ALB needs the ALB-SG id and the subnet ids** (SG + network first); **the
:443 listener needs the ACM cert ARN** (discovered from SSM). The cluster has no dependencies and can be
created in parallel with the SG. Teardown is the strict reverse: listeners → ALB → ALB-SG → cluster.

## 1. Discover & verify upstream network (read-only)  ✔

```bash
VPC_ID="$(aws ssm get-parameter --name /network/vpc --query Parameter.Value --output text)"
SUBNET_1A="$(aws ssm get-parameter --name /network/subnet/public/1a --query Parameter.Value --output text)"
SUBNET_1B="$(aws ssm get-parameter --name /network/subnet/public/1b --query Parameter.Value --output text)"
CERT_ARN="$(aws ssm get-parameter --name /network/cert/${AWS_REGION} --query Parameter.Value --output text)"
# all four must resolve to real ids/arns; empty => missing upstream, stop and stand up the network first
```
> → Live State: VPC_ID, SUBNET_IDS, CERT_ARN  (this step is a **Requires** check, creates nothing)

## 2. ALB security group  🟢

```bash
ALB_SG_ID="$(aws ec2 create-security-group --group-name "$ALB_SG_NAME" \
  --description "Security group for ALB in ${ENV}" --vpc-id "$VPC_ID" \
  --tag-specifications "ResourceType=security-group,Tags=[{Key=Environment,Value=${ENV}},{Key=Application,Value=fargate}]" \
  --query GroupId --output text)"
for PORT in 80 443; do
  aws ec2 authorize-security-group-ingress --group-id "$ALB_SG_ID" \
    --ip-permissions "IpProtocol=tcp,FromPort=${PORT},ToPort=${PORT},IpRanges=[{CidrIp=0.0.0.0/0}]" >/dev/null
done
# default egress (all) is created with the SG; leave as-is
```
```bash
# ✔ verify
aws ec2 describe-security-groups --group-ids "$ALB_SG_ID" \
  --query 'SecurityGroups[0].IpPermissions[].FromPort' --output text   # contains 80 and 443
```
> → Live State: ALB_SG_ID

## 3. ECS cluster  🟢

```bash
aws ecs create-cluster --cluster-name "$CLUSTER" \
  --capacity-providers FARGATE FARGATE_SPOT \
  --settings name=containerInsights,value=enabled \
  --tags key=Environment,value="$ENV" key=ManagedBy,value=ephemera >/dev/null
CLUSTER_ARN="$(aws ecs describe-clusters --clusters "$CLUSTER" \
  --query 'clusters[0].clusterArn' --output text)"
```
```bash
# ✔ verify
aws ecs describe-clusters --clusters "$CLUSTER" --query 'clusters[0].status' --output text   # ACTIVE
```
> → Live State: CLUSTER_ARN

## 4. ALB  🟢⏳

```bash
ALB_ARN="$(aws elbv2 create-load-balancer --name "$ALB_NAME" \
  --type application --scheme internet-facing \
  --subnets "$SUBNET_1A" "$SUBNET_1B" --security-groups "$ALB_SG_ID" \
  --tags Key=Environment,Value="$ENV" Key=Application,Value=fargate-alb \
  --query 'LoadBalancers[0].LoadBalancerArn' --output text)"
aws elbv2 wait load-balancer-available --load-balancer-arns "$ALB_ARN"   # ⏳ ~2-4 min
ALB_DNS="$(aws elbv2 describe-load-balancers --load-balancer-arns "$ALB_ARN" \
  --query 'LoadBalancers[0].DNSName' --output text)"
```
```bash
# ✔ verify
aws elbv2 describe-load-balancers --load-balancer-arns "$ALB_ARN" \
  --query 'LoadBalancers[0].State.Code' --output text   # active
```
> → Live State: ALB_ARN, ALB_DNS

## 5. Listeners — :443 prod + :9001 test  🟢🟡

The default action is a `fixed-response 503` because no service is attached yet; the service plan (or
CodeDeploy) later swaps it to the blue/green target groups.

```bash
# :443 prod listener (HTTPS, ACM cert, TLS13 policy)
LISTENER_443="$(aws elbv2 create-listener --load-balancer-arn "$ALB_ARN" \
  --protocol HTTPS --port 443 --ssl-policy "$SSL_POLICY" \
  --certificates CertificateArn="$CERT_ARN" \
  --default-actions 'Type=fixed-response,FixedResponseConfig={StatusCode=503,ContentType=text/plain,MessageBody=no-service-attached}' \
  --query 'Listeners[0].ListenerArn' --output text)"

# :9001 test listener (blue/green validation; HTTPS, same cert)
LISTENER_9001="$(aws elbv2 create-listener --load-balancer-arn "$ALB_ARN" \
  --protocol HTTPS --port "$TEST_LISTENER_PORT" --ssl-policy "$SSL_POLICY" \
  --certificates CertificateArn="$CERT_ARN" \
  --default-actions 'Type=fixed-response,FixedResponseConfig={StatusCode=503,ContentType=text/plain,MessageBody=no-green-yet}' \
  --query 'Listeners[0].ListenerArn' --output text)"
```
```bash
# ✔ verify — :443 present, :80 absent
aws elbv2 describe-listeners --load-balancer-arn "$ALB_ARN" \
  --query 'Listeners[].Port' --output text   # 443 and 9001 present; 80 ABSENT (enable_http=false)
# ✔ negative assertion — the attached cert must be ISSUED and NOT expired (dry-run found EXPIRED)
aws acm describe-certificate --certificate-arn "$CERT_ARN" \
  --query 'Certificate.Status' --output text   # MUST be ISSUED — if EXPIRED, stop and run network.aws.md §3
```
> → Live State: LISTENER_443, LISTENER_9001  ·  **Provides** now satisfied for consumers

## Update (idempotent reconcile)  🟡

- Cert rotation → `aws elbv2 modify-listener --listener-arn "$LISTENER_443" --certificates CertificateArn=<new>`.
- Add the :80→:443 redirect (if `ENABLE_HTTP=redirect-to-443`) → `create-listener --protocol HTTP --port 80
  --default-actions 'Type=redirect,RedirectConfig={Protocol=HTTPS,Port=443,StatusCode=HTTP_301}'`.
- SG rule drift → re-run §2's `authorize-security-group-ingress` (idempotent; duplicate-rule errors are benign).

## Teardown (observe-first, resumable)  💥

> **Precondition:** every `ecs-service.aws.md` consumer must be torn down first — a service SG that
> references `ALB_SG_ID`, or a target group attached to a listener, will block these deletes. Observe,
> then act, then re-observe.

```bash
# 1. listeners (must go before the ALB)              💥
for L in $LISTENER_9001 $LISTENER_443; do aws elbv2 delete-listener --listener-arn "$L"; done
# 2. ALB  💥⏳ (deletes asynchronously)
aws elbv2 delete-load-balancer --load-balancer-arn "$ALB_ARN"
aws elbv2 wait load-balancers-deleted --load-balancer-arns "$ALB_ARN"
# 3. ALB security group (only deletes once no service SG references it)  💥
aws ec2 delete-security-group --group-id "$ALB_SG_ID"
# 4. ECS cluster  💥
aws ecs delete-cluster --cluster "$CLUSTER"
```
```bash
# ✔ teardown verify
aws elbv2 describe-load-balancers --names "$ALB_NAME" 2>&1 | grep -q 'LoadBalancerNotFound' && echo alb-gone
aws ecs describe-clusters --clusters "$CLUSTER" --query 'clusters[0].status' --output text   # INACTIVE or missing
```
> → Live State: set `status: gone`, clear realized ids.

## Deliberately not included

- **The VPC, subnets, and ACM certificate** — assumed pre-existing and read from SSM (a named
  **Requires**). This plan is the ECS platform, not the network.
- **An HTTP :80 listener** — `enable_http=false`; HTTPS-only by default. Add via Update if desired.
- **Per-service target groups, task defs, services, CodeDeploy, DNS** — those are the *service*
  lifecycle and live in `ecs-service.aws.md`.
- **WAF, access logs, ALB-level authn** — out of scope for the demo; add as a follow-on plan.
