# Ephemera — Task runner (async pipeline) on AWS (API Gateway + Lambda + SQS + DynamoDB)

> Self-executing Markdown. The **AWS binding** of the *task-runner* intent — an async task API (6
> components) as one auditable plan. Same intent, different binding: [`task-runner.cloudflare.md`](./task-runner.cloudflare.md).
> The cloud is the source of truth for state; this file is intent + write-back ledger.

---

## 🤖 Director prompt

Observe before acting; verify each step; **stop at 🔴/💥 for human go**; write realized ARNs/IDs back
into Live State each step; retry through IAM eventual-consistency; teardown observes-first and is
resumable. Verbs: `verify` (read-only), `apply` (create, gated), `teardown` (destroy, gated).

```
Legend  🟢 create · 🟡 config · 🔴 GATE (human go) · 💥 destructive (human go) · ⏳ wait · ✔ verify
```

## Intent

A snappy async task API: `POST /task` creates a task (writes DynamoDB `pending`, enqueues SQS, returns
`taskId`); an SQS-triggered worker Lambda processes it (`processing` → `completed`); `GET /task/{taskId}`
returns status. Keeps the API fast while work runs out-of-band. Acceptance contract: POST returns
`.taskId`, GET returns `.status` ∈ {pending, processing, completed}.

## Design decisions (named, so each is a decision)

1. **Dead-letter queue + redrive** (`maxReceiveCount=5`) — without a DLQ a poison message retries until
   retention expires; a task runner without one is a latent incident.
2. **API-GW→Lambda auth via resource policy** — API Gateway invokes Lambda through the Lambda's
   **resource policy** (`lambda add-permission`, scoped to the API's source-ARN), not an
   `lambda:InvokeFunction` grant in the execution role.
3. **Runtime python3.12** — a current, supported runtime.
4. **Self-contained handler code** — the handlers are inline (built in §3), so the plan is fully
   reproducible from itself.

## Tested: "do we even need this IAM role?" (no shortcut — documented live)

Two facts, both **verified against AWS** in this session:

1. **Every Lambda requires an execution role** — AWS provides no default. "No role" is impossible.
2. **Reusing an existing role only works if it is broad.** We reused `platform-dev-tasks-consumer`
   (a real least-privilege role already in the account):
   - `create-function` **succeeded** — the role is assumable, so reuse *does* bypass IAM-create.
   - `create-event-source-mapping` **failed**: *"role does not have permissions to call ReceiveMessage
     on SQS"* (AWS pre-validates against our queue).
   - Direct invoke **failed**: `AccessDeniedException … dynamodb:PutItem on TaskTable-dev … no
     identity-based policy allows` — runtime denial on our table.

   A least-privilege role is scoped to *its own* resources; ours are different. A functional pipeline
   therefore needs a role with permissions on *these* resources → creating/modifying IAM is
   unavoidable. The §4 role is already minimal (3 scoped statements) — **not** over-provisioned.

## Gotcha that cost the most time: credential brokers can't call IAM

The IAM step failed for a while with `InvalidClientTokenId` — and *only* IAM, while DynamoDB / SQS /
Lambda / API Gateway and `sts get-caller-identity` all worked. Root cause: the AWS creds came from a
`credential_process` that wraps keys in a **GetSessionToken session**, and AWS forbids GetSessionToken
credentials from calling IAM unless MFA is in the request. Fix: vend the **long-term** keys for the IAM
step only — `aws-vault exec <profile> --no-session -- <iam cmd>` — then run §5–§7 on the normal session
creds. (Generalized for any broker/SSO/MFA setup in the project `README.md`.)

## Live State

```yaml
status:        not-created      # published template - run it to realize state
last_action:   teardown — role deleted via aws-vault --no-session; everything else on normal creds
last_verified: teardown verify — 0 APIs, both Lambdas gone, role gone, table deleting
```

| key            | value |
|----------------|-------|
| AWS_REGION     | `us-west-2` |
| ACCOUNT_ID     | `<AWS_ACCOUNT_ID>` |
| TABLE_ARN      | `<ARN>` |
| QUEUE_URL      | `https://sqs.us-west-2.amazonaws.com/<AWS_ACCOUNT_ID>/sqs-lambda-demo-queue` |
| QUEUE_ARN      | `<ARN>` |
| DLQ_ARN        | `<ARN>` |
| ROLE_ARN       | `<ARN>` |
| TASK_FN_ARN    | `<ARN>` |
| RUNNER_FN_ARN  | `<ARN>` |
| ESM_UUID       | `e833675d-90de-44d1-9203-fea3da2321af` |
| API_ID         | `3c9z02d79e` |
| API_ENDPOINT   | `https://3c9z02d79e.execute-api.us-west-2.amazonaws.com/dev` |

| ✔ check                         | expected                              | observed | result |
|---------------------------------|---------------------------------------|----------|--------|
| dynamodb table                  | `ACTIVE`, TTL on `expiration_time`    | ACTIVE         | ✅ |
| sqs main queue + redrive        | redrive → DLQ, maxReceiveCount 5      | wired          | ✅ |
| role assumable by lambda        | trust = lambda.amazonaws.com          | lambda svc     | ✅ |
| both lambdas                    | `Active` state                        | Active/Active  | ✅ |
| event-source mapping            | `Enabled`, queue→runner               | Enabled        | ✅ |
| `POST /task` → taskId           | `200` + `.taskId`                     | 200 + taskId   | ✅ |
| poll `GET /task/{id}` → status  | reaches `completed`                   | pending→completed | ✅ |

## 0. Variables

```bash
export AWS_REGION="us-west-2" ACCOUNT_ID="<AWS_ACCOUNT_ID>" ENV="dev"
export QUEUE="sqs-lambda-demo-queue" DLQ="sqs-lambda-demo-dlq"
export TABLE="TaskTable-${ENV}"
export ROLE="sqs-lambda-role-${ENV}"
export TASK_FN="task-lambda-${ENV}" RUNNER_FN="task-runner-lambda-${ENV}"
export API_NAME="sqs-lambda-demo-api" STAGE="${ENV}" RUNTIME="python3.12"
export BUILD="$(mktemp -d)"   # artifacts build dir
```

## Dependency frontier

```
dynamodb ─┐
sqs+dlq ──┼─> role (policies need queue/dlq/table ARNs) ──┐
          │                                                ├─> lambda_task (role+queue_url+table)
          │                                                ├─> lambda_runner (role) ─> event-source-mapping (queue_arn)
          │                                                │
lambda_task ───────────────────────────────────────────> api_gw (POST/GET → task invoke ARN)
                                                              └─> lambda add-permission (needs API ARN) ─> 🔴 deploy stage ─> ✔ acceptance
```
Non-negotiable edges: **role policies need the queue/DLQ/table ARNs** (storage first); **lambdas need
the role ARN** (and survive IAM propagation lag); **ESM needs the queue ARN + runner**; **API
integration needs the task-lambda invoke ARN**; **add-permission needs the API ARN**.

## 1. DynamoDB status table  🟢

```bash
aws dynamodb create-table --table-name "$TABLE" \
  --attribute-definitions AttributeName=taskId,AttributeType=S \
  --key-schema AttributeName=taskId,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST \
  --tags Key=Environment,Value="$ENV" Key=Team,Value=lambda-tasks >/dev/null
aws dynamodb wait table-exists --table-name "$TABLE"
aws dynamodb update-time-to-live --table-name "$TABLE" \
  --time-to-live-specification "Enabled=true,AttributeName=expiration_time" >/dev/null
TABLE_ARN="$(aws dynamodb describe-table --table-name "$TABLE" --query 'Table.TableArn' --output text)"
```
```bash
# ✔ verify
aws dynamodb describe-table --table-name "$TABLE" --query 'Table.TableStatus' --output text   # ACTIVE
```
> → Live State: TABLE_ARN

## 2. SQS queue + DLQ  🟢

```bash
# DLQ first (redrive target)
DLQ_URL="$(aws sqs create-queue --queue-name "$DLQ" \
  --attributes MessageRetentionPeriod=1209600 --query QueueUrl --output text)"
DLQ_ARN="$(aws sqs get-queue-attributes --queue-url "$DLQ_URL" \
  --attribute-names QueueArn --query 'Attributes.QueueArn' --output text)"

# main queue with redrive → DLQ (attributes via file to avoid escaping hell)
cat > "$BUILD/sqs-attrs.json" <<JSON
{ "VisibilityTimeout": "60",
  "MessageRetentionPeriod": "86400",
  "RedrivePolicy": "{\"deadLetterTargetArn\":\"$DLQ_ARN\",\"maxReceiveCount\":\"5\"}" }
JSON
QUEUE_URL="$(aws sqs create-queue --queue-name "$QUEUE" \
  --attributes file://"$BUILD/sqs-attrs.json" --query QueueUrl --output text)"
QUEUE_ARN="$(aws sqs get-queue-attributes --queue-url "$QUEUE_URL" \
  --attribute-names QueueArn --query 'Attributes.QueueArn' --output text)"
```
```bash
# ✔ verify redrive is wired
aws sqs get-queue-attributes --queue-url "$QUEUE_URL" --attribute-names RedrivePolicy \
  --query 'Attributes.RedrivePolicy' --output text   # contains the DLQ arn + maxReceiveCount 5
```
> → Live State: QUEUE_URL, QUEUE_ARN, DLQ_ARN

## 3. Build the Lambda artifacts (self-contained)  🟢

```bash
cat > "$BUILD/lambda_task.py" <<'PY'
import json, os, time, uuid, boto3
ddb = boto3.client("dynamodb"); sqs = boto3.client("sqs")
TABLE = os.environ["DYNAMODB_TABLE_NAME"]; QUEUE_URL = os.environ["SQS_QUEUE_URL"]
def _resp(code, body): return {"statusCode": code, "headers": {"Content-Type": "application/json"}, "body": json.dumps(body)}
def handler(event, context):
    method = event.get("httpMethod")
    if method == "POST":
        task_id = str(uuid.uuid4()); body = event.get("body") or "{}"
        ddb.put_item(TableName=TABLE, Item={
            "taskId": {"S": task_id}, "status": {"S": "pending"}, "payload": {"S": body},
            "expiration_time": {"N": str(int(time.time()) + 86400)}})
        sqs.send_message(QueueUrl=QUEUE_URL, MessageBody=json.dumps({"taskId": task_id, "payload": body}))
        return _resp(200, {"taskId": task_id, "status": "pending"})
    if method == "GET":
        task_id = (event.get("pathParameters") or {}).get("taskId")
        if not task_id: return _resp(400, {"error": "taskId required"})
        item = ddb.get_item(TableName=TABLE, Key={"taskId": {"S": task_id}}).get("Item")
        if not item: return _resp(404, {"error": "not found"})
        return _resp(200, {"taskId": task_id, "status": item["status"]["S"]})
    return _resp(405, {"error": "method not allowed"})
PY

cat > "$BUILD/lambda_task_runner.py" <<'PY'
import json, os, time, boto3
ddb = boto3.client("dynamodb"); TABLE = os.environ["DYNAMODB_TABLE_NAME"]
def _set(task_id, status):
    ddb.update_item(TableName=TABLE, Key={"taskId": {"S": task_id}},
        UpdateExpression="SET #s = :s", ExpressionAttributeNames={"#s": "status"},
        ExpressionAttributeValues={":s": {"S": status}})  # 'status' is a DDB reserved word → #s
def handler(event, context):
    for rec in event.get("Records", []):
        task_id = json.loads(rec["body"])["taskId"]
        _set(task_id, "processing"); time.sleep(1); _set(task_id, "completed")
    return {"batchItemFailures": []}
PY

( cd "$BUILD" && zip -q lambda_task.zip lambda_task.py && zip -q lambda_task_runner.zip lambda_task_runner.py )
```

## 4. IAM role (one shared role, least-privilege policy)  🟡 ⏳

```bash
cat > "$BUILD/trust.json" <<'JSON'
{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"Service":"lambda.amazonaws.com"},"Action":"sts:AssumeRole"}]}
JSON
aws iam create-role --role-name "$ROLE" --assume-role-policy-document file://"$BUILD/trust.json" \
  --tags Key=Environment,Value="$ENV" >/dev/null
aws iam attach-role-policy --role-name "$ROLE" \
  --policy-arn <ARN>

cat > "$BUILD/policy.json" <<JSON
{ "Version":"2012-10-17","Statement":[
  {"Sid":"SqsSend","Effect":"Allow","Action":["sqs:SendMessage"],"Resource":["$QUEUE_ARN","$DLQ_ARN"]},
  {"Sid":"SqsConsume","Effect":"Allow","Action":["sqs:ReceiveMessage","sqs:DeleteMessage","sqs:GetQueueAttributes"],"Resource":"$QUEUE_ARN"},
  {"Sid":"Ddb","Effect":"Allow","Action":["dynamodb:GetItem","dynamodb:PutItem","dynamodb:UpdateItem","dynamodb:Query"],"Resource":["$TABLE_ARN","$TABLE_ARN/index/*"]} ]}
JSON
aws iam put-role-policy --role-name "$ROLE" --policy-name sqs-lambda-demo --policy-document file://"$BUILD/policy.json"
ROLE_ARN="$(aws iam get-role --role-name "$ROLE" --query 'Role.Arn' --output text)"
```
```bash
# ✔ verify trust
aws iam get-role --role-name "$ROLE" \
  --query 'Role.AssumeRolePolicyDocument.Statement[0].Principal.Service' --output text  # lambda.amazonaws.com
```
> → Live State: ROLE_ARN.  ⏳ IAM is eventually-consistent — §5 retries create-function.

## 5. Lambda functions + event-source mapping  🟢 ⏳

```bash
# task lambda (API handler). Retry through IAM propagation ("role cannot be assumed").
for i in 1 2 3 4 5 6; do
  aws lambda create-function --function-name "$TASK_FN" --runtime "$RUNTIME" --role "$ROLE_ARN" \
    --handler lambda_task.handler --zip-file fileb://"$BUILD/lambda_task.zip" \
    --timeout 3 --memory-size 128 \
    --environment "Variables={SQS_QUEUE_URL=$QUEUE_URL,DYNAMODB_TABLE_NAME=$TABLE}" >/dev/null \
    && break || { echo "role not ready, retry $i"; sleep 5; }
done
TASK_FN_ARN="$(aws lambda get-function --function-name "$TASK_FN" --query 'Configuration.FunctionArn' --output text)"

# runner lambda (SQS consumer)
aws lambda create-function --function-name "$RUNNER_FN" --runtime "$RUNTIME" --role "$ROLE_ARN" \
  --handler lambda_task_runner.handler --zip-file fileb://"$BUILD/lambda_task_runner.zip" \
  --timeout 60 --memory-size 128 \
  --environment "Variables={SQS_QUEUE_URL=$QUEUE_URL,DYNAMODB_TABLE_NAME=$TABLE}" >/dev/null
RUNNER_FN_ARN="$(aws lambda get-function --function-name "$RUNNER_FN" --query 'Configuration.FunctionArn' --output text)"
aws lambda wait function-active --function-name "$RUNNER_FN"

# event-source mapping: SQS → runner
ESM_UUID="$(aws lambda create-event-source-mapping --function-name "$RUNNER_FN" \
  --event-source-arn "$QUEUE_ARN" --batch-size 10 --query UUID --output text)"
```
```bash
# ✔ verify
aws lambda get-function --function-name "$TASK_FN" --query 'Configuration.State' --output text   # Active
aws lambda get-event-source-mapping --uuid "$ESM_UUID" --query 'State' --output text             # Enabled/Creating
```
> → Live State: TASK_FN_ARN, RUNNER_FN_ARN, ESM_UUID

## 6. API Gateway (REST, AWS_PROXY)  🔴 GATE

> 🔴 **STOP — human go.** `create-deployment` publishes a **public** internet endpoint.

```bash
API_ID="$(aws apigateway create-rest-api --name "$API_NAME" \
  --description "API Gateway for sqs-lambda-demo" --query id --output text)"
ROOT_ID="$(aws apigateway get-resources --rest-api-id "$API_ID" --query 'items[?path==`/`].id' --output text)"
TASK_RES="$(aws apigateway create-resource --rest-api-id "$API_ID" --parent-id "$ROOT_ID" --path-part task --query id --output text)"
TASKID_RES="$(aws apigateway create-resource --rest-api-id "$API_ID" --parent-id "$TASK_RES" --path-part '{taskId}' --query id --output text)"
INTEG_URI="<ARN>$AWS_REGION:lambda:path/2015-03-31/functions/$TASK_FN_ARN/invocations"

# POST /task
aws apigateway put-method --rest-api-id "$API_ID" --resource-id "$TASK_RES" --http-method POST --authorization-type NONE >/dev/null
aws apigateway put-integration --rest-api-id "$API_ID" --resource-id "$TASK_RES" --http-method POST \
  --type AWS_PROXY --integration-http-method POST --uri "$INTEG_URI" >/dev/null
# GET /task/{taskId}
aws apigateway put-method --rest-api-id "$API_ID" --resource-id "$TASKID_RES" --http-method GET \
  --authorization-type NONE --request-parameters method.request.path.taskId=true >/dev/null
aws apigateway put-integration --rest-api-id "$API_ID" --resource-id "$TASKID_RES" --http-method GET \
  --type AWS_PROXY --integration-http-method POST --uri "$INTEG_URI" >/dev/null

# resource-policy permission (the CORRECT api-gw→lambda mechanism)
aws lambda add-permission --function-name "$TASK_FN" --statement-id apigw-invoke \
  --action lambda:InvokeFunction --principal apigateway.amazonaws.com \
  --source-arn "<ARN>$AWS_REGION:$ACCOUNT_ID:$API_ID/*/*" >/dev/null

# 🔴 deploy stage
aws apigateway create-deployment --rest-api-id "$API_ID" --stage-name "$STAGE" >/dev/null
API_ENDPOINT="https://$API_ID.execute-api.$AWS_REGION.amazonaws.com/$STAGE"
echo "API_ENDPOINT=$API_ENDPOINT"
```
> → Live State: API_ID, API_ENDPOINT; status: live

## 7. Acceptance verify  ✔

```bash
TASK_ID="$(curl -s -X POST -H 'Content-Type: application/json' \
  -d '{"key1":"value1","key2":"value2"}' "$API_ENDPOINT/task" | jq -r .taskId)"
echo "taskId=$TASK_ID"
STATUS="pending"
for i in $(seq 1 30); do
  STATUS="$(curl -s "$API_ENDPOINT/task/$TASK_ID" | jq -r .status)"
  echo "  poll $i: $STATUS"; [ "$STATUS" = "completed" ] && break; sleep 2
done
[ "$STATUS" = "completed" ] && echo "✅ pipeline works end-to-end" || echo "❌ stuck at $STATUS"
```
> → Live State: fill verify rows; last_verified.

## Teardown  💥 (resumable, reverse of the frontier)

> 💥 Human go. Observe-first. API GW + Lambda + SQS + DDB all delete fast (no CloudFront-style waits).

```bash
# API
API_ID="$(aws apigateway get-rest-apis --query "items[?name=='$API_NAME'].id | [0]" --output text)"
[ "$API_ID" != "None" ] && aws apigateway delete-rest-api --rest-api-id "$API_ID" && echo "💥 API deleted"

# event-source mapping (find by function)
for U in $(aws lambda list-event-source-mappings --function-name "$RUNNER_FN" --query 'EventSourceMappings[].UUID' --output text 2>/dev/null); do
  aws lambda delete-event-source-mapping --uuid "$U" >/dev/null && echo "💥 ESM $U deleted"; done

# lambdas
for FN in "$TASK_FN" "$RUNNER_FN"; do
  aws lambda get-function --function-name "$FN" >/dev/null 2>&1 && aws lambda delete-function --function-name "$FN" && echo "💥 $FN deleted"; done

# role (delete inline + detach managed, then role)
if aws iam get-role --role-name "$ROLE" >/dev/null 2>&1; then
  aws iam delete-role-policy --role-name "$ROLE" --policy-name sqs-lambda-demo 2>/dev/null || true
  aws iam detach-role-policy --role-name "$ROLE" --policy-arn <ARN> 2>/dev/null || true
  aws iam delete-role --role-name "$ROLE" && echo "💥 role deleted"
fi

# queues + table
for Q in "$QUEUE" "$DLQ"; do
  U="$(aws sqs get-queue-url --queue-name "$Q" --query QueueUrl --output text 2>/dev/null)"
  [ -n "$U" ] && [ "$U" != "None" ] && aws sqs delete-queue --queue-url "$U" && echo "💥 $Q deleted"; done
aws dynamodb describe-table --table-name "$TABLE" >/dev/null 2>&1 && aws dynamodb delete-table --table-name "$TABLE" >/dev/null && echo "💥 table deleted"
```
```bash
# ✔ verify teardown — expect absent
aws apigateway get-rest-apis --query "items[?name=='$API_NAME'] | length(@)"   # 0
aws lambda get-function --function-name "$TASK_FN" >/dev/null 2>&1 && echo "task: present" || echo "task: gone"
aws dynamodb describe-table --table-name "$TABLE" >/dev/null 2>&1 && echo "table: present" || echo "table: gone"
```
> → Live State: status: gone; reset realized IDs.

## Deliberately not included

- **HTTP API instead of REST** — would be cheaper/simpler; this plan uses REST for its method/path
  contract. Swap-in candidate.
- **Per-function least-privilege roles** — kept one shared role for simplicity; split for production.
- **CloudWatch alarms on DLQ depth / Lambda errors**, X-Ray tracing, API throttling/usage plans.
- **VPC, reserved concurrency, provisioned concurrency** — defaults are fine for a demo.
