Document type: Technical architecture Audience: Engineers, technical reviewers, architecture interviewers Status: Production Last updated: May 2026 (post-Verifier agent integration)
Crux AI is an autonomous newsletter system that ships one issue every Thursday, written end-to-end by eight specialized Anthropic Managed Agents coordinating through a shared event log. The system aligns with the patterns described in Anthropic’s “Scaling Managed Agents: Decoupling the brain from the hands” (April 2026) — agents (brains), tools (hands), and session state are independent abstractions that can fail, retry, or be replaced without disturbing each other.
Key design choices:
| Decision | Rationale |
|---|---|
| Multi-agent specialization over single big-prompt | Each agent has one job, one quality bar, one criticality class. Easier to debug, swap, and reason about. |
| Event-sourced session log (Memory Stores JSONL) | Durable shared state. Resume-from-crash for free. Natural audit trail. |
Per-tool modules (tools/email.py, etc.) |
Independent failure boundaries. Each tool has its own retry/backoff. “Many hands” pattern. |
| Hallucination grounding via Verifier agent | Every URL and arXiv citation in the approved draft is checked against reality before delivery. Catches fabricated links before they reach subscribers. |
| Citation enforcement in Writer + Critic | Every factual claim must include a Markdown link. Critic rejects drafts that lack citations. Combined with the Verifier, this hardens output trust. |
| GitHub Actions as the orchestration runtime | Vercel Hobby has a 60s function cap; pipelines run 5–15 min. GitHub Actions gives 6 hours, free, no cold starts. |
| Vercel as the web layer only | Subscribe form, /latest, /admin, /unsubscribe. Auto-deploys on every push. |
| Supabase as the subscriber registry | Live source of truth. Orchestrator queries it on every run, so subscribe/unsubscribe takes effect immediately. |
| Structured JSON logs + RunTracker | Per-agent timing, token usage, retry counts. Logs queryable post-hoc. |
Operational profile:
The project’s primary goal is education through production: build a real, end-to-end system on Anthropic Managed Agents to internalize the patterns that matter (and form opinions on which patterns don’t).
A weekly AI newsletter was chosen because it exercises every interesting capability:
If we’d picked a single-agent task (e.g., “summarize this article”), there would be nothing interesting to design.
A successful design should demonstrate:
session_id picks up at the last completed step.These guided every architectural choice. They are not invented; they are lifted directly from Anthropic’s Managed Agents writings and standard distributed systems practice.
This is the central insight from Anthropic’s April 2026 article:
Managed Agents follow [the OS abstraction pattern]. We virtualized the components of an agent: a session (the append-only log of everything that happened), a harness (the loop that calls Claude and routes Claude’s tool calls to the relevant infrastructure), and a sandbox (an execution environment where Claude can run code and edit files). This allows the implementation of each to be swapped without disturbing the others.
In our system:
| Component | What it is | Where it lives |
|---|---|---|
| Brain | A Managed Agent (system prompt + model + tools) | Anthropic platform |
| Hand | A custom tool (emit_event, get_events, send_email_smtp) |
tools/*.py, executed by the orchestrator |
| Session | The shared event log | Memory Stores JSONL files (/mnt/memory/session_{id}.jsonl) |
| Harness | The orchestrator loop | orchestrator_v2.py running on GitHub Actions |
Each of these can fail, be retried, or be swapped without the others noticing. If the orchestrator dies, a new one wakes up, reads the session, resumes. If a tool times out, the agent gets a tool-error result and the harness can retry. If a brain misbehaves, swap its system prompt in the Console — no code change.
Original sin in agent systems: treating the runtime as a long-lived stateful process. In a “pet” architecture, when the container dies, the conversation is lost; when an agent fails, you have to nurse it back.
We aggressively prevent this:
emit_event appends; repeating it inserts a duplicate (acceptable: the orchestrator filters by event type, not count).The tradeoff: we pay setup cost on every run. We accept this because the system runs once a week — the setup cost is invisible, the reliability gain is enormous.
When designing multi-agent systems, the temptation is to pass state through Claude’s context window — let the agent “remember” what previous agents did. This breaks under any failure: rate limits, retries, re-runs, prompt engineering changes.
Our orchestrator never relies on context. Every agent’s output is emitted as an event. Downstream agents read events, not prior conversations. This means:
This is the event sourcing pattern, applied to multi-agent coordination.
The Writer/Critic loop (max 3 attempts) is the only mechanism preventing low-quality issues from being sent. We chose a synchronous critic in the loop rather than:
The Critic adds ~30% to the run cost but catches roughly half the bad drafts. This is a deliberate cost/quality trade.
Three concrete habits:
delivery is critical (must succeed); papers_researched is optional (zero papers is acceptable).We do not silently retry forever. We do not silently swallow errors. We do not require humans to discover failures.
┌──────────────────────────────────────┐
│ GitHub Actions Cron │
│ Schedule: 0 9 * * 4 (Thu 9am UTC) │
│ Runtime: ubuntu-latest, 30 min cap │
└──────────────┬───────────────────────┘
│ runs python3 orchestrator_v2.py
▼
┌───────────────────────────────────────────────────────────────┐
│ OrchestratorV2 (host process; stateless) │
│ ────────────────────────────────────────────────────────── │
│ • Generates / resumes shared_session_id │
│ • Drives 6 logical steps in sequence (parallel where useful) │
│ • Routes 4 custom tool types to handler modules │
│ • Maintains StructuredLogger + RunTracker (observability) │
│ • Persists run summary to runs/{session_id}.json │
└─────────────────────────────────────────────┬─────────────────┘
│
┌──────────────────────────────────┐ │ creates fresh session
│ Anthropic Managed Agents │ ◄────────┘ per agent (cattle)
│ ───────────────────────────── │
│ 8 specialized agents │
│ Each: model + system prompt + │
│ tool config + criticality │
└────────────┬─────────────────────┘
│ stream events back; emit custom_tool_use
▼
┌──────────────────────────────────┐ ┌──────────────────────┐
│ Tools (hands; per-tool module) │ │ Memory Stores │
│ ───────────────────────────── │ │ ────────────────── │
│ tools/memory_store.py ──────►│ ──►│ /mnt/memory/ │
│ emit_event, get_events │ │ session_{id} │
│ tools/email.py ──────►│ │ .jsonl │
│ send_email_smtp │ │ (append-only event │
│ tools/subscribers.py ──────►│ │ sourcing log) │
│ get_subscribers (Supabase) │ └──────────────────────┘
│ tools/verifier.py ──────►│
│ verify_links (HEAD + arXiv) │
└────────────┬─────────────────────┘
│
├─► SMTP (Gmail) ──► subscriber inboxes
│
└─► commits latest_issue.html back to repo
│
▼
┌──────────────────────────────────────┐
│ Vercel (web layer; auto-deploy) │
│ ────────────────────────────────── │
│ Flask app (app.py via index.py) │
│ • GET / subscribe │
│ • POST /api/subscribe add (50-cap)│
│ • GET /unsubscribe remove │
│ • GET /latest serve issue │
│ • GET /admin list emails │
│ │
│ Reads/writes Supabase 'subscribers' │
│ Serves latest_issue.html from repo │
└──────────────────────────────────────┘
Each agent is configured in the Claude Console with a model assignment, system prompt, and tool list. The Console is the source of truth for prompts; this repo references them by agent_id.
| # | Agent | Model | Role | Emits | Reads | Criticality |
|---|---|---|---|---|---|---|
| 1 | Memory | Haiku 4.5 | Read prior runs’ coverage so we don’t repeat | covered_topics |
(none — reads prior runs’ events) | Optional |
| 2 | Research Launches | Opus 4.7 | Find AI ecosystem developments past 7 days | launches_researched |
covered_topics |
Critical |
| 3 | Research Papers | Opus 4.7 | Find actionable AI research past 7 days | papers_researched |
covered_topics |
Optional |
| 4 | Evaluator | Opus 4.7 | Score & rank with transparent rubric (relevance/depth/novelty 1–10 each) | items_evaluated |
research events | Critical |
| 5 | Writer | Opus 4.7 | Draft the brief in markdown with mandatory citations on every claim | draft_written |
items_evaluated + (on retry) critic_rejection or verification_failed |
Critical |
| 6 | Critic | Opus 4.7 | Review for quality, banned-words, style, and citation presence | draft_approved or critic_rejection |
latest draft_written |
Critical |
| 7 | Verifier | Haiku 4.5 | Verify every URL and arXiv citation in the approved draft actually exists | verification_passed or verification_failed |
latest draft_approved |
Critical |
| 8 | Delivery | Haiku 4.5 | Render & send via SMTP | email_sent |
draft_approved + verification_passed |
Critical |
Model tiering rationale:
Parallelism:
Agents 2 and 3 (Research Launches + Research Papers) run concurrently via a ThreadPoolExecutor. They share no state during research; they only converge at the Evaluator. This cuts the longest single step’s wall-clock by ~half.
Retry semantics in the Writer/Critic loop:
The Writer/Critic loop is state-driven, not turn-driven. It counts events:
drafts = count_events("draft_written")
rejections = count_events("critic_rejection")
if drafts == 0 or rejections >= drafts:
→ Writer needs to write (initial pass or addressing rejection)
elif drafts > rejections:
→ Critic needs to review the latest draft
Max 2 rejections (3 total Writer attempts). If the Critic rejects the third draft, we abort before delivery — better to skip a week than send a bad issue. (This has happened zero times in production but is the right default.)
orchestrator_v2.py is a stateless Python process that:
shared_session_id.StructuredLogger and RunTracker for observability.memory, research, evaluate, write_critique, verify, deliver) sequentially.custom_tool_use events to the appropriate tools/ module.RunTracker.runs/{session_id}.json on completion.The orchestrator is single-process but multi-step replayable. If the GitHub Actions runner is killed (e.g., 30-min timeout exceeded), a new run with --session-id <id> resumes from the last completed step. In practice this never triggers — runs complete in 10–15 min, well under the 30-min cap — but the capability exists.
Why one process, not five: The original design had each step as a separate Vercel function chained via async HTTP. We abandoned that because:
The capability still exists in code (orchestrator_v2.py --step <name>) for environments with hard timeouts, but the production path is the single-process loop.
The session log is a per-run JSONL file:
/mnt/memory/session_newsletter_20260507_090000_a1b2c3d4.jsonl
Each line is one event:
{
"id": "uuid",
"session_id": "newsletter_20260507_090000_a1b2c3d4",
"agent_name": "research_launches",
"event_type": "launches_researched",
"data": { "launches": [...] },
"created_at": "2026-05-07T09:04:12.123456Z"
}
Location resolution:
/mnt/memory/ — Anthropic’s workspace-scoped persistent storage../memory_local/ — for python3 orchestrator_v2.py runs from a developer machine.Both paths use identical code. The orchestrator picks whichever is writable.
Why JSONL, not SQLite or a database:
cat session_*.jsonl | jq '.event_type' shows the full run.The trade-off: no native indexes. With ~30 events per run, this is not a problem. If event count grew 100×, we’d add an index file or migrate to SQLite.
Event-type catalog:
| Event type | Emitted by | Means |
|---|---|---|
covered_topics |
Memory | Topics covered in recent prior issues |
launches_researched |
Research Launches (or orchestrator fallback) | Candidate ecosystem items found |
papers_researched |
Research Papers (or orchestrator fallback) | Candidate research papers found |
items_evaluated |
Evaluator | Ranked/filtered set with scoring breakdown |
draft_written |
Writer | A complete brief in markdown |
critic_rejection |
Critic | Specific issues that must be addressed |
draft_approved |
Critic | Brief passed quality + citation checks |
verification_passed |
Verifier | All URLs and arXiv citations confirmed to exist |
verification_failed |
Verifier | One or more URLs were unreachable or fabricated |
email_sent |
Delivery | Issue went out; pipeline complete |
The orchestrator’s resume logic depends only on the terminal event of each step (e.g., draft_approved, email_sent). It does not care how many drafts or rejections preceded approval — those exist in the log for debugging.
Each tool is its own module under tools/. This is the “many hands” pattern — independent failure boundaries, per-tool retry, swappable.
tools/
├── __init__.py # exports
├── memory_store.py # emit_event, get_events — session log access
├── email.py # send_email_smtp — SMTP delivery
├── subscribers.py # get_subscribers — Supabase fetch
└── verifier.py # verify_links — URL + arXiv existence check
tools/memory_store.py:
emit_event(session_id, agent_name, event_type, data) — appends one JSONL lineget_events(session_id, agent_name?, event_type?, limit?) — reads, filters, sortshas_event(), count_events()@retry_with_backoff(max_attempts=3, initial_delay=0.5) — retries on transient I/O errorstools/email.py:
send_email_smtp(args) — sends per-recipient HTML emails via SMTPlatest_issue.html/.md/_meta.json (served by /latest)id attributes)@retry_with_backoff(max_attempts=2, initial_delay=2.0) — SMTP retries are slowertools/subscribers.py:
get_subscribers() — returns the live recipient list with this priority:
subscribers table (live source of truth)RECIPIENT_EMAILS env var (fallback if Supabase unreachable)tools/verifier.py:
verify_links(args) — extracts every Markdown link from a document and checks each URL existsHEAD request with GET fallback for servers that reject HEAD (5xx → fail; 405 → fall through to GET){all_valid, checked, valid: [...], invalid: [{url, reason}]}@retry_with_backoff(max_attempts=2, initial_delay=1.0) — slow networks need a second passWhy one module per tool:
The Anthropic article frames each “hand” as execute(name, input) → string — a uniform interface but a separate execution boundary. Putting each tool in its own file:
The Flask app and the orchestrator both read/write to one Supabase table:
CREATE TABLE subscribers (
id BIGSERIAL PRIMARY KEY,
email TEXT NOT NULL UNIQUE
);
Write path (Flask /api/subscribe):
Read paths:
/admin — display the current list (read-only HTML page)tools/subscribers.py — fetch the list at delivery timeWhy Supabase, not a flat file or env var:
count='exact'.This is the only piece of external infrastructure in the system besides Anthropic and SMTP. We deliberately did not adopt Supabase for the session log (Memory Stores serves that), so the dependency is contained.
Standard Flask app, deployed via Vercel’s Python runtime (index.py → from app import app). Routes:
| Method | Route | Purpose |
|---|---|---|
| GET | / |
Subscribe form (single-page) |
| POST | /api/subscribe |
Add to Supabase, with cap enforcement |
| GET | /unsubscribe?email=X |
Delete from Supabase, styled confirmation |
| GET | /latest |
Serve latest_issue.html (or 404 page if not yet generated) |
| GET | /admin |
Plain page listing current subscribers (no auth — relies on URL obscurity for now) |
/latest content lifecycle:
send_email_smtp, which writes latest_issue.html to disk.latest_issue.html back to the repo./latest is live within ~60 seconds of email send.This is a slightly unusual pattern (using git as the deployment channel for runtime-generated content), but for a weekly cadence it’s perfect: free, audited, version-controlled, and rolls back easily.
observability.py)Two primitives, both deliberately minimal:
StructuredLogger — emits one JSON object per log line:
{
"ts": "2026-05-07T09:04:12.123456Z",
"level": "INFO",
"session_id": "newsletter_...",
"agent": "research_launches",
"message": "agent_end: research_launches",
"elapsed_sec": 142.7,
"tool_calls": 4,
"input_tokens": 45000,
"output_tokens": 8200,
"cache_read_tokens": 320000
}
Logs go to stdout (where GitHub Actions captures them) and to logs/session_{id}.log (uploaded as workflow artifacts, 30-day retention).
RunTracker — in-memory accumulator that produces a single run summary:
{
"session_id": "...",
"total_elapsed_sec": 487.3,
"agent_timings": { "memory": 12.1, "research_launches": 145.2, ... },
"agent_status": { "memory": "success", "research_launches": "success", ... },
"agent_tokens": { "research_launches": { "input": 45000, "output": 8200, ... } },
"totals": { "input_tokens": 380000, "output_tokens": 95000, "cache_read_tokens": 1200000 },
"errors": [],
"success": true
}
Persisted to runs/{session_id}.json after every run. Recent N can be listed via get_recent_runs() for a future runs-history endpoint.
Why we did not adopt Datadog / Sentry / OpenTelemetry:
We will revisit if the system grows to multiple newsletters or sub-daily runs.
retry.py)Three mechanisms, each operating at a different scope:
Tool-level retry (@retry_with_backoff):
tools/* functionis_retryable_anthropic_error()Agent-level timeout & criticality (in orchestrator_v2.py):
AGENT_CRITICALITY = {
"memory": "optional",
"research_launches": "critical",
"research_papers": "optional",
"evaluator": "critical",
"writer": "critical",
"critic": "critical",
"delivery": "critical",
}
AGENT_TIMEOUTS = {
"memory": 120, "research_launches": 300, "research_papers": 300,
"evaluator": 180, "writer": 300, "critic": 180, "delivery": 120,
}
If an agent exceeds its timeout, the AgentRunner breaks the event stream and records a failure. The orchestrator decides what to do based on criticality:
Pipeline-level fallbacks: The most subtle defense is the post-condition placeholder. If the Research Launches or Research Papers agent silently exits without emitting its terminal event (a real bug observed in 2 of 3 early runs), the orchestrator inserts an empty placeholder event:
{
"event_type": "papers_researched",
"data": { "papers": [], "auto_inserted": True, "note": "..." }
}
This unblocks the Evaluator (which depends on the terminal event existing), preserves the failure in the log, and lets the pipeline ship a (lighter) issue rather than dropping the week.
Why we don’t have a global “alert on failure” yet: GitHub Actions emails the workflow owner on job failure. That’s the alert. We have not added Slack/PagerDuty because the system runs once a week and the GitHub email is sufficient.
The Writer’s markdown output is consumed by the email renderer (email_renderer.py), which parses it with regex to identify section headings and per-item entries. This forms an implicit structural contract between two systems:
email_renderer.py, whose regex assumes that shape.The contract specifically requires:
# Crux AI — <theme>company_itemsresearch_items### N. Title (numbered H3, period after the number)Why it matters: When the Writer’s prompt was updated late in the project to use bold-link headings instead of numbered H3s, the pipeline kept passing every internal check (Writer produced output, Critic approved, Verifier confirmed every link resolved, Delivery sent) but the email arrived empty under the section headers because the renderer’s regex no longer matched.
The defense: A smoke test at scripts/test_render.py (or equivalent) feeds a sample brief through the renderer and asserts the rendered HTML contains the expected items before any prompt change ships. The Critic’s STRUCTURE CHECKS section also enforces the format from the producer side, so a structurally invalid draft is rejected before it reaches Verify or Delivery.
The general principle: Anywhere a downstream system parses an LLM’s output, the prompt and the parser form a contract that must be tested explicitly. Prompt iteration breaks consumers silently otherwise.
The risk that a Critic-approved draft could still contain a fabricated URL or a non-existent arXiv paper is real and high-impact: a single hallucinated citation in a research newsletter destroys reader trust permanently. The Critic catches a lot of bad writing but cannot verify factual claims against external reality. The Verifier closes that gap.
The flow looks like this:
draft_approved event emitted).draft_approved event and calls verify_links on the brief content.verify_links extracts every Markdown link, runs HEAD requests on each URL, and queries the arXiv API for any arXiv IDs it finds.verification_passed — every URL and citation resolved successfully; the Delivery agent runs.verification_failed — one or more URLs are unreachable; the orchestrator loops back to the Writer with the list of bad URLs.verification_failed event, the Writer reads the invalid_urls array and either replaces each bad URL with a valid one from the items_evaluated event or rewrites the surrounding sentence to remove the citation entirely. The Critic then re-approves; the Verifier re-checks.Bounded retry: Up to two verification failures are tolerated (MAX_VERIFICATION_RETRIES = 2). Beyond that, the orchestrator aborts the run before delivery — better to skip a week than ship a verified-fabricated issue.
Citation enforcement complements this defense. The Writer’s system prompt requires every factual claim to include a Markdown link, and the Critic explicitly rejects drafts that lack citations. Together, the prompt-level requirement and the runtime check form a two-layer defense against hallucinated facts: the Critic ensures citations exist, the Verifier ensures the citations resolve.
Why a separate agent rather than a tool the Critic calls:
A single Thursday’s run, in time order:
T+0:00 GitHub Actions cron fires (0 9 * * 4)
T+0:01 Ubuntu runner provisioned, repo cloned, deps installed
T+0:30 `python3 orchestrator_v2.py` starts
↓ generates session_id = "newsletter_20260507_090030_a1b2c3d4"
↓ logger + tracker initialized
T+0:31 Step 1: MEMORY AGENT (Haiku, ~30s)
• Reads /mnt/memory/ for events from prior session_ids
• Extracts topics covered in last 12 weeks
• emit_event("covered_topics", {...})
T+1:00 Step 2: RESEARCH (parallel, Opus × 2, ~3 min)
┌─ Research Launches: web search, filter, evaluate
│ emit_event("launches_researched", { launches: [5–7 items] })
└─ Research Papers: web search, filter, evaluate
emit_event("papers_researched", { papers: [2–3 items] })
(post-condition: orchestrator inserts placeholder if either is missing)
T+4:30 Step 3: EVALUATOR (Opus, ~2 min)
• get_events() → reads launches, papers, covered_topics
• Scores each item on relevance/depth/novelty (1–10 each)
• Drops items with total < 18, drops covered duplicates
• emit_event("items_evaluated", { selected_launches, selected_papers, rejected_items, summary })
T+6:30 Step 4: WRITER ↔ CRITIC LOOP (Opus × N, ~3–6 min)
• Writer: get_events("items_evaluated") → draft with mandatory citations →
emit_event("draft_written")
• Critic: get_events("draft_written") → review (quality + citations present) →
◦ If approved: emit_event("draft_approved")
◦ If rejected: emit_event("critic_rejection") → loop back to Writer
(max 2 rejections; 3 total Writer attempts)
T+10:30 Step 5: VERIFY (Haiku, ~30s)
• Verifier: get_events("draft_approved") → call verify_links on brief →
◦ All URLs resolve + arXiv IDs valid: emit_event("verification_passed")
◦ One or more invalid: emit_event("verification_failed") with bad URLs →
loop back to Writer (Writer fixes URLs, Critic re-approves, Verifier re-checks)
(max 2 verification failures before abort)
T+11:00 Step 6: DELIVERY (Haiku, ~1 min)
• get_events("draft_approved") + verification_passed exists → final markdown
• Calls send_email_smtp:
◦ tools/subscribers.get_subscribers() → live list from Supabase
◦ tools/email.handle_send_email_smtp(subject, markdown):
- Per-recipient HTML render (personalized unsubscribe)
- Save snapshot: latest_issue.{html,md,json}
- SMTP send to all subscribers
• emit_event("email_sent", { recipients, subject })
T+12:00 Orchestrator finalizes:
• RunTracker.persist() → writes runs/{session_id}.json
• briefs/{session_id}_log.json written (compact view of run)
T+12:05 GitHub Actions workflow:
• Commits latest_issue.{html,md,json} to repo
• Uploads logs/, runs/, briefs/ as workflow artifacts (30-day retention)
• Pushes to main
T+12:30 Vercel auto-deploys the new commit
• /latest now serves the new issue
• Subscribers see new edition in their inbox
| Failure | Detection | Recovery |
|---|---|---|
| Anthropic API rate-limited | Retry decorator catches 429 | Exponential backoff, 3 attempts |
| Agent silently exits without emitting | Post-condition check after step | Auto-insert placeholder event; pipeline continues |
| Agent exceeds per-agent timeout | AgentRunner timer breaks stream | Marked failed; criticality decides if pipeline aborts |
| Critic rejects 3 times | Loop counter hits max_retries | Pipeline aborts before delivery; week is skipped |
| Verifier reports invalid URLs | verification_failed event |
Loop back to Writer with bad URLs; Critic re-approves; Verifier re-checks (max 2 retries) |
| Verifier exceeds 2 retries (URLs still bad) | MAX_VERIFICATION_RETRIES exhausted |
Pipeline aborts before delivery; week is skipped |
| GitHub Actions runner killed mid-run | Workflow shows failure | Re-trigger with --session-id <id> resumes from last terminal event |
| SMTP server transient failure | tools/email.py retry decorator |
2 attempts with 2s/4s backoff |
| Supabase unreachable for subscriber fetch | tools/subscribers.py returns None |
Falls back to RECIPIENT_EMAILS env var |
| Memory Stores write failure | emit_event returns {ok: false} |
Tool-level retry; if persistent, agent receives error and decides |
| Slow target server during link verification | verify_links HEAD timeout |
Falls back to GET; if both fail, URL marked invalid |
| Vercel deploy fails after commit | Vercel dashboard shows red | Manual rollback; orchestrator already sent emails — only /latest page is stale |
latest_issue.html git push conflict |
Workflow exits nonzero | Re-run workflow; idempotent commit |
The system is designed to degrade gracefully, fail loudly, and require zero manual intervention for transient failures.
This is a small, public-facing system. We did not over-engineer security, but we did reason about it.
Threat model (what we worry about):
/admin page exposes them without auth. This is acceptable for the current population (friends/family + first ~50 subscribers); it’s not acceptable at scale.SMTP_PASSWORD is a Gmail App Password — limited blast radius (only sends mail from that account; cannot read inbox). Stored in GitHub Secrets and Vercel Env Vars.What we do not protect against:
Improvements we’d make at scale:
/admin behind auth (Vercel password protection or Supabase Auth).subscribers, not select bulk.Per-run timing (observed, p50):
Per-run cost (observed range, USD):
Cost dominators:
Operational cost (monthly, 4 runs):
Total project cost (all-in, including prototyping and development):
Brief notes on choices that warrant justification.
| Decision | Alternatives considered | Why |
|---|---|---|
| Memory Stores JSONL for session log | Supabase table; SQLite | JSONL is the native primitive of the platform; no schema; trivially inspectable |
| GitHub Actions for orchestration | Vercel Cron + Functions; AWS Lambda; Modal | Vercel Hobby’s 60s cap is fatal; GH Actions is free with 6h headroom |
| Vercel for web only | Render; Fly.io | Already using for Flask; auto-deploy on git push is the perfect channel for latest_issue.html |
Per-tool modules under tools/ |
One tool_handlers.py file |
Independent failure boundaries; per-tool retry policies |
| State-driven Writer/Critic loop | Counter-driven | Counts of draft_written vs critic_rejection give correct resume behavior automatically |
| Auto-insert placeholder on silent agent failure | Hard-fail; Manual restart | Optional agents shouldn’t kill the run; failure stays visible in the log |
| Verifier as a separate agent (not a Critic tool) | Add verify_links to Critic’s toolset |
Different criticality, timeout, and model tier; keeps Critic’s prompt focused on quality, not HTTP semantics |
| Citation enforcement at prompt + Critic check | Just hope the Writer cites things | Two-layer defense: Writer must include links by prompt; Critic rejects drafts that don’t. Pairs with Verifier for full trust hardening. |
| Verifier on Haiku, not Opus | All-Opus | Verification is mechanical (read draft, call HTTP); no judgment needed. Cuts cost to ~$0.02/run |
Bounded verification retries (MAX_VERIFICATION_RETRIES = 2) |
Unlimited retries | If URLs keep failing, the source data is broken; better to skip a week than ship a verified-fabricated issue |
Writer output format constrained to ### N. Title H3s |
Free-form markdown | Email renderer parses with regex; structural contract is enforced at the prompt and again by the Critic to prevent silent rendering failures |
| Smoke test the renderer on prompt changes | Manually run the pipeline after each prompt edit | Costs ~$2 per pipeline run; smoke test costs zero and catches the same class of bug |
| Opus for cognitive agents, Haiku for mechanical | All-Opus; All-Haiku | Tiering cuts ~40% of cost with no observable quality loss on Haiku-assigned tasks |
| Supabase as live subscriber source | Static RECIPIENT_EMAILS env |
Avoids weekly manual env-var updates after subscribe/unsubscribe |
Commit latest_issue.html to repo |
Object storage (S3); database blob | Free; auditable; rolls back via git revert |
Single shared session_events style log across all runs |
Per-run isolated logs | Memory Agent needs cross-run history (covered topics); easier than a separate cross-run store |
| 50-subscriber cap | No cap | Sanity for a Hobby-tier system; trivially raised |
| No structured alerting (Slack/PagerDuty) | Slack webhook on failure | GitHub email-on-failure is sufficient for weekly cadence |
This system is intentionally over-engineered for one subscriber. The point is to exercise the patterns, not to optimize the immediate workload.
/admin behind auth.None of these are needed today. They are obvious at the scale that triggers them.
Concrete next moves we have considered but not yet built:
/runs endpoint reading from runs/*.json, showing token cost and timing trends.| Path | Purpose |
|---|---|
orchestrator_v2.py |
Step-based orchestrator + AgentRunner with retry/timeout/observability + Verifier loop |
observability.py |
StructuredLogger, RunTracker, run-history readers |
retry.py |
Exponential backoff decorator + Anthropic transient-error classifier |
credentials.py |
Credential resolution chain (env → vault placeholder) |
tools/__init__.py |
Tool exports (many-hands pattern) |
tools/memory_store.py |
emit_event, get_events (session log) |
tools/email.py |
send_email_smtp (SMTP + per-recipient render + snapshot) |
tools/subscribers.py |
get_subscribers (Supabase live fetch + env fallback) |
tools/verifier.py |
verify_links (HTTP HEAD + arXiv API for hallucination grounding) |
email_renderer.py |
Editorial HTML rendering (per-recipient unsubscribe + Gmail-safe TOC) |
app.py |
Flask web app: subscribe / unsubscribe / latest / admin |
index.py |
Vercel Python runtime entrypoint (from app import app) |
vercel.json |
Vercel config (web layer only) |
.github/workflows/newsletter.yml |
GitHub Actions cron + manual trigger |
templates/index.html |
Subscribe page template |
requirements.txt |
Python dependencies |
briefs/ |
Per-run JSON logs (gitignored) |
logs/ |
Per-session structured JSON log files (gitignored) |
runs/ |
Per-run summary metrics (gitignored) |
memory_local/ |
Local Memory Stores fallback for testing (gitignored) |
latest_issue.{html,md,json} |
Latest newsletter (committed by GH Actions; served by /latest) |
architecture_diagram.svg |
One-page system diagram |
GitHub Secrets (orchestrator):
ANTHROPIC_API_KEY # Anthropic platform key
SMTP_USER # Gmail address
SMTP_PASSWORD # Gmail App Password (not regular password)
APP_BASE_URL # https://your-domain.vercel.app
SUPABASE_URL # Supabase project URL
SUPABASE_ANON_KEY # Supabase anon key
RECIPIENT_EMAILS # Fallback list (used only if Supabase unavailable)
SMTP_HOST # Defaults to smtp.gmail.com
SMTP_PORT # Defaults to 587
SMTP_FROM # Defaults to SMTP_USER
Vercel Environment Variables (web layer):
SUPABASE_URL, SUPABASE_ANON_KEY # for subscribe/unsubscribe/admin
APP_BASE_URL # for absolute link generation
# Recent runs (downloaded from GH Actions artifacts):
ls runs/ | sort | tail -5
# Full event stream of a session:
cat memory_local/session_<id>.jsonl | jq '.event_type'
# Run summary:
jq '.' runs/<session_id>.json
# Log lines for a specific agent:
grep '"agent":"writer"' logs/session_<id>.log | jq '.'
| Ritual | Frequency | What |
|---|---|---|
| Inbox check | Weekly (Thursday morning) | Confirm email arrived |
/latest check |
Weekly | Confirm new issue served |
| GH Actions run review | Weekly | Skim logs for warnings, auto-inserted placeholders, and verification_failed events |
| Verifier review | Weekly | Check session log for any verification_failed events; spot-check the URLs that triggered them |
| Renderer smoke test | Before any Writer prompt change | Push a sample brief through email_renderer.py; assert the output HTML contains the expected number of items. Costs zero, catches the contract-drift class of bugs |
| Cost review | Monthly | Anthropic dashboard; investigate if > $10/run |
| Subscriber audit | Monthly | /admin page; remove obvious dupes/typos |
| Anthropic key rotation | Quarterly | Generate new key, update Vercel + GH secrets |
| SMTP password rotation | Quarterly | Generate new App Password, update secrets |
End of architecture document.