0At a glance
Compute is not on EC2 — everything runs on managed serverless: web on Lambda (OpenNext) + CloudFront, workers on 58 Lambda containers + 3 Fargate tasks, DB on Aurora Serverless v2.
Live service health (measured)
GET https://d2bep41lgzej60.cloudfront.net/api/health → ok:true db:ok sqs:configured s3:configured · scrapers active (MIC 26 invocations / 24h).
1Data pipeline (end-to-end)
Scrape → normalize → AI classify → embed → dedup → store → (web read + research-agent enrichment). A Step Functions Distributed Map orchestrates per source.
Schedule
per-source cron triggers
EventBridge ×74Collect
TLS-spoofing scrapers / browser
Lambda+FargateNormalize
HTML→structured RFQ, raw to S3
selectolaxClassify
relevance · quality · intent
Haiku 4.5 BatchEmbed
1536-dim halfvec
Gemini / QwenDedup
FIFO + advisory lock, ANN
pgvector HNSWStore
idempotent ON CONFLICT insert
Aurora v2Enrich
email/contact research + verify
Sonnet 4.6 + MVBrowse
lead hub · AI search · SSE
Next.js 16AI models (per stage)
| Classification / match coarse | claude-haiku-4-5 |
| Research / match fine | claude-sonnet-4-6 |
| Selector self-heal / retry | claude-opus-4-8 |
| Dedup embedding | gemini-embedding-001 (halfvec 1536) |
| Partner search embed/rerank | Qwen3-Embedding-8B / Qwen3.6 |
Reliability design (from code)
| Classification decoupling | Batch submit/poll split (up to 24h, SFN non-blocking) |
| Idempotent inserts | UNIQUE(source_id,url) + ON CONFLICT |
| Dedup race | SQS FIFO MessageGroupId + advisory lock |
| Circuit breaker | per-source DynamoDB (5 fails → open) |
| Budget killswitch | disable scheduler when daily > $100 |
2Compute — "Is it EC2?" code↔reality
Conclusion: not an EC2 deployment. Zero ec2.Instance in the CDK. Every workload runs on Lambda / Fargate / Aurora Serverless.
| Workload | Compute | CDK definition (file:line) | Measured in AWS |
|---|---|---|---|
| Web app | Lambda · OpenNext | web-stack.ts:170 NODEJS_22 arm64 | rfq-web-server-prod Zip/nodejs22.x/arm64/1024MB/60s |
| Web delivery path | CloudFront→Lambda URL | web-stack.ts:489,498 | CF E2FMEFMSZOHFB6 · d2bep41lgzej60 · Deployed |
| 12 scrapers | Lambda container | worker-stack.ts:69–81 DockerImageFunction | rfq-scrape-mic Image/arm64/1024MB/600s |
| ~40 loaders | Lambda container | worker-stack.ts:83–93 | rfq-load-* (many) |
| Classify · research · match · dedup | Lambda container | worker-stack.ts:94–170 | rfq-research-handler Image/arm64/900s |
| Long-running batch (re-enrich/re-classify) | Fargate | worker-stack.ts:611–833 FargateTaskDefinition 2vCPU/4GB | cluster rfq-reenrich-prod · FARGATE 2 RUNNING (family rfq-classify-prod) |
| Database | Aurora Serverless v2 | data-stack.ts:98–113 min0/max16 | rfq-aurora-prod available · PG17.7 · db.serverless |
| Direct EC2 instance | none in CDK | 0 definitions | ⚠️ 1 exists (rfq-verify-oneshot) — see §Drift |
All 58 Lambdas are arm64 (Graviton). Only the web is Zip (OpenNext bundle); workers are all container images. Packages/images are stored in the ECR cdk-hnb659fds-container-assets-* shared asset repo.
3Network & Database
| Resource | Spec / measured | CDK source | State |
|---|---|---|---|
| VPC | vpc-0b9be81d493ff39b1 · 10.0.0.0/16 · rfq-vpc-prod · 3 AZ | network-stack.ts:26 | active |
| Subnets | Public + Private-with-egress | network-stack.ts:32–37 | active |
| NAT Gateway | 1× nat-0fdfe4b724e908cbd (cost-optimized) | network-stack.ts:30 | available |
| VPC Endpoint | none (NAT path only) | network-stack.ts comment | unused |
| Aurora cluster | rfq-aurora-prod · Serverless v2 · 0–16 ACU · 1h auto-pause | data-stack.ts:98–113 | available |
| Engine version | PostgreSQL 17.7 (code 17.5 → auto minor upgrade) | data-stack.ts:103 | drift |
| RDS Proxy | none — Lambda connects directly (accepts cold start) | data-stack.ts comment | by design |
| Backup / encryption | PITR 7d · storageEncrypted KMS CMK | data-stack.ts:117–120 | on |
| Extensions | pgvector(halfvec/HNSW) · pg_trgm · pgcrypto | migration 0000 | installed |
4Orchestration & Schedule
| Resource | Measured | CDK source | Notes |
|---|---|---|---|
| Step Functions | rfq-scrape-workflow-prod (STANDARD) | state-machines/scrape-workflow.ts | 26h timeout (Batch 25h + buffer), WAIT_FOR_TASK_TOKEN |
| EventBridge Scheduler | 74 schedules (mostly ENABLED) | worker-stack.ts:1128,1191 (code ~55) | grew with multi-industry expansion |
| Disabled schedule | rfq-research-cron-hourly DISABLED | — | research paused (Serper credits, suspected) |
| Representative crons | go4world 3h · ec21 12h · bizmaps-jp daily · us-fda monthly · classifier-poll 15min · qwen-clean hourly | scheduler group | per-source cadence |
5Storage & Messaging
S3 (5) · DynamoDB · ECR
rfq-corpus-data-prod | raw HTML · corpus |
rfq-exports-prod | CSV exports (30d expiry) |
rfq-web-assets-prod | static assets (CF) |
rfq-web-cache-prod | OpenNext ISR cache |
rfq-web-cf-logs-prod | CloudFront logs (90d) |
rfq-web-tag-cache-prod | DynamoDB · OpenNext tag cache |
cdk-hnb659fds-...assets | ECR · container images |
SQS (9 + DLQ)
rfq-scrape-request-prod | Standard |
rfq-dedupe-prod.fifo | FIFO |
rfq-research-queue-prod | Standard |
rfq-classifier-batch-prod | Standard |
rfq-match-candidate-prod | Standard |
rfq-campaign-events-prod | SES events |
rfq-sourcing-run-prod | Standard |
rfq-web-revalidation-prod.fifo | FIFO |
rfq-lambda-async-dlq-prod | 3,396 backlog |
6Security · Secrets · Edge
KMS · Secrets Manager
| KMS CMK | 4 (Aurora·S3·Secrets·Logs) auto-rotate |
| Secret | rfq/prod/oauth-google |
| Secret | rfq/prod/db-creds (30d rotation) |
WAFv2 (us-east-1)
4 managed rule sets (Common · BadInputs · IpReputation · AnonIP) + rate limit 1000 req/5min/IP. CloudFront scope.
SSM Parameter Store · /rfq/prod/* (19)
Web/workers lazy-load from SSM at runtime → local boots without keys (only features fail).
7Observability & Alarm State
Per-function errors/throttles + Step Functions failures + DLQ depth + Aurora ACU + classification null-rate. Mostly OK; 2 rfq alarms in ALARM.
| Alarm | State | Reason (measured) | Verdict |
|---|---|---|---|
| Many per-function errors/throttles | OK | below threshold | normal |
rfq-dlq-LambdaAsyncDlq-prod | ALARM | DLQ ≥1 (currently 3,396) | real bug → §Findings #1 |
rfq-aurora-acu-prod | ALARM | ACU 4.1 > threshold 4.0 | minor (burst scale-up) |
SES-* (Complaint/Quota) | ALARM | not rfq-prefixed | other project on shared account |
8Code ↔ Reality Drift
| Item | CDK / docs | Actual AWS | Interpretation |
|---|---|---|---|
| EC2 instance | not defined | rfq-verify-oneshot 1× (running 1 month) | outside IaC |
| Aurora engine | 17.5 | 17.7 | auto minor upgrade |
| Scheduler count | ~55 | 74 | multi-industry crons added |
| Web hosting (PLAN doc) | "App Runner" | Lambda (OpenNext) | doc stale — actually Lambda |
| DB version (PLAN doc) | "PG 18" | 17.7 | doc stale |
⚠️ rfq-verify-oneshot — orphaned EC2 outside the code
i-05152be67822f4cc5 · t4g.nano · public IP 3.38.198.6 · IAM rfq-migrate-ec2-role · launched 2026-05-22. Origin: an aws ec2 run-instances in docs/aws/04-deploy-runbook.md:115 — a one-shot DB migration box, manually created. It's since hardcoded as the in-VPC Aurora SSM tunnel host in scripts/db-tunnel.sh:29 and left running. → outside CDK management + "oneshot" yet running 24/7 + public IP exposed. Candidate for cleanup (mind the db-tunnel dependency).
9Findings
#1 rfq-qwen-clean cron failing every hour from schema drift (ongoing)
The hourly schedule fails on every run with column "created_at" does not exist → 3,396 messages piled into the async DLQ. 36 errors in 24h, ~3/hour, latest failure 2026-06-23 00:55 UTC. The next run will fail too.
#2 Orphaned EC2 rfq-verify-oneshot running outside the code
IaC drift + public IP exposure. See §Drift. Cost is negligible but cleanup is advised on security/hygiene grounds.
#3 Local dev doesn't work per README — hardcoded ssl:'require'
apps/web/lib/db/index.ts:20 forces SSL (assuming Aurora) → can't connect to the SSL-less local docker pgvector (login's rate_limit query 500s). Worked around with a localhost branch.
#4 Aurora ACU alarm (minor) · research cron disabled
ACU 4.1>4.0 is a normal burst scale-up. research-cron-hourly being DISABLED looks like an intentional pause (Serper credits exhausted per todolater.md).
10Tech Stack Summary
apps/web (Next.js 16)
Next 16.2 App Router (RSC · Server Actions) · React 19.2 · Drizzle ORM (28 migrations) · postgres-js · Auth.js v5 (email+password, JWT 7d) · TanStack Table/Query · Tailwind 4 · pgvector HNSW · partner API /api/v1/* (Qwen embed + rerank) · SSE NL search.
apps/workers (Python 3.12)
uv · curl_cffi (chrome TLS spoofing) · Camoufox/Patchright · selectolax · asyncpg (direct, shares Drizzle schema) · anthropic/google-genai · pgvector · Pydantic v2 · ~34k LOC. 10 scrapers + ~40 loaders + classify/research/match/dedup.
infra (AWS CDK v2 TS)
6 stacks (network·secrets·data·workers·web·observability) + us-east-1 edge (WAF). Lambda (Zip+Docker arm64) · Fargate · Aurora v2 · SQS · S3 · DynamoDB · Step Functions · EventBridge Scheduler · KMS.
CI/CD (.github/workflows)
web · workers · infra · deploy · security. Schema parity test (Drizzle↔information_schema) · TruffleHog/Semgrep · OpenNext build → CDK deploy (serialized) → DB Migrator Lambda.