RFQ Tracker — Infrastructure & Pipeline Audit

Multi-industry (beauty-first) B2B buyer RFQ tracker. Scrapes 7+ marketplaces 24/7 → AI classification, embedding & dedup → research agent enriches contacts → internal Next.js tool for lead management. AWS Seoul (ap-northeast-2) · 100% serverless.

📦 FINGU-GRINDA/rfq-system ☁️ AWS acct 793175550504 🌏 ap-northeast-2 🟢 6/6 stacks deployed 🗓️ Audited 2026-06-23

0At a glance

Compute is not on EC2 — everything runs on managed serverless: web on Lambda (OpenNext) + CloudFront, workers on 58 Lambda containers + 3 Fargate tasks, DB on Aurora Serverless v2.

6/6CDK stacksall UPDATE_COMPLETE
58Lambda functions1 web (Zip) + 57 workers (Image) · arm64
74EventBridge cronsscrapers · loaders · pollers
0 → 16Aurora ACUscale-to-zero, PG 17.7
3Fargate tasksre-enrich · re-classify (2 RUNNING)
9+DLQSQS queuesincl. 2 FIFO
5S3 bucketscorpus · exports · web ×3
3,396DLQ backlog (bug)qwen-clean failing hourly

Live service health (measured)

GET https://d2bep41lgzej60.cloudfront.net/api/healthok:true db:ok sqs:configured s3:configured · scrapers active (MIC 26 invocations / 24h).

1Data pipeline (end-to-end)

Scrape → normalize → AI classify → embed → dedup → store → (web read + research-agent enrichment). A Step Functions Distributed Map orchestrates per source.

Schedule

per-source cron triggers

EventBridge ×74
🕷️

Collect

TLS-spoofing scrapers / browser

Lambda+Fargate
🧹

Normalize

HTML→structured RFQ, raw to S3

selectolax
🏷️

Classify

relevance · quality · intent

Haiku 4.5 Batch
🧮

Embed

1536-dim halfvec

Gemini / Qwen
🔗

Dedup

FIFO + advisory lock, ANN

pgvector HNSW
🗄️

Store

idempotent ON CONFLICT insert

Aurora v2
🔬

Enrich

email/contact research + verify

Sonnet 4.6 + MV
🖥️

Browse

lead hub · AI search · SSE

Next.js 16

AI models (per stage)

Classification / match coarseclaude-haiku-4-5
Research / match fineclaude-sonnet-4-6
Selector self-heal / retryclaude-opus-4-8
Dedup embeddinggemini-embedding-001 (halfvec 1536)
Partner search embed/rerankQwen3-Embedding-8B / Qwen3.6

Reliability design (from code)

Classification decouplingBatch submit/poll split (up to 24h, SFN non-blocking)
Idempotent insertsUNIQUE(source_id,url) + ON CONFLICT
Dedup raceSQS FIFO MessageGroupId + advisory lock
Circuit breakerper-source DynamoDB (5 fails → open)
Budget killswitchdisable scheduler when daily > $100

2Compute — "Is it EC2?" code↔reality

Conclusion: not an EC2 deployment. Zero ec2.Instance in the CDK. Every workload runs on Lambda / Fargate / Aurora Serverless.

WorkloadComputeCDK definition (file:line)Measured in AWS
Web appLambda · OpenNextweb-stack.ts:170 NODEJS_22 arm64rfq-web-server-prod Zip/nodejs22.x/arm64/1024MB/60s
Web delivery pathCloudFront→Lambda URLweb-stack.ts:489,498CF E2FMEFMSZOHFB6 · d2bep41lgzej60 · Deployed
12 scrapersLambda containerworker-stack.ts:69–81 DockerImageFunctionrfq-scrape-mic Image/arm64/1024MB/600s
~40 loadersLambda containerworker-stack.ts:83–93rfq-load-* (many)
Classify · research · match · dedupLambda containerworker-stack.ts:94–170rfq-research-handler Image/arm64/900s
Long-running batch (re-enrich/re-classify)Fargateworker-stack.ts:611–833 FargateTaskDefinition 2vCPU/4GBcluster rfq-reenrich-prod · FARGATE 2 RUNNING (family rfq-classify-prod)
DatabaseAurora Serverless v2data-stack.ts:98–113 min0/max16rfq-aurora-prod available · PG17.7 · db.serverless
Direct EC2 instancenone in CDK0 definitions⚠️ 1 exists (rfq-verify-oneshot) — see §Drift

All 58 Lambdas are arm64 (Graviton). Only the web is Zip (OpenNext bundle); workers are all container images. Packages/images are stored in the ECR cdk-hnb659fds-container-assets-* shared asset repo.

3Network & Database

ResourceSpec / measuredCDK sourceState
VPCvpc-0b9be81d493ff39b1 · 10.0.0.0/16 · rfq-vpc-prod · 3 AZnetwork-stack.ts:26active
SubnetsPublic + Private-with-egressnetwork-stack.ts:32–37active
NAT Gatewaynat-0fdfe4b724e908cbd (cost-optimized)network-stack.ts:30available
VPC Endpointnone (NAT path only)network-stack.ts commentunused
Aurora clusterrfq-aurora-prod · Serverless v2 · 0–16 ACU · 1h auto-pausedata-stack.ts:98–113available
Engine versionPostgreSQL 17.7 (code 17.5 → auto minor upgrade)data-stack.ts:103drift
RDS Proxynone — Lambda connects directly (accepts cold start)data-stack.ts commentby design
Backup / encryptionPITR 7d · storageEncrypted KMS CMKdata-stack.ts:117–120on
Extensionspgvector(halfvec/HNSW) · pg_trgm · pgcryptomigration 0000installed

4Orchestration & Schedule

ResourceMeasuredCDK sourceNotes
Step Functionsrfq-scrape-workflow-prod (STANDARD)state-machines/scrape-workflow.ts26h timeout (Batch 25h + buffer), WAIT_FOR_TASK_TOKEN
EventBridge Scheduler74 schedules (mostly ENABLED)worker-stack.ts:1128,1191 (code ~55)grew with multi-industry expansion
Disabled schedulerfq-research-cron-hourly DISABLEDresearch paused (Serper credits, suspected)
Representative cronsgo4world 3h · ec21 12h · bizmaps-jp daily · us-fda monthly · classifier-poll 15min · qwen-clean hourlyscheduler groupper-source cadence

5Storage & Messaging

S3 (5) · DynamoDB · ECR

rfq-corpus-data-prodraw HTML · corpus
rfq-exports-prodCSV exports (30d expiry)
rfq-web-assets-prodstatic assets (CF)
rfq-web-cache-prodOpenNext ISR cache
rfq-web-cf-logs-prodCloudFront logs (90d)
rfq-web-tag-cache-prodDynamoDB · OpenNext tag cache
cdk-hnb659fds-...assetsECR · container images

SQS (9 + DLQ)

rfq-scrape-request-prodStandard
rfq-dedupe-prod.fifoFIFO
rfq-research-queue-prodStandard
rfq-classifier-batch-prodStandard
rfq-match-candidate-prodStandard
rfq-campaign-events-prodSES events
rfq-sourcing-run-prodStandard
rfq-web-revalidation-prod.fifoFIFO
rfq-lambda-async-dlq-prod3,396 backlog

6Security · Secrets · Edge

KMS · Secrets Manager

KMS CMK4 (Aurora·S3·Secrets·Logs) auto-rotate
Secretrfq/prod/oauth-google
Secretrfq/prod/db-creds (30d rotation)

WAFv2 (us-east-1)

4 managed rule sets (Common · BadInputs · IpReputation · AnonIP) + rate limit 1000 req/5min/IP. CloudFront scope.

SSM Parameter Store · /rfq/prod/* (19)

anthropic-api-keygemini-api-key serper-api-keymillionverifier-api-key linkup-api-keycapsolver-api-key companies-api-keyqwen-embed-base-url qwen-chat-base-urlsearxng-proxies webshare-api-tokencloudflare-api-token scrape-dispatch-mapjapan-control-password admin-emailadmin-password auth-secretauth-url database-url

Web/workers lazy-load from SSM at runtime → local boots without keys (only features fail).

7Observability & Alarm State

Per-function errors/throttles + Step Functions failures + DLQ depth + Aurora ACU + classification null-rate. Mostly OK; 2 rfq alarms in ALARM.

AlarmStateReason (measured)Verdict
Many per-function errors/throttlesOKbelow thresholdnormal
rfq-dlq-LambdaAsyncDlq-prodALARMDLQ ≥1 (currently 3,396)real bug → §Findings #1
rfq-aurora-acu-prodALARMACU 4.1 > threshold 4.0minor (burst scale-up)
SES-* (Complaint/Quota)ALARMnot rfq-prefixedother project on shared account

8Code ↔ Reality Drift

ItemCDK / docsActual AWSInterpretation
EC2 instancenot definedrfq-verify-oneshot 1× (running 1 month)outside IaC
Aurora engine17.517.7auto minor upgrade
Scheduler count~5574multi-industry crons added
Web hosting (PLAN doc)"App Runner"Lambda (OpenNext)doc stale — actually Lambda
DB version (PLAN doc)"PG 18"17.7doc stale

⚠️ rfq-verify-oneshot — orphaned EC2 outside the code

i-05152be67822f4cc5 · t4g.nano · public IP 3.38.198.6 · IAM rfq-migrate-ec2-role · launched 2026-05-22. Origin: an aws ec2 run-instances in docs/aws/04-deploy-runbook.md:115 — a one-shot DB migration box, manually created. It's since hardcoded as the in-VPC Aurora SSM tunnel host in scripts/db-tunnel.sh:29 and left running. → outside CDK management + "oneshot" yet running 24/7 + public IP exposed. Candidate for cleanup (mind the db-tunnel dependency).

9Findings

🔴

#1 rfq-qwen-clean cron failing every hour from schema drift (ongoing)

The hourly schedule fails on every run with column "created_at" does not exist3,396 messages piled into the async DLQ. 36 errors in 24h, ~3/hour, latest failure 2026-06-23 00:55 UTC. The next run will fail too.

Evidence: async-dlq message body (error) + CloudWatch Errors metric · log group is IA class so direct grep is blocked (matches the constraint noted in todolater.md)
🟠

#2 Orphaned EC2 rfq-verify-oneshot running outside the code

IaC drift + public IP exposure. See §Drift. Cost is negligible but cleanup is advised on security/hygiene grounds.

Evidence: docs/aws/04-deploy-runbook.md:115 · scripts/db-tunnel.sh:29
🟠

#3 Local dev doesn't work per README — hardcoded ssl:'require'

apps/web/lib/db/index.ts:20 forces SSL (assuming Aurora) → can't connect to the SSL-less local docker pgvector (login's rate_limit query 500s). Worked around with a localhost branch.

Verified: after the patch all 11 pages render fine · /api/health db:ok
🟡

#4 Aurora ACU alarm (minor) · research cron disabled

ACU 4.1>4.0 is a normal burst scale-up. research-cron-hourly being DISABLED looks like an intentional pause (Serper credits exhausted per todolater.md).

Evidence: alarm StateReason · scheduler state · todolater.md

10Tech Stack Summary

apps/web (Next.js 16)

Next 16.2 App Router (RSC · Server Actions) · React 19.2 · Drizzle ORM (28 migrations) · postgres-js · Auth.js v5 (email+password, JWT 7d) · TanStack Table/Query · Tailwind 4 · pgvector HNSW · partner API /api/v1/* (Qwen embed + rerank) · SSE NL search.

apps/workers (Python 3.12)

uv · curl_cffi (chrome TLS spoofing) · Camoufox/Patchright · selectolax · asyncpg (direct, shares Drizzle schema) · anthropic/google-genai · pgvector · Pydantic v2 · ~34k LOC. 10 scrapers + ~40 loaders + classify/research/match/dedup.

infra (AWS CDK v2 TS)

6 stacks (network·secrets·data·workers·web·observability) + us-east-1 edge (WAF). Lambda (Zip+Docker arm64) · Fargate · Aurora v2 · SQS · S3 · DynamoDB · Step Functions · EventBridge Scheduler · KMS.

CI/CD (.github/workflows)

web · workers · infra · deploy · security. Schema parity test (Drizzle↔information_schema) · TruffleHog/Semgrep · OpenNext build → CDK deploy (serialized) → DB Migrator Lambda.

healthy / deployed minor / drift real bug / cleanup unused / reference