Case study · delivered
GenAI knowledge platform at Sage.
A multi-agent RAG platform serving Sage's product teams across multi-region AWS: Bedrock-powered semantic search, semantic caching, multilingual support, and a 15-metric LLM evaluation harness gating every release.
- role
- Mobile Product Lead
- client
- Sage UK · via TEKsystems
- period
- Jan 2024 — Apr 2026
- stack
- Bedrock · LangGraph · OpenSearch
The problem
Sage's product teams needed an enterprise-grade GenAI knowledge platform serving multiple products across multi-region AWS infrastructure. Customer queries were taking hours to resolve, content was multilingual and growing faster than humans could curate, and the cost and quality of any LLM-backed answer experience had to hold up under enterprise audit.
The brief was to build a multi-agent RAG system that met the latency, accuracy, and compliance bar of a 50-year-old enterprise software vendor, and to prove it on every release.
Constraints
- ·Multilingual coverage across 7+ locales. Answer quality cannot regress in any language.
- ·Multi-region AWS, with data residency, OIDC federation, and zero secret material in CI.
- ·Conversational latency budget. Token cost and inference time are both treated as first-class metrics.
- ·Audit-clean by default. No security findings tolerated at any release boundary.
Approach
01 · Ingestion
Distributed nightly ingestion on Step Functions with 10 concurrent maps, embedding 500k+ documents into an OpenSearch hybrid vector index. Multi-region CDK stacks; GitHub Actions CI/CD over OIDC; no long-lived secrets.
02 · Multi-agent RAG
FastAPI + LangGraph orchestration with 6 concurrent agents handling retrieval, ranking, synthesis, and follow-up disambiguation. Semantic caching short-circuits repeat work. Claude on Bedrock is the answer model.
03 · Evaluation
A 15-metric LLM eval framework covering groundedness, relevance, latency, token cost, and locale-specific accuracy. Every prompt and model change is gated against the suite, with drift alerts on production traffic.
04 · Observability
OpenTelemetry across the stack: agent traces, retrieval scoring, token spend per route. Audit posture clean enough to land zero security findings on review.
Outcomes
- customer query time
- 2 hr → 10 min
- inference cost
- −35%
- nightly ingestion
- 500k+ docs
- locale coverage
- 7+ locales
- eval coverage
- 15 metrics
- security audit
- 0 findings
automated RAG workflows replacing manual research
eval-driven prompt & model selection
step functions · 10 concurrent maps
per-locale eval gating
groundedness · relevance · latency · token cost
enterprise compliance review
Stack
Reflections
The 15-metric eval harness mattered more than the agent graph or the embedding model. Without it, every prompt change felt like a coin flip. With it, iteration speed compounded across 7 locales.
If I were starting again, I would build the eval harness first and the multi-agent retrieval second. At enterprise scale, an honest answer to "did this change help" is what earns trust, especially when an audit is the gate to production.