Case study · delivered

GenAI knowledge platform at Sage.

A multi-agent RAG platform serving Sage's product teams across multi-region AWS: Bedrock-powered semantic search, semantic caching, multilingual support, and a 15-metric LLM evaluation harness gating every release.

role: Mobile Product Lead
client: Sage UK · via TEKsystems
period: Jan 2024 — Apr 2026
stack: Bedrock · LangGraph · OpenSearch

sage · rag · semantic-search · architecture

The problem

Sage's product teams needed an enterprise-grade GenAI knowledge platform serving multiple products across multi-region AWS infrastructure. Customer queries were taking hours to resolve, content was multilingual and growing faster than humans could curate, and the cost and quality of any LLM-backed answer experience had to hold up under enterprise audit.

The brief was to build a multi-agent RAG system that met the latency, accuracy, and compliance bar of a 50-year-old enterprise software vendor, and to prove it on every release.

Constraints

·Multilingual coverage across 7+ locales. Answer quality cannot regress in any language.
·Multi-region AWS, with data residency, OIDC federation, and zero secret material in CI.
·Conversational latency budget. Token cost and inference time are both treated as first-class metrics.
·Audit-clean by default. No security findings tolerated at any release boundary.

Approach

01 · Ingestion

Distributed nightly ingestion on Step Functions with 10 concurrent maps, embedding 500k+ documents into an OpenSearch hybrid vector index. Multi-region CDK stacks; GitHub Actions CI/CD over OIDC; no long-lived secrets.

02 · Multi-agent RAG

FastAPI + LangGraph orchestration with 6 concurrent agents handling retrieval, ranking, synthesis, and follow-up disambiguation. Semantic caching short-circuits repeat work. Claude on Bedrock is the answer model.

03 · Evaluation

A 15-metric LLM eval framework covering groundedness, relevance, latency, token cost, and locale-specific accuracy. Every prompt and model change is gated against the suite, with drift alerts on production traffic.

04 · Observability

OpenTelemetry across the stack: agent traces, retrieval scoring, token spend per route. Audit posture clean enough to land zero security findings on review.

Outcomes

customer query time: 2 hr → 10 min
inference cost: −35%
nightly ingestion: 500k+ docs
locale coverage: 7+ locales
eval coverage: 15 metrics
security audit: 0 findings

Stack

AWS Bedrock Claude FastAPI LangGraph OpenSearch Step Functions Lambda EventBridge CDK OIDC OpenTelemetry RAG

Reflections

The 15-metric eval harness mattered more than the agent graph or the embedding model. Without it, every prompt change felt like a coin flip. With it, iteration speed compounded across 7 locales.

If I were starting again, I would build the eval harness first and the multi-agent retrieval second. At enterprise scale, an honest answer to "did this change help" is what earns trust, especially when an audit is the gate to production.

More work