Skip to content
← Selected work

Case study · delivered

GenAI knowledge platform at Sage.

A multi-agent RAG platform serving Sage's product teams across multi-region AWS: Bedrock-powered semantic search, semantic caching, multilingual support, and a 15-metric LLM evaluation harness gating every release.

role
Mobile Product Lead
client
Sage UK · via TEKsystems
period
Jan 2024 — Apr 2026
stack
Bedrock · LangGraph · OpenSearch
sage · rag · semantic-search · architecture

The problem

Sage's product teams needed an enterprise-grade GenAI knowledge platform serving multiple products across multi-region AWS infrastructure. Customer queries were taking hours to resolve, content was multilingual and growing faster than humans could curate, and the cost and quality of any LLM-backed answer experience had to hold up under enterprise audit.

The brief was to build a multi-agent RAG system that met the latency, accuracy, and compliance bar of a 50-year-old enterprise software vendor, and to prove it on every release.

Constraints

  • ·Multilingual coverage across 7+ locales. Answer quality cannot regress in any language.
  • ·Multi-region AWS, with data residency, OIDC federation, and zero secret material in CI.
  • ·Conversational latency budget. Token cost and inference time are both treated as first-class metrics.
  • ·Audit-clean by default. No security findings tolerated at any release boundary.

Approach

01 · Ingestion

Distributed nightly ingestion on Step Functions with 10 concurrent maps, embedding 500k+ documents into an OpenSearch hybrid vector index. Multi-region CDK stacks; GitHub Actions CI/CD over OIDC; no long-lived secrets.

02 · Multi-agent RAG

FastAPI + LangGraph orchestration with 6 concurrent agents handling retrieval, ranking, synthesis, and follow-up disambiguation. Semantic caching short-circuits repeat work. Claude on Bedrock is the answer model.

03 · Evaluation

A 15-metric LLM eval framework covering groundedness, relevance, latency, token cost, and locale-specific accuracy. Every prompt and model change is gated against the suite, with drift alerts on production traffic.

04 · Observability

OpenTelemetry across the stack: agent traces, retrieval scoring, token spend per route. Audit posture clean enough to land zero security findings on review.

Outcomes

customer query time
2 hr → 10 min

automated RAG workflows replacing manual research

inference cost
−35%

eval-driven prompt & model selection

nightly ingestion
500k+ docs

step functions · 10 concurrent maps

locale coverage
7+ locales

per-locale eval gating

eval coverage
15 metrics

groundedness · relevance · latency · token cost

security audit
0 findings

enterprise compliance review

Stack

AWS Bedrock Claude FastAPI LangGraph OpenSearch Step Functions Lambda EventBridge CDK OIDC OpenTelemetry RAG

Reflections

The 15-metric eval harness mattered more than the agent graph or the embedding model. Without it, every prompt change felt like a coin flip. With it, iteration speed compounded across 7 locales.

If I were starting again, I would build the eval harness first and the multi-agent retrieval second. At enterprise scale, an honest answer to "did this change help" is what earns trust, especially when an audit is the gate to production.