# Prathamesh Saraf

> Senior Forward Deployed Engineer · GenAI Architect

I help enterprise teams ship production GenAI: voice agents, agentic workflows, RAG, and the infrastructure to make them stick.

Location: Remote (US)
Email: pratamesh1867@gmail.com
LinkedIn: https://www.linkedin.com/in/sarafpr
X: https://x.com/S1LV3R_J1NX
GitHub: https://github.com/S1LV3RJ1NX
Toptal: https://www.toptal.com/developers/resume/prathamesh-saraf#PzaPn5
Resume: /resume.pdf

---

# Engagements

## CVS Health × TrueFoundry: Senior Forward Deployed Engineer
Feb 2024 to present · Remote (USA)

- Embedded with a Fortune-5 healthcare buyer as the technical owner of their enterprise GenAI platform: voice agents, agentic workflows, and the platform plumbing underneath.
- Led the design and rollout of a voice-agent platform that replaced legacy IVR flows; outbound deployments reached ~95% containment without escalation.
- Core contributor to Cognita, TrueFoundry's open-source RAG framework (4.4k+ GitHub stars).
- Forward-deployed solutions architect into enterprise GenAI engagements (large pharma, industrial): scoping, prototyping, hand-off to customer teams.
- Built internal platforms: a meeting-intelligence and realtime voice product, an MCP registry/guardian for tool governance, and an enterprise RAG framework.

---

## ChatOwl: Technical Lead · Founding Engineer
Dec 2022 to Jan 2024 · Remote (USA)

- Founding tech lead for an AI-augmented therapeutic-sessions platform; owned roadmap, architecture, and the full backend and infra stack from zero to production.
- Shipped weekly releases across a cross-functional team of engineers, designers, and clinical advisors; established the engineering standards and review process.

---

## Indian Institute of Science (IISc): Graduate Researcher, Cloud Systems Lab
Aug 2021 to Apr 2024 · Bangalore, India

- M.Tech (Research) in Computational and Data Science at the Cloud Systems Lab; CGPA 8.1 / 10.0.
- Lead author on CARL: Cost-Optimized Online Container Placement on VMs using Adversarial RL (IEEE Transactions on Cloud Computing). Recast container-to-VM placement as adversarial RL on top of a semi-optimal vector-bin-packing teacher; the agent learns its own reward function for VM cost minimization and ends up out-performing the teacher it imitates.
- Evaluated on realistic Google and Alibaba production cluster traces (5k–10k container requests across 2k–8k VMs): ~16% lower VM cost than classic heuristics and SOTA RL baselines, ~1,900 placement decisions/sec onto ~8,900 candidate VMs, and robust to inference-time workload distribution shift.

---

## Saarthi.ai: Chatbot Developer
Aug 2020 to Aug 2021 · Bangalore, India

- Built multilingual text and IVR chatbots on RASA for BFSI and edtech customers across Hindi, English, and regional Indian languages.
- Authored an automated RASA conversation-testing harness that cut manual QA time by ~50% and made every release reproducible.
- Drove an analytics-driven lead-generation loop (~20% lift in outbound reach) and a serverless containerization migration that trimmed cloud spend by ~15%.

---

# Case studies

## Ferremundo AI: fine-tuned retrieval, image embeddings, and LangGraph ordering for B2B hardware distribution
URL: https://prathameshsaraf.com/case-studies/ferremundo-ai/
Forward Deployed Engineer · Direct client engagement · 2025 to present

An AI product-matching and ordering layer for a Latin American B2B hardware distributor. Fine-tuned multilingual E5 text embeddings, contrastively fine-tuned SigLIP-2 image search, a LangGraph agent with Postgres or Redis checkpointing, and a production stack on AWS RDS, ECS, and Datadog APM.

**Stack:** Python 3.13, FastAPI, LangGraph, PostgreSQL (AWS RDS), Alembic, sentence-transformers, fine-tuned multilingual-E5-base, fine-tuned SigLIP-2 (ONNX), BGE reranker (ONNX), rank-bm25, Polars, RapidFuzz, Hugging Face Hub, boto3, Datadog APM, AWS ECS, AWS CDK, Docker

**Outcomes:**
- Production matching pipeline reaches 64.7% top-1 / 83.7% top-5 on a labeled 374-query benchmark; the agent path holds at 64.2% / 83.2% with about 2.1s extra latency per turn.
- Multi-turn chat fixtures (53 conversations) reach 96.2% follow-up accuracy after synonym redirection and routing fixes.
- Switching to embedding-first retrieval with the fine-tuned E5 raises top-1 from 60.7% to 64.2% and top-5 from 75.7% to 77.3%, and cuts mean latency from 724 ms to 534 ms.
- Fine-tuned multilingual E5-base (published to the Hub) trained with CachedMultipleNegativesRankingLoss on ~46k GPT-4o-mini synthetic Spanish queries, ~1.2k benchmark positives, ~1.2k synonym pairs, and hard negatives mined from catalog embeddings.
- SigLIP-2 image encoder contrastively fine-tuned on 6,251 product images, exported to ONNX at ~99 ms per query on CPU; INT8 quantization collapsed Recall@5 from 64% to 13% for only a 1.3x speedup, so it was rejected and documented.
- LangGraph ReAct agent with JsonPlusSerializer checkpointing across InMemory, AsyncPostgres, and AsyncRedis backends; deterministic bypass for short and SKU-shaped first turns.
- Production worker fits in ~3.2 GB RAM with sentence-transformers, ONNX reranker, ONNX image encoder, and both text and image embedding matrices loaded in process.


## What I own
The retrieval and matching stack end to end: catalog modelling on Postgres, the embedding strategy (text and image), the FastAPI matching surface, the LangGraph ordering agent and its checkpointing model, the incremental warehouse sync that keeps the catalog and embedding matrices fresh, and the production posture (Datadog APM, structured JSON logs, AWS CDK, ECS rolling deploys). On top of matching, the agentic ordering flows that take a reseller from natural-language intent to a structured order without leaving the conversation.

## The retrieval pipeline
A hybrid pipeline with a fine-tuned multilingual E5-base as the first-class encoder. BM25 anchors rare drug-name-style tokens, the dense encoder handles paraphrase and Spanish slang, a BGE cross-encoder reranks the merged pool, and a synonym and alias table corrects long-tail miscalls. Embedding-first routing skips the heavier fuzzy and reranker work for queries that confidently resolve at the encoder layer, which is what produced the +3.5 pp top-1 and -190 ms latency move. A separate Doc2query enrichment pass, which concatenates synthetic queries into catalog text, gave another lift to 81.5% top-5.

## Embedding finetune (text)
Backbone `intfloat/multilingual-e5-base`, fine-tuned with CachedMultipleNegativesRankingLoss on a four-source training set: GPT-4o-mini synthetic Spanish queries (~46k), labeled benchmark positives, synonym-derived positive pairs, and hard negatives mined by nearest neighbors over catalog embeddings in Postgres. Best checkpoint chosen by cosine NDCG@10 on a held-out slice. The frozen baseline lands at Acc@1 45.3%, Acc@5 64.0%, NDCG@10 55.0%; the fine-tuned model is what unlocks the production accuracy and latency posture.

## Image search (vision)
A separate pipeline trains a contrastive image-text encoder on top of `google/siglip2-base-patch16-256` using SigLIP's sigmoid loss, with image-to-text Recall@1/5/10 on a deterministic held-out 10% split. Training data is 6,251 active-SKU thumbnails downloaded from the staging API and aligned with synthetic-query text. The fine-tuned weights are exported to ONNX and serve at about 99 ms per query on CPU. INT8 quantization was tried, recorded, and rejected: Recall@5 went from 64% to 13% for only a 1.3x speedup. This is the canonical example I now reach for when someone proposes "just quantize the encoder."

## Agentic ordering (LangGraph)
A LangGraph ReAct loop (`llm_call` → conditional tool edges → `tool_node` → back) with two tools: `catalog_search`, which calls the real matching pipeline, and `follow_up_resolve`, which uses a `FollowUpResolver` over the last set of candidates. JsonPlusSerializer whitelists `ProductCard`, `AgentSearchResult`, and `ChatResponse` for checkpoint serialization. The checkpointer is selectable at deploy time across `InMemorySaver`, `AsyncPostgresSaver`, and `AsyncRedisSaver`. Trivial first turns (one token, a SKU-shaped string) bypass the LLM entirely; non-trivial first turns inject the top-2 catalog vocabulary hints (about 20 ms) before the model is allowed to run. The agentic path costs about 2.1s extra latency for a 0.5 pp accuracy delta, which is the right trade for an ordering surface but the wrong trade for a search-only one, and the architecture lets the same service expose both.

## In-memory caching
At startup, the match service hydrates catalog rows, signals, the alias map, the BM25 index (pickled from Postgres or rebuilt), the text embedding matrix and SKU ordering, optional vocabulary and synonyms, and the ONNX reranker. The image-search service similarly loads the image embedding matrix and the SigLIP ONNX model. Total resident set lands around 3.2 GB per worker. A nightly sync job upserts warehouse data, embeddings, image embeddings, and BM25 artifacts; operators roll or restart API tasks after a sync so workers reload from the new state. Treating "the cache is reloading fat matrices after batch sync" as a first-class concern (not a side effect) is what keeps the agent honest in production.

## Production posture
Multi-stage Docker images, `ddtrace-run` wrapping the entrypoint when Datadog is enabled, AWS RDS Postgres, Alembic migrations on startup with a `SKIP_MIGRATIONS` escape hatch, S3 model sync from a `ferremundo-models` bucket, CircleCI for black/pytest plus ECR build and ECS rolling deploy to a dev cluster, and structured JSON logs with Datadog trace correlation. The health endpoint exposes embedding coverage and image index size, which is what I look at first when something has shifted.

## Lessons
B2B hardware search fails in ways leaderboard models never measure. Customers type one word or a slang synonym, and the right SKU is often buried behind a popular default, so you spend more time on disambiguation signals and alias learning than on raw retrieval tricks. Fine-tuned E5 only pays off once you stop treating embeddings as a side channel: embedding-first routing was the largest accuracy move and it lowered latency by skipping fuzzy work most of the time. Hard negatives from your own catalog matter as much as synthetic positives; otherwise the cross-encoder and the alias table fight you when expansion changes surface forms across query and title. Agents look like a small wrapper, but they need checkpointed state and deterministic bypasses for short inputs, or you burn seconds on LLM routing for every SKU lookup. Image search is not "quantize and forget"; INT8 wrecked recall with almost no speed win, which is the kind of result you only get from measuring on real product photos. The "cache" is mostly reloading fat matrices after batch sync, which is simple, but it means deployments and cron are part of the correctness story, not an afterthought.

This engagement is ongoing; this page will be updated as the system evolves.


**Links:**
- [ferremundo.com.ec (live product)](https://www.ferremundo.com.ec/)

---

## Praxium: source-grounded, narrated courses from any PDF
URL: https://prathameshsaraf.com/case-studies/praxium/
Lead engineer · Toptal engagement · 2025 to present

A SaaS that ingests a PDF and produces a structured, narrated course: personas, ABCD learning objectives, failure-mode analysis, a hierarchical outline, source-grounded per-subsection content with verbatim reference snippets, inline micro-checks (matching, ordering, fill-blank), KaTeX equations, end-of-lesson MCQs with confetti, narrated Synthesia videos, and a retrieval-grounded chat over the source. Driven by a resumable, multi-state instructional-design state machine, with a multi-dimension eval harness that scores coverage, grounding, coherence, persona alignment, and MCQ quality.

**Stack:** React 18, TypeScript, Vite, Tailwind, Radix / shadcn, TanStack Query, react-markdown + remark-math + rehype-katex, KaTeX, canvas-confetti, @xyflow/react + dagre, PostHog, FastAPI, Pydantic v2, SQLAlchemy 2 async, Alembic, PostgreSQL, pgvector, Redis, Celery, Anthropic Claude (instructional pipeline), OpenAI SDK, Cohere Embed v4 via AWS Bedrock, Landing AI ADE Parse Jobs, Synthesia, ElevenLabs, ffmpeg, LibreOffice, Playwright (slide rendering), AWS S3 (presigned URLs), Stripe (Checkout, Billing Portal, webhooks), Clerk (JWT + Svix webhook), Langfuse + OpenTelemetry (Anthropic instrumentation), Sentry

**Outcomes:**
- Live product at getpraxium.ai covering ingestion, parsing, instructional-design generation, source-grounded narrated lessons, payments, and retrieval-grounded explore mode.
- Resumable workflow state machine with explicit wait states for persona pick, instructional-goal approval, objectives review, and failure-mode analysis; checkpoints on disk plus DB-backed resume paths so 30-minute generations never replay Claude from scratch.
- Multi-dimension eval harness scoring coverage (objective keyword hit rate), grounding (verbatim snippet fuzzy-match against chunk text), coherence (LLM judge), persona alignment (LLM judge), and MCQ quality (programmatic trap-option detection, longest-option bias guard, duplicate detection).
- Source-truth grounding via Landing AI ADE chunking with grounding boxes, BM25 plus cosine hybrid retrieval over Cohere Embed v4 embeddings on AWS Bedrock, and Claude prompts that require verbatim reference snippets keyed to a chunk id on every key point.
- Learner engagement surface: inline micro-checks in three variants (matching, ordering, fill-blank), end-of-lesson MCQ pass with dual-corner canvas-confetti, KaTeX equations with `$...$` and `$$...$$` delimiters and code-fence carve-outs, persisted quiz scores, lesson progress streaks.
- Async Synthesia video generation with polling, per-key-point clips, a structured per-course videos lifecycle, and full delete and regenerate flow; ElevenLabs TTS plus ffmpeg slide movies as a multimedia worker path.
- Production observability with Langfuse plus OpenTelemetry's Anthropic instrumentor, Sentry on both Python and the frontend, and Stripe event durability via a dedicated event service.


## What I own
End-to-end product engineering: the agentic instructional-design pipeline, the workflow state machine, the eval harness, the source-truth grounding stack, the inline learner-engagement components, the video generation lifecycle, the multi-tenant data model on async SQLAlchemy and Postgres, and the API contract the frontend consumes. Stripe, Clerk, S3, and the observability wiring (Langfuse, OpenTelemetry, Sentry) are mine too.

## Workflow state machine
The README narrates a short ladder; the orchestrator behind it is materially richer. The workflow state machine exposes a multi-step DAG with explicit wait states (personas, persona-pick wait, duration wait, instructional-goal approval, objectives review, failure modes, outline, content generation, GENERATED, failures) so a 30-minute course generation never replays Claude from scratch on a transient failure. Checkpoints persist to disk next to the per-course S3 metadata, with DB-backed resume paths so the frontend can poll progress while the backend keeps working.

## Eval harness
The course-quality evaluator pulls finished courses from a per-course S3 prefix and scores five dimensions. Coverage is the fraction of objective stem words appearing in aggregated subsection text. Grounding is the share of key-point reference snippets that fuzzy-match the cited chunk text at a 0.8 SequenceMatcher threshold, with a substring fast path. Coherence and persona alignment use cached Claude judges driven by rubric YAML prompts. MCQ quality runs programmatic checks (trap options like "all/none of the above," longest-option bias guard, duplicate detection, minimum three options). Judge calls are cached to disk between runs. A separate retrieval study compares keyword, BM25, semantic cosine, hybrid RRF, and optional Cohere rerank against the ground-truth chunk set. Both report directly into the repo as markdown.

## Source-truth grounding
PDFs are parsed by Landing AI ADE Parse Jobs into chunk-id-keyed markdown chunks with grounding boxes (page, bbox). At authoring time, a hybrid retriever (BM25 fused with cosine over Cohere Embed v4 embeddings on AWS Bedrock) selects the chunks Claude sees, and the prompt contract requires every key point to attach a verbatim reference snippet keyed to a chunk id. The eval harness then re-verifies those snippets against chunk text, so an ungrounded sentence loses its grounding score and shows up in the generated quality report. Explore mode runs the same retriever at top-k 4 and a 0.3 similarity threshold to ground answers on neighboring chunks.

## Equation rendering
The subsection-authoring prompt fixes notation: `$...$` for inline, `$$...$$` for display, backticks for code, and an explicit ban on bare currency symbols to avoid runaway math regions. The frontend imports KaTeX styles globally; the Markdown component wires `remark-math` and `rehype-katex`; a formatted-text helper segments inline text, gates `$...$` bodies with a math-like guard, and renders display blocks with `\displaystyle`. Micro-check leaf components inherit KaTeX typography classes so math renders cleanly inside fill-blank and matching answers.

## Learner engagement
Inline micro-checks come in three variants (matching, ordering, fill-blank), gated to zero, one, or two per lesson by prompt contract. End-of-lesson quizzes trigger a dual-corner `canvas-confetti` burst on a passing score; a progress hook PATCHes the quiz score to the course progress API, and a local streak counter persists viewed lesson IDs. The diagrams surface uses `@xyflow/react` with a `dagre` layout pass. Progress reads and writes are designed so the engagement layer fails independently of the LLM layer.

## Production posture
Clerk JWT middleware plus a Svix-verified webhook syncs users. Stripe checkout, portal, and webhooks sit behind a thin client and a dedicated event service, with durable event handling. Synthesia uses two template IDs and persists a `videos.json` under the course metadata. S3 is laid out as `courses/{course_id}/{source|metadata|content|grounding}/`. Observability initialises Langfuse alongside OpenTelemetry's Anthropic instrumentor and flushes on shutdown; Sentry is wired on both Python and the frontend. The Dockerfile bakes LibreOffice, ffmpeg, and Playwright Chromium for slide and video paths.

## Lessons
Breaking generation into checkpoints with explicit wait states buys you retries and product surfaces (persona pick, approvals) without replaying Claude from scratch, something a single-shot script cannot match when a run lasts tens of minutes. Faithfulness turns out orthogonal to readability: lexical grounding checks are inexpensive and objective, coherence and persona judgments need slow, noisy judges, and embedding drift metrics only help once you tolerate false alarms on legitimate paraphrase. Engagement work here is disproportionately React affordances, motion budgets, Markdown math correctness, micro-check UX, celebration timing, and persistence of quiz attempts; almost none of it is "another paragraph from the LM." Keeping those layers separate avoids the trap of asking the tutor model to compensate for unclear progress UI.

This engagement is ongoing; this page will be updated as the product evolves.


**Links:**
- [getpraxium.ai (live product)](https://getpraxium.ai/)

---

## Scalable, cost-effective voice agents: a platform-based blueprint
URL: https://prathameshsaraf.com/case-studies/voice-agents-blueprint-cvs/
Senior Forward Deployed Engineer · Co-author on the public blueprint · 2024 to present

A hierarchical voice-agent platform for a Fortune-5 healthcare buyer handling millions of daily customer interactions. A Master Agent orchestrates specialized SLM- and LLM-powered sub-agents; a tiered model strategy plus high-precision intent classification cut the theoretical tens-of-millions-of-LLM-calls-per-day workload by over 90%. Public co-authored blueprint on the CVS Health Tech Blog, Feb 2026.

**Stack:** Python, LangGraph, BM25 (sparse retrieval), Fine-tuned dense embeddings, ColBERT (late-interaction reranking), SLMs + LLMs, Centralized LLM gateway, Active-active GPU infrastructure across two facilities, Enterprise prompt management platform

**Outcomes:**
- Over 90% reduction in actual LLM usage vs. the theoretical peak of tens of millions of LLM calls per day.
- Intent classification weighted F1 of 0.86 via a four-stage multi-vector pipeline (BM25 sparse + dense + ColBERT late-interaction reranking).
- Auth-time pre-fetching cuts `prescription_status` latency from ~1,200 ms to ~80 ms; 90 to 95% of typical requests served instantly from warm cache.
- Geo-redundant active-active deployment across two facilities for zero-downtime operation.
- Public co-authored blueprint on the CVS Health Tech Blog (Feb 2026).


## What I owned

One of the senior engineers on the platform team and a co-author on the public blueprint (seven-author byline). My center of gravity on this work was the intent-classification gatekeeper that the entire cost-optimization strategy hinges on, and the orchestration patterns the Master Agent uses to delegate to specialized sub-agents. The intent slice has its own deep dive in [from IVR to agentic](/case-studies/ivr-to-agentic/); the broader platform (telephony, GPU gateway, prompt management, geo-redundant infrastructure) was the work of the full team listed on the blueprint.

## What shipped

A platform, not a bot. The architecture is a hierarchical agent system:

- A **Master Agent** that receives a classified intent and routes to the most efficient downstream system or sub-agent.
- A **tiered model strategy**: simple, high-volume tasks go to deterministic rules or fast SLM-powered sub-agents; only the most complex multi-turn conversations escalate to LLM-powered sub-agents.
- A **high-precision intent classification gatekeeper** that prevents simple queries from being misrouted into expensive LLM flows.

The composition is the point. No single layer reaches the 90% reduction; the layers stack, and the gatekeeper is the keystone.

## The cost story

The headline economic claim is the >90% reduction in actual LLM usage. The path:

- **Tiered models.** The Master Agent sends the majority of queries to SLMs or deterministic rules. Most calls never see an LLM.
- **Precision intent classification.** Misroutes are the silent failure mode; an SLM-routable query that gets escalated to an LLM by mistake costs you on both ends. High weighted F1 (0.86) keeps that escalation rate low.
- **Contextual efficiency.** Auth-time pre-fetching and well-engineered prompts collapse the turn count for the queries that do reach an LLM.

The result: tens of millions of theoretical LLM calls per day collapse to a few million actual model interactions, the vast majority of which are cheap.

## Intent classification (the gatekeeper)

A four-stage multi-vector retrieval pipeline does the routing:

1. **BM25** sparse retrieval as the lexical baseline.
2. A **fine-tuned dense encoder** for semantic recall.
3. **ColBERT late-interaction reranking** over the merged top candidates.
4. An LLM at the end as the final classifier and out-of-scope filter, given only the top-K survivors.

Weighted F1 of 0.86 across the production intent set. The LLM is the most expensive node, so it is pushed to the end of the pipeline, where it adjudicates a small candidate set rather than acting as the primary classifier. The deep dive on this slice, including the synthetic-data pipeline and the multi-vector wiring in Qdrant, is in [from IVR to agentic](/case-studies/ivr-to-agentic/).

## Auth-time pre-fetching

Customer authentication is a hard latency floor: the system has to verify identity before answering anything sensitive. The platform turns that window into useful work. While the customer is being authenticated, the system fetches their likely-needed user data and loads it into a high-speed cache. By the time the conversation begins, 90 to 95% of typical requests are answerable from warm cache. The published example is `prescription_status`, which drops from ~1,200 ms (cold) to ~80 ms (warm). The same pattern applies across intents that need user-scoped data.

## Reliability posture

Voice agents at this scale are critical infrastructure, not experimental projects. The blueprint emphasizes the unglamorous half of the work:

- **Active-active deployment across two geographically distinct facilities.** No single points of failure on the GPU and gateway layer.
- **Graceful degradation.** SLM-powered and rule-based services keep operating if the LLM tier degrades.
- **Centralized LLM gateway** for budget controls, identity-integrated access, routing, rate limiting, and provider failover.
- **Enterprise prompt management** with versioning, audit logs, approval workflows, and a staging environment that mirrors production.
- **Zero-downtime deployments** via blue-green and canary releases.

## Lessons

- **Voice at enterprise scale is an infrastructure problem first, an AI problem second.** The economics fall apart if you treat the LLM as the default and try to optimize around it; they work when you treat the LLM as the *exception* and let cheap, deterministic, or small-model paths handle the majority.
- **The cost optimizer is high-precision intent classification.** Get it wrong and every other layer is forced to escalate. Get it right and most of your traffic never sees an LLM.
- **The auth window is free latency.** Pre-fetching during a step the customer is already waiting on is the kind of unglamorous engineering that recovers an order of magnitude on the critical path.
- **Platforms beat bots.** The decoupled-layers architecture (telephony / runtime / orchestration / models / gateway) is what makes the blueprint replicable. The win is not a smart bot; it is the boring discipline of letting feature teams spend their time on the conversation, not the plumbing.


**Links:**
- [Read the blueprint on the CVS Health Tech Blog (Feb 2026)](https://medium.com/cvs-health-tech-blog/building-scalable-and-cost-effective-voice-agents-a-platform-based-blueprint-fae6ee5881c9)

---

## From IVR to agentic: multi-vector retrieval for pharmacy intent classification
URL: https://prathameshsaraf.com/case-studies/ivr-to-agentic/
Senior Forward Deployed Engineer · 2024 to present

Replacing a BERT plus LLaMA hybrid intent classifier in CVS Health's pharmacy IVR with a multi-vector retrieval pipeline on Qdrant: BM25 sparse retrieval, a fine-tuned dense encoder, ColBERT late-interaction reranking, and an LLM as the final classifier and out-of-scope filter on the top five candidates. Weighted F1 from 0.58 to 0.86 across roughly 1.5M daily customer interactions and 32 distinct pharmacy intents. The public retrospective is on the CVS Health Tech Blog.

**Stack:** Python, BM25 (sparse retrieval), Fine-tuned dense embedding model, ColBERT (late-interaction reranking), Qdrant (multi-vector store), LLM post-processing (top-5 classification + out-of-scope), Synthetic data pipeline (~100k labeled queries), Telephony + streaming ASR/TTS, Observability (transcript review, quality scoring)

**Outcomes:**
- Weighted F1 from 0.58 to 0.86 on a 32-intent pharmacy taxonomy (`rx_refill`, `store_address`, `drug_availability`, `cancel_vaccine_appointment`, `rx_drug_price`, and 27 more).
- Largest gains on previously unreliable intents: `drug_availability` from 0.18 to 0.82, `available_vaccine` and `rx_drug_price` reaching 1.0.
- Roughly 1.5 million customer interactions per day handled by the new pipeline.
- Three vector representations per item in Qdrant (sparse, dense, late-interaction) with prefetch-plus-rerank wiring to avoid recomputation during ranking.
- 100k synthetic query-to-intent training set generated and a subset used to fine-tune the dense encoder, with cleaner decision boundaries between near-duplicate intents validated on cluster maps.
- Public retrospective published on the CVS Health Tech Blog.


> The full public retrospective is on the CVS Health Tech Blog: [Transforming Customer Interactions: Evolving IVR Systems for Enhanced Experiences](https://medium.com/cvs-health-tech-blog/transforming-customer-interactions-evolving-ivr-systems-for-enhanced-experiences-73b12c7f5aea). Everything below is grounded in that article and the work behind it.

## What shipped
What shipped is the intent classification layer behind CVS's pharmacy IVR, which fields roughly 1.5 million calls per day across 32 intents such as prescription refills, store lookups, drug availability, vaccine appointments, and pricing. We replaced a BERT plus LLaMA hybrid with a multi-vector pipeline backed by Qdrant: BM25 sparse retrieval anchors rare drug terms, a fine-tuned dense encoder captures paraphrase, ColBERT late-interaction reranks the merged candidate set, and an LLM does final classification and out-of-scope rejection on the top five. We trained the dense model on a 100k synthetic query-to-intent dataset and used Qdrant's prefetch plus rerank pattern to keep latency and cost in check. Weighted F1 moved from 0.58 to 0.86, with the largest gains on previously unreliable intents.

## What I owned
The migration path from rules-and-grammars IVR to LLM-driven agentic flows, and the retrieval architecture that made it tractable. That meant designing the agent runtime, the tool-use envelope around it, the multi-vector retrieval and reranking stack, and the seams that let the new system coexist with the old one during cut-over instead of demanding a big-bang rewrite. Operationally, this meant preserving the existing transcript review, analytics, and quality scoring surfaces so the operations team kept their playbooks; only the brain changed.

## Architecture
Three vector representations per query and per intent example live in Qdrant: a sparse BM25 vector, a dense vector from the fine-tuned encoder, and a late-interaction ColBERT vector. At inference, Qdrant prefetches top candidates from the sparse and dense channels, deduplicates them, and reranks the merged pool with ColBERT. The top five candidates then go to the LLM, which does final classification and out-of-scope rejection. The LLM is the most expensive node, so we earn it: pushing it to the end of the pipeline (rather than using it as a primary classifier) was both cheaper and more accurate on near-duplicate intents like "what's the price of amoxicillin" versus "do you have amoxicillin."

## Why each retriever earns its place
Dense embeddings alone miss rare or domain-specific tokens; BM25 is the cheapest fix for that, and keeping it in candidate generation costs almost nothing at inference time. BM25 alone cannot handle paraphrase, so the dense channel is non-negotiable. ColBERT was the part I expected least and relied on most: token-level late interaction recovers context like "refill while traveling abroad" that a bi-encoder flattens into a generic refill intent. The LLM only earned its place at the very end, scoring the top five and rejecting out-of-scope queries.

## Lessons
The non-obvious lesson from this work is that intent disambiguation in a pharmacy IVR is not a single-model problem; it is a retrieval problem with a classification step bolted on at the end. A fine-tuned dense encoder is necessary but not sufficient, because rare drug names and specific SKUs collapse into nearby clusters and the model loses the lexical signal that actually carries the intent. BM25 is the cheapest fix for that failure mode, and keeping it in the candidate generation stage costs almost nothing at inference time. ColBERT was the part I expected least and relied on most: token-level late interaction recovers context like "refill while traveling abroad" that bi-encoders flatten into a generic refill intent. The LLM only earned its place at the very end, scoring the top five and rejecting out-of-scope queries; using it as a primary classifier was slower, more expensive, and less accurate on near-duplicate intents. The F1 jump from 0.58 to 0.86 came mostly from the worst-performing intents, not the easy ones, which is the metric I now watch first when evaluating retrieval changes in production conversational systems. Compliance, observability, and the operations team are not blockers; they are the leverage that lets you ship LLM-driven flows in a regulated environment.


**Links:**
- [Read the public write-up (CVS Health Tech Blog)](https://medium.com/cvs-health-tech-blog/transforming-customer-interactions-evolving-ivr-systems-for-enhanced-experiences-73b12c7f5aea)

---

## MCP Gateway Catalog: one catalog, many tools, unified auth
URL: https://prathameshsaraf.com/case-studies/mcp-registry/
Senior FDE · Platform design and delivery · 2025

A live product surface that lets any team browse 47+ Model Context Protocol servers, complete OAuth or API-key handshakes, and try every tool from the browser, without a local agent host.

**Stack:** MCP, Python, FastAPI, React, TypeScript, OAuth2, API-key vault

**Outcomes:**
- Unified discovery, auth, and invocation across 47+ third-party MCP servers (SEO, finance, productivity, research, project management, and more) from a single browser surface.
- Three auth modes (OAuth2, API key, anonymous) routed through one consistent UX so consumers don't need per-provider wiring.
- An interactive 'Try it' surface that lets engineers exercise any registered tool against real providers without standing up a local agent host.


## What I owned
The platform-side design: how dozens of independent MCP servers get registered, normalized, authenticated, and exposed through a single gateway, and the product surface that makes that registry usable from a browser instead of a CLI.

## What shipped
A public, browsable catalog of 47+ MCP servers across SEO (Ahrefs), data (Airtable, Alpha Vantage), research (arXiv), project management (Atlassian), productivity, finance, and more. Each entry carries its auth contract on its face (OAuth2, API key, or anonymous) and exposes its tool surface inline. A consumer can search, filter by auth type, and exercise a tool live from the browser without installing or hosting anything locally. The catalog sits on top of a gateway that does the boring-but-load-bearing work: credential brokering, request signing, transport multiplexing, and uniform error surfacing.

## Lessons
The interesting part of MCP isn't the protocol, it's the **fan-out**. Once you have ten servers, none of your consumers want to learn ten different auth flows, ten different rate-limit shapes, or ten different "is this thing alive" semantics. A gateway with a real catalog UI in front of it is what turns "MCP servers exist" into "MCP servers are useful." Keeping the registry, the auth broker, and the try-it surface on the same product seam is what makes adoption cheap.


**Links:**
- [Live demo](https://mcp-playground-frontend.ml.tfy-eo.truefoundry.cloud/)

---

## MCP-Guardian: putting MCP on a diet
URL: https://prathameshsaraf.com/case-studies/mcp-guardian/
Author · Maintainer · Speaker · 2026

An MCP proxy that replaces hundreds of tool schemas with three meta-tools, cutting 160k+ startup tokens to 456 (a 99.7% reduction), and adds scoping, audit, and OAuth-aware fan-out to any upstream MCP server, with no client changes.

**Stack:** Python, MCP, asyncio, FastAPI, tiktoken, Docker, PostgreSQL MCP, GitHub MCP

**Outcomes:**
- 99.7% reduction in MCP startup token cost (160,143 to 456 tokens) across 248 PostgreSQL + 41 GitHub tools.
- Proxy overhead ≈ -8 ms (within noise) measured against real upstream servers; 93% hit rate on keyword-based tool search.
- 27/27 scope-enforcement security checks pass; OAuth, bearer, header, and pass-through auth modes all supported per-server.
- Conference talk accepted at the Linux Foundation MCP Dev Summit Bengaluru (9-10 June 2026).


## What I owned
The full design: the three-tool API (`search_tools`, `get_schema`, `execute_tool`), the YAML-driven scope model (per-server allow/block lists, per-scope policies), the OAuth-aware auth broker, the audit log, the in-browser dashboard with a live chat-demo for side-by-side token comparison, the benchmark harness against real upstream MCP servers, and the conference talk at the Linux Foundation summit.

## What shipped
An MCP proxy that an AI client can drop in front of any number of upstream MCP servers. The client sees three meta-tools (search, fetch-schema, execute) instead of the cartesian sum of every server's catalog. The proxy enforces scopes, handles OAuth flows, resolves API keys from `.env` or browser keystore, logs everything for audit, and exposes a dashboard that lets you compare token cost with and without the proxy in real time. The benchmark suite proves the savings hold against PostgreSQL MCP (248 tools) and GitHub MCP (41 tools).

## Lessons
The MCP spec already recommends progressive discovery; the hard part is making it implementable without rewriting every client. A proxy at the infrastructure layer is the right place to put that recommendation. The break-even point is around 39 tools; beyond that, the math is overwhelming. And the most interesting work isn't the token math; it's the auth broker. Once you have OAuth-aware fan-out across heterogeneous servers, you stop treating MCP as a curiosity and start using it like real infrastructure.


**Links:**
- [Source on GitHub](https://github.com/S1LV3RJ1NX/mcp-guardian)
- [MCP Dev Summit Bengaluru talk listing](https://events.linuxfoundation.org/mcp-dev-summit-bengaluru/)

---

## TrueMem: a model-agnostic memory layer for AI applications
URL: https://prathameshsaraf.com/case-studies/truemem/
Senior Engineer · Architecture, design, build, ship · 2025 to 2026

A persistent, two-tier (short-term + long-term) memory service for LLM applications. Distilled facts replace verbatim history, semantic retrieval surfaces what matters, and the same memory follows users across any model.

**Stack:** Python, FastAPI, Postgres, pgvector, Redis (queues), OpenAI / Anthropic / TrueFoundry AI Gateway

**Outcomes:**
- ~10 ms similarity search and ~45 ms end-to-end context preparation, dominated by embedding latency and parallelized against DB fetches.
- Two-tier memory architecture (STM: running summary + last 20 messages; LTM: vector-stored, importance-scored, semantically deduplicated facts) modelled after working vs. long-term human memory.
- Model-agnostic by design: memory lives outside the model, so consumers can swap GPT, Claude, or a fine-tuned in-house model without losing user context.


## What I owned
The full design and build of TrueMem: the dual-tier memory model, the explicit-vs-automatic memory extraction pipeline, semantic deduplication (≥85% similarity collapses to an update, not a duplicate), importance-scoring (1 to 5) with lifecycle-managed pruning, async summary compression, and the public byline on the TrueFoundry engineering blog.

## What shipped
A production memory service that any LLM application calls in two HTTP hops: fetch context before the model, log the interaction after. Short-term memory keeps recent turns coherent without blowing the context window; long-term memory stores distilled, deduped facts about each user, scoped per-user, retrievable by cosine similarity in under 10 ms. Heavy work (summarization, extraction) runs asynchronously on Redis-backed workers so the synchronous read path stays under 80 ms total.

## Lessons
RAG and "bigger context windows" are answering the wrong question for user memory. RAG retrieves *documents*; memory needs to retrieve *relationships*. Context windows pay tokens for amnesia with extra steps. A small dedicated layer (pgvector, two importance-aware tables, parallelized fetch) gets you AI that actually remembers, and it gets you vendor-portability for free. The interesting work is in the data shape (what counts as a fact, how facts age, how duplicates collapse), not the model.


**Links:**
- [TrueFoundry blog: Building TrueMem (byline)](https://www.truefoundry.com/blog/truemem-building-a-model-agnostic-memory-layer-for-ai)

---

## CogenticAI DB Agent: natural-language database queries for an enterprise SaaS
URL: https://prathameshsaraf.com/case-studies/db-agent/
Senior engineer · Toptal engagement (CogenticAI) · 2025

A production-grade NL-to-SQL agent built with Google ADK and FastAPI: a SQL Generator, an SQL Executor with retry-with-feedback, and a Response Generator orchestrated as a sequential + loop agent. Converts business questions into safe, read-only Postgres queries and natural-language answers.

**Stack:** Python 3.12, Google ADK, FastAPI, PostgreSQL, LLM provider (OpenAI / Anthropic), uv, Pytest

**Outcomes:**
- Multi-agent ADK pipeline: SQL Generator (LlmAgent), SQL Executor (Custom), Response Generator (LlmAgent), orchestrated by SequentialAgent + LoopAgent with bounded retry.
- Safe execution by default: read-only SELECT, statement timeouts, row caps, connection pooling, and schema-aware prompting.
- Session-aware conversation: maintains query history across turns so follow-ups ("and for last month?") resolve against the same context.
- Two reference implementations (Python with ADK, and TypeScript) to fit the client's preferred deployment stack.


## What I owned
The agent design (which agents, what each one decides, what state lives between them), the retry-with-feedback loop that lets the SQL Generator self-correct on execution errors, the schema-context engineering that keeps generated queries grounded, and the production posture: connection pooling, timeouts, structured errors, and a RESTful FastAPI surface the client could embed in their existing product.

## What shipped
A SequentialAgent orchestrating a LoopAgent (Generator, Executor, retry on failure) followed by a Response Generator. Read-only SELECT enforcement at the executor layer, statement timeouts, row limits, and a schema-aware system prompt so the model knows what tables, columns, and join paths exist. A RESTful API for embedding into the client's product, session state for multi-turn follow-ups, and a parallel TypeScript implementation so the client could deploy on whichever runtime their team owned.

## Lessons
The hard part of NL-to-SQL is not the SQL. It's *recovery*. A single shot will fail on join paths the model didn't see, ambiguous column names, or non-trivial filters. A retry loop with structured executor feedback ("you referenced `users.email` but that column doesn't exist; the closest match is `users.contact_email`") turns a 60%-correct agent into a 95%-correct one without any prompt tuning. Treat the database as the ground truth and let the model converge to it.


---

## AIME: meeting intelligence and voice agents, end-to-end
URL: https://prathameshsaraf.com/case-studies/aime-meetings/
Author · Maintainer · 2024

An AI meeting and voice-agent platform (capture, transcribe, summarize, retrieve, and run live agents on top) built as a clean separation of a Python backend and a modern web operator console so each side can evolve independently.

**Stack:** Python, FastAPI, LiveKit Agents, LangGraph, Postgres, Google APIs, Vite, React, TypeScript, Tailwind

**Outcomes:**
- Full meeting-intelligence and voice-agent surface: capture, transcript, structured summary, retrieval over past meetings, and live agents on top.
- Clear API contract between backend and operator console so either side can be swapped or extended.
- Public companion repos for ideas explored in the voice and agentic work at scale.


## What I owned
The full architecture: ingestion pipeline for audio, ASR with diarization, the structured summary layer, the storage and retrieval model for searching across past meetings, and the API the frontend talks to. The frontend is a deliberate companion: a clean way to show what the backend can do without leaking implementation details.

## What shipped
A two-repo project (`AIME-backend`, `AIME-frontend`) you can stand up end-to-end. The backend is the system of record; the frontend is one consumer of it. Anything you can do in the UI is reachable from the API.

## Lessons
Meeting intelligence is mostly a *data shape* problem, not an LLM problem. The interesting work is in how transcripts, speaker turns, summaries, and per-meeting facts get represented so that downstream features (search, briefings, follow-up generation) compose cleanly instead of fighting each other.


**Links:**
- [AIME-backend](https://github.com/S1LV3RJ1NX/AIME-backend)
- [AIME-frontend](https://github.com/S1LV3RJ1NX/AIME-frontend)

---

## Yukti: a workable, end-to-end RAG stack
URL: https://prathameshsaraf.com/case-studies/yukti-rag/
Author · Maintainer · 2024

A complete RAG application built to be understood: ingestion, embeddings, retrieval, an evaluation harness, and a UI. Small enough to read, real enough to deploy.

**Stack:** Python, FastAPI, ARQ, Unstructured.io, Docling, pgvector, Qdrant, LiteLLM, Vite, React, Tailwind, Docker

**Outcomes:**
- End-to-end reference implementation people can clone, read, and ship from.
- Companion frontend and backend repos kept intentionally small so the architecture is the documentation.
- Multi-tenant RAG framework with semantic chunking, multi-LLM eval harness, and a real operator console.


## What I owned
Architecture, code, docs, and review. Yukti is a deliberately small RAG application; its job is to be readable end-to-end so someone can understand what a production RAG system actually looks like (and not just the part the framework demo shows).

## What shipped
A FastAPI-based backend that handles document ingestion, chunking, embedding, retrieval, and answer synthesis; a Next.js frontend that consumes the same API; and the connective tissue (Docker, env, eval scripts) you'd need to actually run it. The split between `yukti` and `yukti-frontend` keeps each repo focused.

## Lessons
Most "RAG frameworks" are actually opinions hiding as APIs. The fastest way to teach RAG is to show the whole loop in code small enough to fit in your head, and then let people swap pieces out one at a time as they outgrow the defaults.


**Links:**
- [yukti (backend)](https://github.com/S1LV3RJ1NX/yukti)
- [yukti-frontend](https://github.com/S1LV3RJ1NX/yukti-frontend)

---

# Publications

## CARL: Cost-Optimized Online Container Placement on VMs Using Adversarial Reinforcement Learning
IEEE Transactions on Cloud Computing · 2025
URL: https://ieeexplore.ieee.org/abstract/document/10839070/


Adversarial reinforcement learning formulation for online container-to-VM placement, framed as a multi-dimensional vector-bin-packing problem. A learner agent mimics an offline semi-optimal teacher solver while automatically learning a reward function for VM cost reduction, ending up out-performing the teacher it imitates. Evaluated on Google and Alibaba production cluster traces (5k–10k container requests across 2k–8k VMs): ~16% lower VM cost than classic heuristics and SOTA RL methods, ~1,900 placements per second onto ~8,900 candidate VMs, robust to inference-time workload distribution shift.


---

# Featured writing

## Building Scalable and Cost-Effective Voice Agents: A Platform-Based Blueprint
CVS Health Tech Blog
URL: https://medium.com/cvs-health-tech-blog/building-scalable-and-cost-effective-voice-agents-a-platform-based-blueprint-fae6ee5881c9


A blueprint for treating voice agents as a *platform* rather than a series of point projects: shared runtime, swappable models, layered evaluation, and the operational scaffolding that lets you ship more than one agent without doubling your team.


---

## TrueMem: Building a Model-Agnostic Memory Layer for AI
TrueFoundry engineering blog · byline
URL: https://www.truefoundry.com/blog/truemem-building-a-model-agnostic-memory-layer-for-ai


---

## Transforming Customer Interactions: Evolving IVR Systems for Enhanced Experiences
CVS Health Tech Blog
URL: https://medium.com/cvs-health-tech-blog/transforming-customer-interactions-evolving-ivr-systems-for-enhanced-experiences-73b12c7f5aea


Modernizing IVR without throwing away a decade of integrations: how to wrap existing actions as tools, swap the planner, and measure each path against the legacy one before fully cutting over.


---

## My Adventures with LLMs
Leanpub · book
URL: https://leanpub.com/adventures-with-llms


The book I wish I'd had when I started: a hands-on tour of building LLMs from the Transformer up to modern architectures (including DeepSeek), in PyTorch, from scratch. Code lives in the companion repo [mal-code](https://github.com/S1LV3RJ1NX/mal-code).


---

# Open source

## MCP-Guardian
MCP proxy that replaces hundreds of tool schemas with three meta-tools, a 99.7% startup-token reduction with scoping, audit, and OAuth-aware fan-out. Talk at Linux Foundation MCP Dev Summit Bengaluru, June 2026.
Tech: Python · MCP · FastAPI · asyncio · Docker
URL: https://github.com/S1LV3RJ1NX/mcp-guardian

---

## MAL-Code
Companion code for the book *My Adventures with LLMs*. Transformers to DeepSeek in PyTorch, from scratch.
Tech: Python · PyTorch
URL: https://github.com/S1LV3RJ1NX/mal-code

---

## AIME
AI meeting intelligence and voice agents, end-to-end: capture, transcribe, summarize, retrieve, and run live agents on top, with a self-hosted meeting bot for ingest.
Tech: Python · FastAPI · LiveKit · LangGraph · React
URL: undefined
Status: archived

---

## Yukti
A complete, end-to-end RAG stack built to be understood: ingestion, embeddings, retrieval, an eval harness, and an operator console. Small enough to read, real enough to deploy.
Tech: Python · FastAPI · pgvector · Qdrant · React
URL: undefined
Status: archived

---

## PaymentTracking
PWA expense and income tracker for freelance sole-proprietors. Claude-powered OCR over invoices and FIRA certificates, live Google Sheets ledger, India-tax calculator (Sec 44ADA, new regime), all on Cloudflare Pages and R2.
Tech: React · Vite · Hono · Cloudflare Pages Functions · R2 · Claude Haiku
URL: https://github.com/S1LV3RJ1NX/PaymentTracking

---

## Cognita
Open-source RAG framework I co-built at TrueFoundry. Production-ready primitives for ingestion, retrieval, and serving.
Tech: Python · LangChain · FastAPI
URL: https://github.com/truefoundry/cognita
Stars: 4.4k+
Status: archived

---