I build real-time conversational AI and agentic systems at planetary scale — from the Meetings AI layer to the Agent Extensibility Platform inside Microsoft Teams, serving 300M+ daily users.
Principal Engineer with 14+ years of distributed systems and platform engineering at planetary scale; the last 3+ focused on real-time conversational AI and agentic systems. At Microsoft Teams I architected the Meetings AI layer — real-time RAG over meeting context (transcript, calendar, task systems), context-window optimization, and streaming LLM integration delivering sub-200ms AI assistance to millions of daily participants. Full-stack delivery from React/TypeScript frontend to distributed Python/Java backend, with deep ML/Eng collaboration patterns. Looking to apply real-time RAG and conversational AI product-engineering depth to a focused conversational-AI company.
A walkthrough of how I designed, built, and shipped real-time conversational AI inside Microsoft Teams — and the calls I made (and a few I'd remake) along the way.
Microsoft Teams serves 300M+ daily active users, and meetings are the highest-bandwidth knowledge exchange in most enterprises. Before this program, that bandwidth was lossy: transcripts existed but were post-hoc artifacts, action items walked out the door with whoever happened to be writing them down, and most of the context of the meeting died with the meeting itself. The system I led has two tightly-coupled pieces.
Real-time conversational AI that runs alongside a live meeting. It listens to the streaming transcript, retrieves relevant context (calendar, prior meetings, task systems, chat history), and offers in-meeting assistance — answering questions, drafting follow-ups, surfacing related decisions — with sub-200ms perceived latency. Under the hood it's a streaming LLM tool-use loop wrapped around a hybrid RAG pipeline reading a corpus that is literally growing as the meeting happens.
A runtime, SDK, and set of APIs that lets internal and external developers build agents that live inside Teams — at-mentionable, multi-agent, with the right primitives for context propagation, turn-taking, human-in-the-loop intervention, and partial-failure isolation. It speaks A2A (agent-to-agent) and MCP (Model Context Protocol) as first-class. At GA it backed agents reaching the full 300M+ DAU surface.
Why it mattered. The difference between "AI that helps you" and "AI that lives where you already work" is enormous in enterprise. The latter compounds — your tools get smarter together — and Teams is the only surface where a meeting-aware AI can be born without forcing users to switch context.
Tech lead and lead architect across both pieces. IC role, not management. Across the program I worked with ~12–15 engineers (plus a separate AI Research org) and was personally on the hook for:
Concretely: I wrote the load-bearing design docs, prototyped the hardest pieces (runtime turn-taking, streaming RAG) myself before handing them off, reviewed PRs across the runtime, and spent a lot of time saying no to features so we could ship a thing that worked.
The fastest way to ship a bad AI system was to over-build. The default move every quarter was to reach for what already existed and spend our build budget on the things only we could build.
Behind a thin internal interface. I never seriously entertained training our own — model quality changes month-to-month; commit to one in your architecture and you pay for it for years. The layer was deliberately thin: swap providers, route by task complexity (tiered routing later cut cost/request 40%), unify telemetry. It explicitly did not try to be a "framework" — those grow tentacles.
Hosting our own index early would have been a multi-quarter distraction for marginal gain. Meeting traffic is bursty and dense; managed gave us cold-start in seconds. We revisited this only when retrieval latency on the long tail started showing up in evals.
ASR is a domain of its own with a dedicated org. I had zero interest in building expertise we didn't need to own.
A build/buy decision in disguise. Teams is big enough that we could have invented our own protocols and forced adoption. I deliberately didn't — developers were already converging on MCP, and "Teams works with the protocol you already speak" is a far stronger pitch. It cost some optimization headroom and gave us a much larger ecosystem on day one.
We used it to learn what multi-agent orchestration should feel like before committing to our own runtime. We discarded it before GA, but designing from a blank page without that grounding would have led to a worse design. The cost of throwing away the prototype was much smaller than the cost of skipping the learning.
Four pieces, in roughly the order we admitted we had to build them.
AutoGen, LangGraph, and the open frameworks at the time were single-process, single-tenant, and batch-shaped. We needed distributed, multi-tenant, low-latency, and meeting-shaped: agents joining and leaving as participants do, turn-taking that respects who has the floor, idempotent retries (participants re-trigger under flaky networks), and partial-failure isolation so one misbehaving 3rd-party agent couldn't take down the meeting. There was no honest "buy" option — so we built one, event-driven, with a turn-taking protocol I wrote up as a small internal spec and a failure model that treated partial degradation as the normal case.
Off-the-shelf RAG assumes a static corpus. We needed retrieval over a corpus that grew as the meeting progressed, where "relevance" depends on speaker, topic shifts, and time — a customer mentioned in minute 3 might be what the agent needs in minute 27. We built a streaming chunker respecting semantic boundaries (utterance + speaker + topic), a hybrid retrieval layer combining embedding similarity with structured filters (calendar links, task IDs, participant identity), and a context-window manager that knew which retrieved items were still load-bearing vs. could be evicted.
Existing SDKs were code-first — written for engineers comfortable instantiating an Agent class. Our 3rd-party population was broader, including PMs and customer-org IT admins. The SDK had to be declarative: a manifest where you specify intent, retrieval surfaces, tool permissions, and HITL defaults, and the runtime fills in correct behavior. When we measured it, reliability went up 3× and support tickets dropped 60% — almost entirely because the right defaults were locked into the manifest shape rather than left as opportunities for misuse.
Model providers' tooling stops at the model. We needed full-loop eval — retrieval quality, end-to-end p95 latency, agent behavior under turn-taking pressure, regression detection when AI Research rolled new prompts or models, and progressive rollout gates that could halt before regressions reached production. We built it — partly because it didn't exist, mostly because owning it made the AI Research / Engineering contract concrete: Research could change anything as long as it cleared the eval harness.
Streaming tokens as the model produced them dropped perceived latency from ~3s to ~200ms. The cost: a UI that now had to handle partial states, mid-stream retraction, and graceful degradation when the stream dropped. In a meeting — where the AI competes with humans speaking — feeling instant matters far more than feeling complete.
// I still think that was right.Routing simple intents to a smaller model and reserving the frontier model for harder calls cut cost/request 40%. The cost: a routing classifier that could itself be wrong, and a test surface that grew with the number of tiers. The mitigation was the eval harness — every routing decision scored offline before it shipped.
// Without evals first, this would have been a bad call.Every agent the SDK produced had a human-in-the-loop confirmation step by default for any action writing to a customer system. It made agents feel slower and chattier — the tradeoff was conservative-by-default for enterprise customers with no patience for AI acting incorrectly on their behalf. The escape hatch (explicit opt-out per action) was available, but you had to ask for it.
// I'd make the same call again.This cost some optimization (an internal protocol could pack data more efficiently) and locked us into a moving target — the standards evolved under us. The win was ecosystem: external developers showed up speaking our protocols on day one. The cost showed up later as compatibility work when the standards shifted.
// Net positive, but with eyes open.The decision I'd defend most strongly and second-guess most often. For: no off-the-shelf framework fit our shape, so we'd have forked one anyway. Against: we now own and maintain a runtime forever. I keep watching the open-source space — if a community-maintained runtime ever catches up to our needs, the right move is to migrate and stop carrying the maintenance load.
// Correct at the time. Still watching.Three things, in increasing order of "I should have known better."
We shipped the orchestration runtime and the first agents to GA before a real evaluation harness was in place, and caught a few embarrassing regressions in production that an eval suite would have flagged on commit. The classic mistake: eval feels like infrastructure investment, and infrastructure loses to features under shipping pressure. I'd now treat the eval harness as a P0 deliverable in the same release as the first real model integration, not a fast-follow.
I had a real bias toward "design for the platform we'll be in two years." The runtime carried complexity — tenant isolation primitives, per-tenant rate limits, multi-tenant config surfaces — before any tenant needed it. If I were doing it again, I'd ship a single-tenant runtime first, get one use case completely solid, and only then generalize. We could have shipped two months earlier and learned more from real usage than from the extra abstraction.
The decision to build was right at the moment we made it — but I made it earlier than I had to. Another quarter of watching what the community shipped might have let us base our runtime on something rather than starting from a blank page. The kind of decision where being right too early is almost as expensive as being wrong.
The pieces of this program I'm proudest of aren't the technologies. They're the contracts: the eval harness as the boundary between AI Research and Engineering, the manifest as the boundary between the platform and 3rd-party developers, the turn-taking protocol as the boundary between agents. Each one took an ambiguous, negotiation-heavy interface and turned it into something a machine could check — and that's mostly what let us ship at scale without the program collapsing under coordination cost. Happy to go deeper on any of this in conversation.
Real-Time RAG · Context Engineering · LLM Integration · Streaming UI · Conversation Workflows · Agent Orchestration · A2A · MCP · HITL
AutoGen · Azure OpenAI · OpenAI API · Anthropic Claude API · LangChain & LlamaIndex (familiarity)
Python (AutoGen, Azure OpenAI SDK, Anthropic SDK, OpenAI API) · TypeScript · C#/.NET · Java (Spring Boot) · Node.js
React · TypeScript · Fluent UI · GraphQL · Streaming UI · Perf (virtualization, memoization, batch updates)
REST · GraphQL · gRPC · WebSockets · Microservices · Event-Driven Architecture · Fault Tolerance · Idempotency · Horizontal Scalability · Partial-Failure Isolation
Azure (Service Bus, Functions, Cosmos DB, OpenAI) · Docker · Apache Kafka · Cosmos DB · MySQL · MongoDB · Oracle