AI Digital Advisor: Multi-Agent System for Insurance Comparison
2025
Designed and delivered a multi-agent advisory system for a major Australian insurance comparison platform (NDA) serving 4.5M+ users. Starting from ethnographic research into expert human advisers, I designed a personality-aware conversational AI experience that replaces static forms, defining the experience design framework, AI personality specification, and human handoff thresholds before writing any code. Took the project from research through POC to production-ready with a 5-person squad reporting directly to the CEO.
AI Experience DesignMulti-AgentGuardrails & EvalsUser Research
The Real Problem
75% of visitors to the platform churned at the first sign of a form. The entire revenue model depended on phone-based advisers converting leads, but the industry was shifting to digital self-service. I ran the discovery research to understand why: ~500 survey respondents, 6 in-depth user interviews, and analysis of 40+ recorded calls from the platform's top-performing human advisers. The research revealed that customers weren't leaving because of the content, they were leaving because the experience pushed them into a sales process before they were ready.
75% of traffic left without entering any conversion path, no form started, no call booked
50% form abandonment at the phone number field, fear of follow-up calls drove anxiety. Users described being 'bombarded with follow-up calls' after leaving a number
42% of outbound calls went unanswered, 11% were fake numbers, the phone-first model was breaking
Online quotes didn't match agent recommendations 75% of the time, destroying trust in the digital channel and driving disintermediation (users research on the platform, buy direct from providers)
Each lost conversion represented ~$1,075 in missed commission revenue based on average policy values
Defining Success Metrics Upfront
Before writing a single line of code, I defined the performance measurement framework with the client's CEO and CTO. Every design decision mapped to a measurable outcome. This wasn't an afterthought bolted on at the end, it was the foundation that shaped every subsequent decision about architecture, agent design, and experience flow.
Primary metric: lead conversion rate (baseline 7.6%, target 12-17%)
Recommendation accuracy: percentage of AI suggestions matching adviser-validated outcomes
Conversation completion rate: percentage of users who reach a recommendation vs drop-off
Handoff quality score: CS rep satisfaction with context transferred during human escalation
Cost per conversation: AI advisory cost vs phone adviser cost per qualified lead
Latency targets: sub-2-second response time for conversational turns
The Solution
Not a chatbot. The platform's competitive edge was their human advisers, high phone conversion because humans read customers, adapt tone, and build trust. The design insight came from ethnographic analysis of those advisers: I studied 40+ recorded calls, mapping questioning patterns, trust-building techniques, objection handling, and the adaptive behaviours that separated the best advisers from average ones. The design question wasn't 'can we automate this?', it was 'what makes a human advisory conversation feel trustworthy, and which of those qualities can an AI system replicate authentically?' I deliberately scoped twelve feet deep rather than twelve feet wide, health insurance for two target segments (young couples and singles), because expert-level depth earns trust. Shallow breadth across every vertical feels like another chatbot.
Needs-based conversational flow designed from adviser call analysis, mirroring the question sequencing, progressive disclosure, and trust-building cadence of top performers
6 ML-driven behavioural mindsets (Thorough Evaluator, Discount Switcher, Knowledge Gap Bridge, etc.) defined through customer research and mapped to ML classification tags that drive real-time adaptation
Dynamic contextual cards that progressively build a customer profile through conversation, replacing form-like data capture with progressive profiling that feels conversational
Three-way human handover designed for context continuity: AI introduces the CS rep to the customer with full conversation history, profile, and recommendation rationale transferred
Experience Design Framework
The architecture defines what the system can do. The experience design framework defines how it should behave, adapt, and know when to hand over to a human. I designed and presented this as a strategic framework to the client's CEO and CPO, bridging the gap between product design intent and technical implementation. This was the document that turned a POC into a production commitment.
AI personality specification: defined configurable parameters across Voice and Tone (pitch variation, speaking rate, rhythm, emphasis), Conversational Behaviours (question types, topic switching, follow-up patterns, response length), and Natural Language processing (emotional detection, urgency, confidence, intent classification), each mapped to a tunable parameter for iterative rollout
Dynamic contextual cards: designed the progressive profiling pattern, what the consumer says, what the interface displays, and what the AI agent interprets beneath the surface. Coverage type, lifestyle mindset, and extras are built through conversation, not form-like data capture
Mindset-driven personalisation: rather than a static interface for the average user, I defined six behavioural mindsets through customer research (Thorough Evaluator, Cost Savings, Discount-Motivated Switcher, Knowledge Gap Bridge, etc.). Each mindset maps to ML classification tags that inform how the agent adjusts tone, content emphasis, and question sequencing in real-time
Human handover thresholds: designed the configurable escalation points at which the AI agent hands over to the Customer Success team. These thresholds are experimentable, the client can tune the balance between digital self-service and human intervention as they learn from real usage
Prototype validation: tested the core advisory flow with participants against the working prototype, validating that the conversational approach outperformed the existing form-based funnel before committing to the full build
Experience design framework: AI personality specification, dynamic contextual cards, mindset-driven personalisation, human handover thresholds, and the advisory interface design.
Guardrails and Safety
Insurance is a regulated product. An AI system making coverage recommendations carries real compliance risk. Guardrails weren't a Phase 4 addition, they were designed into the agent architecture from day one. Every recommendation path had defined boundaries, and the system was built to fail safe rather than fail silent. The guardrail framework operated at multiple layers: input validation, agent-level constraints, output filtering, and system-level monitoring.
PII handling: customer data encrypted in transit and at rest, no personal information stored in conversation logs, automated PII detection and redaction in training data
Hallucination prevention: all policy recommendations grounded in RAG against verified product databases and Australian government health data, the system cannot invent policy benefits. No generative responses for policy details; only retrieved, verified information surfaced to users
Out-of-scope deflection: hard boundaries on what the AI will and won't advise on, with graceful escalation to human advisers for edge cases, emotional distress signals, or complex regulatory questions. Supported vs unsupported use cases defined during discovery with the client
Compliance boundaries: AI framed as informational guidance, not financial advice. Clear disclaimers surfaced at recommendation points. Audit trail for every AI-informed recommendation traceable through the full agent interaction chain
Content filtering: input and output moderation preventing prompt injection, off-topic steering, and adversarial conversation patterns. Red team testing validated guardrail robustness against deliberate manipulation attempts
Confidence thresholds: when model confidence drops below defined thresholds, the system automatically triggers human handoff rather than surfacing uncertain recommendations. Thresholds calibrated through eval data, not set arbitrarily
Behavioural constraints: agent-level rules preventing sycophantic agreement with medically or financially dangerous customer assumptions, off-topic drift, and excessive personalization that could feel manipulative rather than helpful
Architecture
A network of specialised AI agents on Microsoft Azure, each with a defined responsibility and clear interface contracts. The Central Orchestrator coordinates conversation flow, routing to specialist agents based on conversation state. The architecture was designed for observability, every agent interaction logged, every decision traceable.
Central Orchestrator managing conversation state and agent routing
Personality Assessment Agent (Gemini) adapting tone, pace, and content sequencing in real-time
Insurance Policy RAG Analyser grounded in funds provider knowledge bases and AU government health databases
Recommendation Engine with suitability ranking and confidence scoring
Voice layer (ElevenLabs + OpenAI GPT-4) for natural conversational delivery
Mixpanel integration for customer sentiment analysis, event tracking, and conversion attribution
Solution architecture: a network of specialised AI agents coordinated by a Central Orchestrator, built on Microsoft Azure with Gemini, OpenAI GPT-4, and ElevenLabs integrations.
Evals Framework
Eval-driven development, evals defined before the system was built, not bolted on after. I defined eval criteria and success thresholds during discovery, captured baselines during prototyping, and built an automated eval pipeline that ran on every deployment from Phase 2 onwards. The framework followed a layered approach: automated evals in the deployment pipeline, model-graded quality assessment, human-in-the-loop calibration, and production monitoring, each layer catching failures the others miss.
Outcome-based grading: evaluated what the agent produced, not the path it took. A correct recommendation delivered with wrong tone or missing context still failed. Multi-dimensional scoring across accuracy, tone adaptation, information retrieval, mindset-appropriate pacing, and conversation completion
Trace-level evaluation: graded the full multi-agent interaction trace, orchestrator routing decisions, personality agent adaptations, RAG retrieval relevance, and handoff trigger accuracy. Each agent's contribution was evaluable independently and as part of the end-to-end conversation
Model-graded evals: used LLM-as-judge to score conversation quality at scale, tone appropriateness, coherence, information completeness, calibrated against human grader agreement before running autonomously
Scenario-based testing: 200+ test conversations across supported verticals, customer segments, happy paths, edge cases, and adversarial inputs. Regression suite expanded as capability evals graduated to baseline expectations
Recommendation accuracy benchmarked against human adviser outcomes on identical customer profiles, testing production reliability (does it work consistently?), not just capability (can it work at all?)
Edge case and red team libraries: adversarial inputs, prompt injection attempts, ambiguous customer needs, multi-policy scenarios, emotional distress signals, out-of-scope requests, testing guardrail behaviour under pressure
Human-in-the-loop calibration: CS team reviewed random samples weekly, scoring against the same quality rubric used for human advisers. Grader agreement validated model-graded eval accuracy. Quality issues fed back into both eval criteria and the experience design specification
Production monitoring: designed for post-launch distribution drift detection, conversation quality degradation alerts, and conversion funnel tracking via Mixpanel, distinguishing pre-launch validation from ongoing production health
Delivery
Design-led delivery with engineering one week behind design. I owned the experience specification, every agent interaction, every conversation flow, every handoff threshold was designed and validated before engineering built it. Four phases, each with defined outputs and go/no-go criteria. Originally hired for POC, used results to create C-level buy-in and extend into full production engagement.
Phase 1. Discovery and research: ~500 survey respondents, 6 in-depth user interviews, 40+ adviser call analyses, competitor and disintermediation research, AU government health database mapping, eval criteria definition, supported use case boundaries
Phase 2. Design and prototype: experience design framework and AI personality specification delivered to CEO/CPO, working POC on AWS with core agents, prototype testing with users, automated eval pipeline running from first deployment
Phase 3. Build, migrate, integrate: full agent network + RAG pipeline built to the experience specification, AWS to client data storage migration, Mixpanel analytics, AI-automated API queries to client systems
Phase 4. Evals, testing, handover: comprehensive scenario testing against both technical and experience quality criteria, regression suite, CS team training on handover thresholds, documentation, production release preparation
Team and Stakeholders
Five-person squad, all present across all phases. I led end-to-end delivery, stakeholder management, experience design, and AI interaction specification. The engagement reported directly to the CEO, this was a C-level AI transformation initiative, not a feature request.
Product and AI Design Lead (me): end-to-end delivery ownership, stakeholder management, experience design, AI interaction specification
Two Full-Stack Engineers: agent development, RAG pipeline, eval framework, API integrations, Mixpanel analytics
Product Designer: UI/UX design, brand design system alignment, Figma prototypes
Client contacts: CEO (product owner), CPO, CTO, Head of Customer Success, Head of Sales
What I Learned
The technology was ready, the system was fully built, integrated, and passing evals. The harder problem was organisational readiness. The Customer Success team needed to champion the shift from phone-first to digital advisory, and that change management work needed to run in parallel with technical delivery from day one, not after it. Performance measurement, evals, and guardrails are not phases, they are baseline infrastructure that shapes every other decision. Define them first, build everything else around them. Depth beats breadth in AI advisory: a narrow, expert-level system earns trust that a shallow generalist never will. And the most important design artifact in an AI product isn't the interface, it's the experience specification that defines how the system should behave. The UI is the surface; the personality spec, handoff thresholds, and mindset definitions are what make the experience feel human. That's the design craft that's invisible to the user but makes everything work.