Cameron Henkes
Back

AI Digital Advisor: From POC to Production for 4.5M Users

2025

Contracted through Pathfindr to design and deliver a multi-agent AI advisory system for a major Australian insurance comparison platform (NDA) serving 4.5M+ users. I ran the discovery research, designed the experience framework and AI personality specification, defined eval criteria and guardrails, and led the squad from POC to production-ready. The engagement started as a proof of concept. The speed and quality of the work convinced the CEO to extend into full production.

AI Experience DesignMulti-AgentGuardrails & EvalsUser Research
AI Digital Advisor: From POC to Production for 4.5M Users

The Engagement

I was contracted through Pathfindr to work on a project for one of their clients -- a major Australian insurance comparison platform serving 4.5M+ users. The platform's entire revenue model depended on phone-based advisers converting leads, but the industry was shifting to digital self-service and their funnel was breaking.

Five-person squad, all Pathfindr contractors: myself as product and AI design lead, a solution architect, two full-stack engineers, and a product designer. The engagement reported directly to the client's CEO -- this was a C-level AI transformation initiative, not a feature request.

I was originally hired for the proof of concept. We built the POC with a production mindset from day one -- the goal was to demonstrate capability that could be productised immediately, not a throwaway demo. The speed and quality of what we delivered convinced the CEO to extend into full production.

We built the POC with a production mindset from day one. The goal was capability that could be productised immediately, not a throwaway demo.

POC agent selection screen showing three AI advisers -- Laura, Henry, and John -- with pink styling to prevent premature productisation
The POC agent selection screen. Deliberately styled in pink to prevent the client from shipping it as-is -- forcing the production design phase.

The Problem

The client's data told a clear story. The phone-first model was breaking and the digital channel wasn't converting. All of these numbers came directly from the client's analytics.

  • 75% of traffic left without entering any conversion path -- no form started, no call booked
  • 50% form abandonment at the phone number field. Users described being 'bombarded with follow-up calls' after leaving a number
  • 42% of outbound calls went unanswered. 11% were fake numbers
  • Online quotes didn't match agent recommendations 75% of the time, destroying trust in the digital channel and driving disintermediation -- users researched on the platform, then bought direct from providers
  • Each lost conversion represented ~$1,075 in missed commission revenue based on average policy values

I ran the discovery research to understand why: ~500 survey respondents, 6 in-depth user interviews, and analysis of 40+ recorded calls from the platform's top-performing human advisers. The research revealed that customers weren't leaving because of the content. They were leaving because the experience pushed them into a sales process before they were ready.

Research Into Human Advisers

The platform's competitive edge was their human advisers -- high phone conversion because humans read customers, adapt tone, and build trust. The design question wasn't 'can we automate this?' It was 'what makes a human advisory conversation feel trustworthy, and which of those qualities can an AI system replicate authentically?'

I studied 40+ recorded calls, mapping questioning patterns, trust-building techniques, objection handling, and the adaptive behaviours that separated the best advisers from average ones. The top performers didn't follow a script. They assessed the customer's mindset within the first few exchanges and adapted everything -- pacing, information density, question sequencing -- to match.

I deliberately scoped the system twelve feet deep rather than twelve feet wide: health insurance for two target segments (young couples and singles). Expert-level depth earns trust. Shallow breadth across every vertical feels like another chatbot.

The design question wasn't 'can we automate this?' It was 'what makes a human advisory conversation feel trustworthy, and which of those qualities can an AI system replicate authentically?'

AI avatar advisory session showing Henry the AI agent in conversation, with dynamic contextual cards displaying policy match, budget, and customer profile
The AI avatar advisory experience: real-time conversation with dynamic contextual cards progressively building the customer profile

Experience Design Framework

The architecture defines what the system can do. The experience design framework defines how it should behave, adapt, and know when to hand over to a human. I designed and presented this as a strategic framework to the client's CEO and CPO, bridging product design intent and technical implementation. This was the document that turned the POC into a production commitment.

  • AI personality specification: configurable parameters across voice and tone (pitch variation, speaking rate, rhythm, emphasis), conversational behaviours (question types, topic switching, follow-up patterns, response length), and natural language processing (emotional detection, urgency, confidence, intent classification) -- each mapped to a tunable parameter for iterative rollout
  • 6 behavioural mindsets (Thorough Evaluator, Cost Savings, Discount-Motivated Switcher, Knowledge Gap Bridge, etc.) defined through customer research and mapped to ML classification tags that drive real-time adaptation of tone, content emphasis, and question sequencing
  • Dynamic contextual cards that progressively build a customer profile through conversation, replacing form-like data capture with progressive profiling that feels conversational
  • Human handover thresholds: configurable escalation points at which the AI agent hands over to Customer Success. These are experimentable -- the client can tune the balance between digital self-service and human intervention as they learn from real usage
  • Prototype validation: tested the core advisory flow with participants against a working prototype, validating the conversational approach outperformed the existing form-based funnel before committing to the full build
Text and voice-to-chat interface showing the AI agent analysing a user's Bupa policy, with live transcript and current policy breakdown
The text/voice-to-chat mode: the AI agent analyses the user's current policy in real-time, breaking down coverage and identifying savings

Metrics, Guardrails, and Evals

Before writing any code, I defined the performance measurement framework with the client's CEO and CTO. Every design decision mapped to a measurable outcome: lead conversion rate (baseline 7.6%, target 12-17%), recommendation accuracy benchmarked against human adviser outcomes, conversation completion rate, handoff quality scoring, cost per conversation vs phone advisers, and sub-2-second latency targets.

Insurance is a regulated product. An AI system making coverage recommendations carries real compliance risk. Guardrails were designed into the agent architecture from day one -- not added in Phase 4. The system was built to fail safe rather than fail silent.

  • All policy recommendations grounded in RAG against verified product databases and AU government health data -- the system cannot invent policy benefits
  • Hard boundaries on what the AI will and won't advise on, with escalation to humans for edge cases, emotional distress, or regulatory complexity
  • AI framed as informational guidance, not financial advice. Disclaimers at recommendation points. Full audit trail for every AI-informed recommendation
  • Content filtering preventing prompt injection, off-topic steering, and adversarial patterns. Red team testing validated robustness
  • Confidence thresholds triggering automatic human handoff rather than surfacing uncertain recommendations -- calibrated through eval data, not set arbitrarily

Evals were defined before the system was built. I defined eval criteria and success thresholds during discovery, captured baselines during prototyping, and the team built an automated eval pipeline that ran on every deployment from Phase 2 onwards. 200+ test conversations across customer segments, edge cases, and adversarial inputs. Model-graded evals using LLM-as-judge for conversation quality at scale, calibrated against human grader agreement. CS team reviewed random samples weekly against the same quality rubric used for human advisers.

Evals and guardrails are not phases. They are baseline infrastructure that shapes every other decision. Define them first, build everything else around them.

Architecture

The solution architect designed the multi-agent architecture on Microsoft Azure -- a network of specialised AI agents, each with a defined responsibility and clear interface contracts. My role was defining the experience layer that sat on top: what each agent should do, how they should behave, and when they should hand off. The architecture was designed for observability -- every agent interaction logged, every decision traceable.

  • Central Orchestrator managing conversation state and agent routing
  • Personality Assessment Agent (Gemini) adapting tone, pace, and content sequencing in real-time based on the behavioural mindset framework I defined
  • Insurance Policy RAG Analyser grounded in funds provider knowledge bases and AU government health databases
  • Recommendation Engine with suitability ranking and confidence scoring
  • Voice layer (ElevenLabs + OpenAI GPT-4) for natural conversational delivery. We also explored Tavus as an alternative for AI avatar-based interactions
  • Mixpanel integration for customer sentiment analysis, event tracking, and conversion attribution
Multi-agent system architecture on Microsoft Azure showing the Central Orchestrator, specialised AI agents, and integration points
Solution architecture designed by the solution architect -- a network of specialised agents coordinated by a Central Orchestrator

Delivery

Design-led delivery with engineering one week behind design. I owned the experience specification -- every agent interaction, every conversation flow, every handoff threshold was designed and validated before engineering built it. Four phases, each with defined outputs and go/no-go criteria.

  • Phase 1 -- Discovery: ~500 survey respondents, 6 user interviews, 40+ adviser call analyses, competitor research, AU government health database mapping, eval criteria and use case boundaries defined
  • Phase 2 -- Design and prototype: experience framework and AI personality specification delivered to CEO/CPO, working POC with core agents, prototype testing with users, automated eval pipeline running from first deployment
  • Phase 3 -- Build and integrate: full agent network and RAG pipeline built to the experience specification, data migration, Mixpanel analytics, API integrations to client systems
  • Phase 4 -- Evals and handover: comprehensive scenario testing, regression suite, CS team training on handover thresholds, documentation, production release preparation
Activity view showing the customer profile built progressively through conversation -- name, age, coverage type, services used, sports, and family planning
Progressive profiling in action: the customer profile built entirely through conversation, not forms
Activity view showing real-time web scrape of HCF health fund with price comparison between CompareClub and direct purchase
Real-time competitive intelligence: the agent scrapes provider websites to compare pricing during the conversation

What Went Wrong

The technology was ready. The system was fully built, integrated, and passing evals. The harder problem was organisational readiness.

The original brief focused on the unauthenticated product funnel -- the 75% of visitors who left without engaging. But as the project progressed, the client's risk appetite shifted. The product moved to an authenticated experience, meaning the problems it was designed to solve changed. The system we built for anonymous visitors converting through a conversational flow was now being evaluated against a different use case.

The Customer Success team needed to champion the shift from phone-first to digital advisory, and that change management work needed to run in parallel with technical delivery from day one -- not after it. We built the technology before the organisation was ready to deploy it.

We built the technology before the organisation was ready to deploy it. Change management needed to run in parallel with technical delivery from day one.

The three-way handover screen showing the AI agent handing the conversation to a human CS agent named Sophia, with full context transfer
The human handover: the AI agent introduces the CS rep with full conversation context. The threshold for when this triggers was experimentable.

Current Status

The system was fully built, integrated, and passing evals across 200+ test conversations. The POC-to-production transition was completed. However, the project has not yet launched to the platform's 4.5M users.

The shift from unauthenticated to authenticated experience, combined with the organisational readiness gap described above, means the production launch timeline is now with the client. The technology is ready. The defined targets -- lead conversion from 7.6% to 12-17%, recommendation accuracy benchmarked against human advisers, sub-2-second latency -- remain unmeasured against real traffic.

The system passed evals. It hasn't passed the market yet. Those are different things.

What Carries Forward

The most important design artifact in an AI product isn't the interface -- it's the experience specification that defines how the system should behave. The UI is the surface. The personality spec, handoff thresholds, and mindset definitions are what make the experience feel human. That's the design craft that's invisible to the user but makes everything work.

Depth beats breadth in AI advisory. A narrow, expert-level system earns trust that a shallow generalist never will. And defining metrics, evals, and guardrails before building anything else isn't just good practice -- it's the foundation that shapes every subsequent decision about architecture, agent design, and experience flow.

The POC was deliberately styled in pink -- a visual signal that this was not production-ready. It forced the production design phase to happen rather than allowing the client to ship the POC as-is. When we moved to production, the product designer aligned the interface with the client's brand system. The same functionality, the same experience framework, but now it looked like their product.

Production version of the AI Digital Advisor aligned with the client's brand -- purple colour scheme, branded header, and polished UI
Production: same experience framework, now aligned with the client's brand system
Mobile production design showing the AI agent Henry recommending a Premium Hospital policy with coverage details and voice interaction
Mobile production design: the full advisory experience adapted for phone, with voice interaction

The most important design artifact in an AI product isn't the interface. It's the experience specification that defines how the system should behave.