Designing good LLM-based live assistance

Lessons learned from 4 years of iteration

Jun 16, 2025

This post is a deep-dive on one component of my larger Gen AI integration guide, here.

One of the most obvious use-cases for LLMs is Q&A-style live-assistance. But there’s a big difference between doing it, and doing it right.

In this post, I’ll share lessons learned from 4 years of relentless iteration at Keeper (starting with GPT-3). Over the course of that time, we went from a mostly-human ops chat system to an almost entirely LLM-powered one:

Keeper’s LLM-enabled live assist evolution

Today, the results speak for themselves:

97% inbound message automation (no human touch)
230% higher engagement rate per user (users send more messages)
Higher CSAT (83% → 92%)

Lessons learned

As every company in the world races to implement this new tech, I want to share some lessons learned at Keeper.

Note: Keeper is a tax filing software. We use live assistance both to help users navigate the product, and to answer tax law questions. It’s not just customer support - it’s an ever-present tab in the app / web dashboard that makes our users feel safe as they file their taxes.

#1: Don’t try to replace core UI with chat

I remember March 2023, when GPT-4 was first announced, there were a ton of utopian projections for how it would impact software. Would everything become a chatbot? Is UI dead?

No, UI is not dead. There are massive efficiency benefits to the structure and visual information density that UI creates. Live assistance, whether text or voice, should not be replacing most of that for a long time. Example:

Yeah, I’d say UI still has a place in the world.

So let’s get practical. Where does live assistance actually fit in?

I think the easiest way to think about it is through the analogy of imagining you having a great customer support manager sitting next to you as you’re using the product.

Here’s what functions live assist is best suited for:

Answers to user-initiated questions. This is the obvious win. Before LLMs, you’d have to wait for hours / days to get a response and those responses were often … quite bad. Now you can get it instantly, and — assuming the agent is well trained — the quality is very high. This expands the value of assistance in general. Topics that would previously be deemed “not worth the wait” can now be asked. As we made our live assist agent better and better, we saw more than double the number of messages sent by users. Rather than go straight for the churn flow or closing the app, they’d go and clarify whether Keeper offered a particular feature or not. It’s a big win.
Limited actions (deep-links + pre-fills). “Do it for me” is a seductive dream for anyone designing with AI, but I think it’s important to draw the line, at least initially, at deep links and pre-fills. Imagine, for example, this common example: a user asks to cancel their subscription. There is already a churn flow in the settings tab that shows the user what payment method they’re using, asks a few questions, potentially offers a discount, tells them how long they have until they lose access. That’s a lot, and it’s probably better suited for UI than freeform text / voice chat (see below). So the best thing for the live assistant to do is to offer the user a deep-link to go to that flow.
Deep-dives into whatever the user is focused on. Typing is hard. If there’s a graceful way to let users click instead of typing, that’s great. One example is to add a button next to a block of text labeled “explain this”. In a pre-LLM world, this would have popped up a modal with static text. If you have a good live-assist agent, you can have it generate that text live so that it’s personalized and the user is easily able to ask clarifying questions.

While tempting, it’s dangerous / foolhardy to try and get live assistance to replace UI in the following functions:

Data entry (e.g. “What’s your address?”). Input field UI has lots of quality-of-life features such as error states, warning states, pre-fills, radio-buttons, dropdowns, disabled states, etc. Users are accustomed to these conventions so it’s much faster to have “SSNs must be 9 digits” as an error state right under the input field, than to try and communicating that in a single-threaded assistance experience.
Structured information display (e.g. “Here are the 5 options to choose from”). Voice / text are a very inefficient way to present information - it’s slow. Well designed UI is much more efficient. The other benefit of UI here is that it’s more grounding. You know where to go to retrieve this information next time (rather than it getting lost in a thread).
Common user actions (e.g. “Submit my return for review”). Tapping / clicking is faster than saying / typing. If a lot of users are going perform an action, it should have a corresponding button. The other advantage of UI here is that it’s grounding — the user knows where it is and knows how to come back to it later.
Outbound messaging. If you want users to use your live assist agent, they need to trust it. As soon as you start considering push notifications or any other kind of unsolicited outbound messages are on a very slippery slope to Clippy.

Don’t be Clippy

From Microsoft’s Clippy to Apple’s Siri, consumers have been thoroughly underwhelmed by allegedly smart personified helpers that actually turn out to be a thin skin on top of a decision tree or voice search.

Assuming your model is actually smart, then the last thing you want to do is trigger users’ pattern recognition by introducing yourself like these bad assistants.

Resist the urge to personify it, don’t force it to recite pre-canned responses, and certainly don’t have it interrupt the user while they’re doing something else. Interruptions are the most sure-fire way to destroy trust. If the system is as smart as you say it is, then it should know that the user is busy doing something else right now. Instead, find contextual hooks – like one-click buttons that pre-fill questions when chat is the best way to explain something.

We learned this lesson by doing exactly that. The first version of our LLM-based assistant got very low engagement. The more we dug in and spoke with users, the more we realized that our presentation didn’t pass the “sniff test”. One glance, and users assumed it would be useless.

Unify your AI live agent with human support

Users don’t know how good (or bad!) your AI live support agent is, so if you force them to choose between the AI and a human … most will never bother with the AI. Perhaps culturally that’ll change in the next 5 years but for now consumers have been burned enough times that they know to avoid “smart chatbots”.

Instead, treat the AI like a junior customer support agent. It’s the first line of defense. It’s quick to respond, and it knows when to raise its hand and pull in their manager. That’s a win-win!

Get your metrics right

In my experience, it’s practically impossible to boil down your live assist agent’s performance into a neat single metric. You’ll need to measure a bunch of metrics and check them against each other periodically:

Human accuracy / tone evals (weekly / monthly). BE CAREFUL — humans are biased, and as your agent becomes better and better it’ll become hard to know who’s right (the agent or the human evaluator). You can of course mitigate this by increasing training and performance management and seeking for consensus, but that starts to get expensive and unwieldy. The human reviewer is only human, and they’re only as good as the review matrix given to them. Don’t overfit on this metric.
AI accuracy / tone evals (daily). Eval agents are effective and, most importantly, cheap and fast. Simply by prompting with a good review matrix, the AI agent will do a good job giving you a directional sense of what changed when you launch a new model.
Usage metrics. All of the basic metrics still matter: escalation rate, msgs per user, clarification frequency, number of new threads started per user, repeat user engagement. Modal changes are complicated and your ML team will need to review all the metrics regularly to get a full picture of what’s going on.
Eng metrics. Obviously classic engineering metrics like latency, error monitoring needs to be in place. Latency really matters to the user experience - be careful with adding too many layers of agents because it’ll slow down your answers.
Qualitative surveys. Obviously it would be nice if users just told you what they think of your agent’s responses. However, it’s super annoying to get asked for feedback all the time so you’ll have to find ways to be more subtle. Reactions (like in iMessage) are a good way to do this because they’re unobtrusive and come naturally to most customers. More heavy handed options like CSAT surveys should be used more sparingly.

Have an unified escalation process in place

Operations / customer support should be able to clearly see what topics are causing escalations and which of them are outliers. The faster they can catch these the better for customer experience. Classic examples here are bugs in the product, and mistakes in the embeddings. Have your CX team review these metrics at least twice a day.

Be careful with over-prompting / over-tuning

The first mistake everyone seems to make is to micro-manage the agent like crazy. What I mean by that is basically over-prompting. E.g. “Don’t try to answerer questions about payments, avoid any type of tax advice, escalate to a human support agent anytime you’re not sure“. While it’ll make your risk / compliance teams feel better, it will also completely kill the ability of your agent to be useful.

Instead, give it a little bit of maneuverability. The reality is not nearly as scary as it sounds, because:

There is a cost to every escalation. Not just salaried, but more critically it forces the user to wait. It can often be better to get a decent answer now, than a great answer in 3 min.
There are much better ways to solve hallucinations: agent-of-agents models. You can read about how we designed ours at Keeper here.
Even without well designed RAG architecture and tuning, LLMs are designed to say reasonable things. You’ll have a harder time getting it to be opinionated than to get it to give “it depends” platitudes. Its base training alone gives it good ability to avoid touchy subjects. Push on legal / compliance here to compromise because the “what ifs” you come up with in a meeting room are very different than the specific examples on the ground.

Wrapping up

Every company needs an AI live assist agent, and I have yet to find a truly effective SaaS offering. Perhaps a winner will emerge, but in the meantime I hope lessons learned from Keeper are helpful. Cheers!

Things I find myself repeating

Discussion about this post