How to successfully implement continuous monitoring and data labeling for a chatbot with Kili Technology

Alexa, Siri, Cortana : for the last decade, Chatbots have invaded our daily lives through the objects and softwares we use on a daily basis, thanks to the great improvements of the technologies they use. And this is not just a fancy add-on for a business, since Gartner had estimated that by 2021, the early adopters of chatbots would see their revenue increase by 30%.

Yet, a successful chatbot is not an immuable truth : no matter how good it performs now, its performance will fade as new situations arise. We successfully implemented with a customer a continuous monitoring pipeline that allowed their chatbot to remain accurate and pertinent, through a few iteration of data labeling. Let’s dive into the details and share a few valuable tips.

Why is chatbot monitoring the main concern for any company

Our client is a luxury goods company, that adapted to the new shopping habits by providing a chatbot service. Of course, its reputation of quality traditionally well-handled in stores should also be visible online, and it is a key topic to be able to answer precisely and rapidly to all types of questions.

Yet, chatbots are an incredibly complicated architecture, that can be generally compared to mapping an enormous decision tree. With Deep Learning allowing to create very complex yet efficient architecture, the performance can be very impressive. Yet, it is far from a linear transcription of the customer’s thoughts, as human have very different ways of thinking.

  • Customer talk about symptoms, i.e. describing the experience they had whereas the conversational assistant will try to map to a larger causality. This is also why their implementation is complicated, as synthetic and generic (mostly issued from research dataset) data often lacks the context to properly map to the relevant cases. An example where the chatbot failed to answer the real desire of the customer, and bring the value he expected. Yet, this is a question we would really appreciate being able to answer as any human would be able to understand it.
  • Languages highly rely on ambiguity and context, making each conversation unique from the first message sent by the customer. All following messages can have a whole different meaning with a different introduction. This is an example of a highly-context dependent conversation. The user assumes, especially after the bot shows it understands the word voucher, that it will be able to answer his need.

Basically, a chatbot will rely on 3 types of data to achieve a standard level, hence three machine learning tasks.

  • First of all, intent detection. The AI must understand as quickly and as precisely as possible, from the context, the question at hand. On those tasks, a lot of models perform quite well, should they be provided with precise and the domain-specific data. We’ll see what data can be used to initiate our chatbot.
  • To help on this classification, entities will be isolated and provided to the AI. If names, places and other proper nouns are often well-recognized by common frameworks, what about domain-specific terms, products, newer products and even new situations, like we have known several in the last months ? This Named entities recognition task will ensure the bot remains sharp.
  • Ultimately, we want to evaluate along the chatbot’s life the quality of the conversations : When did the bot successfully handled the conversation ? If not, where did it lost it and what should it have understood ? The generic idea is to bootstrap the bot by providing it with more and more case study and patterns on which it has failed. The data will then be reused, and over-weighted to be a real source of improvement.

Creating a state of the art chatbot is about continuous labeling and continuous training

What we did for our client was setting up a monitoring pipeline. On a weekly basis, conversations were extracted from the database, and loaded into Kili. Our setup was to cover the previously mentioned tasks, on the production data, in order to keep trace of the performance of the chatbot, and collect more data to improve it.

To put it in a word, the stakes are to grow from a computer program to an autonomous assistant, capable of handling a conversation on its own ( high quality of the conversation), but also sharp assistant, relevant to the customer (always increasing vocabulary, technical terms and context, flagged as entities), and ultimately a learning assistant, capable of improving over the long run (identifying difficult situations for the bot to guide its progress)

An autonomous conversational assistant: classification annotation

The first task consists in a general quality assessment, at a general level on the conversational assistant’s gestion of the conversation, but also focusing on its outcome : did an advisor had to come in to handle the customer ? The aim here is to try to better detect such conversation, to be able to monitor it as it happens in the future : a bot that would from itself be able to redirect so no customer is left unsatisfied.

In terms of Machine learning task, this leads to a classification task, with different nested level to prepare for every eventuality. This will also help in monitoring the success of our project, and will be one of the key KPI during the iteration cycle.

A sharp artificial intelligence: named-entity recognition annotation

As we go through all the conversations, no matter if they have succeeded or not, what can be improved for our chatbot ? Of course it might have missed a few things, asked for the user to rephrase, or lack a clear answer for this particular question. In any case, those are as many cases that shall be avoided in the future. d’écran 2021-01-29 à 18.01.22.png

A learning chatbot: continuous training

Machine learning and deep learning is all about improving along the flow. Just as formations, procedures or problem solving help the customer success desk to improve, the chatbot will need the feedback on how it performs.

Here is a quick example on Kili on our complete annotation task, with a very handy intuitive and responsive interface 🚀.

A few useful tips for your project (starting tomorrow)

To help you getting started, here is our methodology, to help you deliver your chatbot monitoring pipeline.

It is easier to implement a chatbot on an existing support channel with human rather than from scratch. Why ? Because data does exist (see below for how we used it) and because you can implement a cyborg mode where human can take over the conversation when the chatbot is lost and because people from the service desk can label the data.

  1. The first step is to retrieve data. We recommend starting with customer success data, the largest number you can muster. We for example used the existing support desk data created for the last 5 years, that already had a first classification within our system. List data sources : phone calls records, emails, chatroom logs, tickets, …
  2. The next step will be to refine the ontology created. To do so, define with customer experts the ontology you want to implement in the chatbot. Start with something simple, and enrich along the time. To define it you can reuse the existing material such as service desk procedures, real conversations transcripts, emails, etc…
  3. You can then set up your model. We used a BERT custom architecture, the state of the art for language models, fine tuned on our raw data. It must be as simple as possible. As simple framework to start is Rasa, in order to quickly set up our bot.
  4. Don’t forget to monitor everything ! The main monitoring will be through the chatbot’s logs, retrieving metadata such as users, sentiment analysis, time.
  5. The chatbot can then be set up, and be put online as a beta feature for your users. This will ensure we don’t remain stuck on biased data, but on the contrary confronted to the real use cases.
  6. Start the iteration process : the following steps will be iterations from here onward, until your KPIs are at a satisfactory level. those higly depend on your use case, but two you might consider adding are quality of the conversation and escalation rate notably to monitor ROI.
    • Set-up your annotation tasks, like we presented above. Be sure to make it as clear as possible for the labeler, as we know quality data is the most important factor in AI. If necessary, you can also use quality monitoring features such as consensus to ensure the quality of the annotation.
    • Train your model on the new data created, and deploy it to production. Again, the chatbot monitoring will be key in order to measure the overall improvement.


This is more than ever true taking into account the year 2020 and the Covid-19 situation, that has seen the generalization of remote working, online shopping and, for many companies, has driven their digitalization.

To put it shortly, both internally or externally to your organization, bots can have a tremendous impact on your company’s internal processes and onboarding and your customer success leads, but at a cost : its monitoring and continuous improvement. So many data science projects fail to deliver an impact, because of the lack of quality data, a challenge Kili helps organization to tackle. Read more about us here.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.