How Kili Technology and AutoML Helped Me Scale Email Classification for Customer Services

How Kili Technology and AutoML Helped Me Scale Email Classification for Customer Services

Emails are the de facto standard for written communication between companies and customers. They are the most preferred channel of communication with 91% of all US consumers still using email daily. [1] That’s why businesses and individuals face massive volumes of unprioritized noisy emails.

Customer services are no exception. They keep using emails as they are a great straightforward way for clients to express themselves in natural language. In that context, being able to prioritize and classify emails is crucial for customer services. The customer service we worked with wanted emails to be automatically dispatched to the proper service thanks to machine learning. This low-added-value process

From a data scientist’s perspective, what are some of the lessons learnt from working eight months with the customer service of a leading bank to streamline their email classification pipeline?

Challenges behind managing emails in machine learning

Email classification is a classic of natural language processing (NLP). However, you will face some challenges when processing emails:

  • Emails are noisy:
    • They are mainly composed of natural language along with unstructured metadata (sender, recipient, dates and attached files in different formats).
    • Part of the text is greetings (“Dear Sir”), closing formulas (“Yours sincerely”) or lengthy signatures, that may or may not add to the context.
    • All attachments cannot be read as they come in pretty much any possible formats.
    • Some of the emails are spams and were sent to you for commercial purposes by non customers.
    • Emails keep trace of history (all previous emails are usually embedded in the current email).
  • Several requests can be made in one. In customer services, in particular, customers would batch their requests:”I would like to meet with one of your advisor. Also, can you unsubscribe me from the mailing list?”
  • Emails have implicit content. For instance, the fact the email was sent by a major customer could indicate a high priority.
  • Customers do not always use the proper vocabulary. They can mix up words, in place of the formal business words that an expert would use, or even address their demand to the wrong service, as they are unaware of the underlying business processes.
  • The register and the formality scale can also vary, wether the customers are angry or they come from different socio-economic levels with different abilities to write English.
  • Emails are very prone to concept/data drift [2], as vocabulary changes and your business evolves. The model you built six months ago, might now very well be outdated because new products landed in your company, that the model never heard of.

To tackle all these issues, you will have to continuously annotate and train a model. No public dataset will help you for training, as they most certainly do not contain your products and your vocabulary. The need to annotate and train on your very own emails will emerge from the beginning.

Agile iterations for data science projects

In our case, four parties were cooperating for the project’s success:

  • infolinguist project managers (1 senior)
    • frame the project
    • build the annotation plan
    • manage team and deadlines
  • business people from the customer service (1 senior + 4 junior)
    • formalize business needs
    • annotate emails
  • data scientists (1 senior + 1 junior)
    • industrialize a performant development pipeline
    • rationalize technical choices that match the requirements of business (e.g., performance metrics)
  • data engineers (1 senior +1 junior)
    • industrialize a robust production pipeline
    • take care of software quality and scalability

Do not separate concerns! All four parties must be as agile as possible and cooperate along the project. That is why we favoured two-week SCRUM iterations over monolithic work. [3]

  • We mixed annotation sprints with the first data science improvements to see what works well and what does not. We could react accordingly in both directions.
  • Similarly, as soon as a first model is ready, it should be put into production to see how it integrates in the IT and/or performs on real streams of data.
  • The infolinguist managers synchronize teams and manage delivery.
project management (1).jpg

Development machine learning pipeline

Annotation plan. At the very beginning of the project, the annotation plan has to be formalized. The annotation plan constitutes the categories of the classification along with the written guidelines to annotate emails. Actually annotating 100 random emails will give you a sense of all ambiguities and edge cases in the dataset. Too few categories would introduce high bias, and too many overlapping or ambiguous categories would lead to high variance in the dataset. It is important to co-create the annotation plan, as well as to share and explain it across the team.

Scale annotation. What should you expect from an annotation platform when doing email classification?

  • Mails contain structured data, so it is important to have intuitive interfaces to spot important information at a glance.
  • Kili Technology clarifies the annotation plan for all annotators by embedding targeted explanations at category and email levels.
  • Email annotation is time-consuming, so if you want to scale, you will have to invite more contributors and manage roles on the platform (annotator, reviewer, administrator).

The goal was to annotate 10,000 emails by four people who were working full-time at the customer service. The customer service used to annotate with Excel. Excel allows none of the latter expectations, which is why they turned to Kili Technology and increased their productivity by a factor of 2. This saves time:

  • for annotators
    • when labelling
    • when reviewing the annotations
  • for data scientists:
    • when consolidating a high-quality dataset
    • when analyzing the dataset for category confusion and model performance (see below)

Pick the performance measure. As far as the evaluation metrics is concerned, accuracy will not work well if classes are strongly imbalanced (which is practically always the case on real-world examples). Recall, on the other hand, will only account for the coverage of the positive classes. The metrics you should aim for is F1-score, because it balances the concerns of both precision and recall. [4]

Development pipeline. Machine learning consists in iteratively pick data, a preprocessing scheme and a model, see how they perform in regards to your performance metrics, and repeat.


We aimed for a fully-automatic pipeline in order to be able to repeat experiments and launch new experiments very quickly. The idea is to seamlessly recompute business metrics given a change in the model or the data. Here are some experiments that proved successful or interesting to tackle to gain performance:

  • Approach. Will you model your problem as a single-class or a multiple-class classification problem?
  • Labelling. Will you merge the most ambiguous classes?
  • Preprocessing. Will you strip introduction/signatures? Using regular expression or a model? Will you include attachments or only their names?

Every choice has to be backed by cross-validated metrics, to see how it impacts the final model.

Models. As far as the model is concerned, we focused on easy-to-use packaged frameworks with out-of-the-box solutions and preferably APIs for automated machine learning (AutoML). [5] For transparency and interpretability reasons towards the business, we began by the most traditional techniques.

  • Scikit-learn along with auto-sklearn (AutoML package on top of scikit-learn)
  • Fasttext has AutoML included. Its speed allows for rapid prototyping.
  • Huggingface’s transformers for all deep mainstream architectures (BERT). Transfer learning from existing models here guarantees a fast learning.

Do not fine-tune any model until you find a better strategy by an order of magnitude in percentage (+5-10%). In most cases, fine-tuning a particular model will only make you win few metrics points (+0-1%).

Track quality all along. Your model will only perform as good as your dataset is. To track quality along the annotation phase, you can use:

  • Review: Let one senior business people review emails randomly to check the coherence.
  • Consensus: Let 2 annotators annotate the same email, and measure the inter-annotator agreement. The agreement is a distance measure, proportional to the intersection of categories over their union (IoU).
  • Honeypot: You manually annotated some emails. So you can see how the annotators perform on this mini-batch.

Those three moments are the occasion to check how relevant your annotation plan is, and if every member of the team understood it. That is also why you shouldn’t compare your final model’s performance to some theoretic threshold but to the annotator performance. Business managers of machine learning projects always expect top performances for models in production. In our use case, it wouldn’t go to production if the performance was under 80% (F1-score). For some categories, however, we were able to show that the inter-annotator agreement didn’t even meet this requirement. In other words, Kili Technology showed that even a human was not capable of completing the task with this level of confidence. An important lesson on this was to compare the inter-annotator confusion matrix with the model confusion matrix.

Having a robust development pipeline will allow you to scale to more use cases within your company. Today, our client delivers a new email classification project in weeks.

Production machine learning pipeline

The production pipeline typically re-uses some of the code written for the development pipeline. For instance, preprocessing steps will often be the same.

The pipeline should be thoroughly tested to guarantee its resilience to all kinds of inputs. Unit tests are particularly relevant to test all separate pieces of a pipeline.

However, unit tests protect you against code regressions, not data regressions. Robust production pipelines also encompass strong safeguards against data outage and concept drift. We will detail more about the topic of monitoring ML models in production in a future blog post.

In conclusion, using Kili Technology allowed the customer service to increase the productivity by a factor of 2 on the most critical phase: the annotation phase. Kili Technology allowed to work frictionlessly in an agile setting throughout the project by synchronizing all actors from the business people to the data science team. Once in production, automating email dispatching by machine learning saves hours of human review everyday.

[1] Aufreiter, N., Boudet, J., Weng, V., Why marketers should keep sending you e-mails (2014),

[2] Katakis, I., Tsoumakas, G. & Vlahavas, I. Tracking recurring contexts using ensemble classifiers: an application to email filtering (2010),

[3] Sutherland, J., Schwaber, K., SCRUM guides,

[4] Precision and recall, Wikipedia,

[5] Xin He, Kaiyong Zhao, Xiaowen Chu, AutoML: A survey of the state-of-the-art (2020),

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.