Training Information Extraction Models: the Story of LCL
Banks are amongst the most regulated establishments in the world. Discover how LCL built a powerful ID information extraction algorithm using Kili Technology to classify the IDs of their customers.
Why information extraction? Another issue...
Banking activity is one of the most regulated sectors in advanced countries. The Basel Committee and the European Central Bank act as supervisors of the banking commercial structure which has the power to create money through credit. With such responsibility, banks are subject to national or international regulations. Regulations from influential countries like the United States, or the EC, also apply to our own institutions through extraterritorial laws (for instance, when using USD currency, or abiding by embargos).
One of the pillars of good management for a bank is the well-known KYC: Know Your Customer. A massive amount of data needs to be collected to ensure a suitable KYC, and banks’ customers are perfectly aware of this obligation. One of the passage obligé to build powerful KYC is the collection of identity documents. To comply with international regulations, banking institutions are required to have a recent identity document copy for each of their customers. This copy should be of sufficient quality to be used for verification or control purposes. The issue begins when you need to parse your entire client base to ensure this constraint in a short time range, for all the clients in your portfolio (a few millions). This, of course, cannot be done manually.
Another key pillar of good management is Corporate Social Responsibility (CSR). As banks are a major transmission belt in the economy, policies to make ecological changes apply to the finance industry. Banks being the natural ally of citizens when buying real estate, setting conditions on the energetic score when acquiring new construction or making renovations is important. The “DPE” (Energy Performance Diagnostic) is now a mandatory document due when a real estate loan is signed. This document contains data that allow banks to create the regulatory extra-financial reporting asked by the Regulators.
Where AI & extracting information comes into action
Many formats of ID documents (National ID Card, Passports, residence permit) and DPE – for which there is no unique template – are collected and dumped in the LCL Electronic Document Management system. For each document, a dozen text fields must be extracted to ensure that the proper information is registered either in the bank’s CRM or in the appropriate reporting. And this information extraction must be executed successfully millions of times, for our millions of customers (we do not complain!).
To recognize, categorize and extract structured data from these documents, we have two options: build an army of labelers to annotate 100% of the raw data manually or use state-of-the-art recognition algorithms to develop AI models. The former is not feasible in real life, while the latter is more than an option: it is the solution. And LCL is equipped for this challenge: there is already a team in LCL specialized in the business of creating AI products to process images, text files, voice samples and run named entity recognition or natural language processing techniques on legal documents.
Document categorization & natural language processing are particularly well served by using supervised machine learning. The challenge resides in the high expectations from the business units (Compliance Department, CSR Management): we cannot afford any mistakes given the importance of our two missions. But that’s the way our business works!
Even with our choice to use artificial intelligence and information extraction, building a labeling team is still a challenging task. As we said, we did not plan to hire an extensive team of annotators. But when using supervised learning, it is well known that to train our model, we need to obtain an example database, containing thousands of labeled and annotated images. This is where we need software to handle our document extractions and pre-processing to simplify the job of our five dedicated labelers handling the extracted data.
Working with Kili Technology for information extraction: the right solution
To be able to label our existing data, we chose to work with Kili Technology, the labeling platform to build high-quality training data from structured and unstructured data.
Having extracted tens of thousands of images from our Enterprise Data Management, we stored them in our on-premises servers. A bridge with the Kili Technology software, also installed on-premise, allows our remote team of labelers to work on identifying entities, classification & global annotation tasks (e.g. the letter from the energy diagnostic, or the expiry date of a national identity card). The Kili Technology SDK allows us to use our custom models to OCRize interest areas, extract information, and prepare these documents for manual annotation.
Labelers, but also experts from businesses enjoy working on Kili Technology because they can focus on the task to be done easily and in time. This is opposed to our former constant fear of work being lost or needing nightmarish file management with a dozen Excel files. From our perspective, not having to bother with data transfer and backups is a great relief: the installation of Kili Technology has been done up to our data safety standards. After some use, Kili Technology was considered a great comfort by both hired and volunteer labelers. People from our business units wanted to be involved in building AI: engaging many people as a workforce helped the federation of the company around AI.
In the end, Kili Technology provides a strong labeling software focused on dataset quality and an easy-to-use interface, but there’s also a team behind the scenes. Our counterparts at Kili Technology gathered very quickly whenever we encountered any issue. Our dedicated customer success manager is very careful about any pain point that can arise and will organize meetings with tech profiles, should there be a need to customize or be trained on certain functionalities of the application.
On one hand, all features needed for labeling are present, and a few of them are often used. On the other hand, we sometimes have a need for a feature that is not (yet) developed but that can be added to the roadmap. As a large corporation collaborating with a start-up, chances are our paces are sometimes different. But even with the differences in the working model, Kili Technology remains very attentive to our challenges.
Back to the topic at hand
For each of our two labeling campaigns, we used Kili Technology’s labeling platform. It allowed us to push, label and retrieve the data that will feed our machine learning algorithms. Deep learning is a big consumer of data if the output model is to meet the business requirement with a very small error rate.
A standard annotation campaign accounts for typically 5,000 documents to be annotated in 2 or 3 weeks. To obtain them, there is an important data preparation work upstream. Thanks to the fact that our AI infrastructure now includes Kili Technology, we can use the tool for all kinds of projects by people trained to use the platform. Simply by requesting the relevant raw data to be loaded in, LCL teams can accelerate drastically the creation of their training datasets, which means a significant improvement for all the parties involved.
Once our models are ready, there are integrated by our IT and ready to be pushed to production massively. See for yourself: more than 13 million documents were run in batches to check if every single KYC was complete with its readable ID document. The data extraction from every scan copy filled in a compliance tool that allowed the retail network of advisors to update documents with their clients. All of that is done algorithmically, with a training dataset built on Kili Technology.
As for the extra-financial reporting, our algorithms will catch the relevant information contained in the DPEs during the credit process. That is, one more time, something that cannot be done by a manual process unless at great cost. And for all these services rendered by AI, there is a need for a labeled database. Without a tool such as Kili Technology, filling the requirements of the regulators would take months and make us miss the compliance schedule. Regulators do not wait!
That’s not the end of the story
As you may have experienced, banks collect data from customers all the time: images, e-mails, phone calls, contracts, etc. We can even process voice recordings, where speech-to-text algorithms can apply the power of NLP to live conversations. We will undoubtedly use Kili Technology to annotate additional assets, all with the goal of improving our security and customer satisfaction. Regulators never sleep!
To learn more about information extraction, optical character recognition, natural language processing, automatically extracting structured information, extracted entities, automatic annotation, information extraction and information retrieval, check out our webinars and other articles!