LoadingLoading
Date2023-01-30 15:02
Read10min

Our Guide to Natural Language Processing, an Introduction to NLP

Natural Language Processing (NLP) is an essential composent of Machine Learning applications today. In this guide, we'll walk you through the components of an NLP stack and its business applications.

Our Guide to Natural Language Processing, an Introduction to NLP

What is Text Annotation in Machine Learning?


Simply put, text annotation in machine learning (ML) is the process of assigning labels to a digital file or document and its content. Labeling a text can consist in assigning tags to text attributes such as keywords, sentences, and paragraphs or simply classifying the text based on its content (i.e., text classification).

These include various NLP technologies like neural machine translation (NMT) programs, auto Q&A (question and answer) platforms, smart chatbots, sentiment analysis, text-to-speech synthesizers, and auto speech recognition (ASR) tools, among other related projects. These technologies can streamline the activities and transactions of many organizations across different industries.

What are the different types of text annotations?

Text / Document classification

Text and document classification consists in attributing one or multiple attributes to a single text or full document.

Examples:

  1. Classifying emails as spam or regular emails

  2. Doing sentiment analysis on tweets

  3. Labeling legal documents based on their content (legal notices, agreements, bonds, …)

    example-of-sentiment-analysis-from-an-amazon-review

    Example of sentiment analysis on an amazon review

    ladding-bill-being-classified

    Bill of lading being classified

Entity linking

This is the process for linking entities to better understand the structure of the text and the relationships between entities. Here you can see a table being extracted from a scanned invoice and item lines being linked together to preserve its structure.

table-extracted-from-scanned-document

A table being extracted from scanned document

KYC-use-case-entity-extraction-use-case

Entity being extracted in a KYC use case

Layout analysis

Layout analysis consists of labeling document structures to transform them into another format (ex: JSON).

document-layout-being-extracted

Document layout being extracted

How to label texts, PDFs, and Images?

In real life, textual data exists under a wide range of different formats txt, pdf, or even text in images or scanned documents. In this part are going to dive deep into the specificities of those data formats and what features are mandatory to label efficiently.

Labeling text data

When labeling text data in simple .txt format the following feature are important:

Multilingual Support

When labeling textual data language and tokenization are key to take into account. The labeling tool used must allow the labeling of all languages. Using Kili Technology, you can natively label any language using our interfaces.

annotation-on-english-speaking-document

Annotated English-speaking document

annotation-on-mandarin-speaking-document

Mandarin-speaking document being annotated

annotation-on-arab-speaking-document

Arab-speaking document is being annotated

Propagate regex from UI

When labeling long documents, a time saver is to automatically tag all equal tokens in one click. We call this feature regex propagation. In the example below, you will see that the name of the company RentalCo is identified once and propagated by the labeler.

Propagate regex from UI

Formatting text for ease of labeling

Some text documents can be hard to visualize and annotate if formatting is not applied. For example, a long blog article or a conversation between multiple agents can be tedious to label if not displayed correctly. We have introduced rich text formatting in Kili to help customize text documents to ease the annotation task.

Labeling PDFs

PDF documents can be a lot more complex than simple text documents. Often, PDFs are made from a couple to hundreds of pages, and they contain other data types such as images, tables, and graphs. When choosing a tool to label PDFs, you should have in mind the following features to ensure success:

Native PDFs vs. Scanned PDFs

A native PDF is a PDF of a document that was “born digital” because the PDF was created from an electronic version of a document, rather than from print. A scanned PDF, by contrast, is a PDF of a print document, such as when you scan in pages from a print journal and then save this file as a PDF.

When labeling a Native PDF, the content of the PDF can be directly extracted by using the underlining text layer (i.e., text can be selected from the PDF). On a scanned PDF this layer might not be present. In Kili, you can add OCRs to the scanned PDF to label it just as a native one. Now, you just need one interface to label all of your PDF documents!

NER-performed-on-scanned-document

NER being performed on a scanned document delivered by DHSS

another-NER-being-performed-on-scanned-document

NER being performed on a scanned property register document

Page-level classification vs. Document-level classification

Because PDFs can contain multiple pages, it’s important you can make classifications at the page level and document level.

For example, let's say you want to label the type of document (claim, invoice, quote, …) and for each document, you also want to classify each page to differentiate the main document from the attachments. Tooling needs to be adapted for the metadata to be extracted.

document-being-labeled-on-document-and-page-level

Name of contact and signature being extracted from a PDF document

Extracting mixed content (images, tables, graphs, stamps, signatures, …)

PDF documents also contain elements other than texts such as images, tables, or signatures. In various use cases such as automatic document parsing and understanding, it is mandatory to be able to extract such elements during the annotation process.

Therefore, you need to have on your PDF interface the possibility to use both text annotation tools like classification NER and transcription and also image annotation tools like bounding boxes.

name-of-contact-and-signature-being-extracted-from-a-PDF-document

Name of contact and signature being extracted from a PDF document

tables-and-graphs-being-extracted-from-PDF-document

Name of contact and signature being extracted from a PDF document

Labeling images with OCR

Lastly, you might also need to perform text annotation to extract texts from images, like extracting license plates video streams or reading traffic signs.

If this is the case being able to use an OCR model with the labeling interface will be mandatory to speed up drastically labeling times. With Kili Technology, you can use any OCR model that you want like Google Vision API, Tesseract OCR, Azure OCR, Amazon Textract, or your own custom model!

ADR-sign-being-labeled-leveraging-OCR

ADR sign being labeled leveraging OCR

Going further

Text annotation is the backbone of AI applications handling text under all its aspects. To scale annotation and produce reliable AI models in production data needs to be of the highest quality and relevance. Our mission at Kili is to help organizations tackle the bottleneck of training data quality. To do so, we are building a state-of-the-art data labeling platform with high en labeling interfaces supported by data quality monitoring and automation.

If you want to learn more about data quality and labeling automation, go check out our documentation:



An article by Paul Graffan,
Lead Solution Engineer @ Kili Technology,
and by Kili Technology's team.

Other articles on topic
LoadingLoading
Get started

Get started

Get started! Build better data, now.