2022-02-18 14:58

Understanding Named Entity Recognition & Text Classification

Understanding Named Entity Recognition & Text Classification

What is Named Entity Recognition?

Named Entity Recognition (NER) refers to methods and processes of identifying entities in text and applying labels to them that can be used for further analysis or processing. As a simple example, NER may recognize the name ‘Netflix’ in a document and apply the label ‘company’ to it. NER is extremely flexible and many different kinds of entities may be defined, identified, and sorted into categories, which means that it is capable of entity recognition across people, places, companies, periods of time, and more. 

NER leverages natural language processing (NLP) to identify and tag entities based on defined parameters. Entity extraction pulls out and tags important entities identified within a document. This can help to quickly and automatically identify the most important pieces of information within a document. As an example, NER can be used with invoices to automatically identify the account ID, the shipping and billing addresses, and invoice amounts. This can then be integrated into a company’s electronic payments system to allow for the automatic processing and payment of invoices, even if the initial invoice is received on paper. 

What is Text Classification?

Text classification has some similarities to NER but fulfills a different role. Text classification looks at a text holistically to make a judgement about its content. In contrast to entity recognition which extracts specified entities within a document, text classification looks at characteristics of the text as a whole to draw conclusions about things such as text sentiment, the language of the text or the text topic.

There are several different methods by which text classification can categorize a corpus of text. These methods include:

Sentiment Analysis

Sentiment analysis refers to gauging the sentiment of a piece of text, such as whether the author is expressing a positive or negative opinion within the text. The use of sentiment analysis is very popular within social media for understanding how consumers feel about a given topic or brand. Sentiment analysis analyzes the language used within a body of text to determine its sentiment. It can be either positive or negative or more sharp classes, specific to the use case.

Language Classification

Text classification can be used to simply detect the language of a corpus of text. This has many uses, such as automatically routing support queries across different languages to the correct department or for offering automatic language translation services if a body of text does not match the user’s language locale settings.

Topic Classification

To stick with the example of a support query, topic classification is used to automatically identify the nature of a support query without relying on the customer to correctly tag the request. By identifying certain parameters or phrases used within the support request, text classification can identify whether a customer support query relates to billing or troubleshooting, for example, and then automatically route the support request to the most appropriate department. 

How are both Concepts used to Automate Document Processing?

NER and text classification both have distinct roles to play in automating document processing but they can be used together to effectively and automatically analyze, categorize and process a wide range of documents. A good example of both of these technologies coming together is when handling customer support queries.

Text classification can be used to determine the language, topic and sentiment of the support query. The language identification is used so that the support query is routed to the correct support team members who speak the same language. Then, topic classification can be used to judge the type of query: is it one about billing or invoices? Is it a technical support request, or a complaint? This helps customer queries to be routed to the correct department automatically. Lastly, sentiment analysis can be used to make a judgement about how the customer feels, such as whether they are angry or upset, and can help assign an urgency to the query.

Named entity recognition can be used to extract important information from supporting documents related to the query. For example, if a customer is getting in touch with a bank about a loan application, then they may need to supply supporting documentation, such as a proof of address. Named entity recognition can be used to quickly and automatically verify whether the document provided contains an address and whether the name of the document matches the name of the customer within the bank’s systems. 

Together, these concepts help to reduce the amount of manual work an organization must do in order to support requests at scale, particularly those involving supplementary text that may come in a variety of formats and structures.

How have these Concepts Evolved and what is their Future?

The ability to identify and classify information into discrete categories has been around for a long time and does not require machine learning or deep learning to achieve satisfactory results. The earliest forms of NER and text classification were done by defining rule-based sets, often based on the presence of keywords or set phrases. Data extraction was done by cycling through this list of defined entities rather than relying on an algorithm. These approaches suffered from relatively low precision as a result. Spelling mistakes, synonyms or foreign languages would all result in failing to extract important information and often required human validation.

The introduction of deep learning and AI to the field has enabled technology to overcome these problems and allows for more abstract concepts to be identified within text rather than hard-coded words or phrases. As we look towards the future, these technologies will evolve from information and entity extraction to knowledge extraction. Rather than simply understand the sentiment or topic of a piece of text, these systems will eventually be able to incorporate the knowledge contained within a piece of text into their wider system. This could mean the automatic understanding, processing, and resolution of a customer support request without the need for human intervention at any stage. 

Another interesting future development for this technology is ontology learning. This means being able to identify and understand entities within a piece of text and to also be able to identify and extract relationships between those entities within a given domain without requiring human intervention. This would allow for a richer and deeper understanding of a text to be developed from natural language text without specific training.

Related resources

Get started

Get Started

Get started! Build better data, now.