Natural language processing is as old as computing itself. At Serimag we understand the documents in order to process them naturally thanks to NLP (Natural Language Processing).
Since the beginning of mankind, something has distinguished us from the animal world: language. But with technological progress came machines, and with them, the need to establish a man-machine relationship that has not always been easy or trivial. Our desire to want to communicate naturally with them is as old as the computer itself. Natural Language Processing (NLP) is the attempt to overcome these barriers so that there is understanding.
In the beginning it was inevitably unidirectional. We had the necessary tools for the machines to understand us: a button that triggered a response, early searches on Google, or even the actual programming languages are examples of this. Over time, bidirectionality has appeared: machines that talk to us, chatbots with which to start a conversation, etc. In fact, the use of NLP has evolved, it has spread and we use it in our daily routines almost without realizing it: spell-checks in word processors, predictive keyboards on smartphones, increasingly intelligent Google searches or even shopping tips that Amazon gives us. In all of them there is an understanding of human language by machines to a greater or lesser extent.
The techniques that lie behind NLP work on the basis of concepts and relationships that are established. The advance of Artificial Intelligence, and more specifically of Machine Learning (ML), has given a boost to the processing of all this information. The first phases are responsible for digesting all the information:
- Optical character recognition (OCR), which converts images into plain text.
- Optical character recognition (OCR), which converts images into plain text.
- Elimination of irrelevant words
- Reduction of words to their respective lexemes
- Semantic bleaching of information
- Regular expressions that convert words into specific data
- Predictive systems (e.g. N-gram)
These techniques are intended to reduce the information to be processed. But there are also expansion techniques, where pyramid methods detect entities (name, verb, etc.) and multiply their links by relating them to each other to give them meaning.
We have had an NLP research team at Serimag for years, together with the TALP (Centre for Language and Speech Technologies and Applications) department at the Universitat Politècnica de Barcelona (UPC). This collaboration has been focused on BPO (Business Process Outsourcing) data recording processes, which have benefited by incorporating what we call an assisted data capture layer. With it, the system is capable of using Artificial Intelligence to locate, order and prioritize the necessary data in documents in order to highlight them to the user so that the latter can perform the corresponding operation much more efficiently. In document processing, our TAAD solution is capable of recognizing more than 50 different fields, including personal identification data, locating postal addresses, mortgage operation data, etc. Or in assisting debt collection teams it also works on land registry reports. These, despite being common documents with a more or less simple structure, conceal great complexity since each registrar writes them with their own subtle differences. Answering questions such as “Has the house been seized?”, “By whom?”, “For how much?” are not trivial items. And trying to trace how many square metres a property has requires a thorough understanding of the text since its definition can be divided throughout the document according to its different concepts.
So, we can clearly see the use of NLP in our daily lives. In recent times, chatbots have been taking great strides and attracting numerous headlines. But NLP goes beyond a conversation between a man and a machine. Its evolution also involves, for example, systems for detecting emotions or recommendations. At Serimag we are also part of this evolution and believe that the next step in our documentary processing should be to find the relationship established between documents. Being able to understand what a document says, the next thing is to be able to understand its correlation with other documents so that we can segment documentation more accurately and naturally, just as a human would.