Scientific text understanding by machines

Regular software works much like a calculator, every time the result is the same for the same input data. However, machine learning software works more like a human brain – the more data it receives and reads the better the results. The applications of Natural Language Processing are endless. Natural Language Processing makes it possible for computers to read text, hear speech, interpret it, measure sentiment, and determine which parts are important. With Natural Language Processing we can come close to teaching the machine to read and understand how humans learn.

What is Natural Language Processing?

Natural Language Processing is a combination of three categories – linguistics, which is an understanding of human language, computer science – the setup and the servers etc., and artificial intelligence as a research field.

NLP enables a computer to understand  the content of documents including the contextual nuances of the language within them to accurately extract information and insights as well as categorize and organize the documents themselves. Documents can be defined as anything from a small Twitter message through scientific papers to full books.

NLP covers many applications including speech recognition, natural language understanding and natural language generation. 

A bit of NLP history

From the 1950s all the way up to the 1990s there symbolic NLP was used. There had to be written rules for grammar and then fed to the machine. It needed to be hard-coded with dictionary lookups as the set of rules. All details had to be mentioned, for example how to stem a word, how to remove the endings of words, how grammar works, how to conjugate verbs, etc. It is unscalable and incomprehensible because every little nuance in the language had to be taught to the machine. 

Then in the 1990s, statistical NLP came into use. Statistical NLP comprises all quantitative approaches to automated language processing, including probabilistic modeling, information theory, and linear algebra. The machine was giving answers based on counting statistics. 

Eventually, in the 2010s neural NLP was introduced, which allowed much more accurate and natural results based on examples of actual text. Popular methods include the use of word embeddings to capture semantic properties of words, and an increase in end-to-end learning of higher-level tasks. This type of NLP is used up until today. 

Leading models for language understanding

Currently, on the market, there are two popular engines for text understanding – BERT and GPT-3. 

BERT (Bidirectional Encoder Representations from Transformers) was created by Google in 2019. Its design allows the model to consider the context from both the left and the right sides of each word. BERT obtains new state-of-the-art results on eleven NLP tasks, including question answering, named entity recognition, and other tasks related to general language understanding.

GPT-3 (generative pre-training, 3rd generation) was introduced in May 2020 by OpenAI. It is a 175B-parameter autoregressive language model. GPT-3 can respond to any text that a person types into the computer with a new piece of text that is appropriate to the context. The evaluation under few-shot learning, one-shot learning, and zero-shot learning demonstrates that GPT-3 achieves promising results and even occasionally outperforms the state of the art achieved by fine-tuned models.

Both of them are available as pre-trained models meaning that they have already been trained on a large set of documents that could be useful for general tasks but in specific problems that general corpus is not enough. For the models to be more applicable for other challenges it would have to be finetuned or even re-trained from scratch with a big data set and multiple training cycles. Then it gets expensive. 

Scientific Text Understanding

At we built a machine for general scientific text understanding. For training, we used a set of 18 mln interdisciplinary research articles and we found out that our machine needed to understand a vocabulary of about 200 000 words contextually. That means that there was an engine needed both on the software side but also on the hardware side with the architecture and the capacity for really complex word and contextual understanding. Technology provides literature review tools where the user can form a problem statement to find related documents and then narrow that reading list down to a precise document. Using tools the input text can be the title and abstract of a research paper or it can be your self-written problem statement. It’s just a small piece of text 300-500 words. The machine then extracts the key concepts or the most meaningful terms from the document and finds contextual synonyms, topic model words or hypernyms. It builds a fingerprint of that text and uses it to do fingerprint matching with the abstracts of millions of research papers. 

For key information extraction, uses the upgraded version of TF-IDF. It is a well-known, well-tested, proven algorithm and the company develops the continuation of it. Then to do the contextual analysis of the input text the machine uses word embeddings. The goal is to give results that not only contain the keywords but have the same contextual meaning. To put it in other words, we take every word and turn it into a point in a vector space, which has its location, direction, and length. As an analogy to better understand it, there is an infinite universe where every word is a little planet and those planets are organized into galaxies. If two points are closer together then they have more contextually things in common and if they’re further apart they’re not quite as similar. Except here, the machine works in a hundred-dimensional vector space. 

To train the machine, the team took that set of about 18 million articles and started with the 11 first words of the first abstract of the first paper and showed that to the machine. The machine gets five first words and the five last words in the string of eleven words and was asked to guess the word in the middle. And then move one word to the side and repeat the process. After 18 million papers the machine has a really good understanding of what are contextual synonyms, what words are being used in place of each other in this context.

Another thing deals with is Neural Topic Models (NTM). The machine receives a group of papers and is asked to categorize them into hidden topics or cluster them into contextually similar groups. It is based on unsupervised learning so the machine does not get any instructions or expectations. Then the machine is asked to describe each group with ten words. The machine builds out a model of what are different categories when it comes to the world of science. For the machine trained on a general corpus of scientific documents, there is a clear difference of the topics in comparison to the machine trained only on a specific group of the documents in one research field.

Another approach that uses is WISDM – Word importance-based similarity of documents metric, which was developed in-house and published in 2017. It is a fast and scalable document similarity metric for the analysis of scientific documents. It is based on recent advancements in the area of word embeddings. WISDM combines learned word vectors together with traditional count-based models for document similarity computation, eventually achieving state-of-the-art performance and precision. The novel method first selects from two text documents those words that carry the most information and forms a word set for each document respectively. Then it relies on an existing word embeddings model to get the vector representations of the selected words. In the final step, it computes the closeness of the two sets of word vector representations, fit into a matrix, using a correlation coefficient.

Key Takeaways

👉 Natural Language Processing is a computer capable of understanding the content of documents to accurately extract information and insights as well as categorize and organize the documents themselves.

👉 Throughout the last 70 years there were three types of NLP used: symbolic, statistical and neural NLP.

👉 Currently, leading models for language understanding are BERT and GPT-3.

👉 The engine is trained on 18 million interdisciplinary research papers.

👉 Technology that we are using includes:  TF-IDF, Neural Topic Models, word embeddings and WISDM.