Tech Deep Dive: Parsing, entity extractions and variables linking – Extraction Part 2

Welcome to the second article in the Tech Deep Dive series! In this series we take a closer look at the technology behind our tools. This article is the continuation of explaining the Extract tool. Read the first part “Extraction of table data (and why it’s difficult)”, where we talk about extracting the data from the tables. In this blog post, we will walk you through the process of text parsing, entity extractions and variables linking. If you like this series and want to be notified about the new articles, subscribe to our newsletter.

Text parsing

One part of the extraction process is table data extraction, the other is text extraction. We convert the documents, through an external library, from PDF format to .xml and then to raw text. After that we refine and structure the extracted text into useful information to work with. Next step for text extraction is meta info extraction – the information about the paper itself. In the scientific articles it would be the title, authors, date of publishing and abstract. In case of patents these are: patent number, date filling, date of publication, inventors etc. For that we use pattern recognition techniques to spot the exact location of this information in the extracted text.

So now we have information from tables with specified rows and columns, text and metadata. Further we will focus on extracting key entities of interest.

Extracting entities and quantities identifies all entities of interest (e.g. temperature) by modeling the text information profile. Then it extracts the quantities – numbers with units (400 °C) and links them together calling each such entity – quantified entity. The quantities can also be defined as a range or value comparison. Besides the name of the entity we also extract the context around it (e.g. boiling temperature” or “manufacturing process) to be able to link those entities to user defined requirements (our clients provide that in the form of an output data layout).

The tables go through the same process, where the header of the column is the name of the entity and the quantity is the value in the cell. 

Output data layout

To continue with the extraction our clients need to specify the information they are interested in. Output data layout (ODL) is a simple spreadsheet with specified entities that need to be extracted, alongside with possible supplementary requirements such as the units the entity should be measured in, etc. Such entities could be chemical compounds, process variables,  etc. It’s worth mentioning that the engine understands all chemical compounds and could find them even within abbreviations. Moreover, the machine is able to convert the units to the one specified in the ODL. For example all feet, inches, meters and millimeters will be converted to centimeters. 

Next Steps

The next step is linking entities to ontology and reinforcement, which you will read about in the next blog post!