Today we’re coming to you with the next article in our Tech Deep Dive series! In this series we explain the technology behind our tools. In our previous articles we wrote about “Extraction of table data (and why it’s difficult)” and “Parsing, entity extractions and variables linking”. This is the 3rd and last part of unraveling the Extract tool, where we will take a closer look at how the entities are being linked to the Output Data Layout and what other information the client receives after the extraction is done. If you are interested, subscribe to our newsletter to be the first one to notify whenever a new blog post is published.
To summarize, during the extraction process we have located the tables in the PDF documents and extracted the table captions and the data inside it. Then we have extracted the metadata and entities and quantities from text. Lastly, we have put the data points the clients want to extract into a spreadsheet we call Output Data Layout Now it’s time to put it all together!
Grouping entities into products
The first stage of the process is grouping entities in what we call a product. One row in ODL is one product. In the steel industry a product would be a material produced and in pharma – the tested drug. Another part is linking the data in the table rows to the correct products described in the text.
Linking entities to ontology elements
Now we have a bunch of entities – entities from text and entities from the table. At this point they are not connected to the Output Data Layout. The first step is to create connections between extracted entities and ODL elements based on known features that could help our Assignment Solver Model to decide which entity should go to which ODL element. Examples of such features are when the entity and the element have the same units (e.g. °C connected to temperature), are used in a similar context, appear in near proximities in texts, etc. Each established connection is rated based on various characteristics and then we use the ASM model trained on the connections that selects the best option. Once we are done with extraction the choices are stored in a graph database so they can help us improve the ASM going forward. Unfortunately, it’s not as simple as it seems. Sometimes we have entities that could be connected to multiple ODL elements and we consider that a candidate and we want to choose the best candidate, of course. But this entity could be a candidate for another element as well and at the end it should be assigned to only one ODL element (and one product). The problem begins to be quite complex – since every time you do an assignment this entity could not be a candidate for another ODL element anymore. Therefore we use advanced algorithms on top of the ASM to optimize the search for optimal assignment in this multidimensional problem.
The engine can be reinforced on client’s industrial domain so it will more accurately connect the right data. We teach the machine the industry language: the terminology, the format of papers and patents, abbreviations, units etc. We need very few examples to be able to do that. As little as 5-10-20 are enough since we use our other tools to find similar documents to grow the dataset (as the bigger the corpus, the better the results). In this way the machine becomes better at understanding the specifics in the industry and understands the connections between entities and ontology better.
The last stage is exporting the results in ODL format (usually tabular format like Excel or other desired by client format). We also generate debug information – it helps clients understand what is going on in different stages. It contains a folder with few files – highlighted PDFs with entities that we extracted and highlighted tables, .csv file with data from tables in tabular format and JSON files that contain valuable information to track the results from every stage. Additionally, in the result file there is a separate spreadsheet with confidence for each extracted data.
Verifying data accuracy
To verify the accuracy of the results we use precision and recall as well as a self assessment module. For the precision and recall we compare the annotated results from the client to the results Iris.ai provided. Recall measures if we extracted all data the client requested and precision measures how many of them have been extracted correctly. The self assessment module rates how confident the machine is through each stage of the extraction. The confidence is measured in number and is included in the ODL.
👉 The data extraction process is complicated and includes multiple processes: table data extraction, metadata extraction, extracting entities and qualities and connecting them to ontology elements.
👉 Table data extraction is especially difficult due to PDF formatting and vast variety of types of tables.
👉 The client chooses what data points need to be extracted by creating an Output Data Layout.
👉 The Iris.ai machine can be reinforced on specific industry to provide more accurate results.