Tech Deep Dive: Extraction of table data (and why it’s difficult) – Extraction Part 1
We are starting a new blog series! Here is the Tech Deep Dive, where we explain the technology behind our tools. In this first post of the series we will take a closer look at the table data extraction (which is a part of the Extract tool). Our next blog post in a series will explain the rest of the process in the Extract tool – text linking and grouping of data – so subscribe to our newsletter to be the first one to notify!
The table data extraction is a multistage process and each of the stages has its own challenges. At Iris.ai, we split table data extraction into three downstream tasks – table location, table caption linking, table structure and data extraction.
PDFs are human readable
PDF documents are human readable, not machine readable. PDF is a format that allows humans to view the document exactly the same on different programs. This is because the text is actually linked to an image that is displayed the same way. This format was specifically created to universally show information to humans. The PDF document doesn’t have the information on where the tables are exactly, what are the rows and columns – humans can easily see it, but the machine doesn’t!
Locating the tables
To locate the tables in the PDF documents, Iris.ai is using an object detection model. Object detection is a computer vision technique that allows us to identify and locate objects in an image or video. It’s the same model that tells you that there is a dog in the photo. Our machine is trained to detect the rectangles with equally spaced data inside. We input the generated image to the object detection model and it returns the tables’ location paired with the confidence level.
Table caption extraction
The next step is to extract the table captions. The machine does not know which caption is assigned to each table, therefore we extract all possible captions and link them to all possible table locations. Then we choose the right one based on the probability and distance to the table. We also match continuation tables with the original tables.
Table data extraction
The last part is the actual extraction of table data. And it’s more complicated than it looks! To summarize, we located the tables, we know where they are and we know the name of the tables. Now we need to split the data into rows and columns.
We use several algorithms to solve this problem. We use an image-based algorithm to use the visual information for cells split, a statistical algorithm to fill in the missing visual information and a graph-based approach to extract the full table structure. One of the algorithms we are using reads the information in the PDF about the lines dividing the cells so based on that we can actually divide the table in rows and columns. And that’s it! Easy, right? Well, sometimes the PDF doesn’t contain the information about the lines or sometimes the tables do not have lines between the columns, so we need to use a different algorithm that reconstructs the missing lines. The third algorithm categorizes the headers and labels (row names) and extracts the full table structure which allows us to have the table data extracted and structured as in the document.
So why is table data extraction difficult?
Table data is an extremely important part of the data extraction process. Most of the time it’s in the tables where you can find the key information.
The table data extraction is a multistage problem and each stage is difficult on its own. In the first stage of locating the tables in PDFs, the object detection model had to be trained specifically to recognize tables. The training is a laborious and time-consuming process.
Another issue we had to combat is understanding the data – extracting the rows and columns. There are many approaches in the literature to deal with that – advanced neural networks, graph based approaches, rule based approaches etc. – but very few of them actually present good results.
Moreover, if you’ve read scientific papers then you probably noticed – there are a lot of different types of tables. Some tables have lines between rows and columns, some don’t. Some columns are closer together than others. Some tables are bigger and some are smaller. Some tables have more rows and columns and some have less. Sometimes there are two different tables next to each other on the same page. Sometimes the table continues on the next page. That diversity makes it more difficult to extract the data.
Lastly, it’s challenging to find the right balance between the size of the model and the performance. Table data extraction models are big and take a long time, but at Iris.ai we strive to keep up the performance on a good level and provide results fast.
After table data extraction, Iris.ai extracts the text and metadata and then connects it to the ODL (Output Data Layout) which you will read about in our next Tech Deep Dive blog post!