P5: Improving extraction of tabular information from scientific text

Contact

RQ: How can we reliably and performantly detect tables in scientific publications from both machine-readable PDFs as well as from scanned images? How can we correctly detect their column- and row-boundaries to enable reliable extraction of the table content?

Description:
At Iris.ai, we use machine learning and natural language processing (NLP) to help researchers more easily find the scientific information they need. Since a lot of crucial information in scientific publications appear in tabular forms, we need to reliably detect and extract tabular data from these documents. This is not an easy task, as data in pdf formats typically comes without metadata regarding contained tables. Our algorithms thus need to infer the presence of a table, learn the exact location of different table cells and correctly reflow text within cells based on formatting cues alone. Yet there is no universal format on how authors present their tabular data.

The PDF documents we need to analyze can often be parsed into an XML-tree such that the formatting cues we need to learn to recognize come in the form of the placement of different XML-elements. Moreover, PDF documents of older publications are often only available in image-only formats. In such cases, table detection and data extraction become optical character recognition (OCR) problems. As a part of Iris.ai’s AI Chemist roadmap, we have combined computer-vision and heuristic techniques to develop algorithms capable of detecting and extracting tabular data. These algorithms will serve as baselines for this project. The goal is to research an alternative machine learning approach that can improve on these baselines.

Tasks:

  1. familiarize yourself with the existing code, models and evaluation data
  2. research the literature on models and algorithms for table location and table data extraction
  3. search for public benchmark datasets or, if needed, devise a methodology for creating a suitably large and diverse benchmark dataset
  4. compare the published models and algorithms to our own
  5. propose and prototype an improved implementation for table location detection and table data extraction
  6. evaluate your prototype and iterate
  7. if successful, polish your implementation such that it can be used in production

References:
[1] Qasim SR, Mahmood H, Shafait F. Rethinking Table Recognition using Graph Neural Networks. arXiv:190513391 [cs].
[2] Schreiber S, Agne S, Wolf I, Dengel A, Ahmed S. DeepDeSRT: Deep Learning for Detection and Structure Recognition of Tables in Document Images. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). IEEE; 2017:1162-1167. doi:10.1109/ICDAR.2017.192
[3] Paliwal SS, D V, Rahul R, Sharma M, Vig L. TableNet: Deep Learning Model for End-to-end Table Detection and Tabular Data Extraction from Scanned Document Images. In: 2019 International Conference on Document Analysis and Recognition (ICDAR). ; 2019:128-133. doi:10.1109/ICDAR.2019.00029