How to extract data from patents and papers

It’s hard for humans to extract data from patents and papers. Although it’s a very important task, it takes us a very long time to do; the process is tedious, monotonous and error-prone.

— Scroll down to watch the demo!

Why is data extraction time-consuming for humans?

Based on what we’ve heard from our customers, it takes about one ‘person month’ to retrieve all the relevant data from 60 patents. The vast amounts of data in patents and format inconsistency are some reasons for why it takes so much time. Some other reasons for why it’s time-consuming include: 

  • Tables with no headers
  • Descriptions of tables hidden in the text
  • Data from a single experiment being spread across multiple tables
  • Important data points that are barely mentioned
  • Abbreviations and different terminology between patents

There’s a lot of value in extracted data — but the process of extracting is not valuable in itself. That’s why this process is now being automated.

Data extraction for steel manufacturers, chemical providers and pharmacovigilance

There are numerous examples of companies who apply AI to data extraction:

  1. One of the world’s largest steel manufacturers needed to extract data as part of their competitive landscape analysis and research. They applied AI to pull out experiment data from patents, and used the results to spot R&D opportunities.
  1. A global chemical provider for agricultural uses is developing derivative products. They automated data extraction from patents to quickly identify chemical structures and properties.
  1. A major pharmaceutical company extracted data as part of their pharmacovigilance and post market activities, including patient baseline, treatment, outcomes and adverse effects.
Three data extraction use cases

How do I extract data from patents and papers?

It’s pretty simple! Like drag-and-drop, you place the patents you’d like to extract in a folder, and the machine starts working for you. The basic process has four steps:

  1. Extract entities and data from text
  2. Extract data from tables
  3. Link and match all data to entities
  4. Populate the results into a output data layout, which the client plays a key role in defining

It takes the machine a couple of mintues to complete these steps for hundreds of patents.

Additional modules include the extraction of data from charts or the machine processing of image-based chemical notation. Several more features are also in the pipeline (e.g. abstractive summarization).

How do we verify the data accuracy?

This is actually a machine learning question, as the machine verifies its own accuracy through two metrics: precision and recall. Precision is the fraction of relevant instances among the retrieved instances, and recall is the fraction of relevant instances that were retrieved. 

At Iris.ai, we’re achieving precision and recall in the vicinity of 90%, which means the machine picks up almost all the relevant data points. Although these metrics aren’t perfect, the machine has fewer errors than a human would have — and the numbers are improving weekly, as the technology matures

Moreover, the Iris.ai system assigns a confidence score for every datapoint it extracts, enabling humans to evaluate the results.

Do you want to see how to extract data?

Check out our Extract tool demo!

Do you have any questions about data extraction?