Introducing the AI Chemist

Anita Schjøll Abildgaard June 16, 2020

For the past 18 months, we have been hard at work to create a brand new product offering. We are finally ready to start introducing the details of our upcoming new tools!

First, a bit of background. We have spent the last five years building an award winning AI engine for scientific text understanding. Our engine deals with similarity, causality and compositionality of highly complex academic and scientific texts, and this understanding creates a variety of opportunities for any human with the need to process vast amounts of scientific text.

We started applying this engine in what is likely to be the hardest place: With the academics. We saw how our engine could help students and researchers in the early phase of a new project, where there is a need to perform what is called a “systematic research landscape mapping” – or a full literature review, if you like. We applied the engine to this, building out a suite of tools where the users first Explore broadly, based on a natural language problem statement of 300-500 words – and then Focus down to a precise reading list. We have been able to prove that the Explore tool yields far better results than old school key word based search engines – and that the Focus tool saves researchers up to 78% in their phase of narrowing down, at the same academic accuracy, compared to a manual review.

This is great for academics who absolutely need systematic thoroughness in their reviews. Not so for most industrial researchers: we were asked time and time again whether we “could just give them a quicker answer”.

In order to give more precise quicker answers though, the engine needs to be specialized. About one year ago, we decided it was ready – and we chose the field of chemistry as our first exciting target industry.

As our engine is already generalized with a training data set of >18M academic articles and a vocabulary of more than 200,000 words, specializing the core engine is easy enough: For each domain we need a collection of articles from within the domain – about 2,000 of them – or, for some use case, a human made seed ontology to get the process started. Both of these are simple enough – especially the former, as we’ve built the Explore tool exactly for this purpose.

Once the engine is specialized, it also needs to be added to specific tools solving clear problems. Here, we have worked closely with the chemical industry to identify three separate use cases that we are turning into three tools. The tools are all at their core understanding scientific text, but operating at three different levels of granularity.

Discover: mapping out the landscape

At the highest overview level, the Discover tool allows researchers diving into a new area in order to rapidly map out relevant papers, patents and internal documentation related to their problem. The input is a piece of text describing the research focus, and the output is a visual overview over the main topics and highly matching documents. The Discover tool can be connected to any scientific text based content, and as the engine is trained on the specific field, the results are interdisciplinary but highly specific and relevant.

The discover tools deals with unknown unknowns: where you are still in the early stage of a research project and really need a better overview over what you need to know.

Identify: finding the right bits of knowledge

Identify is a much more advanced tool, in the shape of a conversational AI. This means you provide the tool with some starter information about what you are looking for, and the tool asks clarifying questions to narrow down to the precise bits of information. This could be for example finding new application areas for an existing compound, or alternative synthesis procedures. You provide the tool with the chemical name and formula, and the tool builds out a knowledge graph, asking you to build out more details where it is missing them, such as a description of properties or known areas of application. All along you can see what the machine brain is finding and thinking, and the results are presented as highlighted pieces of text in from papers, patents and other documents.

The Identify tool deals with known unknowns: when you know the answer is “out there somewhere” but it is like looking for a needle in a haystack.

Extract: fetching key data from document

Extracting key data from a set of documents is tedious, boring and error prone, but often necessary, for example before going into the lab to recreate experiments done by others. Connecting written descriptions, tables and figures and placing it all in a systematic tabular format is where the Extract tool shines. Two full months of human labor can be done in a matter of minutes, at 90% accuracy. The input is a document (such as a PDF patent) and the output is the data related to the experiment processes, products, results and all their corresponding numbers and units. The data is in a tabular format and can thus be used with Excel, a database or any process tool you are using.

The Extract tool deals with known knowns: They just need to be extracted in the right format, with the right connections, for them to be actionable.

We are very excited to see these tools coming to life together with our chemical industry co-development partners and clients, and would be happy to discuss with potential new partners what we could help you with!