In the beginning of 2021 we started an exciting research journey when we received research funding from the Norwegian Research Council in the “BIA” grant scheme. And now this project is coming to an end. We’re very grateful for this opportunity. Let’s reflect on what we accomplished in the last 1,5 years.
The main theme of the BIA project is to understand the knowledge graph, which we divided into three different subprojects with collaboration:
Domain-specific word embeddings
Embeddings evaluation framework
Knowledge graph building
Domain-specific word embeddings
Domain adaptation of embedding models is a proven technique for domains that have insufficient data to train an effective model from scratch. Chemistry is one such domain, where scientific jargon and overloaded terminology inhibit the performance of a general language model. In the past year, we have experimented with spherical embeddings, latent semantic imputation (LSI) and Cross-domain knowledge discovery with NLP (collaboration with CEA Saclay).
We published several research paper about each project:
This ongoing project aims to develop a suite of transferable intrinsic and extrinsic tasks for domain-specific word-embedding evaluations which can be applied to chemistry-specific evaluations. We combine the ideas of an extrinsic test suite, VecEval, and an intrinsic test suite, LDT toolkit, to design an automated pipeline for evaluating embeddings using various intrinsic and extrinsic evaluation tasks. Our current progress includes implementations of semantic partitioning as one of the intrinsic tasks, and named-entity-recognition (NER) and document classification as part of the extrinsic tasks.
Knowledge graph building
To identify novel applications for existing compounds from millions of research papers, it is crucial to build a knowledge graph that helps navigate these publications. To do so, we collaborate (1) with the CORE research group at the Open University (OU) to determine the types of citations used in the literature; and (2) with the KnowLab at the University Colleges of London (UCL) to understand how to enrich a human-annotated ontology with word embeddings.
Moreover, there will be another paper published in December 2022 – “Be aware of semantic `misuse’: benchmarking distributed representations for triple classification in Chemistry”. You can check our scientific publications on iris.ai/publications/.
“I enjoyed seeing how other research groups operate and bringing in their expertise. It can be quite stimulating and we can get inspiration from different projects. And these collaborations also created opportunities – some of them did turn into research papers and some were extended into new projects. So in general I think it’s great that we maintain those research collaborations with the research groups outside of Iris because collaborations are the key to cross-domain research and stimulations of new ideas.” saysRonin Wu, AI Research Lead and Head of Research Collaborations at Iris.ai.
We’re extremely grateful for this opportunity and excited for upcoming ones.
Oslo, Norway – March 16, 2022 –Iris.ai, developer of AI tools for processing scientific research, launches a new platform – the Researcher Workspace. The purpose of this comprehensive suite of tools is to help researchers in industry and academia, librarians and students follow their own research process. Modules include a visual content based search, analysis of document sets, extracting and systematizing data points, automatically writing summaries of multiple documents – and very powerful filters based on context descriptions, the machine’s analysis, or specific data points or entities.
Anita Schjøll Brede, the CEO of Iris.ai, shares her excitement about the launch: “We’re very excited to launch our new platform! It’s the culmination of several years of both working closely with clients – and serious Research and Development efforts. With this new platform, we have focused on flexibility and adaptability – so that each researcher is able to streamline their exact literature review process, and unlock the knowledge they need as swiftly and painlessly as possible! We’re also exceptionally proud of the tools we now offer as part of the Workspace: the Extraction tool and the Abstractive summarization are very unique – not to mention the fact that our research team has found a way to train the entire system on each client’s research field, with no humans involved. I do believe we’re about to see the next chapter of what’s possible in using AI/ML for scientific research and I am so proud of what our team has accomplished.”
The Researcher Workspace is mainly directed towards R&D heavy industries like chemistry, pharmaceuticals, MedTech, material science, biotech, food safety or engineering. The tool can be reinforced on the customer’s field. The machine can be trained on industry specific terminology to provide more precise and accurate results.
About Iris.ai
Iris.ai is one of the world’s leading start-ups in the research and development of artificial intelligence (AI) technologies. Founded in 2015, the start-up offers an award-winning AI engine for scientific text understanding. The company uses Natural Language Processing/Machine Learning to review massive collections of research papers or patents: find the right documents, extract all their key data or identify the most precise pieces of knowledge. Applied to literature reviews, data extraction, document summarization, competitive intelligence or any other task involving thousands of documents like papers or patents, R&D professionals and students no longer waste time on tasks the Iris.ai tools can do for them. Iris.ai collaborates both with innovation-oriented universities and corporate customers, and contributes to many joint research projects fostering Open Science (CORE) and innovation.
CORE and Iris.ai are extremely pleased to announce the initiation of a new research collaboration funded by the Norwegian Research Council.
Discovering scientific insights about a specific topic is challenging, particularly in an area like chemistry which is one of the top-five most published fields with over 11 million publications and 307,000 patents. The team at Iris.ai have spent the last 5 years building an award-winning AI engine for scientific text understanding. Their patented algorithms for identifying text similarity, extracting tabular data and creating domain-specific entity representations mean they are world leaders in this domain.
The AI Chemist project is a collaboration between Iris.ai and The Open University, Oxford University, Trinity College, Dublin and University College, London. CORE is a not-for-profit platform delivered by The Open University in cooperation with Jisc that hosts the world’s largest collection of open access scientific articles. As of February 2022, the CORE dataset provides metadata information (title, author, abstract, publishing year, etc.) for approximately 210 million articles, and the full text for 29.5 million articles.
Working in partnership with CORE developers and researchers, Iris.ai will now leverage the vast quantities of research papers available in the CORE dataset. This dataset will be employed in improving the quality of text extraction from scientific literature from Chemistry focused domains. The output of this phase will support Iris.ai and The AI Chemist in understanding reasoning and inference across research papers.
Currently, the state of the art in the chemical domain is a combination of direct manual evaluation of text documents, social networks and curated, but incomplete databases. The manual nature of these approaches makes discovery of novel application areas immensely time consuming. The goal is to develop a set of algorithms that can machine read vast amounts of scientific literature and data, discover and detect mentions of entities of interest and their relations (such as chemical products, compounds, properties, processes, applications, etc.) and connect these pieces of information to build an increasingly complex knowledge graph.
Dr Ronin Wu, Research Lead and Head of Research Collaboration at Iris.ai, said: “Iris.ai are extremely pleased to be partnering with CORE on the AI Chemist project and we’re looking forward to seeing some exciting new developments with our AI models”.
Dr. Petr Knoth, Head of CORE and Senior Research Fellow in Text and Data Mining, said: “This cooperative research project will put CORE at the forefront of the global effort to create open scholarly knowledge graphs. As part of this project we will use state-of-the-art machine learning approaches to address problems including topic / themes extraction, affiliation extraction, deduplication and citation function detection. With the demise of Microsoft Academic Graph at the end of 2021, we see on a daily basis how much this is in demand among CORE users. ”
Artificial intelligence is rapidly changing the way we work in various companies and industries around the world, including the chemical industry. Organizations are adopting these technologies to accelerate processes and reduce costs, as well as saving employees from tedious, mundane tasks.
Accenture suggests that there are three ways of applying artificial intelligence in research across industries:
Reinventing the process to manage process change, rethinking standardized processes as continuously adaptive, and using AI across multiple processes.
Rethinking human-machine collaboration; how companies can have an AI-enabled culture to reskill employees to work in alliance with machines.
Utilizing data, making use of AI and data to solve previously unsolved problems and reveal hidden patterns.
In this article, we will explain how chemical researchers are applying artificial intelligence.
How chemical researchers are applying AI
There are three categories of chemical research that are affected by AI. The first category is molecule prediction — draw on known properties to predict new behavior. The second category is synthesis models, which predict how to create certain molecules in fewer steps and more reliable processes. The third is handling prior knowledge to make sense of what we already know —starting with data mining to find the right information.
1. Case studies on molecule predictions
The pharmaceutical industry is one of the front runners in AI. In February 2020 the model in “A Deep Learning Approach to Antibiotic Discovery” was created, a model that translates molecules into vectors. It starts with every atom being represented with a vector of simple properties. This is used to create a fingerprint of the molecule’s structure, which helps the neural network to learn.
The model was trained on tests with E.coli to see what molecular structures actually were antibiotic. Then it was applied to the Broad Institute’s drug repurposing hub – an open-access library of more than 6000 molecules with known biological activity. As a result, they discovered a compound called Halicin with impressive antibiotic activity, despite having a chemical structure unlike conventional antibiotics.
Following this success, the team applied their AI technique to a database known as ZINC15 — 107 million molecules were manually selected for screening. Based on the deep learning tool’s predictions, 23 compounds were chosen for further investigation. Two of these compounds showed promise against a range of drug-resistant E. coli.
In march 2020 Münster University published A Structure-Based Platform for Predicting Chemical Reactivity. The new tool is based on the assumption that reactivity can be directly derived from a molecule’s structure. It uses an input based on multiple fingerprint features as an overall molecular representation. Organic compounds can be represented as graphs on which simple structural (yes/no) queries can be carried out. Fingerprints are numeric sequences based on a combination of multiple queries. They were developed to search for structural similarities and proved well suited for use in computational models. For the most accurate presentation of the molecular structure of each compound, a large number of different fingerprints are used.
2. Finding the best synthesis method: expert system vs. machine learning
In 2018, The Defense Advanced Research Projects Agency (Darpa), the development agency of the United States Department of Defense, presented a project where artificial intelligence was used to develop and find the best synthesis methods. The user can input any structure, either known or novel, and then a machine generates thousands or even millions of reaction sequences in order to end up with the final product. Reactions are being ranked and identified based on feasibility, cost, and other factors. Darpa has two ways of doing this. They can apply the expert system, a system based on 60 000 handwritten rules, which is effective but not scalable. Alternatively, they can encode each of the molecules to predict bond changes, using machine learning (much like on molecule predictions). The next step is having a manual help to filter the results and generate a shorter list of top candidates.
There are three fundamental problems in using the machine learning approach as opposed to the expert system. First, the challenges seen in using machine learning, in this case, is the data acquisition. There is missing information and biased reporting due to lacking reports on failed experiments. When it comes to reaction sequences that can be extracted from patents, not all information is going to be reported in the same place.
The second disadvantage is data representation, meaning how this data is presented and explained to a machine in a comprehensive way. The data format needs to be considered and determined — whether the data is presented in formulas, images, features, properties, etc.
The third problem is the exploration space. That space is so much vaster than the information we have available. That raises questions about how to teach a chemistry engine to invent new potential molecules and pathways when we don’t have data on that at all.
There is a model called Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction which can predict the outcome of a chemical reaction with much higher accuracy than trained chemists, and it will suggest ways to make complex molecules. However, it needs a lot of data and in a very specific text-based format called SMILES (simplified molecular-input line-entry system) that has been data mined from patents. In the end, the preparations to use it for a specific use case might not be worth it from a cost perspective.
3. Organizing knowledge
Artificial intelligence is already used in prior art. There are a few existing and future inventions in that area which will change the current process radically. The first and most basic invention already in use is smarter search. Automated literature reviews is the second step, which we have been working on for the past five years at Iris.ai. We’ve gotten to semi-automation, meaning the search needs human-machine collaboration.
The next frontier that we are working on is identifying specific insights from text. The first step is advanced data extraction and linking, which we have developed in our Extract tool. The PDF to be extracted is sent to the Iris.ai system. This PDF can be a patent, a clinical trial report, a research paper or any other relevant type of scientific content. It can be one simple document at a time, or hundreds or thousands of them in a batch. The Iris.ai engine extracts the text and identifies all the domain specific entities, then locates the tables and extracts the data from rows and columns, and links the data between the text and table. Graphs, figures and other elements go through the same process. Then the engine populates a pre-defined output in a machine readable format; an excel sheet, an integrated lab tool, a database or anywhere else your researchers require.
What’s important in this step is the self-assessment module which communicates to the human researchers how confident the machine is in its results, to give the human guidance on where to do the most rigorous manual verifications.
In the long-run, we expect to see developments in hypothesis extraction from the prior art, knowledge validation based on prior art, and lastly, drawing new conclusions and finding new hypotheses from all of the existing prior arts.
Automating manual tasks vs. rethinking the imaginable
There are two very different mindsets when it comes to applying AI in your organization. You can replace a human process and have a machine do the same activity but faster, for example, in extracting the data. Willingness to invest time and resources is needed, but there is clear ROI and known outcome and benefits. The second mindset is about activities that cannot be done by a human. For example, a machine can identify new potential application areas, meaning you need willingness to invest as well as rethink and re-imagine what’s possible (ROI will be unknown until you try).
Interpretability and explainability
One of the emerging fields in AI worth mentioning is interpretability or explainability. It is not just AI that tells if something will work or not, but explaining why. For example in molecule prediction, AI can predict that certain actions will cause an activity or property because of a specific area in the molecule or combination. As a result, it gives the chemist an immediate indication of how it could be altered if the reaction is unwanted. Similar to the data extraction tool that Iris.ai is working on, where every row and column will come with a machine-created self-assessment with a percentage of certainty.
At Iris.ai we have spent the last five years researching and developing an AI Engine for Scientific text understanding. Already successfully deployed in a generalized suite of tools for Academic literature reviews, we believed it was time to see how this engine could be reinforced on one specific domain, and how it could be used to find precise and more spot-on answers for industry researchers. Chemistry was an interesting place to start, for the reasons outlined below, as well as because it is an industry ripe for digital innovation and essential for the sustainable future of our planet.
The interesting thing about chemistry
In 1776, chemist and mechanical engineer James Watt invented the Watt steam engine, which was fundamental to the changes brought by the Industrial Revolution. Ever since – and potentially even before – an understanding of chemistry has been the foundation for our technological development, and there is no reason to believe that this holds any less true for the future. Whether we need more sustainable materials or biodegradable fuel to reduce our carbon emissions, new materials allowing us to travel to space or terraform Mars, novel ways of ensuring that every person on this planet is properly fed or understanding how we can handle an ocean filled with plastic particles, chemistry is going to be absolutely foundational.
What has enabled such a thorough understanding of chemistry pertains to the field’s formalism – the same as for maths and physics. This means structured approaches to unifying language so that any chemist anywhere can talk about anything from the basic elements, via molecular formulas to complex synthesis procedures in the same way. This structured way of communicating with each other has allowed rapid progress in this scientific field.
However, formalism has its downsides: when you simplify a process or a thought process into a unified language, inevitably there will be a loss of information on the way. Much like a compressed image is easier to share and still show the same motive – but is pixelated, so can formalist research results be easier to convey transmitting a general idea of the approach, whilst missing the finer details though. Ideas are compressed to formulas, long research papers compressed to abstracts, novel ideas compressed to a 140 character tweet, detailed lab notes compressed to summaries.
In chemical research, this ‘compression’ has been required because of human limitation – but today, it isn’t required anymore. Computers have already allowed a much broader and larger volume of shared knowledge – which in itself makes absolute formalism tricky. And thanks to advances in AI, we are rapidly approaching a new frontier in chemical research (and beyond).
With new AI advances, machines can help researchers find what other researchers have done, ‘translate’ it into that researchers’ current context, and get a much higher clarity on how and why the solutions or conclusions were reached – without the information loss built into the current process. The machine will have all necessary information as there is no information loss – but only communicate or ‘translate’ the exact relevant pieces between the researchers. This will truly be a new paradigm of chemical research, and we intend to be part of it.
Iris.ai’s first steps into chemistry
We have taken our core engine and reinforced it on chemistry. The interesting thing about that approach is that because the starting point of the general engine is strong, we only need a small collection of research paper in the specialized field, or for some use cases a seed ontology already created by a human, to specialize the tool – which makes it very flexible and re-deployable on many different research fields with similar user needs.
We are already now building out this engine into the first set of tools that will help Chemistry researchers on three different levels:
Discover. When dealing with unknown unknowns, the Discover tool allows interdisciplinary discovery, beyond today’s limiting keyword queries. It fingerprints the description of the researchers’ problem, and maps out all relevant papers and patents they should be reading to get a full overview of the field. The discover tool is especially helpful at the early phase of a new and interdisciplinary project, where it has proven to give researchers a better overview, find more spot on papers and draw better conclusions.
Identify. When the researcher knows the answer is ‘out there somewhere’ but it’s like looking for a needle in a haystack. Known unknowns can be found through this conversational AI that guides the researcher through the information found in millions of documents, asking the right questions to narrow down to exactly the bits of knowledge you need. This knowledge could be finding new application areas for an existing compound, identifying better synthesis procedures or simply identifying the right material for your use case.
Extract. In spite of chemistry being such a formalist field, every researcher writes in their own way. That means when you have a need to extract key data from a document – for example experiment data before going to the lab to recreate them – it takes a lot of valuable researcher time. Our automatic extraction achieves 90% accuracy and perform two months worth of manual labor in a matter of minutes.
At Iris.ai, we are very excited in bringing our AI skills together with some very talented chemical researchers, and see what just might be possible when you bypass the limitations of human formalist language and let an AI understand the context of your words to help advance your research.
The past and the future of chemistry as we know it.
In 1776, chemist and mechanical engineer James Watt invented the Watt steam engine, which was fundamental to the changes brought by the Industrial Revolution. Ever since – and potentially even before – an understanding of chemistry has been the foundation for our technological development, and there is no reason to believe that this holds any less true for the future. Whether we need more sustainable materials or biodegradable fuel to reduce our carbon emissions, new materials allowing us to travel to space or terraform Mars, novel ways of ensuring that every person on this planet is properly fed or understanding how we can handle an ocean filled with plastic particles, chemistry is going to be absolutely foundational.
What has enabled such a thorough understanding of chemistry pertains to the field’s formalism – the same as for maths and physics. This means structured approaches to unifying language so that any chemist anywhere can talk about anything from the basic elements, via molecular formulas to complex synthesis procedures in the same way. This structured way of communicating with each other has allowed rapid progress in this scientific field.
However, formalism has its downsides: when you simplify a process or a thought process into a unified language, inevitably there will be a loss of information on the way. Much like a compressed image is easier to share and still shows the same motive, but is pixelated, so can formalist research results be easier to convey, transmitting a general idea of the approach, whilst missing the finer details. Ideas are compressed to formulas, long research papers compressed to abstracts, novel ideas compressed to a 140 character tweet, detailed lab notes compressed to summaries.
In chemical research, this ‘compression’ has been required because of human limitation – but today, it isn’t required anymore. Computers have already allowed a much broader and larger volume of shared knowledge – which in itself makes absolute formalism tricky. And thanks to advances in AI, we are rapidly approaching a new frontier for research.
With new AI advances, machines can help researchers find what other researchers have done, ‘translate’ it into that researchers’ current context, and get a much higher clarity on how and why the solutions or conclusions were reached – without the information loss built into the current process. The machine will have all the necessary information–as there is no information loss–to communicate or ‘translate’ the exact relevant pieces between the researchers. This will truly be a new paradigm of chemical research (and beyond).
The chemical industry is trying to solve a 21st-century problem with a 20th-century approach
The state of scientific knowledge today is as if we had millions of cities (knowledge nodes) but only small footpaths through the woods to connect them – and no reliable map. However, with recent advances in AI technology, we are now able to build some serious digital highways, connecting all these ideas and people.
Discovering scientific insights about a specific topic is challenging, particularly considering that chemistry is one of the top-five most published fields with over 11 million publications and 307,000 patents. Moreover, the pace at which worldwide scientific knowledge expands is staggering. In 2016 alone, almost 2.2 million scientific articles were published and this output is doubling every nine years. In the process of trying to navigate, extract information from, and understand all this material, simplifications are being made and too much information is lost or missed. This is hampering global progress, and frustrating both the individual researchers trying to wrangle all this information and their R&D managers responsible for the department delivering quality commercially feasible results.
We have reached a point where a researcher will know that the answer they are looking for is likely to be ‘out there somewhere’, but there is no way for them to find it. The reasons for this include:
a) there is not one table in one location where all the information one might need is stored;
b) researchers are no longer able to adhere strictly to the previously helpful formalist rules as interdisciplinarity and creativity is (or should be) the new norm;
c) no-one documents and disseminates information in the exact same way; and, at the end of the day,
d) there is simply too much knowledge for a human researcher to assimilate.
This causes major challenges in finding the right knowledge, whether the answer is a quick and simple one (“what other applications are there for my compound?”) or way more complex (“if we extract the knowledge from these three papers, these ten patents, and this product sheet… doesn’t that mean we have an entirely novel compound?”).
Chemical companies need to:
a) find ways to utilise their core competencies and existing knowledge to generate new revenue,
b) reduce the risk of lab experiments failing by having as much upfront information as possible, and
c) make their R&D process as cost-efficient as possible without compromising quality.
The best way to go about the first two challenges is leveraging existing scientific literature, but unfortunately, that is not possible today while at the same time achieving the latter.
Chemical companies’ R&D departments today are absolutely vital to the companies’ survival and ongoing success, but they are at the same time seen as a non-revenue generating “burden”: Necessary, but expensive. Very expensive.
Chemical companies’ R&D departments today are absolutely vital to the companies’ survival and ongoing success, but they are at the same time seen as a non-revenue generating “burden”: Necessary, but expensive. Very expensive. R&D managers are under pressure to deliver more results, faster, but because of the overwhelming amount of information, it is becoming increasingly difficult. At the same time, their most valuable R&D assets are their research staff, to whom searching through thousands of documents to try to find answers is just an annoying and tedious burden, far removed from the real fun work that happens in the lab. And the less the researcher reads in advance, the higher the chances are their lab experiments may fail, wasting their time and the company’s valuable budget.
Out of an average of 1,800 hours worked every year, studies show that about 40% of a chemistry R&D researcher’s time is currently spent between finding (19%), reading (11%) and organizing (10%) existing literature. This represents a massive inefficiency.
To remain competitive and grow their market share, chemical companies need to constantly ask themselves the questions listed below, and efficiently find their answers from existing literature (something not fully possible today, based on the current state of affairs, including the productized available technology):
What are new uses for an existing compound?
How can we change the properties of an existing material?
What other synthesis pathways will improve our existing manufacturing process?
What compounds with specific properties can be used as substitutes in an existing application area?
What new chemical substances can we create by combining known compounds, and thus a new market?
Of all of the above questions, what approaches are more sustainable, as we are under continued pressure to reduce our environmental footprint?
The only way for industrial chemists to potentially find answers is relying on limiting keyword-based search engines, summarized findings that ‘everyone’ has access to and following key researchers on social media channels to see what they are up to. After that, no matter how the papers are found, they still need to manually screen and review existing chemical literature one paper at a time. But as we have shown above, this is a very challenging task with very slim chances of finding what is needed. And even if they should be diligent and have a large enough team to be able to stay on top of existing literature – they can not also have time to crunch the findings, test the knowledge, validate the hypotheses in the lab and then publish the results as well. This means that, based on the current state of existing knowledge management solutions, there is very little time for actual innovation. Chemists then have no other alternative, but to rely on their own experience, limited knowledge, rules of thumb, outdated tools and the occasional dumb luck. Moreover, ‘blind’ trial and error leads to repeatable, mundane and time-consuming tasks, ending in unpredictable results – until hopefully a solution is found, although with low confidence it was the best solution or a good use of time.
Innovate or die must be adopted as the key mantra by the chemical sector if those companies want to remain competitive.
The chemical industry is today trying to solve a 21st Century problem (increased speed-to-market, lower product margins and cut-throat global competition amidst an overload of information) with a 20th-century approach (slow, outmoded and error-prone guesswork).
Chemical companies are coming under increased pressure to get smarter in the current wave of digitization, amidst new technological challenges, shrinking product life cycles and the rush to commoditize products. They simply need to increase the pace of innovation.
Innovate or die must be adopted as the key mantra by the chemical sector if those companies want to remain competitive. This involves embracing innovative ways to research and develop new commercial products.