In the beginning of 2021 we started an exciting research journey when we received research funding from the Norwegian Research Council in the “BIA” grant scheme. And now this project is coming to an end. We’re very grateful for this opportunity. Let’s reflect on what we accomplished in the last 1,5 years.
The main theme of the BIA project is to understand the knowledge graph, which we divided into three different subprojects with collaboration:
- Domain-specific word embeddings
- Embeddings evaluation framework
- Knowledge graph building
Domain-specific word embeddings
Domain adaptation of embedding models is a proven technique for domains that have insufficient data to train an effective model from scratch. Chemistry is one such domain, where scientific jargon and overloaded terminology inhibit the performance of a general language model. In the past year, we have experimented with spherical embeddings, latent semantic imputation (LSI) and Cross-domain knowledge discovery with NLP (collaboration with CEA Saclay).
We published several research paper about each project:
- “Domain-adaptation of spherical embeddings” (November 2021)
- “Leveraging knowledge graphs to update scientific word embeddings using latent semantic imputation” (November 2022) (video)
- “Searching for carriers of the diffuse interstellar bands across disciplines, using Natural Language Processing” (November 2022)
Embeddings evaluation framework
This ongoing project aims to develop a suite of transferable intrinsic and extrinsic tasks for domain-specific word-embedding evaluations which can be applied to chemistry-specific evaluations. We combine the ideas of an extrinsic test suite, VecEval, and an intrinsic test suite, LDT toolkit, to design an automated pipeline for evaluating embeddings using various intrinsic and extrinsic evaluation tasks. Our current progress includes implementations of semantic partitioning as one of the intrinsic tasks, and named-entity-recognition (NER) and document classification as part of the extrinsic tasks.
Knowledge graph building
To identify novel applications for existing compounds from millions of research papers, it is crucial to build a knowledge graph that helps navigate these publications. To do so, we collaborate (1) with the CORE research group at the Open University (OU) to determine the types of citations used in the literature; and (2) with the KnowLab at the University Colleges of London (UCL) to understand how to enrich a human-annotated ontology with word embeddings.
Published research paper:
- “ACT2: A multi-disciplinary semi-structured dataset for importance and purpose classification of citations” (June 2022)
Moreover, there will be another paper published in December 2022 – “Be aware of semantic `misuse’: benchmarking distributed representations for triple classification in Chemistry”. You can check our scientific publications on iris.ai/publications/.
“I enjoyed seeing how other research groups operate and bringing in their expertise. It can be quite stimulating and we can get inspiration from different projects. And these collaborations also created opportunities – some of them did turn into research papers and some were extended into new projects. So in general I think it’s great that we maintain those research collaborations with the research groups outside of Iris because collaborations are the key to cross-domain research and stimulations of new ideas.” says Ronin Wu, AI Research Lead and Head of Research Collaborations at Iris.ai.
We’re extremely grateful for this opportunity and excited for upcoming ones.