Skip to content


The research team works with a combination of in-house research and external libraries.

Here are our own publications.

  1. [Date] Nov 2022
    [Title] Leveraging knowledge graphs to update scientific word embeddings using latent semantic imputation
    [Summary] The most interesting words in scientific texts will often be novel or rare. This presents a challenge for scientific word embedding models to determine quality embedding vectors for useful terms that are infrequent or newly emerging. We demonstrate how LSI can address this problem by imputing embeddings for domain-specific words from up-to-date knowledge graphs while otherwise preserving the original word embedding model. We use the MeSH knowledge graph to impute embedding vectors for biomedical terminology without retraining and evaluate the resulting embedding model on a domain-specific word-pair similarity task. We show that LSI can produce reliable embedding vectors for rare and OOV terms in the biomedical domain.

  2. [Date] Nov 2022
    [Title] Searching for Carriers of the Diffuse Interstellar Bands Across Disciplines, using Natural Language Processing
    [Summary] The explosion of scientific publications overloads researchers with information. This is even more dramatic for interdisciplinary studies, where several fields need to be explored. A tool to help researchers overcome this is Natural Language Processing (NLP): a machine-learning (ML) technique that allows scientists to automatically synthesize information from many articles. As a practical example, we have used NLP to conduct an interdisciplinary search for compounds that could be carriers for Diffuse Interstellar Bands (DIBs), a long-standing open question in astrophysics. We have trained a NLP model on a corpus of 1.5 million cross-domain articles in open access, and fine-tuned this model with a corpus of astrophysical publications about DIBs. Our analysis points us toward several molecules, studied primarily in biology, having transitions at the wavelengths of several DIBs and composed of abundant interstellar atoms. Several of these molecules contain chromophores, small molecular groups responsible for the molecule’s colour, that could be promising candidate carriers. Identifying viable carriers demonstrates the value of using NLP to tackle open scientific questions, in an interdisciplinary manner.

  3. [Date] Oct 2022
    [Title] Benchmark for Research Theme Classification of Scholarly Documents
    [Summary] We present a new gold-standard dataset and a benchmark for the Research Theme Identification task, a sub-task of the Scholarly Knowledge Graph Generation shared task, at the 3rd Workshop on Scholarly Document Processing. The objective of the shared task was to label given research papers with research themes from a total of 36 themes. The benchmark was compiled using data drawn from the largest overall assessment of university research output ever undertaken globally (the Research Excellence Framework – 2014). We provide a performance comparison of a transformer-based ensemble, which obtains multiple predictions for a research paper, given its multiple textual fields (e.g. title, abstract, reference), with traditional machine learning models. The ensemble involves enriching the initial data with additional information from open-access digital libraries and Argumentative Zoning techniques (CITATION). It uses a weighted sum aggregation for the multiple predictions to obtain a final single prediction for the given research paper. Both data and the ensemble are publicly available on and, respectively.

  4. [Date] June 2022
    [Title] ACT2: A multi-disciplinary semi-structured dataset for importance and purpose classification of citations
    [Summary] Classifying citations according to their purpose and importance is a challenging task that has gained considerable interest in recent years. This interest has been primarily driven by the need to create more transparent, efficient, merit-based reward systems in academia; a system that goes beyond simple bibliometric measures and considers the semantics of citations. Such systems that quantify and classify the influence of citations can act as edges that link knowledge nodes to a graph and enable efficient knowledge discovery. While a number of researchers have experimented with a variety of models, these experiments are typically limited to single-domain applications and the resulting models are hardly comparable. Recently, two Citation Context Classification (3C) shared tasks (at WOSP2020 and SDP2021) created the first benchmark enabling direct comparison of citation classification approaches, revealing the crucial impact of supplementary data on the performance of models. Reflecting from the findings of these shared tasks, we are releasing a new multi-disciplinary dataset, ACT2, an extended SDP 3C shared task dataset. This modified corpus has annotations for both citation function and importance classes newly enriched with supplementary contextual and non-contextual feature sets the selection of which follows from the lists of features used by the more successful teams in these shared tasks. Additionally, we include contextual features for cited papers (e.g. Abstract of the cited paper), which most existing datasets lack, but which have a lot of potential to improve results. We describe the methodology used for feature extraction and the challenges involved in the process. The feature enriched ACT2 dataset is available at

  5. [Date] Nov 2021
    [Title] Domain-adaptation of spherical embeddings
    [Summary] The recent spherical embedding model (JoSE) proposed in arXiv:1911.01196 jointly learns word and document embeddings during training on the multi-dimensional unit sphere, which performs well for document classification and word correlation tasks. But, we show a non-convergence caused by global rotations during its training prevents it from domain adaptation.

  6. [Date] May 2018
    [Title] Scithon™ – An evaluation framework for assessing research productivity tools
    [Summary] We develop the novel framework, Scithon™, for performing evaluation tasks on the science discovery tools. The framework, organized around live events, involves a systematic and cross-disciplinary comparison that focuses on productivity gains and takes into account user engagement.

  7. [Date] 15 December 2017
    [Title] Word importance-based similarity of documents metric (WISDM): Fast and scalable document similarity metric for analysis of scientific documents
    [Summary] Word importance-based similarity of documents metric (WISDM) is a fast and scalable novel method for document similarity/distance computation for analysis of scientific documents. It is based on recent advancements in the area of word embeddings and it achieves significant performance speed-up in comparison to state-of-the-art metrics with a very marginal drop in precision.
Untitled design-3
OUR COOKIE POLICYIn order to provide our services and give better more secure experience uses cookies. By continuing to browse the site you are agreeing to our use of cookies.