P1. Building chemistry aware word-embeddings model

August 1, 2019

RQ: What is the best way of introducing specialized, domain specific knowledge into an existing general word embedding model, in order to produce high quality embeddings for domain specific concepts?

Description:
Recent impressive results [1] have shown that specific domain knowledge can be expressed very accurately by word embeddings trained on a highly specialized corpus. However, it is to be expected that the quality and expressivity of more generic terms in such a model are suboptimal due to the variation of meanings in the concepts of a more generic corpus.

This project aims to investigate what is the best way to inject domain knowledge from a highly specialized corpus into a pre-existing general word embedding model trained on a large diverse corpus. The goal is to enrich the existing model with (or improve existing) embeddings for domain specific concepts, while leaving general embeddings untouched.

Compared to training an entirely new model from scratch on a domain specific corpus, this approach is not only expected to yield better quality embeddings both for general and domain specific concepts, but is also more flexible and scalable for training specialized models for several different domains. It improves maintainability (i.e. maintaining one generic model and specializing on the fly) as well as scalability (requires less effort to introduce new domain specific applications on the fly).

One possible approach is to start from a pre-trained general model (that possibly already includes some domain specific terms). Domain specific training then entails feeding only examples for the specialized terms from the specialized domain corpus to train domain specific embeddings, while leaving the general embeddings untouched.

Tasks:

Literature research of existing state of the art methods for enriching word embedding models.
Develop method for injecting domain specific embeddings into existing model.
- Test if it is necessary and determine how to best filter for domain specific terms when
  obtaining synonyms for a given term.
Compare and benchmark against embeddings trained from scratch on the domain specific corpus alone. This
serves as a baseline test.
Evaluation will be done based on a given test set of domain specific terms.

[1] Tshitoyan et al., Nature 571, 95 (2019).