P1. Enriching word embeddings with domain specific knowledge

February 26, 2021

RQ: What is the best way of introducing domain-specific knowledge into a general word embedding model, in order to produce high quality embeddings within that domain? How can we balance the information provided by the domain-specific corpus and the generic knowledge? After enriching the embeddings, how can we effectively measure the domain adaptation?

Description:
In recent years word embedding techniques have achieved remarkable successes in the natural language processing field. Nonetheless, there are still critical gaps in the literature. For example, it is difficult to generate reliable word embeddings from cross-domain corpora if indispensable words are relatively infrequent. Additionally, semantic relationships between words in a generic corpus can differ significantly from a domain-specific one.
Overcoming these challenges would unlock natural language applications in various domains with minimal or no additional adaptation; such as chemistry, biology, and healthcare where jargon is prevalent.

The scientific community has worked on solutions to these challenges, and some of the suggestions are documented in the literature. Among others, a recent paper is proposing to continue training on domain-specific corpora, where the new information is selectively incorporated to stabilize and balance the domain adaptation[1]. Another paper proposes using external domain knowledge, not necessarily from pure textual data, combined with graph theory to inject additional information into a generic word embedding model without transforming it[2].

During this project we will investigate ways of enriching a general word embedding model with domain specific knowledge. We will identify a specific domain of interest and collect the needed data to conduct experiments. We will choose the most promising approaches suitable for the task and evaluate their behaviour on the selected domain. Finally, we will measure the rate of adaptation from the generic embedding model for each method.

Tasks:

Literature research of existing state of the art methods for enriching word embedding models.
Research and plan methods to evaluate domain adaptation.
Collect generic and domain specific dataset(s).
Implement and extend method(s) for injecting the domain specific information into the generic model.
Evaluation and comparison:
- Compare the method(s) and benchmark them against embeddings naively trained on the domain-specific and generic corpora (the generic corpora could include the domain-specific one).
- Measure and assess the domain adaptation.

[1] A Simple Regularization-based Algorithm for Learning Cross-Domain Word Embeddings (2019).
[2] Enhancing Domain Word Embedding via Latent Semantic Imputation (2019).