P2: Enriching a knowledge base with domain-specific word embeddings

Contact

RQ: How do we best represent an existing knowledge base in a semantic space to relate its entities through contextual data? How do we best complete an annotated prototype knowledge base by injecting semantic relationships to defined entities based on their positions in the semantic space? By identifying their semantic properties through word embeddings, what is the best way to integrate these properties back into the knowledge base?

Description:
Knowledge bases, or ontologies, have long been employed in the life sciences to formalize and reason over domain knowledge. In recent years, their applications in machine learning models have been widely researched. In the field of computational linguistics, the construction of knowledge bases for specific domains often requires a significant amount of human annotations and relies heavily on expert knowledge. In specific domains, such applications often require manual annotations and rely heavily on expert knowledge. How to automate the process of ontology learning in the semantic space to gain a full scope of domain-specific knowledge still poses a formidable challenge.

Various approaches have been introduced to tackle this challenge and those that leverage word embeddings have attracted a lot of interest. For example, recent publications [1] [2] show that domain ontologies can be enriched by incorporating new properties from domain specific word embedding models. Another method jointly learns word and entity embeddings from an existing ontology and text that are then used to create contextual relations within the ontology [3].

In this project, we will investigate how to leverage textual information to automatically enrich an existing ontology in chemistry. We will first develop a see ontology that is annotated by experts and train word embedding models with a domain-specific corpus. We will then implement a method to inject the textual information from the word embeddings into the constructed ontology. Finally, we will compare and evaluate the enriched ontology against baseline ontologies.

Tasks:

  1. Research existing methods for enriching ontologies with textual information.
  2. Collect existing ontologies for comparison (possible source: ChEBI[4]).
  3. Find or create an ontology structure to populate from the embedding model.
  4. Curate a chemistry-specific corpus for training the embedding model. (Possible resources include BioCreative V5[5] and SCI[6])
  5. Train the word embedding model.
  6. Develop a method to inject information from the word embedding into the basic ontology.
  7. Compare and evaluate the newly enriched ontology against the baselines.

References:
[1] A Web-scale system for scientific knowledge exploration (2018).
[2] A Word Embedding Analysis towards Ontology Enrichment (2019).
[3] Combining Word and Entity Embeddings for Entity Linking (2017).
[4] ChEBI: a database and ontology for chemical entities of biological interest (2007).
[5] Evaluation of chemical and gene/protein entity recognition systems at BioCreative V. 5: the CEMP and GPRO patents tracks (2017).
[6] Chemical names: terminological resources and corpora annotation (2008).