P2. Building a classifier to mark words as chemical elements, properties, processes, and products

Applay

RQ: What is the best method to classify a concept occurring in a text into a fixed set of classes (such as e.g. “chemical element”, “chemical property”, “process”, …) given its position in the text and its context.

Description:
In order to build and enrich good knowledge graphs in the chemical domain using text mining of scientific documents, it is necessary to relate concepts occurring in the text to the classes in the knowledge graph. This is important for identifying occurrences of elements (nodes) of the knowledge graphs in the text, adding new nodes, or establishing/reinforcing links between existing nodes. Such a node/relation structure often exists as outside expert knowledge and the goal is to use text mining of extensive corpora of specific literature to improve and extend such knowledge graphs.

As a side effect, such a classification is also important for building disambiguated embeddings for concepts that can play different roles (i.e. belong to different classes, have different meanings) depending on how they appear in the text. This avenue will however not be explored in this project, just shows an amazing bi-product of a good classifier.

The goal is to establish a reliable and efficient method to perform concept classification. As an input the algorithm should take a context window (paragraph, sentence, 10 words, etc.) and as an output predict the correct class of all concepts in that context. The different classes and the relations between them are pre-given in the form of outside expert knowledge.

Tasks:

  1. Literature research of existing state of the art methods for concept classification
  2. Build corpus of annotated documents from given ontology of concepts, their classes and the relations
    between them
  3. Building of prototypes of increasing complexity and performance using – amongst others – word embeddings
    (Word2Vec), statistical methods (based e.g. on co-occurrences), RNNs and Attention based methods (LSTM,
    Transformer, BERT, …)