P4. Explore different measures to define semantic dependencies between two texts


RQ: What is the best way to determine the semantic relationship between two texts? For example, given two texts discussing similar topics in their contents, how do we identify the dependencies between the contexts, e.g. one text is cited in the other, etc., based purely on their semantic properties?

Description: In order to assign documents to their positions on a knowledge graph, the prerequisite is to identify their contextual relationships to other documents in the corpus. Although meta information, such as citations provided by the authors, may serve this purpose, it inevitably introduces bias to the results. The goal of this project is to explore metrics based solely on the semantic properties of a document that can be used to measure such dependencies.

Recent applications of algorithms, such as bidirectional Long short-term memory (bi-LSTM) or bidirectional recurrent neural network (bi-RNN) etc., have demonstrated great success in determining the temporal relationships between keyphrases in documents. The extracted summarization concepts for each document can thus be used to derive the concept co-occurrence scores1 in relationships with other documents in the corpus, similar to the idea adopted in the construction of Microsoft Academic Graph (MAG2). The meta-information of documents can then be compared with the constructed semantic relationships. Together with you, we will experiment how these techniques can be applied in identifying the hierarchical and causal relationships between texts.

The tasks in this project include experimenting the key-phrases extraction and exploring the semantic relationships between the extracted key-phrases using state-of-the-art algorithms. An extensive literature review of the recent development on the extraction of semantic sequential relationships will also be a part of the project.


  1. Literature review of the recent development on the extraction of semantic sequential relationships.
  2. Understanding the structure of the MAG dataset.
  3. Compile a specific corpus where the causal links of the included texts can be determined with meta-information (MAG data can be useful here).
  4. Experimenting key-phrase extraction for the text units and classify the relationships between text-unit nodes to construct the knowledge graph within the corpus.
  5. Evaluate the graph constructed semantically against the graph constructed with meta-information in 2.

[1] Zhihong Shen, Hao Ma, Kuansan Wang, 2018, A Web-scale system for scientific knowledge exploration.
[2] Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June (Paul) Hsu, and Kuansan Wang. 2015.
An Overview of Microsoft Academic Service (MAS) and Applications.