P2. Topic Models in the Spherical Embedding Space

Contact

RQ: How can we combine the spherical word and document embedding knowledge to create Topic Models ? Would there be a fundamental difference between an approach as such to the statistical topic modeling, e.g. LDA? What additional information can we learn from embedding topics to the word- and document-vector space? Would it be possible to analytically describe topic vectors mathematically for a more efficient and general application?

Description:
Topic models are statistical tools for discovering the hidden semantic structure in a collection of documents and they are mostly used to automatically detect topics from textual data. Their application is really broad in the natural language processing community and a lot of extensions have been proposed to enhance them. One recent study proposes to combine topic models with the power of word embeddings to achieve more accurate topics [1].

At the same time, spherical text embeddings have been proven to produce enhanced latent representations of words and documents by training on the spherical unit sphere [2] and by incorporating and balancing local and global contexts of the textual data [3]. Further research on the spherical text embeddings has shown that their word representations form consistent clusters [4].

In this project we want to investigate how to combine the information embedded in the spherical word- and document-vector space in order to generate enhanced topic models in the same space. Based on the constructed hierarchical vector space, which embeds topic, document, and word vectors, we would like to understand if it is possible to analytically express the topic distributions within the space for a more efficient and general application.

The project will start by familiarizing oneself with the literature in topic-modelling and spherical word embeddings. You will be leading an extensive analysis on how to embed topic information to the existing word- and document-vector space and implement it for further testing and evaluations.

Tasks:

  1. Literature review of existing state of the art methods for creating topic models and textual embeddings
  2. A thorough understanding of the spherical embeddings.
  3. Design and implement approaches to create Topic Models or Topic embeddings in the spherical space.
  4. Evaluate and benchmark the aforementioned approaches against the state of the art topic models.

[1] Topic Modeling in Embedding Spaces (2019).
[2] Spherical Text Embedding (2019).
[3] Unsupervised Word Embedding Learning by Incorporating Local and Global Contexts (2020).
[4] Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for Fast and Good Topics too! (2020).