Master Thesis Topic 6: Vector space of documents

Creating a vector space of documents

RQ: Given a predefined set of documents (or part of documents) and their word/topic distributions what is the best way to project them into a vector space. The distances between elements follow the rule that similar documents should be close to each other and not similar documents should be far from each other?

There are a lot of advantages of encoding a text document into a vector. In relation to Iris it can allow us to create a better index/search trees. It can give a possibility for easier classification or topic modelling and finally it can allow forming of knowledge generalization. We already use technique to project words into vector space – word2vec. Recent research shows possibility of advancing that to paragraph2vec and what we are looking for is even further generalization to be able to project a whole document into vector space.

Sub RQs:

  • How would the length of the text affect the vector’s dimensionality?
  • What are reasonable chunks into which the document should be split?
  • What information is important for forming the vector representation?

The work requires a literature review to form an opinion of the state-of-the-art. We also expect evaluation of different approaches to be carry out over a few data sets from Iris production environment.

Interested? Get in touch!