Master Thesis Topic 6: Vector space of documents

November 21, 2016

Creating a vector space of documents

RQ: Given a predefined set of documents (or part of documents) and their word/topic distributions what is the best way to project them into a vector space. The distances between elements follow the rule that similar documents should be close to each other and not similar documents should be far from each other?

Description:
There are a lot of advantages of encoding a text document into a vector. In relation to Iris it can allow us to create a better index/search trees. It can give a possibility for easier classification or topic modelling and finally it can allow forming of knowledge generalization. We already use technique to project words into vector space – word2vec. Recent research shows possibility of advancing that to paragraph2vec and what we are looking for is even further generalization to be able to project a whole document into vector space.

Sub RQs:

How would the length of the text affect the vector’s dimensionality?
What are reasonable chunks into which the document should be split?
What information is important for forming the vector representation?

The work requires a literature review to form an opinion of the state-of-the-art. We also expect evaluation of different approaches to be carry out over a few data sets from Iris production environment.

Interested? Get in touch!