Analysis of state-of-the-art document cluster labeling algorithms
RQ: Given clusters of documents what is the best way to label the cluster with a sequence of words? (It should not necessarily be sentences, it could be a list of words sorted based on a relevance score or frequency or other topic modelling criteria)
In Iris AI we cluster documents based on topic models. The essence of the statistical topic models we are using is analysis of the statistical distribution of topics in a document and of a word in topic. Those two properties give both clusters of documents and clusters of words representing topics. In this case the clusters are labelled by the corresponding words clusters – which is sorted based on the score of each word. The problem is that those labels usually carry hidden meaning and presented to humans might not be able to describe well the content of the cluster.
The goal is to find a way to label a cluster of documents, clustered together based on topics, relevance or other classification criteria, such that the sequence of words in the label describe the cluster as closely as possible to a human annotation (sequence of tags).
- What methods exist for labeling document clusters?
- What are the prior limitations and assumptions for a method to be successful?
Students are also encouraged to propose new methods on top of what they find in their study.
The work requires a literature review of the existing labeling methods plus analysis of their performance over multiple examples provided by Iris.
Interested? Get in touch!