Master Thesis Topic 1: Avoiding biases in training datasets

Avoiding biases in building training datasets for supervised topic modelling algorithms (sLDA, sNTM)

RQ: How to build a training dataset that is unbiased in the context of words used (vocabulary) from different areas of research and sentiment (positive and negative connotation) for natural language processing engine used for topic modelling?

The problem domain is related to supervised machine learning and more specifically topic modelling in the Natural Language Processing domain.

The selected students will have access to a database of a couple of hundreds annotated data points in a dataset. Their goal will be to analyze the dataset and by forming smaller subsets to reveal the problem of biases and their impact on the results and furthermore develop a method for selecting data points for a dataset based on minimal biases criteria. In the process they need to be able to answer the following sub RQs:

  • What key properties should each of the data points have in order to be put into the training data set? (inclusion-exclusion properties)
  • What should the relationship be between the data points, and how can we automate the process of selecting a particular data point into a dataset, minimizing the risk for biases?

Iris AI will give support in providing knowledge about Python libraries that could be used for the experiments, providing access to the datasets and additional supervision if needed.

Interested? Get in touch!