P1: Scaling an embedding evaluation framework

October 11, 2021

RQ: How can we reliably measure the quality of a word embedding model trained for a new linguistic domain, in particular, for specific scientific domains? How can we get more detailed insights into the strengths and weaknesses of domain-specific embedding models? How can we use this information to iterate faster towards high-quality domain-specific embedding models?

Description:
At Iris.ai, we use machine learning and natural language processing (NLP) to help researchers stay on top of the flood of scientific literature. For this purpose, we need word embedding models which are well-adapted to the highly specific language found in the scientific literature. To date there is no satisfying framework for evaluating how well a word embedding model works for a given scientific domain. Without a solid evaluation methodology we cannot systematically improve specific word embedding models for a given scientific domain; nor can we evaluate general strategies for domain adaptation or model training. This project’s ultimate goal is to fill this crucial gap and implement a word embedding evaluation framework which is adaptable to new linguistic domains. This will allow us to iterate faster and with more confidence on domain-specific embedding models, improve the quality of the iris.ai tools, and thereby enable researchers to find more relevant scientific information quicker and more easily.

To build this word embedding evaluation framework we draw on the existing research literature. Rogers et al. (2018)1 have demonstrated convincingly that the “quality” or “degree of fit” of a word embedding model is not a one-dimensional metric. Instead, different qualities of a word embedding model become important depending on what the embedding model is to be used for. We therefore design our embedding evaluation framework to produce a multi-dimensional evaluation score by implementing a whole suite of tasks/tests that the embedding models are evaluated on. Such a multi-dimensional word embedding evaluation has been implemented previously by Nayak et al. (2016)2. They propose to test word embedding models by using them for a suite of downstream NLP tasks (extrinsic evaluation).

We will improve on the framework of Nayak et al. in two important ways: First, we will also integrate evaluation metrics which do not depend on downstream NLP tasks but only on the geometry of the word embedding space (intrinsic evaluation) since Rogers et al. have demonstrated the utility of these intrinsic metrics. Second and more importantly, our evaluation framework will be adaptable to different linguistic domains. This allows us, on the one hand, to draw conclusions about the soundness of our evaluation methodology by comparing our evaluation results to published results for those domains where published results are available. On the other hand, we will be able to draw conclusions about the performance of embedding models for domains where no evaluations exist yet. Our vision is to achieve a level of robustness and performance for the evaluation framework that automated model and model architecture selection becomes feasible.

Tasks:
So far, we have implemented a prototype of the evaluation framework. In this project, you will investigate different evaluation tasks for word embeddings and, in close collaboration with other researchers in the team, add new evaluation tasks to the existing framework. To do this you will iteratively

research literature and datasets for a specific embedding evaluation task
prototype an implementation of the evaluation task (in Python)
devise a methodology for creating test datasets
evaluate different embedding models using your prototype
compare your evaluation results to published data
summarize your findings (potentially becoming a contributor to a research paper about the evaluation framework)
polish your code such that it can be merged to the stable evaluation framework

References:
[1] Rogers, A.; Ananthakrishna, S. H.; Rumshisky, A. What’s in Your Embedding, and How It Predicts Task Performance. In Proceedings of the 27th International Conference on Computational Linguistics; 2018; pp 2690–2703
[2] Nayak, N.; Angeli, G.; Manning, C. D. Evaluating Word Embeddings Using a Representative Suite of Practical Tasks. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP; Association for Computational Linguistics: Berlin, Germany, 2016; pp 19–23. https://doi.org/10.18653/v1/W16-2504.