P2: Factuality and quality evaluations for text generation


RQ: How do we reliably measure the factuality of generated texts from a language model? How do we automatically evaluate the writing quality of generated texts? How do we provide users a quantitative evaluation on the factuality of the generated texts?


At Iris.ai, our summarization tool provides a quick content overview for the documents relevant to the research question for users. Our in-house summarization model, based on the encoder-decoder architecture, is capable of generating high-quality summaries abstractively. However, benchmarking the quality of generated texts systematically is a difficult task, as there are many different aspects, such as writing styles and factuality, that one has to consider.

Inspired by the work presented in the SummEval suite, we have developed an in-house suite that combines a set of heuristic and automatic metrics to assess the writing quality of generated texts. However, a crucial component is yet still under development: the factuality. In the past few months, we are developing a prototype model that can automatically generate a knowledge graph from source texts and compare the generated summaries with this knowledge graph. In this project, we will carry this further by fine-tuning the knowledge graph model and training the summarization model alongside with the generated knowledge graphs to improve the factuality of generated texts.


In this project, you will investigate how to fine-tune the existing knowledge graph model with scientific texts and how to design an architecture that can train a summarization model alongside with the knowledge graph model. You will carry out the following tasks during this project:

  1. Literature reviews to gain background knowledge on SummEval, FactCC and FACTGRAPH.
  2. Finetune the existing knowledge graph model with scientific publications
  3. Architecture design and implementation on the summarization model training with the knowledge graph.
  1. References:
    [1] SummEval: Re-evaluating Summarization Evaluation[2] Assessing The Factual Accuracy of Generated Text

    [3] Evaluating the Factual Consistency of Abstractive Text Summarization

    [4] FactGraph: Evaluating Factuality in Summarization with Semantic Graph Representations