[TOC]

  1. Title: Improving Machine Translation Use Quality Estimation as a Reward Model 2024
  2. Author: Zhiwei He et. al.
  3. Publish Year: 23 Jan 2024
  4. Review Date: Sun, Jan 28, 2024
  5. url: arXiv:2401.12873v1

Summary of paper

Contribution

In this research, the authors explore using Quality Estimation (QE) models as a basis for reward systems in translation quality improvement through human feedback. They note that while QE has shown promise aligning with human evaluations, there’s a risk of overoptimization where translations receive high rewards despite declining quality. The study addresses this by introducing heuristic rules to identify and penalize incorrect translations, resulting in improved training outcomes. Experimental results demonstrate consistent enhancements across various setups, validated by human preference studies. Additionally, the approach proves highly data-efficient, outperforming systems relying on larger parallel corpora with only a small amount of monolingual data.

Some key terms

Preliminary experiment result

image-20240128230715185

Overoptimisation

The researchers identified an overoptimization problem in their translation quality improvement system, where increasing rewards led to declining translation performance. This issue, observed in preliminary experiments, was termed “overoptimization” and attributed to the imperfect alignment between the reward model and human preferences, reminiscent of Goodhart’s Law.

They found that the reward model sometimes gave high scores to erroneous translations, particularly those involving common machine translation errors like length-ratio discrepancies, off-target errors, and hallucinations. These errors, if rewarded highly, could propagate and disrupt the training process significantly.

To address overoptimization, the researchers proposed monitoring length-ratio and off-target errors during training and penalizing them with negative rewards. They introduced a criterion C(x, y) to identify unacceptable length ratios and off-target translations, assigning punitive rewards accordingly. This solution aimed to mitigate overoptimization and improve translation quality during training.

Summary

Very simple and straightforward solution