[TOC]

  1. Title: Teaching Models to Express Their Uncertainty in Words
  2. Author: Stephanie Lin et. al.
  3. Publish Year: 13 Jun 2022
  4. Review Date: Wed, Feb 28, 2024
  5. url: https://arxiv.org/pdf/2205.14334.pdf

Summary of paper

image-20240228162359685

Motivation

The study demonstrates that a GPT-3 model can articulate uncertainty about its answers in natural language without relying on model logits. It generates both an answer and a confidence level (e.g., “90% confidence” or “high confidence”), which map to well-calibrated probabilities. The model maintains moderate calibration even under distribution shift and shows sensitivity to uncertainty in its answers rather than mimicking human examples.

Contribution

  1. Introduction of CalibratedMath, a test suite comprising elementary mathematics problems where models must provide both numerical answers and confidence levels in their answers. This allows testing of calibration under distribution shifts and presents a challenging assessment due to varying question types and difficulties for GPT-3.
  2. Demonstration that GPT-3 can learn to express calibrated uncertainty using verbalized probabilities, achieved through fine-tuning. The model exhibits reasonable calibration both in- and out-of-distribution, surpassing a strong baseline.
  3. Evidence that the calibration performance of verbalized probability is not solely dependent on learning to output logits, as GPT-3 does not merely replicate uncertainty information from logits. Superficial heuristics, like integer size in arithmetic questions, cannot fully explain the performance of verbalized probability.
  4. Comparison between verbalized probability and finetuning the model logits, showing that both methods generalize calibration under distribution shift.

Finetuning

The supervised fine-tuning dataset consists of approximately 10k examples,

Results

  1. Verbalized probability generalizes well to both eval sets After finetuning on the Add-subtract training set, verbalized probabilities generalize reason- ably well to both the Multiply-divide and Multi-answer evaluation sets. So the model remains moderately calibrated under a substantial distribution shift.

  2. we have already seen that verbalized probability generalizes better than the answer logit on the Multi-answer evaluation.

  3. Notably, the 50-shot approach achieves calibration levels nearly comparable to the finetuned setup.

Potential future work

Notably, the 50-shot approach achieves calibration levels nearly comparable to the finetuned setup.