Human Feedback Without Reinforcement Learning Direct Preference Optimization (DPO) fine-tunes pretrained large language models on human preferences without the cumbersome step of reinforcement learning.

Published

Feb 28, 2024

Reading time

2 min read

Reinforcement learning from human feedback (RLHF) is widely used to fine-tune pretrained models to deliver outputs that align with human preferences. New work aligns pretrained models without the cumbersome step of reinforcement learning.

What’s new: Rafael Rafailov and colleagues at Stanford University and Chan Zuckerberg Biohub Network developed Direct Preference Optimization (DPO) to fine-tune language models on human preferences using a learning style akin to supervised learning.

RLHF basics: Given a model pretrained to complete sentences in a large text database, reinforcement learning from human feedback proceeds in three steps:

The model produces pairs of answers to various prompts, and humans rate which of the answers is better.
Another model learns to mimic how the humans evaluated the outputs. This becomes a so-called reward model.
The generative model uses evaluations from the reward model to learn, via reinforcement learning, to produce desirable outputs (which earn high rewards) while constrained to keep its answers relatively close to the original model’s output.

Key insight: Instead of training a reward model on human preferences and fine-tuning their language model on the reward model’s output, the authors used the human preferences to fine-tune a copy of their language model directly. The fine-tuning trained the copy to be (i) more likely than the original model to generate human-preferred outputs and (ii) less likely than the original model to generate non-preferred outputs.

How it works: The authors used DPO to fine-tune a pretrained GPT-J to summarize text. The dataset was TL;DR.

The authors prompted GPT-J to produce pairs of outputs. Given a pair of outputs, humans rated which they preferred.
Given an annotated pair, a copy of GPT-J was trained to generate sequences of tokens for preferred outputs with higher probability than that of the original model, and sequences of tokens for other outputs with low probability than that of the original model.
The loss function was constrained to keep the copy from deviating too far from the original model. This step avoided drastic changes that might induce problems such as catastrophic forgetting.

Results: The authors used GPT-4 to estimate whether humans would prefer summaries written by GPT-J fine-tuned via either DPO or RLHF versus human-written summaries. In fine-tuning GPT-J via DPO and RLHF, they experimented with sampling temperatures (a hyperparameter that controls the randomness in choosing the next token, where higher numbers increase randomness) between 0 and 1 and used the best-performing value. GPT-4 evaluated that humans would prefer summaries generated by GPT-J fine-tuned via DPO 61 percent of the time and summaries generated by GPT-J fine-tuned via RLHF 57 percent of the time. In a separate test, human volunteers judged 272 summaries generated by the two models using the best-performing sampling temperatures. The judges preferred the DPO model’s summaries 58 percent of the time.

Why it matters: RLHF is a fundamental technique for making large language models safe for a wide variety of users. Improvements — in this case, a significant boost in efficiency — can help teams to build more useful models, do it faster, and require fewer resources. It’s inspiring that there’s still room for improvement in core LLM building blocks.

We’re thinking: People often ask whether university labs — which don’t have the massive computational resources of big tech — can still do cutting-edge research on large language models. The answer, to me, is obviously yes! This work is a beautiful example.

Subscribe to The Batch