RLHF Response Explorer | Dheeraj Dhillon

Qualitative policy evaluation

Compare the instruction model with its PPO-aligned policy.

Select one of 16 manually reviewed examples from the final 1,024-token policy suite. Reward scores are shown as diagnostics, while the review label records whether the example is an improvement, failure, or reward-model mismatch.

Evaluation example

Loading the static evaluation artifact...

Manual review

Waiting for data

Prompt

Evaluation example

Left

Base Qwen2.5-0.5B-Instruct

Base

Right

PPO-aligned policy

PPO

Reward scores come from the learned reward model and are not human quality labels.