RLHF Response Explorer
Side-by-side Base and PPO generations from the final 2,017-prompt evaluation
Back to the RLHF project write-up
Qualitative policy evaluation
Compare the instruction model with its PPO-aligned policy.
Select one of 16 manually reviewed examples from the final 1,024-token policy suite. Reward scores are shown as diagnostics, while the review label records whether the example is an improvement, failure, or reward-model mismatch.
Loading the static evaluation artifact...
Manual review
Waiting for data
Prompt
Evaluation example
Left
Base Qwen2.5-0.5B-Instruct
Right
PPO-aligned policy
Reward scores come from the learned reward model and are not human quality labels.