RLHF Response Explorer

Side-by-side Base and PPO generations from the final 2,017-prompt evaluation

Back to the RLHF project write-up

Qualitative policy evaluation

Compare the instruction model with its PPO-aligned policy.

Select one of 16 manually reviewed examples from the final 1,024-token policy suite. Reward scores are shown as diagnostics, while the review label records whether the example is an improvement, failure, or reward-model mismatch.

Loading the static evaluation artifact...

Manual review

Waiting for data

Prompt

Evaluation example

Left

Base Qwen2.5-0.5B-Instruct

Base

Right

PPO-aligned policy

PPO

Reward scores come from the learned reward model and are not human quality labels.