Qwen-Image-2.0-RL's Real Lesson Is How Carefully RL Must Be Applied to Diffusion Models
Alibaba's Qwen team released Qwen-Image-2.0-RL, a reinforcement-learning fine-tuned version of their image generation model that improves benchmark scores, including a 2.61-point gain on Qwen-Image-Bench and higher arena Elo ratings for both text-to-image and image editing. Rather than simply applying standard RL reward optimization, the team discovered that naive approaches caused training instability and model degradation. A key finding involved classifier-free guidance: using it during both rollout and training caused image collapse, while omitting it entirely hurt stylization; the solution was to apply CFG only during rollout sampling and exclude it from the policy optimization step. The team also found that training across all 40 denoising timesteps led to rapid reward hacking, so they restricted updates to a subset focused on early high-noise timesteps that govern broad image structure. The paper highlights that effective post-training is not just about choosing the right reward signal, but carefully controlling where and how that reward is allowed to influence the model.
This is an AI-generated summary. ShortSingh links to the original source for the complete article.
Discussion (0)
Log in to join the discussion and vote.
Log in