Better Rubrics Hurt Small LLMs but Boost Large Ones, Study Finds
A developer experimenting with LLM-based evaluation judges found that improving the scoring rubric had opposite effects depending on model size. A small local model (Qwen2.5-1.5B) saw its agreement with human votes drop from 67% to 54% when given a detailed, criteria-rich rubric. In contrast, a large model (DeepSeek-V4-Pro via OpenRouter) improved from 65% to 79% agreement under the same rubric, a 14-percentage-point gain. The pattern held across a second large model, Qwen 32B, which also reduced ties significantly with the better rubric. The findings suggest that detailed evaluation instructions sharpen capable models but overwhelm smaller ones, challenging the common assumption that a better rubric is a free, universal improvement.
This is an AI-generated summary. ShortSingh links to the original source for the complete article.
Discussion (0)
Log in to join the discussion and vote.
Log in