Developer Tests LLM-as-a-Judge Against Human Votes, Finds It Agrees Only 43% of the Time
A developer built a simple LLM-based grading system using Qwen2.5-1.5B-Instruct to score chatbot answers on a 1–10 scale and benchmarked it against real human judgments from the LMSYS Chatbot Arena dataset. The judge proved unstable, returning slightly different scores for the same answer across repeated runs, and rarely ventured outside a narrow 7–8 band regardless of actual answer quality. When tested on 60 head-to-head answer pairs, the judge tied on 20 cases where humans had a clear preference, revealing a lack of resolution to distinguish good responses from great ones. On the 40 pairs where it gave a decisive verdict, it matched human judgment 65% of the time — but counting ties as failures, overall agreement with humans dropped to just 43%. The experiment highlights that naive LLM-as-a-judge setups can produce misleading evaluation signals, particularly for questions requiring real-world awareness such as the current date.
This is an AI-generated summary. ShortSingh links to the original source for the complete article.
Discussion (0)
Log in to join the discussion and vote.
Log in