Developer Tests LLM-as-a-Judge Against Human Votes, Finds It Agrees Only 43% of the Time

·1 views

A developer built a simple LLM-based grading system using Qwen2.5-1.5B-Instruct to score chatbot answers on a 1–10 scale and benchmarked it against real human judgments from the LMSYS Chatbot Arena dataset. The judge proved unstable, returning slightly different scores for the same answer across repeated runs, and rarely ventured outside a narrow 7–8 band regardless of actual answer quality. When tested on 60 head-to-head answer pairs, the judge tied on 20 cases where humans had a clear preference, revealing a lack of resolution to distinguish good responses from great ones. On the 40 pairs where it gave a decisive verdict, it matched human judgment 65% of the time — but counting ties as failures, overall agreement with humans dropped to just 43%. The experiment highlights that naive LLM-as-a-judge setups can produce misleading evaluation signals, particularly for questions requiring real-world awareness such as the current date.

Read the full story at DEV Community

This is an AI-generated summary. ShortSingh links to the original source for the complete article.

Discussion (0)

Why Mixing State, Derived State, and Effects Breaks Frontend Architecture

Modern frontend development often groups API payloads, computed values, and side-effect results under the broad label of 'state,' blurring critical architectural boundaries. Experts argue that true state should serve as the sole source of truth in a data flow, while derived values — such as filtered lists or form validation statuses — should be computed automatically rather than stored independently. When derived state is promoted to standalone state, developers must manually synchronize it, introducing risks like data drift and timing-dependent bugs. A common React pattern using useEffect to keep a filtered user list in sync illustrates how this approach fragments a simple derivation into three disjointed, fragile parts. The core argument is that Effects are among the most abused mechanisms in frontend development because their flexibility tempts developers to offload data-flow problems into them rather than addressing structural design.

0 comments Read more at DEV Community

ProgrammingDEV Community ·

18-Year-Old Kerala Developer Builds Open-Source Terraform Drift Scanner Before College

Jeffrin, an 18-year-old developer from Kerala, India, built and released SynchroIaC, an open-source tool designed to detect and explain Terraform infrastructure drift in AWS environments. The tool integrates via a single GitHub Action, compares Terraform state against live AWS resources using a read-only IAM role, and surfaces discrepancies on a web dashboard. Each detected drift is automatically classified by risk level and accompanied by an AI-generated explanation, with an option to auto-generate a fix pull request. The project was built in two days using a stack that includes Go, Next.js, Supabase, and OpenRouter AI models, with AWS credentials remaining entirely within the user's own GitHub Actions environment. Jeffrin has published the tool on GitHub and the GitHub Actions Marketplace and is seeking community feedback ahead of starting college in nine months.

0 comments Read more at DEV Community

ProgrammingDEV Community ·

Three Developers Built a Multi-Agent AI System Overnight Using Strict Code Ownership

A three-person team built a functioning multi-agent AI system with persistent memory and cost-aware routing in a single overnight session. The key to their success was dividing the project into three independent layers — memory, runtime, and UI/agents — with each developer owning separate files to avoid merge conflicts. Before writing any code, the team agreed on shared function signatures that served as contracts between modules, allowing parallel development using placeholder implementations. Several real bugs emerged during the build, including a non-existent dependency version, a reasoning model that exhausted its token budget before responding, and an async event loop conflict inside Streamlit. The team documented these issues and their fixes as lessons for anyone attempting a similar rapid-build approach.

0 comments Read more at DEV Community

ProgrammingDEV Community ·

GEO vs SEO: Why AI Answer Engines Demand a Different Content Strategy

Generative Engine Optimization (GEO) is an emerging practice focused on getting brands cited directly inside AI-generated answers from tools like ChatGPT, Perplexity, and Google AI Overviews, rather than ranked on a traditional results page. Unlike SEO, which targets search ranking signals and backlinks for crawler visibility, GEO prioritizes clear, quotable answers, specific data points, and structured content that AI models can easily extract and reference. Marketing teams in the US are increasingly noticing a disconnect where traffic remains steady but leads decline — a gap experts attribute to poor GEO positioning rather than tracking errors. Practical steps recommended for marketers and IT teams include surfacing direct answers in the opening sentences of high-traffic pages, replacing vague claims with concrete figures, and auditing what AI tools currently say about relevant topics before creating new content. As user search behavior shifts toward reading AI-generated answers rather than clicking multiple links, brands that adapt their content for GEO early are expected to gain a sustained citation advantage.

0 comments Read more at DEV Community