🧠 How I Used Multiple LLMs to Evaluate Each Other: A Meta-Judgment Experiment in AI Reasoning

Originally published on LinkedIn on August 03, 2025

The Experiment: AI Judging AI

Can artificial intelligence critique itself? I set out to find out by running a meta-experiment: letting one large language model (LLM) generate a challenging question, having several different LLMs answer it, and then asking yet another LLM to judge those answers—without knowing which model wrote which.

The results were far from trivial. Different AI judges valued different qualities—some leaned toward factual precision, others rewarded creative or philosophical insight. Perhaps the most surprising? Most models did not rate their own answers highest, hinting at an unexpected “algorithmic humility.”

Key Findings

Beyond curiosity, the experiment opens important questions:

Distinct Judgment Styles: Are unique evaluation patterns emerging in different AI models?
Objectivity Question: Could AI-to-AI evaluation be more objective than human assessment?
Trust in AI Evaluation: Should we trust LLMs to grade reasoning, argumentation, or even code?

Methodology Overview

By combining automated question generation, multi-model answering, blind evaluation, and reasoning analysis, this work offers a glimpse into AI’s evolving ability not just to produce text—but to critically reflect on it.

What This Means for AI Development

The implications extend beyond academic curiosity. As AI systems become more sophisticated, understanding how they evaluate quality, reasoning, and creativity becomes crucial for:

Educational Technology: AI tutors that can fairly assess student work
Content Moderation: Systems that can evaluate nuanced content quality
Code Review: AI assistants that can critique and improve programming solutions
Research Validation: Automated peer review systems for academic work

The Surprising Results

The most intriguing finding was the apparent “algorithmic humility”—models consistently rating other responses higher than their own. This challenges assumptions about AI bias and suggests a more nuanced understanding of quality assessment than previously thought.

Read the full detailed experiment and analysis on LinkedIn →

Looking for more AI experiments and insights? Check out my other field notes on AI reasoning and evaluation methods.

Share on

X Facebook LinkedIn Bluesky

🧠 How I Used Multiple LLMs to Evaluate Each Other: A Meta-Judgment Experiment in AI Reasoning

Sanuwar Rashid

The Experiment: AI Judging AI

Key Findings

Methodology Overview

What This Means for AI Development

The Surprising Results

Share on

You May Also Enjoy

Welcome to Jekyll!

Post: Link Permalink

Post: Quote

Post: Notice

🧠 How I Used Multiple LLMs to Evaluate Each Other: A Meta-Judgment Experiment in AI Reasoning

Sanuwar Rashid

The Experiment: AI Judging AI

Key Findings

Methodology Overview

What This Means for AI Development

The Surprising Results

Related Posts

Share on

You May Also Enjoy

Welcome to Jekyll!

Post: Link Permalink

Post: Quote

Post: Notice