Originally published on LinkedIn on August 03, 2025
The Experiment: AI Judging AI
Can artificial intelligence critique itself? I set out to find out by running a meta-experiment: letting one large language model (LLM) generate a challenging question, having several different LLMs answer it, and then asking yet another LLM to judge those answers—without knowing which model wrote which.
The results were far from trivial. Different AI judges valued different qualities—some leaned toward factual precision, others rewarded creative or philosophical insight. Perhaps the most surprising? Most models did not rate their own answers highest, hinting at an unexpected “algorithmic humility.”
Key Findings
Beyond curiosity, the experiment opens important questions:
- Distinct Judgment Styles: Are unique evaluation patterns emerging in different AI models?
- Objectivity Question: Could AI-to-AI evaluation be more objective than human assessment?
- Trust in AI Evaluation: Should we trust LLMs to grade reasoning, argumentation, or even code?
Methodology Overview
By combining automated question generation, multi-model answering, blind evaluation, and reasoning analysis, this work offers a glimpse into AI’s evolving ability not just to produce text—but to critically reflect on it.
What This Means for AI Development
The implications extend beyond academic curiosity. As AI systems become more sophisticated, understanding how they evaluate quality, reasoning, and creativity becomes crucial for:
- Educational Technology: AI tutors that can fairly assess student work
- Content Moderation: Systems that can evaluate nuanced content quality
- Code Review: AI assistants that can critique and improve programming solutions
- Research Validation: Automated peer review systems for academic work
The Surprising Results
The most intriguing finding was the apparent “algorithmic humility”—models consistently rating other responses higher than their own. This challenges assumptions about AI bias and suggests a more nuanced understanding of quality assessment than previously thought.
Read the full detailed experiment and analysis on LinkedIn →
Related Posts
Looking for more AI experiments and insights? Check out my other field notes on AI reasoning and evaluation methods.