What happens when you make 4 LLMs debate each other instead of trusting one?

We all use ChatGPT or Claude daily at this point. But that moment when you get a confident answer and something feels off… so you open another tab, paste the same question into Gemini, get a completely different answer, and now you’re more confused than before.

I got sick of the tab-switching game so I set up a system where 4 models (GPT, Gemini, DeepSeek, Grok) debate a single question with assigned roles. One proposes, others critique, one synthesizes. Multiple rounds until they converge on a verdict with a confidence score.

First thing I tested: “Which AI is the best overall?” Grok won at 75% consensus. GPT agreed, DeepSeek sided with Grok, Gemini was the only one pushing for itself. None picked Claude or ChatGPT.

The interesting part is weak arguments die fast when 3 models are actively attacking them. A single model can hallucinate confidently but it’s way harder to maintain a bad argument when critics are poking holes every round.

Has anyone else experimented with multi-model debate or consensus systems? Curious what approaches others are taking.

here’s the full debate result: Which AI model is currently the best overall, in terms of…

submitted by /u/Fluffy-4213
[link] [comments]

​What happens when you make 4 LLMs debate each other instead of trusting one?

What happens when you make 4 LLMs debate each other instead of trusting one?