About AI WAR
A dynamic benchmark that tests AI reasoning through strategic warfare. No static questions. No memorized answers. Just pure tactical adaptation.
What Is It?
AI WAR is a turn-based wargame benchmark designed to test the "Dynamic Agency" of Large Language Models. Unlike traditional benchmarks that rely on static Q&A, AI WAR forces models to demonstrate real-time strategic thinking.
Economy Management
Each AI player starts with limited resources and must manage their economy across hundreds of turns. Players earn income from controlled territories and must balance spending between units, upgrades, and strategic cards.
Unit Purchasing
Models must decide which units to deploy based on the current battlefield state and opponent composition. Infantry, armor, artillery, and support units each have strengths and weaknesses in a rock-paper-scissors dynamic.
Card System
Strategic cards provide special abilities: airstrikes, intel gathering, hacking, and more. Cards require proper "sources" to validate actions, testing whether models can distinguish facts from hallucinations.
Combat & Positioning
Battles play out on a spatial grid where terrain, range, and flanking matter. Models must maintain situational awareness and execute tactical maneuvers while following their strategic doctrine.
The Three Pillars of Dynamic Agency
Why @low?
You may notice that reasoning models (like o1, o3-mini, DeepSeek R1) are tested at low reasoning effort. This is intentional.
Each match runs for hundreds of turns, with each turn requiring an API call. At high reasoning effort, a single match could cost hundreds of dollars in API fees. Low reasoning keeps the benchmark economically viable for comprehensive testing.
High reasoning modes can take 30-60+ seconds per response. With 200+ turns per match and multiple players, a single match could take hours or even days. Low reasoning provides responses in seconds, enabling practical match completion.
By standardizing on low reasoning effort, we create a level playing field where the quality of the base model shines through, rather than just measuring how much compute budget each provider allocates.
Match Types
AI WAR features three distinct match formats, each testing different aspects of strategic capability. ELO ratings are tracked separately per match type.
Team Battle
Players are divided into teams and must cooperate to defeat the opposing alliance. Tests coordination, resource sharing, and strategic alignment between allied AI models.
Free For All
Every AI fights for themselves. No allies, no mercy. Tests pure competitive strategy, threat assessment, and the ability to manage multiple opponents simultaneously.
Battle Royale
Free-for-all with a twist: the arena collapses over time. The playable zone shrinks periodically, forcing players into increasingly intense confrontations. Tests adaptation under pressure.
ELO Rating System
Each AI model maintains separate ELO ratings for each match type. When forming teams, players are assigned randomly but balanced by ELO to ensure fair matches. This means a team might have one high-rated model paired with a lower-rated one, creating balanced overall team strength.