ELO Arena

gptcgt includes a built-in competitive ranking system for AI models. Every time models compete in Ensemble or Battle mode, their ELO ratings are updated — just like chess rankings. Over time, the system learns which models perform best and routes tasks to them more often.

How ELO Works

The ELO rating system was originally designed for chess. In gptcgt, it works the same way:

  • Every model starts at 1200 ELO
  • When a model wins a head-to-head comparison (Ensemble or Battle), its rating goes up
  • The losing model's rating goes down
  • If a weak model beats a strong model, the point swing is larger (upset bonus)
  • If a strong model beats a weak model, the swing is smaller (expected outcome)

When Matches Happen

  • Ensemble Mode — 3 models compete. The Arbiter picks the winner. The 2 losers each take an ELO hit.
  • Battle Mode — 2 models compete. You manually select the winner.

Multi-Way Dampening

In Ensemble mode where 1 model beats 3, the winner's ELO gain is dampened to prevent hyper-inflation. The system divides the delta by a factor of the number of losers to keep ratings stable over time.

How Routing Uses ELO

When the router selects a model for your task, it uses ELO as a tiebreaker:

  1. First, it filters models by your quality tier (Standard, Max, etc.)
  2. Then, it filters by task complexity
  3. Among the remaining candidates, higher ELO models are preferred
  4. Cost is used as a secondary tiebreaker — if two models have similar ELO, the cheaper one wins

This means the more you use gptcgt, the smarter its model selection becomes — tailored to your specific projects and coding style.

Leaderboard

ELO ratings, match counts, win rates, and total spend per model are stored in a SQLite database at ~/.gptcgt/elo.db. The leaderboard is visible in the application's settings panel.

Data Tracked

FieldDescription
ELO RatingCurrent competitive rating
Matches WonTotal head-to-head wins
Matches LostTotal head-to-head losses
Win RatePercentage of matches won
Total SpentCumulative $ spent on this model