RALPH WIGGUM VS ATTRACTORFLOW

38 TRIALS · 6 TASKS · BLINDED JUDGE

CREDITS: 0

── SELECT YOUR FIGHTER ──

RALPH WIGGUM 🧒

🧒

Iterative CLI loop. Calls claude -p up to 10x. Pure stubborn determination. Tokens? Who counts tokens?

STAT RATING
SPEED ⭐⭐⭐⭐⭐
CONSISTENCY ⭐⭐⭐
TOKEN COST 💀💀💀💀
AVG TOKENS 2,146
AVG ITERS 1.2

ATTRACTORFLOW 🌀

🌀

Lyapunov-guided orchestrator. Phase-space trajectories. Spawns explorer/convergence subagents. Dynamical systems theory applied to code.

STAT RATING
SPEED ⭐⭐⭐
CONSISTENCY ⭐⭐⭐⭐⭐
TOKEN COST 💀
AVG TOKENS 601
AVG ITERS 3.3
RALPH
μ = 9.42
ATTRACTOR
μ = 9.63

── THE ARENA ──

How the battle was judged

TASKS

6 tasks total: 5 coding + 1 analysis. Difficulty tiers: 🟢 Standard (A,C) · 🔴 Hard (H)

REPS

4 repetitions per task per condition. Latin-square ordering eliminates carry-over effects.

JUDGE

Claude Sonnet 4.6 · Temperature=0 · Blinded: UUID-named outputs, condition hidden

SCORING

0-10 rubric. Coding tasks: +2 bonus for passing pytest suite (cap 10).

BLINDING

Each output saved as random UUID.txt. Judge never sees condition label.

STATS

Repeated-measures ANOVA · Bonferroni correction · Bootstrap 95% CIs (n=10,000)

TASK INVENTORY

🟢
A01
Complexity Analysis
Analysis
🟢
C02
Rate Limiter
Coding
🟢
C08
Async Retry Client
Coding
🔴
H01
Consistent Hash Ring
Hard
🔴
H02
Pratt Parser
Hard
🔴
H03
LFU Cache
Hard

── QUICK START ──

HOW TO USE THIS IN YOUR OWN BATTLES

01
CLONE THE REPO
git clone https://github.com/SharathSPhD/strange-wiggum
cd strange-wiggum
02
SET UP PYTHON ENV
uv venv .venv --python 3.13
source .venv/bin/activate # or .venv\Scripts\activate on Windows
uv pip install pingouin scipy pandas numpy
03
INSTALL CLAUDE CODE CLI
npm install -g @anthropic-ai/claude-code
claude login # authenticate with Anthropic
04
INSTALL ATTRACTORFLOW PLUGIN
claude plugin install attractor-flow
# Verify: claude plugin list
05
RUN A SINGLE TRIAL
python -m benchmark.harness --tasks H02 --conditions ralph --reps 1
# Results appear in benchmark/results/scores.csv
06
GENERATE REPORT
python -m benchmark.stats
python -m benchmark.report
# Open benchmark/results/leaderboard.md

── ROUND BY ROUND ──

TASK DIFFICULTY RALPH μ AF μ WINNER
A01 🟢 Analysis 9.00 9.00 🤝 TIE
C02 🟢 Coding 9.33 9.33 🤝 TIE
C08 🟢 Coding 9.67 9.67 🤝 TIE
H01 🔴 Hard 10.00 10.00 🤝 TIE
H02 🔴 Hard 8.75 10.00 ⚡ 🌀 ATTRACTORFLOW
H03 🔴 Hard 10.00 9.67 🧒 RALPH

⚡ KEY BATTLE: H02 (PRATT PARSER)

H02 (Pratt Parser) was the decisive battle. Ralph's Haiku model failed the unary-minus-power precedence test (-2**2 should = -4, not 4) in rep 2, scoring 5/10. AttractorFlow stayed CONVERGING and nailed all 21 tests.

⏳ Loading CSV...

── POWER LEVEL CHARTS ──

ATTRACTOR FLOW

μ 0.00
σ 0.60
n 19

RALPH WIGGUM

μ 0.00
σ 1.22
n 19
F(1, 18) 0.520
p-value 0.480
Cohen's d 0.17

⚔️ NO KNOCKOUT

NEITHER AGENT DOMINATED

NEITHER FIGHTER WAS KO'D.
ATTRACTORFLOW EDGES QUALITY BY +0.21
WHILE SPENDING 72% FEWER TOKENS.
THE REAL WIN: AF σ=0.60 VS RALPH σ=1.22
— ATTRACTOR IS MORE CONSISTENT.

⚠ KNOWN LIMITATIONS
  1. Model Confound: AF reps 0+1 used Sonnet 4.6 (inherited from parent session). Rep 2 and validation used Haiku 4.5. Ralph always used Haiku 4.5. Rep 3 (validation) used Haiku for both.
  2. Small Sample: Only 6 tasks, 19 observations per condition. Underpowered (power ≈ 12%).
  3. Single Judge: Claude Sonnet 4.6 at temperature=0. No inter-rater reliability beyond ICC.
  4. Task Selection: H01-H03 ("Hard") may favor complex orchestration patterns (AF's strength).
RANK CONDITION SCORE GRADE
🥇 #1 ATTRACTORFLOW 9.63 S
🥈 #2 RALPH WIGGUM 9.42 A
🎮 VIEW SOURCE + DATA github.com/SharathSPhD/strange-wiggum

DATA ARTIFACTS

CREDITS

═══════════════════════

BUILT WITH:

  • • Claude Code (Anthropic)
  • • AttractorFlow Plugin
  • • Ralph Wiggum Loop
  • • Chart.js · PapaParse

═══════════════════════

BENCHMARK: April 2026

JUDGE: Claude Sonnet 4.6

TRIALS: 38

═══════════════════════