TRIZ Arena

Inventive Problem Solving vs Vanilla LLM — Head-to-Head Benchmark Battle

Participants run on claude -p --model haiku (Haiku 4.5) · LLM-as-judge runs on claude -p --model sonnet (Sonnet 4.6)

TRIZ
TRIZ Engine
VS
CLAUDE
Vanilla Claude
Click an arena to begin...
📚
Internal TRIZBENCH
📄
Patent Analysis
🔧
MacGyver
🦄
CresOWLve

📦 Install the Plugin

Three ways to get TRIZ Engine running. The plugin path (recommended) loads commands, agents, and the MCP server into a fresh Claude Code session.

Claude Code Plugin
Local Clone
Dev Setup (benchmarks)
$ git clone https://github.com/sharathsphd/triz-engine.git
$ cd triz-engine

# register the local marketplace
$ claude plugin marketplace add ./

# install the plugin
$ claude plugin install triz-engine@triz-arena

# verify in a fresh session
$ claude
> /plugin list                  # should show triz-engine
> /triz-engine:analyze "How to reduce weight of aircraft wings while maintaining strength?"

After install, a fresh Claude Code session auto-loads the 6 slash commands (/triz-engine:analyze, :principles, :matrix, :ifr, :ariz, :benchmark), the 3 specialized agents, and the triz-knowledge MCP server.

$ git clone https://github.com/sharathsphd/triz-engine.git
$ cd triz-engine/triz-engine
$ python3.11 -m venv .venv && source .venv/bin/activate
$ pip install -e ".[dev]"

# verify
$ python -m pytest -q
$ python -m mcp.triz_server      # starts the FastMCP server on stdio

Use this path to explore the knowledge base, run unit tests, or embed mcp/triz_server.py in another MCP-compatible client.

# run a single internal TRIZBENCH problem with trace capture
$ python -m benchmark.runner \
    --problems TB-01 \
    --participants triz-engine vanilla-claude \
    --capture-trace

# run a slice of MacGyver problems
$ python -m benchmark.external.run_external \
    --benchmarks macgyver --limit 8 --capture-trace

# rebuild the dashboard from results/*.json
$ python scripts/build_dashboard.py

Traces captured with --capture-trace include every MCP tool call, agent hand-off, and assistant turn — the same data you see in the Live Demos section below.

🏆 Leaderboard

ELO ratings computed via Bradley-Terry model across all benchmarks

Per-Problem Score Heatmap

Each cell shows TRIZ score (top) and Vanilla score (bottom). Color encodes the delta — gold favors TRIZ, blue favors Vanilla. Click any cell for the full scoring breakdown.

Δ (TRIZ − Vanilla): −50 −10 0 +10 +50

TRIZ vs Vanilla per Problem (Scatter)

Points above the dashed 45° line indicate TRIZ wins; below indicates Vanilla wins. Hover for details.

Score Distribution by Benchmark

🎬 Live Demos

Representative problems with real side-by-side output and the complete plugin execution trace (MCP tool calls, agent hand-offs, assistant turns).

💡 Applied — inventive problems outside benchmarks

The plugin's real purpose isn't to win benchmarks — it's to guide Claude through tough, open-ended inventive problems by systematically applying TRIZ. Here are a couple of practical challenges solved live by the installed plugin (via /triz-engine:analyze) alongside plain vanilla Claude. No ground-truth, no scoring — this is a qualitative side-by-side so you can see how the plugin reshapes the reasoning.

How to try your own
Once the plugin is installed (see Install), just invoke the slash command in any Claude Code session:
$ claude -p "/triz-engine:analyze How do I make a backpack that's waterproof in rain but breathable in sun, with no moving parts?" --model haiku
The plugin identifies the contradiction, queries the 39×39 contradiction matrix via MCP, picks the most relevant inventive principles, and hands off between the three specialist agents to produce a concrete, non-obvious solution.

🧱 Plugin Architecture

Three specialized agents orchestrated through an MCP knowledge server.

User: /triz-engine:analyze "..." Slash command entry point — Claude Code routes to plugin-defined handler ① Contradiction Agent Parses problem, extracts improving vs worsening dim. uses: suggest_parameters, list_parameters ② Solution Agent Looks up matrix, applies principles, generates sketches uses: lookup_matrix, get_principle, apply_principle ③ Evaluator Agent Scores vs IFR, ranks, picks winner, logs session uses: score_solution, evaluate_solution, log_session MCP Server: triz-knowledge 8 tools grounded in the 40 Inventive Principles + 39×39 Contradiction Matrix list_parameters · suggest_parameters · lookup_matrix · get_principle · apply_principle · score_solution · get_separation_principles · log_session_entry
📦
6 Slash Commands
/triz-engine:analyze, :principles, :matrix, :ifr, :ariz, :benchmark — auto-discovered by Claude Code on plugin load.
🤖
3 Specialized Agents
Each agent has its own system prompt, tool allowlist, and hand-off contract. Defined as markdown files under agents/.
🛠
8 MCP Tools
FastMCP server exposes knowledge-base queries so the plugin cannot hallucinate principle numbers, parameter IDs, or matrix entries.
📚
Grounded Knowledge
All 40 Inventive Principles, the 39-parameter × 39-parameter contradiction matrix, and separation principles, shipped as validated JSON.

🔍 Benchmark Deep-Dives

Actual problems, solutions, and scoring from every benchmark run

Internal TRIZBENCH
External Patents
MacGyver
CresOWLve

📈 Methodology

How benchmarks are scored and ELO ratings computed

Claude Models Used
Participants (both TRIZ Engine plugin and Vanilla Claude baseline): claude -p --model haikuclaude-haiku-4-5 (Claude Haiku 4.5). Same model for both sides so the only variable is the plugin itself.
LLM-as-judge for Solution Novelty (SN) and Contradiction Resolution (CR), plus the MacGyver 4-level rubric: claude -p --model sonnetclaude-sonnet-4-5 (Claude Sonnet 4.6, the stronger reasoning model).
All runs use --dangerously-skip-permissions in headless mode; trace-enabled runs add --output-format stream-json --verbose. See benchmark/runner.py and scorer.py for the exact invocations.
Internal TRIZBENCH Scoring (5 dimensions)
Contradiction Identification (CI) — 25% weight
Principle Selection (PS) — 20% weight
Solution Novelty (SN) — 20% weight, LLM-as-judge
Contradiction Resolution (CR) — 25% weight, LLM-as-judge
Ideal Final Result (IFR) — 10% weight
Final = weighted sum, max 100 points
External TRIZBENCH (Patent Analysis)
Evaluated against real patent contradiction data (236 patents):
Contradiction Type — exact match (20 pts)
Parameter Match — exact (30 pts) or partial (15 pts)
Principle F1 — precision/recall vs ground truth (50 pts)
Derived score = type + params + F1*50, max 100
MacGyver Benchmark
Constrained creative problem solving with limited objects.
Scored by LLM judge on a 4-level scale:
Perfect (1.0) — fully correct, creative
Good (0.75) — workable solution
Partial (0.5) — partially addresses problem
Fail (0.0) — doesn't solve the problem
CresOWLve Benchmark
Lateral/divergent creative reasoning riddles.
Binary scoring: correct or incorrect.
Answer matching uses normalized substring matching
with punctuation stripping and case-insensitive comparison.
Known domain mismatch: CresOWLve
CresOWLve is a trivia and lateral-thinking retrieval benchmark — most items require a specific proper name, book title, or culturally-specific fact. TRIZ is an inventive-problem methodology; it does not add signal when the answer is retrieval rather than construction. We report CresOWLve for transparency, not because it is a fair fit. Both systems score near zero; after the recent prompt update, TRIZ now abstains on questions it cannot answer confidently instead of hallucinating a plausible-but-wrong answer.
ELO Rating System
Bradley-Terry model adapted for benchmark comparison:
• Initial rating: 1000 for each participant
• K-factor: 32 for first match, 16 thereafter (max of both players)
• Expected score: E = 1 / (1 + 10(Rb-Ra)/400)
• Win/Loss/Draw determined by comparing scores per problem
• 95% confidence intervals via 1000-iteration bootstrap
• Matches processed in order: Internal → External → MacGyver → CresOWLve