TRIZ Arena

📦 Install the Plugin

Three ways to get TRIZ Engine running. The plugin path (recommended) loads commands, agents, and the MCP server into a fresh Claude Code session.

Claude Code Plugin

Local Clone

Dev Setup (benchmarks)

$ git clone https://github.com/sharathsphd/triz-engine.git
$ cd triz-engine

# register the local marketplace
$ claude plugin marketplace add ./

# install the plugin
$ claude plugin install triz-engine@triz-arena

# verify in a fresh session
$ claude
> /plugin list                  # should show triz-engine
> /triz-engine:analyze "How to reduce weight of aircraft wings while maintaining strength?"

After install, a fresh Claude Code session auto-loads the 6 slash commands (/triz-engine:analyze, :principles, :matrix, :ifr, :ariz, :benchmark), the 3 specialized agents, and the triz-knowledge MCP server.

$ git clone https://github.com/sharathsphd/triz-engine.git
$ cd triz-engine/triz-engine
$ python3.11 -m venv .venv && source .venv/bin/activate
$ pip install -e ".[dev]"

# verify
$ python -m pytest -q
$ python -m mcp.triz_server      # starts the FastMCP server on stdio

Use this path to explore the knowledge base, run unit tests, or embed mcp/triz_server.py in another MCP-compatible client.

# run a single internal TRIZBENCH problem with trace capture
$ python -m benchmark.runner \
    --problems TB-01 \
    --participants triz-engine vanilla-claude \
    --capture-trace

# run a slice of MacGyver problems
$ python -m benchmark.external.run_external \
    --benchmarks macgyver --limit 8 --capture-trace

# rebuild the dashboard from results/*.json
$ python scripts/build_dashboard.py

Traces captured with --capture-trace include every MCP tool call, agent hand-off, and assistant turn — the same data you see in the Live Demos section below.

💡 Applied — inventive problems outside benchmarks

The plugin's real purpose isn't to win benchmarks — it's to guide Claude through tough, open-ended inventive problems by systematically applying TRIZ. Here are a couple of practical challenges solved live by the installed plugin (via /triz-engine:analyze) alongside plain vanilla Claude. No ground-truth, no scoring — this is a qualitative side-by-side so you can see how the plugin reshapes the reasoning.

How to try your own

Once the plugin is installed (see Install), just invoke the slash command in any Claude Code session:

$ claude -p "/triz-engine:analyze How do I make a backpack that's waterproof in rain but breathable in sun, with no moving parts?" --model haiku

The plugin identifies the contradiction, queries the 39×39 contradiction matrix via MCP, picks the most relevant inventive principles, and hands off between the three specialist agents to produce a concrete, non-obvious solution.

🧱 Plugin Architecture

Three specialized agents orchestrated through an MCP knowledge server.

📦

6 Slash Commands

/triz-engine:analyze, :principles, :matrix, :ifr, :ariz, :benchmark — auto-discovered by Claude Code on plugin load.

🤖

3 Specialized Agents

Each agent has its own system prompt, tool allowlist, and hand-off contract. Defined as markdown files under agents/.

🛠

8 MCP Tools

FastMCP server exposes knowledge-base queries so the plugin cannot hallucinate principle numbers, parameter IDs, or matrix entries.

📚

Grounded Knowledge

All 40 Inventive Principles, the 39-parameter × 39-parameter contradiction matrix, and separation principles, shipped as validated JSON.

📈 Methodology

How benchmarks are scored and ELO ratings computed

Claude Models Used

Participants (both TRIZ Engine plugin and Vanilla Claude baseline): claude -p --model haiku — claude-haiku-4-5 (Claude Haiku 4.5). Same model for both sides so the only variable is the plugin itself.
LLM-as-judge for Solution Novelty (SN) and Contradiction Resolution (CR), plus the MacGyver 4-level rubric: claude -p --model sonnet — claude-sonnet-4-5 (Claude Sonnet 4.6, the stronger reasoning model).
All runs use --dangerously-skip-permissions in headless mode; trace-enabled runs add --output-format stream-json --verbose. See benchmark/runner.py and scorer.py for the exact invocations.

Internal TRIZBENCH Scoring (5 dimensions)

Contradiction Identification (CI) — 25% weight
Principle Selection (PS) — 20% weight
Solution Novelty (SN) — 20% weight, LLM-as-judge
Contradiction Resolution (CR) — 25% weight, LLM-as-judge
Ideal Final Result (IFR) — 10% weight
Final = weighted sum, max 100 points

External TRIZBENCH (Patent Analysis)

Evaluated against real patent contradiction data (236 patents):
Contradiction Type — exact match (20 pts)
Parameter Match — exact (30 pts) or partial (15 pts)
Principle F1 — precision/recall vs ground truth (50 pts)
Derived score = type + params + F1*50, max 100

MacGyver Benchmark

Constrained creative problem solving with limited objects.
Scored by LLM judge on a 4-level scale:
Perfect (1.0) — fully correct, creative
Good (0.75) — workable solution
Partial (0.5) — partially addresses problem
Fail (0.0) — doesn't solve the problem

CresOWLve Benchmark

Lateral/divergent creative reasoning riddles.
Binary scoring: correct or incorrect.
Answer matching uses normalized substring matching
with punctuation stripping and case-insensitive comparison.

Known domain mismatch: CresOWLve

CresOWLve is a trivia and lateral-thinking retrieval benchmark — most items require a specific proper name, book title, or culturally-specific fact. TRIZ is an inventive-problem methodology; it does not add signal when the answer is retrieval rather than construction. We report CresOWLve for transparency, not because it is a fair fit. Both systems score near zero; after the recent prompt update, TRIZ now abstains on questions it cannot answer confidently instead of hallucinating a plausible-but-wrong answer.

ELO Rating System

Bradley-Terry model adapted for benchmark comparison:
• Initial rating: 1000 for each participant
• K-factor: 32 for first match, 16 thereafter (max of both players)
• Expected score: E = 1 / (1 + 10^{(R_b-R_a)/400})
• Win/Loss/Draw determined by comparing scores per problem
• 95% confidence intervals via 1000-iteration bootstrap
• Matches processed in order: Internal → External → MacGyver → CresOWLve

📦 Install the Plugin

🏆 Leaderboard

Per-Problem Score Heatmap

TRIZ vs Vanilla per Problem (Scatter)

Score Distribution by Benchmark

🎬 Live Demos

💡 Applied — inventive problems outside benchmarks

🧱 Plugin Architecture

🔍 Benchmark Deep-Dives

📈 Methodology