Inventive Problem Solving vs Vanilla LLM — Head-to-Head Benchmark Battle
Participants run on claude -p --model haiku (Haiku 4.5) ·
LLM-as-judge runs on claude -p --model sonnet (Sonnet 4.6)
Three ways to get TRIZ Engine running. The plugin path (recommended) loads commands, agents, and the MCP server into a fresh Claude Code session.
$ git clone https://github.com/sharathsphd/triz-engine.git $ cd triz-engine # register the local marketplace $ claude plugin marketplace add ./ # install the plugin $ claude plugin install triz-engine@triz-arena # verify in a fresh session $ claude > /plugin list # should show triz-engine > /triz-engine:analyze "How to reduce weight of aircraft wings while maintaining strength?"
After install, a fresh Claude Code session auto-loads the 6 slash commands (/triz-engine:analyze, :principles, :matrix, :ifr, :ariz, :benchmark), the 3 specialized agents, and the triz-knowledge MCP server.
$ git clone https://github.com/sharathsphd/triz-engine.git $ cd triz-engine/triz-engine $ python3.11 -m venv .venv && source .venv/bin/activate $ pip install -e ".[dev]" # verify $ python -m pytest -q $ python -m mcp.triz_server # starts the FastMCP server on stdio
Use this path to explore the knowledge base, run unit tests, or embed mcp/triz_server.py in another MCP-compatible client.
# run a single internal TRIZBENCH problem with trace capture
$ python -m benchmark.runner \
--problems TB-01 \
--participants triz-engine vanilla-claude \
--capture-trace
# run a slice of MacGyver problems
$ python -m benchmark.external.run_external \
--benchmarks macgyver --limit 8 --capture-trace
# rebuild the dashboard from results/*.json
$ python scripts/build_dashboard.py
Traces captured with --capture-trace include every MCP tool call, agent hand-off, and assistant turn — the same data you see in the Live Demos section below.
ELO ratings computed via Bradley-Terry model across all benchmarks
Each cell shows TRIZ score (top) and Vanilla score (bottom). Color encodes the delta — gold favors TRIZ, blue favors Vanilla. Click any cell for the full scoring breakdown.
Points above the dashed 45° line indicate TRIZ wins; below indicates Vanilla wins. Hover for details.
Representative problems with real side-by-side output and the complete plugin execution trace (MCP tool calls, agent hand-offs, assistant turns).
The plugin's real purpose isn't to win benchmarks — it's to guide Claude through
tough, open-ended inventive problems by systematically applying TRIZ. Here are a
couple of practical challenges solved live by the installed plugin (via
/triz-engine:analyze) alongside plain vanilla Claude. No ground-truth,
no scoring — this is a qualitative side-by-side so you can see how the plugin
reshapes the reasoning.
$ claude -p "/triz-engine:analyze How do I make a backpack that's waterproof in rain but breathable in sun, with no moving parts?" --model haikuThe plugin identifies the contradiction, queries the 39×39 contradiction matrix via MCP, picks the most relevant inventive principles, and hands off between the three specialist agents to produce a concrete, non-obvious solution.
Three specialized agents orchestrated through an MCP knowledge server.
/triz-engine:analyze, :principles, :matrix, :ifr, :ariz, :benchmark — auto-discovered by Claude Code on plugin load.agents/.Actual problems, solutions, and scoring from every benchmark run
How benchmarks are scored and ELO ratings computed
claude -p --model haiku — claude-haiku-4-5 (Claude Haiku 4.5).
Same model for both sides so the only variable is the plugin itself.claude -p --model sonnet —
claude-sonnet-4-5 (Claude Sonnet 4.6, the stronger reasoning model).--dangerously-skip-permissions in headless mode; trace-enabled runs
add --output-format stream-json --verbose. See
benchmark/runner.py
and scorer.py
for the exact invocations.