Benchmark Methodology

Our evaluation protocol is focused on statistical precision and technical depth. Each model score is derived from a multi-replicate execution cycle within isolated Luau environments.

Evaluation Tracks

Logic Completion (35%)

Algorithm implementation under zero-dependency constraints. Tests raw reasoning and syntax depth.

Bug Correction (35%)

Resolving logic and syntax errors in complex Luau routines.

Type System (15%)

Leveraging Luau's gradual type system for static analysis and runtime optimization.

Execution Tests (15%)

Comprehensive unit tests to verify runtime correctness of generated code.

Scoring Schema

01

Multi-Replicate Sampling

Each task is executed 3x to account for model variance. Scores are averaged across all passes.

02

Execution Verification

Scores are only valid if code is syntactically correct and passes predefined unit tests.

03

Weighted Aggregation

Final scores are aggregated using track weights to provide a singular intelligence metric.

Data Transparency Policy

We prioritize open results. To run the latest benchmark cycle yourself, use the following terminal commands:

PowerShell Environment

# 1. discover new models

$env:OPENROUTER_API_KEY="sk-..."; npm run fetch:models

# 2. execute benchmark suite

$env:OPENROUTER_API_KEY="sk-..."; npm run run:benchmark

GitHub Repository