Luau Benchmark Rankings
Comparative performance and efficiency analysis for leading intelligence models.
Cohort
61 Models
Spec
v1.0
Updated: 07/03/2026Export RAW
| # | Intelligence Model | Weighted Score |
|---|---|---|
| 1 | OpenAI: GPT-5.2 Chatopenai/gpt-5.2-chat | 83.6% |
| 2 | Anthropic: Claude Opus 4.5anthropic/claude-opus-4.5 | 83.6% |
| 3 | Anthropic: Claude Haiku 4.5anthropic/claude-haiku-4.5 | 82.6% |
| 4 | Anthropic: Claude Sonnet 4.5anthropic/claude-sonnet-4.5 | 81.9% |
| 5 | DeepSeek: DeepSeek V3.2deepseek/deepseek-v3.2 | 80.2% |
| 6 | Qwen: Qwen3.5-122B-A10Bqwen/qwen3.5-122b-a10b | 78.8% |
| 7 | MoonshotAI: Kimi K2 0711moonshotai/kimi-k2 | 78.4% |
| 8 | Anthropic: Claude Sonnet 4.6anthropic/claude-sonnet-4.6 | 77.7% |
| 9 | Inception: Mercury Coderinception/mercury-coder | 76.4% |
| 10 | Google: Gemini 3 Flash Previewgoogle/gemini-3-flash-preview | 76.3% |
| 11 | Google: Gemini 3.1 Flash Lite Previewgoogle/gemini-3.1-flash-lite-preview | 76.2% |
| 12 | OpenAI: GPT-5.4openai/gpt-5.4 | 76.2% |
| 13 | Mistral: Devstral 2 2512mistralai/devstral-2512 | 75.8% |
| 14 | Anthropic: Claude Opus 4.6anthropic/claude-opus-4.6 | 75.2% |
| 15 | Inception: Mercuryinception/mercury | 75.0% |
| 16 | MoonshotAI: Kimi K2 0905moonshotai/kimi-k2-0905 | 74.2% |
| 17 | Qwen: Qwen3.5-Flashqwen/qwen3.5-flash-02-23 | 73.9% |
| 18 | OpenAI: GPT-5.2-Codexopenai/gpt-5.2-codex | 73.3% |
| 19 | DeepSeek: DeepSeek V3.1 Terminusdeepseek/deepseek-v3.1-terminus | 73.2% |
| 20 | OpenAI: GPT-5.3 Chatopenai/gpt-5.3-chat | 73.1% |
| 21 | xAI: Grok 4.1 Fastx-ai/grok-4.1-fast | 72.8% |
| 22 | xAI: Grok 4 Fastx-ai/grok-4-fast | 71.2% |
| 23 | Google: Gemini 2.5 Flash Lite Preview 09-2025google/gemini-2.5-flash-lite-preview-09-2025 | 69.4% |
| 24 | xAI: Grok Code Fast 1x-ai/grok-code-fast-1 | 68.9% |
| 25 | xAI: Grok 3 Minix-ai/grok-3-mini | 67.9% |
| 26 | Cohere: Command Acohere/command-a | 67.9% |
| 27 | DeepSeek: DeepSeek V3.1deepseek/deepseek-chat-v3.1 | 66.0% |
| 28 | Mistral: Mistral Small Creativemistralai/mistral-small-creative | 65.3% |
| 29 | Meta: Llama 4 Maverickmeta-llama/llama-4-maverick | 64.3% |
| 30 | Meta: Llama 4 Scoutmeta-llama/llama-4-scout | 63.8% |
| 31 | Mistral: Ministral 3 8B 2512mistralai/ministral-8b-2512 | 62.5% |
| 32 | MiniMax: MiniMax M2-herminimax/minimax-m2-her | 61.0% |
| 33 | Mistral: Ministral 3 14B 2512mistralai/ministral-14b-2512 | 58.2% |
| 34 | Meta: Llama 3.3 70B Instructmeta-llama/llama-3.3-70b-instruct | 57.3% |
| 35 | Inception: Mercury 2inception/mercury-2 | 55.9% |
| 36 | Cohere: Command R+ (08-2024)cohere/command-r-plus-08-2024 | 53.5% |
| 37 | OpenAI: GPT-5.3-Codexopenai/gpt-5.3-codex | 52.6% |
| 38 | xAI: Grok 4x-ai/grok-4 | 49.8% |
| 39 | Cohere: Command R (08-2024)cohere/command-r-08-2024 | 45.7% |
| 40 | Cohere: Command R7B (12-2024)cohere/command-r7b-12-2024 | 37.9% |
| 41 | Qwen: Qwen3.5 Plus 2026-02-15qwen/qwen3.5-plus-02-15 | 37.1% |
| 42 | Qwen: Qwen3.5-35B-A3Bqwen/qwen3.5-35b-a3b | 35.5% |
| 43 | Mistral: Ministral 3 3B 2512mistralai/ministral-3b-2512 | 34.7% |
| 44 | Meta: Llama 3.2 3B Instructmeta-llama/llama-3.2-3b-instruct | 32.7% |
| 45 | Meta: Llama 3.2 1B Instructmeta-llama/llama-3.2-1b-instruct | 22.4% |
| 46 | Z.ai: GLM 4.6Vz-ai/glm-4.6v | 18.8% |
| 47 | Z.ai: GLM 4.6z-ai/glm-4.6 | 16.9% |
| 48 | DeepSeek: R1 0528deepseek/deepseek-r1-0528 | 15.9% |
| 49 | MoonshotAI: Kimi K2 Thinkingmoonshotai/kimi-k2-thinking | 14.4% |
| 50 | Z.ai: GLM 4.7z-ai/glm-4.7 | 13.2% |
| 51 | MiniMax: MiniMax M1minimax/minimax-m1 | 12.3% |
| 52 | Z.ai: GLM 5z-ai/glm-5 | 11.7% |
| 53 | DeepSeek: DeepSeek V3.2 Specialedeepseek/deepseek-v3.2-speciale | 8.5% |
| 54 | MoonshotAI: Kimi K2.5moonshotai/kimi-k2.5 | 7.9% |
| 55 | MiniMax: MiniMax M2.5minimax/minimax-m2.5 | 7.6% |
| 56 | MiniMax: MiniMax M2.1minimax/minimax-m2.1 | 5.9% |
| 57 | Google: Gemini 3 Pro Previewgoogle/gemini-3-pro-preview | 5.7% |
| 58 | Google: Gemini 3.1 Pro Previewgoogle/gemini-3.1-pro-preview | 5.0% |
| 59 | MiniMax: MiniMax M2minimax/minimax-m2 | 4.2% |
| 60 | Z.ai: GLM 4.7 Flashz-ai/glm-4.7-flash | 3.5% |
| 61 | Qwen: Qwen3.5-27Bqwen/qwen3.5-27b | 3.0% |
Research Methodology
Each model is evaluated over 30 tasks in Luau specific environments. Scores are generated through automated testing with 3x replicates per task to ensure reproducibility.
View ProtocolOpen Infrastructure
All prompts, model parameters, and raw logs are publicly auditable. Submit new models or tasks via GitHub.
GitHub Repository