Luau Benchmark Rankings
Comparative performance and efficiency analysis for leading intelligence models.
Cohort
61 Models
Spec
v1.0
Updated: 07/03/2026Export RAW
| # | Intelligence Model | Weighted Score |
|---|---|---|
| 1 | OpenAI: GPT-5.4openai/gpt-5.4 | 80.3% |
| 2 | Google: Gemini 3 Flash Previewgoogle/gemini-3-flash-preview | 77.9% |
| 3 | Anthropic: Claude Opus 4.6anthropic/claude-opus-4.6 | 77.5% |
| 4 | Anthropic: Claude Opus 4.5anthropic/claude-opus-4.5 | 77.5% |
| 5 | MoonshotAI: Kimi K2 0711moonshotai/kimi-k2 | 77.4% |
| 6 | Anthropic: Claude Haiku 4.5anthropic/claude-haiku-4.5 | 76.7% |
| 7 | OpenAI: GPT-5.2 Chatopenai/gpt-5.2-chat | 76.7% |
| 8 | xAI: Grok Code Fast 1x-ai/grok-code-fast-1 | 76.5% |
| 9 | Anthropic: Claude Sonnet 4.5anthropic/claude-sonnet-4.5 | 76.2% |
| 10 | OpenAI: GPT-5.3 Chatopenai/gpt-5.3-chat | 75.8% |
| 11 | Anthropic: Claude Sonnet 4.6anthropic/claude-sonnet-4.6 | 75.8% |
| 12 | DeepSeek: DeepSeek V3.1 Terminusdeepseek/deepseek-v3.1-terminus | 75.6% |
| 13 | Google: Gemini 3.1 Flash Lite Previewgoogle/gemini-3.1-flash-lite-preview | 75.1% |
| 14 | Inception: Mercury Coderinception/mercury-coder | 74.9% |
| 15 | OpenAI: GPT-5.2-Codexopenai/gpt-5.2-codex | 74.5% |
| 16 | xAI: Grok 4 Fastx-ai/grok-4-fast | 73.7% |
| 17 | DeepSeek: DeepSeek V3.1deepseek/deepseek-chat-v3.1 | 73.2% |
| 18 | Meta: Llama 4 Maverickmeta-llama/llama-4-maverick | 72.0% |
| 19 | Mistral: Mistral Small Creativemistralai/mistral-small-creative | 71.8% |
| 20 | MoonshotAI: Kimi K2 0905moonshotai/kimi-k2-0905 | 71.7% |
| 21 | DeepSeek: DeepSeek V3.2deepseek/deepseek-v3.2 | 71.4% |
| 22 | Google: Gemini 2.5 Flash Lite Preview 09-2025google/gemini-2.5-flash-lite-preview-09-2025 | 70.8% |
| 23 | xAI: Grok 3 Minix-ai/grok-3-mini | 70.5% |
| 24 | Mistral: Devstral 2 2512mistralai/devstral-2512 | 69.4% |
| 25 | Inception: Mercuryinception/mercury | 66.9% |
| 26 | Cohere: Command Acohere/command-a | 65.9% |
| 27 | Qwen: Qwen3.5-Flashqwen/qwen3.5-flash-02-23 | 64.2% |
| 28 | xAI: Grok 4.1 Fastx-ai/grok-4.1-fast | 62.2% |
| 29 | Qwen: Qwen3.5-122B-A10Bqwen/qwen3.5-122b-a10b | 61.6% |
| 30 | Meta: Llama 4 Scoutmeta-llama/llama-4-scout | 60.9% |
| 31 | Mistral: Ministral 3 14B 2512mistralai/ministral-14b-2512 | 60.5% |
| 32 | Meta: Llama 3.3 70B Instructmeta-llama/llama-3.3-70b-instruct | 59.2% |
| 33 | OpenAI: GPT-5.3-Codexopenai/gpt-5.3-codex | 53.8% |
| 34 | Mistral: Ministral 3 8B 2512mistralai/ministral-8b-2512 | 53.0% |
| 35 | Inception: Mercury 2inception/mercury-2 | 49.1% |
| 36 | MiniMax: MiniMax M2-herminimax/minimax-m2-her | 48.7% |
| 37 | Cohere: Command R (08-2024)cohere/command-r-08-2024 | 47.6% |
| 38 | Qwen: Qwen3.5-27Bqwen/qwen3.5-27b | 45.1% |
| 39 | xAI: Grok 4x-ai/grok-4 | 44.5% |
| 40 | Mistral: Ministral 3 3B 2512mistralai/ministral-3b-2512 | 43.1% |
| 41 | Cohere: Command R+ (08-2024)cohere/command-r-plus-08-2024 | 35.8% |
| 42 | Meta: Llama 3.2 3B Instructmeta-llama/llama-3.2-3b-instruct | 33.7% |
| 43 | Cohere: Command R7B (12-2024)cohere/command-r7b-12-2024 | 29.9% |
| 44 | Qwen: Qwen3.5-35B-A3Bqwen/qwen3.5-35b-a3b | 25.1% |
| 45 | DeepSeek: R1 0528deepseek/deepseek-r1-0528 | 21.4% |
| 46 | Meta: Llama 3.2 1B Instructmeta-llama/llama-3.2-1b-instruct | 21.2% |
| 47 | MiniMax: MiniMax M2.1minimax/minimax-m2.1 | 21.1% |
| 48 | Z.ai: GLM 4.6z-ai/glm-4.6 | 21.1% |
| 49 | MoonshotAI: Kimi K2 Thinkingmoonshotai/kimi-k2-thinking | 17.5% |
| 50 | MiniMax: MiniMax M1minimax/minimax-m1 | 16.6% |
| 51 | Z.ai: GLM 4.6Vz-ai/glm-4.6v | 15.9% |
| 52 | MiniMax: MiniMax M2.5minimax/minimax-m2.5 | 11.6% |
| 53 | Z.ai: GLM 4.7z-ai/glm-4.7 | 10.0% |
| 54 | Z.ai: GLM 5z-ai/glm-5 | 8.5% |
| 55 | Qwen: Qwen3.5 Plus 2026-02-15qwen/qwen3.5-plus-02-15 | 6.2% |
| 56 | Google: Gemini 3.1 Pro Previewgoogle/gemini-3.1-pro-preview | 4.9% |
| 57 | Google: Gemini 3 Pro Previewgoogle/gemini-3-pro-preview | 4.8% |
| 58 | MiniMax: MiniMax M2minimax/minimax-m2 | 4.6% |
| 59 | MoonshotAI: Kimi K2.5moonshotai/kimi-k2.5 | 2.7% |
| 60 | DeepSeek: DeepSeek V3.2 Specialedeepseek/deepseek-v3.2-speciale | 0.7% |
| 61 | Z.ai: GLM 4.7 Flashz-ai/glm-4.7-flash | 0.0% |
Research Methodology
Each model is evaluated over 30 tasks in Luau specific environments. Scores are generated through automated testing with 3x replicates per task to ensure reproducibility.
View ProtocolOpen Infrastructure
All prompts, model parameters, and raw logs are publicly auditable. Submit new models or tasks via GitHub.
GitHub Repository