Grok-2 has been evaluated across several performance benchmarks that measure its capabilities in reasoning, language understanding, mathematics, coding, and multimodal tasks. The key benchmarks include:
- GPQA (Graduate-Level Science Knowledge): Tests advanced scientific reasoning and knowledge. Grok-2 achieved a score of 56.0%, showing significant improvement over its predecessor.
- MMLU (Massive Multitask Language Understanding): Evaluates general knowledge across multiple disciplines. Grok-2 scored 87.5%.
- MMLU-Pro: A more challenging version of MMLU with harder reasoning-focused questions. Grok-2 scored 75.5%.
- MATH: Assesses mathematical problem-solving abilities. Grok-2 achieved a score of 76.1%.
- HumanEval: Measures coding and problem-solving capabilities. Grok-2 scored 88.4%, indicating strong performance in code generation.
- MMMU (Massive Multitask Multimodal Understanding): Tests multimodal understanding across text, images, audio, and video. Grok-2 scored 66.1%.
- MathVista: Evaluates visual mathematical reasoning. Grok-2 excels in this benchmark, achieving a score of 69.0%.
- DocVQA (Document-based Question Answering): Measures the model's ability to extract and reason with information from documents. Grok-2 scored 93.6%, demonstrating its strength in document-based tasks.
These benchmarks highlight Grok-2's strengths in scientific reasoning, multimodal tasks, and document-based question answering, while also showing competitive performance in general knowledge and coding tasks.