MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

Shaden Alshammari; Kevin Wen; Abrar Zainal; Mark Hamilton; Navid Safaei; Sultan Albarakati; William T. Freeman; Antonio Torralba

✨ TL;DR

MathNet is a large-scale, multilingual dataset of 30,676 Olympiad-level math problems from 47 countries spanning two decades, designed to benchmark both mathematical reasoning in generative models and mathematical retrieval in embedding systems. The benchmark reveals that even state-of-the-art models struggle with these problems, with top models achieving only 78.4% accuracy, and that retrieval quality significantly impacts retrieval-augmented generation performance.

01 · Problem

Existing mathematical reasoning benchmarks suffer from significant limitations in scale, language diversity, and task coverage. Current datasets are too small to adequately test modern large language models, focus predominantly on English, and fail to evaluate critical capabilities like mathematical retrieval—the ability to find semantically or structurally similar problems. This gap is particularly problematic as mathematical problem solving represents a fundamental test of reasoning ability, and real-world mathematical applications often require both solving problems and retrieving relevant prior work or similar examples.

02 · Approach

The authors constructed MathNet by collecting 30,676 expert-authored Olympiad-level mathematics problems with solutions from 47 countries across 17 languages, spanning two decades of competitions. They designed a comprehensive benchmark supporting three distinct tasks: (i) Problem Solving, where models generate solutions to problems; (ii) Math-Aware Retrieval, where embedding models must retrieve mathematically equivalent or structurally similar problems from a corpus; and (iii) Retrieval-Augmented Problem Solving, which combines retrieval with generation. For the retrieval benchmark, human experts curated pairs of mathematically equivalent and structurally similar problems to enable rigorous evaluation of mathematical understanding in embedding models.

03 · Key insights

What the paper shows.

01State-of-the-art reasoning models still struggle significantly with Olympiad-level problems, with Gemini-3.1-Pro achieving 78.4% and GPT-5 achieving 69.3%, indicating substantial room for improvement in mathematical reasoning

02Embedding models demonstrate poor performance on mathematical retrieval tasks, struggling to identify mathematically equivalent or structurally similar problems despite their semantic relationships

03Retrieval-augmented generation performance is highly sensitive to retrieval quality, with DeepSeek-V3.2-Speciale showing up to 12% improvement when provided with high-quality retrieved examples

04The dataset's scale, multilingual coverage (17 languages, 47 countries), and expert curation make it the largest high-quality Olympiad dataset available and the first to systematically evaluate mathematical retrieval capabilities

04 · Results

The benchmark evaluation revealed that top-performing models achieve 78.4% accuracy (Gemini-3.1-Pro) and 69.3% accuracy (GPT-5) on the problem-solving task, demonstrating that even state-of-the-art systems are challenged by Olympiad-level mathematics. Embedding models showed poor performance on the mathematical retrieval task, indicating fundamental limitations in capturing mathematical equivalence and structural similarity. In the retrieval-augmented setting, DeepSeek-V3.2-Speciale achieved the highest scores on the benchmark with gains of up to 12% when provided with retrieved examples, though performance was highly dependent on retrieval quality. These results establish strong baselines while highlighting significant gaps in both mathematical reasoning and retrieval capabilities.

05 · Limitations

While the paper presents a comprehensive benchmark, potential limitations include the focus on Olympiad-level problems which, though challenging, represent a specific subset of mathematical reasoning and may not capture all aspects of mathematical problem-solving encountered in research or applied settings. The retrieval benchmark relies on human-curated pairs of equivalent and similar problems, which may introduce subjective biases in what constitutes mathematical equivalence or structural similarity. The dataset, despite being the largest of its kind, is still limited to competition mathematics from specific countries and time periods, potentially missing mathematical traditions or problem types from underrepresented regions. Additionally, the evaluation focuses primarily on final answer accuracy, which may not fully capture the quality of mathematical reasoning or the pedagogical value of solution approaches.

✨ Generated by Claude · Apr 21, 2026 · Read the PDF for authoritative content.

What the paper shows.

↘ Related papers