Logo VisAidMath

Benchmarking Visual-Aided Mathematical Reasoning

1NLP2CT Lab, Department of Computer and Information Science, University of Macau,
2University of Macau, 3DAMO Academy, Alibaba Group

• Dataset: 1.2K samples | 16 data sources | 7 visual-aid types | 3 difficulty levels | 3 math branches
• Core Value: First benchmark supporting cross-domain visual-aided mathematical reasoning

Overview

Key Features

  • 16 diverse data sources across academic and educational domains
  • 7 types of visual aids supporting mathematical reasoning
  • 3 mathematical branches: plane geometry, solid geometry, analytic geometry, and calculus
  • 3-level difficulty classification for comprehensive evaluation
  • Cross-domain visual-mathematical reasoning assessment

Dataset Comparison

Data Distribution

Figure 1. Comparison between VisAidMath and other benchmarks. Our work particularly focus on utilization of explicit and implicit visual context during reasoning process.

Model Performance Comparison

Model Radar Chart

Figure 2. Accuracies of all LMM on visual-aided mathematical reasoning task across four mathematical branches and six visual aids. The best performing GPT-4V cannot exceed 60% accuracy score.

Logo VisAidMath Dataset

Data Sources & Distribution

Data Sources

Data Source Detail
High Textbook Chinese high school textbook
Middle Practice Chinese high school practice sheet
AP Easy AP calculus (categorized into Easy category)
Middle Simulate Chinese middle school simulated examination
AP Middle AP calculus (categorized into Medium category)
High Practice Chinese high school practice sheet
DSE Final HKDSE final examination
High Final Chinese high school final examination
High Simulate Chinese high school simulated examination
Math Analysis Demidovich Textbook Demidovich Problems in Mathematical Analysis
Analytic Geometry Lv Textbook Analytic geometry textbook written by Lingen Lv
CMO Final Chinese Mathematical Olympiad
CMO Practice Chinese Mathematical Olympiad practice sheet
AIME Final American Invitational Mathematics Examination (AIME)
AMC 8 Practice American Mathematics Competition 8 (AMC 8)
AMC 10 Final American Mathematics Competition 10 (AMC 10)

Data Distribution

Data Distribution

Figure 3. Distribution of samples across different mathematical branches and visual aid types. The dataset maintains balance across difficulty levels to ensure comprehensive evaluation.

Task Comparison

VisAidMath evaluates models across different mathematical reasoning tasks that require visual understanding. Below we compare how different models perform on representative tasks from our benchmark. The comparison highlights the varying capabilities of models to integrate visual information with mathematical reasoning.

Task Comparison

Figure 5. Comparison between VisAidMath and other benchmarks. Our work particularly focus on utilization of explicit and implicit visual context during reasoning process.

Examples

Mathematical Branches

Difficulty Levels

Visual-Aid Types

Annotation Process

Annotation Process

Figure 4. Pipeline invloving data collection, annotation and verification

Experimental Results

Task 1: (p)CQ2(p)VA Visual-aided Mathematical Reasoning

Model ALL PLG SDG AYG CAL AXL RTC THC PLG SDG FUG
Heuristics Baselines
Random Answer 24.42 21.54 34.31 21.45 20.07 24.44 20.87 35.16 10.53 32.89 21.50
Frequent Answer 40.83 28.92 50.65 40.36 44.22 32.79 47.25 74.73 20.00 47.73 44.53
Large Language Models (LLMs): Text-Only Input
Llama2-7B 26.83 21.85 34.64 30.55 20.75 26.68 25.23 39.56 11.58 30.26 26.49
Mistral-7b-Instruct-v0.2 27.42 27.38 30.72 27.64 23.81 27.57 28.21 28.57 11.58 27.63 26.87
GPT3.5 37.58 32.31 42.16 37.45 38.78 37.56 38.30 40.66 13.68 42.11 38.20
GPT4 51.92 41.54 52.29 50.91 63.95 45.75 54.59 60.44 23.16 53.29 61.23
Large Multimodal Models (LMMs): Text-Only Input
LLaVA-Next-Mistral-7B 23.08 21.23 22.55 25.45 23.47 22.21 23.62 25.27 8.42 26.32 25.34
InternLM-XComposer2-VL 33.17 24.62 44.12 32.36 31.97 30.40 33.03 46.15 10.53 41.45 34.17
Qwen-VL-Plus 34.75 30.15 43.46 33.82 31.63 34.43 34.63 48.35 21.05 44.74 32.63
Gemini-Pro-Vision 38.42 31.08 48.37 31.27 42.86 34.72 37.84 49.45 18.95 51.97 39.54
Claude-3-Sonnet 38.58 31.38 43.46 39.27 40.82 36.66 40.14 46.15 14.74 43.42 42.23
GPT4V 47.00 35.08 47.06 50.55 56.80 41.43 50.69 48.35 15.79 47.37 55.66
Large Multimodal Models (LMMs): Multimodal Input
LLaVA-Next-Mistral-7B 24.58 22.77 24.18 27.64 24.15 23.55 24.54 29.67 9.47 25.00 25.91
InternLM-XComposer2-VL 29.00 21.54 32.68 31.64 30.95 26.97 30.73 37.36 10.53 35.53 32.05
Qwen-VL-Plus 32.00 28.62 35.95 33.45 30.27 32.34 33.49 32.97 21.05 42.11 32.05
Gemini-Pro-Vision 38.33 28.92 48.69 32.73 43.20 33.68 38.07 50.55 14.74 53.95 39.73
Claude-3-Sonnet 37.08 27.69 41.50 39.27 40.82 33.38 40.60 46.15 14.74 41.45 42.42
GPT4V 45.33 34.46 42.16 49.45 56.80 39.64 50.00 41.76 13.68 46.71 55.28

In the table, green indicates the best performance, yellow indicates the second-best performance

ALL: Overall accuracy

Math branches: PLG: Plane Geometry, SDG: Solid Geometry, AYG: Analytic Geometry, CAL: Calculus and Functions

Visual aid types: AXL: Auxiliary Lines, RTC: Rectangular Coordinates, THC: Three-dimensional Coordinates, PLG: Plane Geometry Figures, SDG: Solid Geometry Figures, FUG: Function Graphs

Visual-aids Construction Performance

Visual Aid Construction Analysis 1

Figure 6. N-gram similarity of Answer between general reasoning (CQ2A) and visual-aided reasoning (CQ2VA).

Visual Aid Construction Analysis 2

Figure 7. N-gram similarity of Visual Aids between model hypothesis and reference (CQ2VA).

The construction of visual aids is a critical step in solving mathematical problems. The figures above demonstrate the capabilities of different models in constructing auxiliary lines, coordinate systems, and other visual elements, which directly impact the final reasoning performance. GPT-4V performs best in most cases, but still shows notable deficiencies in certain specific geometry problems.

Quantitative Analysis

Qualitative Analysis

General Reasoning Tendency

Visual Aid Inference Capability

Citation

@article{ma2024visaidmath,
        title={Visaidmath: Benchmarking visual-aided mathematical reasoning},
        author={Ma, Jingkun and Zhan, Runzhe and Wong, Derek F and Li, Yang and Sun, Di and Chan, Hou Pong and Chao, Lidia S},
        journal={arXiv preprint arXiv:2410.22995},
        year={2024}
      }