VisAidMath: Benchmarking Visual-Aided Mathematical Reasoning

• Dataset: 1.2K samples | 16 data sources | 7 visual-aid types | 3 difficulty levels | 3 math branches
• Core Value: First benchmark supporting cross-domain visual-aided mathematical reasoning

Overview

Key Features

16 diverse data sources across academic and educational domains
7 types of visual aids supporting mathematical reasoning
3 mathematical branches: plane geometry, solid geometry, analytic geometry, and calculus
3-level difficulty classification for comprehensive evaluation
Cross-domain visual-mathematical reasoning assessment

Dataset Comparison

Figure 1. Comparison between VisAidMath and other benchmarks. Our work particularly focus on utilization of explicit and implicit visual context during reasoning process.

Model Performance Comparison

Figure 2. Accuracies of all LMM on visual-aided mathematical reasoning task across four mathematical branches and six visual aids. The best performing GPT-4V cannot exceed 60% accuracy score.

Data Sources & Distribution

Data Sources

Data Source	Detail
High Textbook	Chinese high school textbook
Middle Practice	Chinese high school practice sheet
AP Easy	AP calculus (categorized into `Easy` category)
Middle Simulate	Chinese middle school simulated examination
AP Middle	AP calculus (categorized into `Medium` category)
High Practice	Chinese high school practice sheet
DSE Final	HKDSE final examination
High Final	Chinese high school final examination
High Simulate	Chinese high school simulated examination
Math Analysis Demidovich Textbook	Demidovich Problems in Mathematical Analysis
Analytic Geometry Lv Textbook	Analytic geometry textbook written by Lingen Lv
CMO Final	Chinese Mathematical Olympiad
CMO Practice	Chinese Mathematical Olympiad practice sheet
AIME Final	American Invitational Mathematics Examination (AIME)
AMC 8 Practice	American Mathematics Competition 8 (AMC 8)
AMC 10 Final	American Mathematics Competition 10 (AMC 10)

Data Distribution

Figure 3. Distribution of samples across different mathematical branches and visual aid types. The dataset maintains balance across difficulty levels to ensure comprehensive evaluation.

Task Comparison

VisAidMath evaluates models across different mathematical reasoning tasks that require visual understanding. Below we compare how different models perform on representative tasks from our benchmark. The comparison highlights the varying capabilities of models to integrate visual information with mathematical reasoning.

Figure 5. Comparison between VisAidMath and other benchmarks. Our work particularly focus on utilization of explicit and implicit visual context during reasoning process.

Examples

Mathematical Branches

Plane Geometry

Problems involving shapes, angles, and measurements in two-dimensional space

Solid Geometry

Problems involving three-dimensional objects, volumes, and spatial relationships

Analytic Geometry

Problems involving coordinates, equations, and transformations in two-dimensional space

Calculus

Problems involving derivatives, integrals, and limits with visual representations

Difficulty Levels

Easy

Basic problems requiring straightforward application of mathematical concepts

Medium

Intermediate problems requiring multiple steps and deeper understanding

Hard

Advanced problems requiring complex reasoning and sophisticated visual interpretation

Visual-Aid Types

Auxiliary Lines

Additional lines drawn to facilitate geometric proofs and solutions

Rectangular Coordinates

2D coordinate systems for solving analytic geometry problems

3D Coordinates

Three-dimensional coordinate systems for spatial problems

Plane Geometry Figures

Visual representations of 2D geometric shapes and their properties

Solid Geometry Figures

Visual representations of 3D objects and their properties

Function Graphs

Visual representations of mathematical functions and their properties

Annotation Process

Figure 4. Pipeline invloving data collection, annotation and verification

Task 1
Task 2
Task 3

Task 1: (p)CQ2(p)VA Visual-aided Mathematical Reasoning

Model	ALL	PLG	SDG	AYG	CAL	AXL	RTC	THC	PLG	SDG	FUG
Heuristics Baselines
Random Answer	24.42	21.54	34.31	21.45	20.07	24.44	20.87	35.16	10.53	32.89	21.50
Frequent Answer	40.83	28.92	50.65	40.36	44.22	32.79	47.25	74.73	20.00	47.73	44.53
Large Language Models (LLMs): Text-Only Input
Llama2-7B	26.83	21.85	34.64	30.55	20.75	26.68	25.23	39.56	11.58	30.26	26.49
Mistral-7b-Instruct-v0.2	27.42	27.38	30.72	27.64	23.81	27.57	28.21	28.57	11.58	27.63	26.87
GPT3.5	37.58	32.31	42.16	37.45	38.78	37.56	38.30	40.66	13.68	42.11	38.20
GPT4	51.92	41.54	52.29	50.91	63.95	45.75	54.59	60.44	23.16	53.29	61.23
Large Multimodal Models (LMMs): Text-Only Input
LLaVA-Next-Mistral-7B	23.08	21.23	22.55	25.45	23.47	22.21	23.62	25.27	8.42	26.32	25.34
InternLM-XComposer2-VL	33.17	24.62	44.12	32.36	31.97	30.40	33.03	46.15	10.53	41.45	34.17
Qwen-VL-Plus	34.75	30.15	43.46	33.82	31.63	34.43	34.63	48.35	21.05	44.74	32.63
Gemini-Pro-Vision	38.42	31.08	48.37	31.27	42.86	34.72	37.84	49.45	18.95	51.97	39.54
Claude-3-Sonnet	38.58	31.38	43.46	39.27	40.82	36.66	40.14	46.15	14.74	43.42	42.23
GPT4V	47.00	35.08	47.06	50.55	56.80	41.43	50.69	48.35	15.79	47.37	55.66
Large Multimodal Models (LMMs): Multimodal Input
LLaVA-Next-Mistral-7B	24.58	22.77	24.18	27.64	24.15	23.55	24.54	29.67	9.47	25.00	25.91
InternLM-XComposer2-VL	29.00	21.54	32.68	31.64	30.95	26.97	30.73	37.36	10.53	35.53	32.05
Qwen-VL-Plus	32.00	28.62	35.95	33.45	30.27	32.34	33.49	32.97	21.05	42.11	32.05
Gemini-Pro-Vision	38.33	28.92	48.69	32.73	43.20	33.68	38.07	50.55	14.74	53.95	39.73
Claude-3-Sonnet	37.08	27.69	41.50	39.27	40.82	33.38	40.60	46.15	14.74	41.45	42.42
GPT4V	45.33	34.46	42.16	49.45	56.80	39.64	50.00	41.76	13.68	46.71	55.28

In the table, green indicates the best performance, yellow indicates the second-best performance

ALL: Overall accuracy

Math branches: PLG: Plane Geometry, SDG: Solid Geometry, AYG: Analytic Geometry, CAL: Calculus and Functions

Visual aid types: AXL: Auxiliary Lines, RTC: Rectangular Coordinates, THC: Three-dimensional Coordinates, PLG: Plane Geometry Figures, SDG: Solid Geometry Figures, FUG: Function Graphs

Task 2: (p)CQ2A General Mathematical Reasoning with Visual Context

Model	ALL	PLG	SDG	AYG	CAL	AXL	RTC	THC	PLG	SDG	FUG
Heuristics Baselines
Random Answer	24.42	21.54	34.31	21.45	20.07	24.44	20.87	35.16	10.53	32.89	21.50
Frequent Answer	40.83	28.92	50.65	40.36	44.22	32.79	47.25	74.73	20.00	47.73	44.53
Large Language Models (LLMs): Text-Only Input
Llama2-7B	23.25	22.77	29.74	17.82	22.11	22.80	19.72	28.57	8.42	28.29	21.11
Mistral-7b-Instruct-v0.2	25.58	24.31	29.41	25.09	23.47	24.59	25.46	25.27	6.32	26.32	25.91
GPT3.5	37.75	32.62	44.44	33.82	40.14	37.85	38.30	40.66	17.89	43.42	38.20
GPT4	51.17	41.54	47.39	50.91	65.99	45.45	55.73	59.34	22.11	49.34	61.80
Large Multimodal Models (LMMs): Text-Only Input
LLaVA-Next-Mistral-7B	28.83	26.15	35.29	24.36	29.25	27.72	28.67	30.77	10.53	35.53	28.79
InternLM-XComposer2-VL	34.33	28.00	45.75	28.36	35.03	32.64	33.49	53.85	13.68	36.18	33.40
Qwen-VL-Plus	33.00	34.15	39.54	29.09	28.57	34.87	30.05	34.07	13.68	43.42	30.52
Gemini-Pro-Vision	40.33	31.38	50.98	35.27	43.88	36.66	41.28	53.85	17.89	49.34	41.84
Claude-3-Sonnet	38.83	27.08	47.06	38.18	43.88	33.38	39.68	57.14	16.84	48.03	42.80
GPT4V	49.00	42.46	46.08	48.73	59.52	43.82	52.75	52.75	22.11	50.00	57.58
Large Multimodal Models (LMMs): Multimodal Input
LLaVA-Next-Mistral-7B	30.08	27.08	37.91	25.82	29.25	28.32	29.13	41.76	9.47	34.87	28.98
InternLM-XComposer2-VL	33.17	26.77	39.87	32.00	34.35	30.85	32.80	43.96	11.58	36.84	34.93
Qwen-VL-Plus	30.58	29.23	35.62	28.73	28.57	31.15	29.82	27.47	13.68	41.45	30.52
Gemini-Pro-Vision	39.00	27.38	49.02	36.36	43.88	35.32	40.37	52.75	14.74	48.68	42.03
Claude-3-Sonnet	39.33	30.15	46.41	37.45	43.88	34.72	38.99	56.04	16.84	47.37	42.42
GPT4V	49.08	41.54	47.39	48.73	59.52	43.82	53.21	51.65	24.21	51.97	57.97

In the table, green indicates the best performance, yellow indicates the second-best performance

ALL: Overall accuracy

Math branches: PLG: Plane Geometry, SDG: Solid Geometry, AYG: Analytic Geometry, CAL: Calculus and Functions

Visual aid types: AXL: Auxiliary Lines, RTC: Rectangular Coordinates, THC: Three-dimensional Coordinates, PLG: Plane Geometry Figures, SDG: Solid Geometry Figures, FUG: Function Graphs

Task 3: (p)CQpV2A Mathematical Reasoning with Provided Visual Aids

Model	ALL	PLG	SDG	AYG	CAL	AXL	RTC	THC	PLG	SDG	FUG
Heuristics Baselines
Random Answer	24.42	21.54	34.31	21.45	20.07	24.44	20.87	35.16	10.53	32.89	21.50
Frequent Answer	40.83	28.92	50.65	40.36	44.22	32.79	47.25	74.73	20.00	47.73	44.53
Large Language Models (LLMs): Text-Only Input
Llama2-7B	24.08	21.23	31.05	25.82	18.37	25.04	22.71	31.87	7.37	30.26	22.46
Mistral-7b-Instruct-v0.2	28.33	27.69	33.33	24.73	27.21	27.72	27.29	34.07	14.74	32.89	27.26
GPT3.5	36.33	31.08	39.22	34.91	40.48	33.08	37.84	50.55	14.74	39.47	39.73
GPT4	52.17	42.77	49.02	53.09	64.97	46.94	57.11	54.95	20.00	52.63	62.76
Large Multimodal Models (LMMs): Text-Only Input
LLaVA-Next-Mistral-7B	27.67	27.38	33.99	24.36	24.49	27.42	25.00	29.67	11.58	33.55	25.91
InternLM-XComposer2-VL	33.50	28.31	43.46	32.36	29.93	33.68	32.80	49.45	13.68	41.45	31.86
Qwen-VL-Plus	35.42	34.15	39.54	38.18	31.29	36.51	39.22	40.66	15.79	39.47	34.93
Gemini-Pro-Vision	42.92	32.31	50.98	40.73	43.88	36.66	41.28	53.85	17.89	49.34	41.84
Claude-3-Sonnet	39.00	31.38	42.16	41.45	41.84	35.92	40.14	46.15	17.89	42.11	43.19
GPT4V	47.58	40.31	47.71	42.55	60.20	42.32	47.94	50.55	21.05	55.26	53.93
Large Multimodal Models (LMMs): Multimodal Input
LLaVA-Next-Mistral-7B	27.08	27.69	32.03	23.64	24.49	27.42	24.31	26.37	11.58	32.89	25.72
InternLM-XComposer2-VL	30.42	20.00	39.54	33.09	29.93	26.97	31.88	40.66	10.53	34.87	32.25
Qwen-VL-Plus	32.58	31.69	30.39	37.45	31.29	33.23	38.99	25.27	16.84	37.50	34.55
Gemini-Pro-Vision	41.42	29.54	48.69	41.09	47.28	37.85	43.81	45.05	14.74	48.03	46.07
Claude-3-Sonnet	36.67	24.92	39.22	42.18	41.84	32.04	40.37	41.76	14.74	43.42	43.76
GPT4V	44.17	37.54	37.25	42.91	59.86	38.60	47.25	36.26	17.89	48.03	53.74

In the table, green indicates the best performance, yellow indicates the second-best performance

ALL: Overall accuracy

Math branches: PLG: Plane Geometry, SDG: Solid Geometry, AYG: Analytic Geometry, CAL: Calculus and Functions

Visual aid types: AXL: Auxiliary Lines, RTC: Rectangular Coordinates, THC: Three-dimensional Coordinates, PLG: Plane Geometry Figures, SDG: Solid Geometry Figures, FUG: Function Graphs

Visual-aids Construction Performance

Figure 6. N-gram similarity of Answer between general reasoning (CQ2A) and visual-aided reasoning (CQ2VA).

Figure 7. N-gram similarity of Visual Aids between model hypothesis and reference (CQ2VA).

The construction of visual aids is a critical step in solving mathematical problems. The figures above demonstrate the capabilities of different models in constructing auxiliary lines, coordinate systems, and other visual elements, which directly impact the final reasoning performance. GPT-4V performs best in most cases, but still shows notable deficiencies in certain specific geometry problems.

Quantitative Analysis

In-depth analysis on reasoning capability upon the utilization of visual information.

GPT-4V reasoning patterns in visual-aided reasoning with visual context (CQ2VA).

In-depth analysis on reasoning capability upon the utilization of visual information.

GPT-4V error distributions during visual-aided reasoning process (CQ2VA).

Error analysis of visual-aided reasoning task (CQ2VA task, GPT-4V).

Correlation between visual aid and reasoning hallucination.

Error analysis of visual-aided reasoning task (CQ2VA task, GPT-4V).

Correlation between errors of visual aid and answer correctness.

Error analysis of visual-aided reasoning task (CQ2VA task, GPT-4V).

Correlation between error reasons of visual aid and answer correctness.

Qualitative Analysis

General Reasoning Tendency

Example 1

Example 2

Example 3

Example 4

Example 5

Visual Aid Inference Capability

Example of GPT4V generate visual aids correct in semantics

Example of GPT4V fail to generate visual aids due to task misunderstanding

Example of GPT4V fail to correctly generate visual aids due to input conflicting hallucination

Example of GPT4V fail to correctly generate visual aids due to fact conflicting hallucination

Example of GPT4V fail to correctly generate visual aids due to context conflicting hallucination

Example of GPT4V generate different visual aids for alternative substantial reasoning path and provide correct final result

Example of GPT4V generate different visual aids for alternative substantial reasoning path and provide wrong final result

Citation

@article{ma2024visaidmath,
        title={Visaidmath: Benchmarking visual-aided mathematical reasoning},
        author={Ma, Jingkun and Zhan, Runzhe and Wong, Derek F and Li, Yang and Sun, Di and Chan, Hou Pong and Chao, Lidia S},
        journal={arXiv preprint arXiv:2410.22995},
        year={2024}
      }

VisAidMath

Benchmarking Visual-Aided Mathematical Reasoning

Overview

Key Features

Dataset Comparison

Model Performance Comparison

VisAidMath Dataset

Data Sources & Distribution

Data Sources

Data Distribution

Task Comparison

Examples

Mathematical Branches

Plane Geometry

Solid Geometry

Analytic Geometry

Calculus

Difficulty Levels

Easy

Medium

Hard

Visual-Aid Types

Auxiliary Lines

Rectangular Coordinates

3D Coordinates

Plane Geometry Figures

Solid Geometry Figures

Function Graphs

Annotation Process

Experimental Results

Task 1: (p)CQ2(p)VA Visual-aided Mathematical Reasoning

Task 2: (p)CQ2A General Mathematical Reasoning with Visual Context

Task 3: (p)CQpV2A Mathematical Reasoning with Provided Visual Aids

Visual-aids Construction Performance

Quantitative Analysis

In-depth analysis on reasoning capability upon the utilization of visual information.

In-depth analysis on reasoning capability upon the utilization of visual information.

Error analysis of visual-aided reasoning task (CQ2VA task, GPT-4V).

Error analysis of visual-aided reasoning task (CQ2VA task, GPT-4V).

Error analysis of visual-aided reasoning task (CQ2VA task, GPT-4V).

Qualitative Analysis

General Reasoning Tendency

Example 1

Example 2

Example 3

Example 4

Example 5

Visual Aid Inference Capability

Example of GPT4V generate visual aids correct in semantics

Example of GPT4V fail to generate visual aids due to task misunderstanding

Example of GPT4V fail to correctly generate visual aids due to input conflicting hallucination

Example of GPT4V fail to correctly generate visual aids due to fact conflicting hallucination

Example of GPT4V fail to correctly generate visual aids due to context conflicting hallucination

Example of GPT4V generate different visual aids for alternative substantial reasoning path and provide correct final result

Example of GPT4V generate different visual aids for alternative substantial reasoning path and provide wrong final result

Citation