Knowledge-Based Visual Question Answering (KB-VQA) benchmarks face significant challenges, as highlighted by a recent study from authors Qian Ma, S M Rayeed, Charles V. Stewart, Qiong Wu, and Yao Ma. Published on June 30, 2026, this research uncovers systematic flaws in existing benchmarks, calling for immediate reform in evaluation protocols.
Critical Flaws in Existing KB-VQA Protocols
The study reveals that current KB-VQA benchmarks rely on critical assumptions that are often violated. These include the necessity for annotated answers to be derivable from the associated knowledge base and well-posed questions with sufficient constraints. The authors found substantial instances of missing or contradicted answers, leading to misleading accuracy metrics.
Moreover, the benchmarks tend to use visually trivial, single-entity scenes. This oversight bypasses the need for complex visual-to-knowledge mappings, resulting in distorted model rankings and inflated assessments of reasoning capabilities.
Proposed Audit and Repair Protocols
To address these issues, the authors propose a principled audit-and-repair protocol. This protocol aims to restore answer derivability and enhance question clarity. Additionally, it introduces a controlled multi-entity augmentation protocol to create visual ambiguity, thereby challenging the initial retrieval and grounded reasoning.


