The Office Comprehension Benchmark (OCB) was introduced on May 29, 2026, by a team of researchers led by Firoz Shaik. This benchmark is the first public evaluation framework designed for assessing large language models (LLMs) on their ability to comprehend Word, Excel, and PowerPoint files in their native formats.
Overview of Office Comprehension Benchmark
The OCB consists of two main tracks: File Fidelity Q&A and Domain Q&A. The File Fidelity track evaluates the structural and visual comprehension of office documents, including tables, charts, and embedded images. Domain Q&A focuses on expert-level reasoning based on real-world industry documents across 12 professional domains.
Each reference answer in the benchmark is broken down into atomic, binary-gradable claims. LLM judges score the responses based on these claims independently. Notably, even the most advanced LLMs currently achieve only about 59.3% accuracy on Domain Q&A, indicating significant challenges in reasoning and comprehension.
Key Features of OCB
The OCB provides a comprehensive evaluation of LLMs through its unique structure. It emphasizes the importance of multi-step analysis and synthesis across documents, which is crucial in real-world applications. The benchmarks are released alongside evaluation tools, judge prompts, and a public leaderboard.


