Test-time verification for Text-to-SQL is a critical advancement in enhancing the reliability of large language models (LLMs) at inference time. A recent paper by Mattia Tritto and colleagues, submitted on June 29, 2026, explores Outcome Reward Models (ORMs) as a novel approach to improve structured reasoning tasks. This research was accepted at the SURGeLLM Workshop at ACL 2026 in San Diego, US.
Outcome Reward Models Transform Text-to-SQL
The study highlights the limitations of traditional test-time inference strategies, such as Best-of-N sampling and Majority Voting, which rely on heuristic signals like execution success. These methods often lack the semantic discrimination needed for effective candidate output selection. The authors propose ORMs as learned semantic scoring functions that enhance the verification process in Text-to-SQL tasks.
By introducing GradeSQL, a scalable framework for training task-specific ORMs, the researchers enable automated candidate generation and execution-based labeling, reducing the need for manual annotation. This innovation facilitates the training of verifiers in a more efficient manner.
Performance Evaluation on BIRD and Spider Benchmarks
The effectiveness of ORM-based selection was rigorously evaluated on the BIRD and Spider benchmarks, utilizing various open-source LLM families. The results reveal that ORM-based verification consistently surpasses execution-based Best-of-N and Majority Voting methods. Specifically, the gains achieved were up to 4.33% on BIRD and 2.10% on Spider.



