SLIM-RL, a novel reinforcement learning method, was introduced by authors Ruikang Zhao, Zhenting Wang, Han Gao, and Ligong Han on June 30, 2026. This approach addresses the limitations of trajectory-aware methods in diffusion large language models (dLLMs), particularly the existing method, TraceRL, which requires trajectory reconstruction during training. By utilizing a tau-budget decoder, SLIM-RL significantly reduces training data commit risk without the need for trajectory slicing.
Advancements in Reinforcement Learning Techniques
The SLIM-RL method focuses on enhancing the efficiency of training dLLMs by implementing a risk-controlled rollout strategy. It bounds the commit risk at each step, allowing for improved optimization while maintaining a trace-free random-masking objective. This innovative approach integrates variance-reduction tools, including sequence-level importance sampling and deterministic quadrature, which are complemented by a novel per-block mask schedule.
Through rigorous testing, SLIM-RL has demonstrated its capability to match the best accuracy of TraceRL on the MATH500 dataset using only 0.46x of its training samples at a block size of 16. It achieved a 6.32% improvement on MATH500 and an 11.05% enhancement on the GSM8K benchmark under matched dynamic sampling conditions.
Performance Comparisons with Other Models
When evaluated at a block size of 4, SLIM-RL outperformed larger models, including the LLaDA-8B and Dream-7B dLLMs, achieving a remarkable 10.76% increase over LLaDA-8B on the MATH500 dataset. Additionally, it surpassed TraceRL by 4.20% on the MBPP coding challenge and 3.65% on HumanEval.
The tau-budget decoder's flexibility allows it to transfer knowledge across various architectures, such as LLaDA and Dream, enhancing its application potential in the field of artificial intelligence.
Accessing the Research and Source Code
The complete research paper titled SLIM-RL: Risk-Budgeted Random-Masking RL for Diffusion LLMs Without Trajectory Slicing is available for public access. Interested readers can view the paper in PDF format or explore the source code through the provided links. This research is expected to contribute significantly to the ongoing development of more efficient reinforcement learning algorithms in the context of large language models.
- Authors: Ruikang Zhao, Zhenting Wang, Han Gao, Ligong Han
- Submission Date: June 30, 2026
- Improvements: 6.32% on MATH500, 11.05% on GSM8K
- Performance: Surpassed LLaDA-8B by 10.76% on MATH500
🤖 This article was rewritten by Feed and Figures' editorial AI from a report originally published by arXiv NLP. Facts and quotes are preserved from the original; the rewrite focuses on clarity and structure. For the unedited original, see the source link below.