Zhongyao Wang recently addressed a significant issue in the machine learning field in a paper titled Why Do Few-Step Text Latents Fail When Image Latents Work? Non-Commitment at Sharp Categorical Readouts. The research, submitted on June 29, 2026, explores the reasons behind the failure of few-step generation in text latents compared to its success in image latents.
Understanding Few-Step Generation in Text and Image Latents
Few-step generation has shown remarkable success in generating coherent images. However, the same approach fails to produce meaningful text outputs. This paper identifies the root cause as geometric rather than a mere training or scaling deficiency. The author argues that a smooth, regularity-limited deterministic map cannot resolve discrete branch choices before a sharp categorical readout.
This failure is attributed to decoder sharpness rather than transport accuracy. The research presents a clear distinction between how image and text latents operate under deterministic generation, emphasizing that the mechanics are fundamentally different.
Key Findings on Decoder Sharpness and Categorical Commitment
In the overlapping regime of real text autoencoders, the study reveals that the posterior-mean terminal step flips tokens at a rate dependent on latent mass. This phenomenon occurs within an O(s(t)) tube around decision boundaries. The paper introduces two diagnostic metrics: DABI (readout sharpness) and CCI (categorical commitment).



