Controllable Image Generation with Composed Parallel Token Prediction

Jamie Stirling, Noura Al Moubayed, Chris G. Willcocks and Hubert P. H. Shum
Proceedings of the 2026 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop (CVPRW), 2026

Controllable Image Generation with Composed Parallel Token Prediction

Abstract

Conditional discrete generative models struggle to faithfully compose multiple input conditions. To address this, we derive a theoretically-grounded formulation for composing discrete probabilistic generative processes, with masked generation (absorbing diffusion) as a special case. Our formulation enables precise specification of novel combinations and numbers of input conditions that lie outside the training data, with concept weighting enabling emphasis or negation of individual conditions. In synergy with the richly compositional learned vocabulary of VQ-VAE and VQ-GAN, our method attains a 63.4\% relative reduction in error rate compared to the previous state-of-the-art, averaged across 3 datasets (positional CLEVR, relational CLEVR and FFHQ), simultaneously obtaining an average absolute FID improvement of -9.58. Meanwhile, our method offers a 2.3x to 12x real-time speed-up over comparable methods, and is readily applied to an open pre-trained discrete text-to-image model for fine-grained control of text-to-image generation.


Downloads


YouTube


Cite This Research

Plain Text

Jamie Stirling, Noura Al Moubayed, Chris G. Willcocks and Hubert P. H. Shum, "Controllable Image Generation with Composed Parallel Token Prediction," in Proceedings of the 2026 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop, Colorado, USA, IEEE/CVF, 2026.

BibTeX

@inproceedings{stirling26controllable,
 author={Stirling, Jamie and Moubayed, Noura Al and Willcocks, Chris G. and Shum, Hubert P. H.},
 booktitle={Proceedings of the 2026 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop},
 title={Controllable Image Generation with Composed Parallel Token Prediction},
 year={2026},
 publisher={IEEE/CVF},
 location={Colorado, USA},
}

RIS

TY  - CONF
AU  - Stirling, Jamie
AU  - Moubayed, Noura Al
AU  - Willcocks, Chris G.
AU  - Shum, Hubert P. H.
T2  - Proceedings of the 2026 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop
TI  - Controllable Image Generation with Composed Parallel Token Prediction
PY  - 2026
PB  - IEEE/CVF
ER  - 


Supporting Grants


Similar Research

Jamie Stirling, Noura Al Moubayed and Hubert P. H. Shum, "Investigating Permutation-Invariant Discrete Representation Learning for Spatially Aligned Images", Proceedings of the 2026 International Conference on Pattern Recognition (ICPR), 2026
Jiaxu Liu, Li Li, Hubert P. H. Shum and Toby P. Breckon, "TFDM: Time-Variant Frequency-Based Point Cloud Diffusion with State Space Model", Proceedings of the 2026 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop (CVPRW), 2026
Shuang Chen, Amir Atapour-Abarghouei and Hubert P. H. Shum, "HINT: High-quality INpainting Transformer with Mask-Aware Encoding and Enhanced Attention", IEEE Transactions on Multimedia (TMM), 2024
Shuang Chen, Amir Atapour-Abarghouei, Haozheng Zhang and Hubert P. H. Shum, "MxT: Mamba x Transformer for Image Inpainting", Proceedings of the 2024 British Machine Vision Conference (BMVC), 2024

HomeGoogle ScholarLinkedInYouTubeGitHubORCIDResearchGateEmail
 
Print