RealCustom++: Representing Images as Real-Word for Real-Time Customization

Zhendong Mao¹, Mengqi Huang¹, Fei Ding², Mingcong Liu², Qian He², Yongdong Zhang¹,

¹University of Science and Technology of China, ²ByteDance

This work extends our previous CVPR 2024 paper RealCustom

RealCustom++ is now applied in Dreamina(JiMeng), enjoy your woderful journey!

Acknowledgement: We'd like to thank Yiming Luo, for the valuable help on data processing.

Paper Github

Our proposed new paradigm, RealCustom++ is naturally capable of ALL types of customization.
One2One: Given only a single image representing the given subject in the open domain (any subjects you could imagine, humans, cartoon characters, clothes, toys, buildings, etc.), RealCustom++ could generate realistic and harmonious images that consistently adhere to the given text for the given subjects in real-time (without any test-time optimization steps);
One2Many: Given a single image representing multiple subjects, RealCustom++ could decouple and customize each subject within the reference image.
Many2Many: Given multiple reference images, RealCustom++ could combine different subjects from different reference images into a single customization result. The customized native real-words are highlighted in color.

Abstract (TL,DR)

Existing Text-to-image customization works: follow the pseudo-word paradigm, i.e., (1) firstly representing the given subjects as pseudo-words that do not exist in the textual vocabulary, (2) secondly composing them with other given text to guide the generation collectively. However, the inherent conflict and entanglement between the pseudo-words and texts result in a dual-optimum paradox, where subject similarity and text controllability cannot be optimal simultaneously.
We introduce a novel real-words paradigm named RealCustom++, which instead represents subjects as non-conflict real words, thereby disentangling subject similarity from text controllability and allowing both to be optimized simultaneously. The core idea of RealCustom++ is to represent given subjects as real words (i.e., its super-category word) that can be seamlessly integrated with given texts, and further leveraging the relevance between real words and image regions to disentangle subjects from texts.

How does it work?

Illustration of RealCustom++, which employs a novel “train-inference” decoupled framework:
Training Paradigm: RealCustom++ learns the alignment between vision conditions and all real words in the text. This is achieved by the Cross-layer Cross-Scale Projector (CCP) to robustly and finely extract subject features, and a Curriculum Training Recipe (CTR) that adapts the generated subject to diverse poses and sizes. The projected visual conditions are injected into the diffusion models by extending their textual cross-attention with an additional visual cross-attention in each block.
Inference Paradigm: A novel Adaptive Mask Guidance (AMG) is proposed to customize the generation of the specific target real word ("toy" in this case). This is achieved by incorporating two branches for each generation step: a Guidance Branch that constructs the image guidance mask and a Generation Branch that uses the image guidance mask to keep other subject-irrelevant regions uncontaminated.

Comparison To Current Methods

Qualitative comparison of single subject customization.

Qualitative comparison of multiple subject customization.

Visualization of How RealCustom++ Works

Illustration of gradually customizing the target real words into the given subjects for single-subject customization. The customized words are highlighted in red, with their attention maps gradually forming into the given subjects and details being added step by step. This process provides a more accurate image guidance mask for open-domain customization, while the remaining subject-irrelevant parts are completely controlled by the given text.
Illustration of gradually customizing the native real words into the given subjects for multiple-subject customization. The different customized words in each case are highlighted in blue and green, respectively. We show that RealCustom++ provides accurate and decoupled guidance masks for each given subject, thereby achieving high-quality similarity for each subject.