Existing Text-to-image customization works:
follow the pseudo-word paradigm, i.e.,
(1) firstly representing the given subjects as pseudo-words that do not exist in the textual vocabulary,
(2) secondly composing them with other given text to guide the generation collectively.
However, the inherent conflict and entanglement between the pseudo-words and texts result in a dual-optimum paradox,
where subject similarity and text controllability cannot be optimal simultaneously.
We introduce a novel real-words paradigm named RealCustom++,
which instead represents subjects as non-conflict real words,
thereby disentangling subject similarity from text controllability and allowing both to be optimized simultaneously.
The core idea of RealCustom++ is to represent given subjects as real words (i.e., its super-category word)
that can be seamlessly integrated with given texts,
and further leveraging the relevance between real words and image regions to disentangle subjects from texts.
Illustration of RealCustom++, which employs a novel “train-inference” decoupled framework:
Training Paradigm:
RealCustom++ learns the alignment between vision conditions and all real words in the text.
This is achieved by the Cross-layer Cross-Scale Projector (CCP) to robustly and finely extract subject features,
and a Curriculum Training Recipe (CTR) that adapts the generated subject to diverse poses and sizes.
The projected visual conditions are injected into the diffusion models
by extending their textual cross-attention with an additional visual cross-attention in each block.
Inference Paradigm:
A novel Adaptive Mask Guidance (AMG) is proposed to customize the generation of the specific target real word ("toy" in this case).
This is achieved by incorporating two branches for each generation step: a Guidance Branch that constructs the image guidance mask
and a Generation Branch that uses the image guidance mask to keep other subject-irrelevant regions uncontaminated.
Qualitative comparison of single subject customization.
Qualitative comparison of multiple subject customization.