Abstract

We propose an algorithm to improve multi-concept prompt fidelity in text-to-image diffusion models. We start from a common failure: prompts like “a cat and a clock” sometimes yield images where one concept is missing, faint, or colliding awkwardly with another. We hypothesize this happens when the model drifts into mixed modes that over-emphasize a single concept pattern it learned strongly during training, while the others are weakened. Instead of retraining, we introduce a corrective sampling strategy that gently suppresses regions where the joint prompt behavior overlaps too strongly with any single concept's dominant pattern, steering generation toward “pure” joint modes where all concepts can co-exist with balanced visual presence. We further show that existing multi-concept guidance schemes can operate in unstable weight regimes that amplify imbalance; we characterize favorable regions and adapt sampling to remain within them. The approach is plug-and-play, requires no model tuning, and complements standard classifier-free guidance. Experiments on diverse multi-concept prompts show consistent gains in concept coverage, relative prominence balance, and robustness, reducing dropped or distorted concepts compared to standard baselines and prior compositional methods. Results suggest that lightweight corrective guidance can substantially mitigate brittle semantic behavior in modern diffusion systems.

Figure 1
Figure 1: The figure illustrates our hypothesis on mode overlap using a simple 2D toy example. (a) Two modes of the distribution $p_t(x \mid \textit{“a cat and a dog”})$ (in contour) has significant overlap with the modes of the individual concept distributions $p_t(x \mid \textit{“a cat”})$ (in contour) and $p_t(x \mid \textit{“a dog”})$ (in contour). (b) The proposed corrector distribution $p_t(x \mid \textit{“a cat and a dog”}) / (p_t(x \mid \textit{“cat”}) p_t(x \mid \textit{“dog”}))$ suppresses these overlaps, steering the generation away from problematic modes. The arrows indicate the denoising directions.

Problematic Modes and Correction Hypothesis

T2I models like Stable-Diffusion sample from the modes (or high probability regions) of the learned distribution, $p(x \mid C)$. While such models can produce high resolution images in general, every so often, the results are surprisingly misaligned even for very simple prompts containing few concepts, e.g., $C=\textit{“a cat and a dog”}$. Diagnosing exactly why this behavior emerges periodically is difficult. It is conceivable that the complex training process in high dimensions, especially in conjunction with text embeddings, creates some problematic modes in $p(x \mid C)$.

We hypothesize that problematic modes in $p(x \mid C)$ arise when they overlap with modes of individual concept distribution $p(x \mid c_i)$. Such an overlap biases the generation toward a single concept $c_i$, reducing the prominence of others. For instance, across images of $c_1 = \textit{“cat”}$ in the training dataset, a few may have an inconspicuous or partial $c_2 = \textit{“dog”}$ in the background. This image may still fall under the mode of $p(x \mid C)$. We attribute this to training instabilities and relatively less coverage of multi-concept prompts $C$, which cause the model to assign high probability even to weakly conforming images. Said differently, an image of a big cat and an inconspicuous dog can get assigned high probabilities under $p(x \mid C)$, causing semantic misalignment.

Preventing such problematic modes warrants strict and specialized training paradigms; a difficult task for such large models. However, “curing” them after their occurrence is a more viable approach. Assuming our hypothesis is true, we propose a cure for problematic modes. Our intuitive idea is to go away from problematic modes and move towards modes under which none of the individual concepts are strong. To realize this, we propose to design a corrector that generates samples from the following distribution:

Figure 2
Corrected Distribution.

Figure 1 illustrates the intuition behind our proposal. Our corrector distribution $\tilde{p}(x \mid C)$ assigns low probability to regions where $p(x \mid C)$ overlaps with individual $p(x \mid c_i)$; we deem them as degenerate modes dominated by a single concept. By suppressing these overlaps, the corrector emphasizes pure $p(x \mid C)$ modes where all concepts coexist without one overwhelming the others. From a probabilistic perspective, this acts as a corrective factor: while $p(x \mid C)$ may assign high probability to weakly conforming images due to training noise or limited multi-concept data, dividing by the marginals removes this bias and sharpens the distribution toward genuine multi-concept samples. As a result, the modes we target are more semantically aligned and less prone to concept suppression or distortion.

CO3 – Algorithm

Algorithm 1
Algorithm  2
Algorithm 3

Results

Figure 3
Figure 3: Quantitative comparison of different methods on compositional generation tasks over different category of prompts. We evaluate the generated images using two metrics: BLIP-VQA and ImageReward. Top performing model is highlighted in Black and 2nd best in Blue. Higher the score the better.
Figure 4
Figure 4: Qualitative comparison of different methods on simpler prompts.
Figure 5
Figure 5: Qualitative comparison of CO3 with competing methods on complex prompts.
Figure 5
Figure 6: Additional qualitative comparison of SDXL, and SDXL+CO3 on simple prompts.
Figure 6
Figure 7: Model Agnostic behavior: Qualitative comparison of generation from PixART-$\Sigma$ base diffusion model, PixART-$\Sigma$ + CO3, and PixART-$\Sigma$ + Composable Diffusion.

Code