InstanceControl: Controllable Complex Image Generation without Instance Labeling

Authors: Xiaoyu Liu, Huan Wang, Siming Li, Fan Li, Zhixin Wang, Jiaqi Xu, Wengang Zhou, Houqiang Li

arXiv ID: 2606.31924

Problem: Existing controllable image generation methods like ControlNet struggle with attribute confusion in multi-instance scenes, while approaches that fix this require labor-intensive manual instance labeling.

Key Methodology:

Uses a Vision-Language Model (VLM) to automatically parse instance descriptions from text prompts and predict instance masks directly from visual conditions (e.g., canny edges, depth maps, HED), eliminating the need for manual instance labeling.
Introduces a Shared SEG Token (SST) strategy to aggregate multiple descriptive phrases for the same instance into a unified mask representation, handling cases where a single instance is described across scattered text.
Proposes a mask refinement module (MRM) that adaptively refines predicted masks during generation by fusing confidence scores, attention-based masks, and latent image features - relaxing constraints when predicted masks are unreliable.

Key Results:

Outperforms all baselines (including instance-labeled methods) across canny, depth, and HED conditions on the MIG-Eval benchmark: achieves 95.34% Spatial Accuracy (vs. 91.97% EliGen, 80.74% DreamRenderer) and 9.93 FID (vs. 14.51 DreamRenderer) on canny.
Without instance labeling, beats FLUX ControlNet by ~12.3% in both Accuracy and Local CLIP Score on canny (93.54% vs. 84.67% Spatial Accuracy).
Mask refinement module improves Accuracy from 87.97% → 90.10%; with interactive correction reaches 93.19%.

Applied Context: If you're building image generation pipelines where users describe complex scenes with multiple distinct objects - think design tools, game asset generation, or storyboarding - InstanceControl gives you fine-grained instance control without requiring users to manually draw boxes or masks for each object.