The emergence of vision-language foundation models, such as CLIP and DALL·E, has significantly advanced the use of natural language for image generation and manipulation. These models learn broad alignments between visual and textual representations, enabling flexible and general-purpose multimodal capabilities. However, using language as an efficient and robust interface for image generation remains challenging, with limitations in controllability, semantic expressiveness, and visual fidelity. This thesis focuses on several fundamental challenges in employing language as a control interface for image generation with foundation models. Specifically, it investigates (1) enhancing the controllability and precision of language-guided generation, (2) leveraging foundation models to explore and utilize generative latent spaces, and (3) developing a spectral perspective on CLIP embeddings to better analyze and improve generation quality. To address these aspects, we introduce three complementary approaches: Predict, Prevent, and Evaluate (PPE) enhances the controllability and precision of language-guided image manipulation by modeling and regularizing attribute interactions with natural language. StylerDALLE explores the latent space of generative models, formulating style transfer as latent translations, supervised through CLIP-based reinforcement learning to jointly preserve style and content in generation. SpectralCLIP analyzes the frequency spectrum of CLIP embeddings to suppress common artifacts in CLIP-guided generation, improving output robustness without compromising semantic alignment. Together, these contributions highlight the potential of natural language as a flexible and high-level interface for visual generation, grounded in the capabilities of vision-language foundation models. This thesis demonstrates how language, when appropriately modeled and guided, can effectively control diverse aspects of the generative process.
On Language as an Interface for Controlling Image Generation / Xu, Zipeng. - (2025 Dec 09).
On Language as an Interface for Controlling Image Generation
Xu, Zipeng
2025-12-09
Abstract
The emergence of vision-language foundation models, such as CLIP and DALL·E, has significantly advanced the use of natural language for image generation and manipulation. These models learn broad alignments between visual and textual representations, enabling flexible and general-purpose multimodal capabilities. However, using language as an efficient and robust interface for image generation remains challenging, with limitations in controllability, semantic expressiveness, and visual fidelity. This thesis focuses on several fundamental challenges in employing language as a control interface for image generation with foundation models. Specifically, it investigates (1) enhancing the controllability and precision of language-guided generation, (2) leveraging foundation models to explore and utilize generative latent spaces, and (3) developing a spectral perspective on CLIP embeddings to better analyze and improve generation quality. To address these aspects, we introduce three complementary approaches: Predict, Prevent, and Evaluate (PPE) enhances the controllability and precision of language-guided image manipulation by modeling and regularizing attribute interactions with natural language. StylerDALLE explores the latent space of generative models, formulating style transfer as latent translations, supervised through CLIP-based reinforcement learning to jointly preserve style and content in generation. SpectralCLIP analyzes the frequency spectrum of CLIP embeddings to suppress common artifacts in CLIP-guided generation, improving output robustness without compromising semantic alignment. Together, these contributions highlight the potential of natural language as a flexible and high-level interface for visual generation, grounded in the capabilities of vision-language foundation models. This thesis demonstrates how language, when appropriately modeled and guided, can effectively control diverse aspects of the generative process.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione



