The emergence of vision-language foundation models, such as CLIP and DALL·E, has significantly advanced the use of natural language for image generation and manipulation. These models learn broad alignments between visual and textual representations, enabling flexible and general-purpose multimodal capabilities. However, using language as an efficient and robust interface for image generation remains challenging, with limitations in controllability, semantic expressiveness, and visual fidelity. This thesis focuses on several fundamental challenges in employing language as a control interface for image generation with foundation models. Specifically, it investigates (1) enhancing the controllability and precision of language-guided generation, (2) leveraging foundation models to explore and utilize generative latent spaces, and (3) developing a spectral perspective on CLIP embeddings to better analyze and improve generation quality. To address these aspects, we introduce three complementary approaches: Predict, Prevent, and Evaluate (PPE) enhances the controllability and precision of language-guided image manipulation by modeling and regularizing attribute interactions with natural language. StylerDALLE explores the latent space of generative models, formulating style transfer as latent translations, supervised through CLIP-based reinforcement learning to jointly preserve style and content in generation. SpectralCLIP analyzes the frequency spectrum of CLIP embeddings to suppress common artifacts in CLIP-guided generation, improving output robustness without compromising semantic alignment. Together, these contributions highlight the potential of natural language as a flexible and high-level interface for visual generation, grounded in the capabilities of vision-language foundation models. This thesis demonstrates how language, when appropriately modeled and guided, can effectively control diverse aspects of the generative process.

On Language as an Interface for Controlling Image Generation / Xu, Zipeng. - (2025 Dec 09).

On Language as an Interface for Controlling Image Generation

Xu, Zipeng
2025-12-09

Abstract

The emergence of vision-language foundation models, such as CLIP and DALL·E, has significantly advanced the use of natural language for image generation and manipulation. These models learn broad alignments between visual and textual representations, enabling flexible and general-purpose multimodal capabilities. However, using language as an efficient and robust interface for image generation remains challenging, with limitations in controllability, semantic expressiveness, and visual fidelity. This thesis focuses on several fundamental challenges in employing language as a control interface for image generation with foundation models. Specifically, it investigates (1) enhancing the controllability and precision of language-guided generation, (2) leveraging foundation models to explore and utilize generative latent spaces, and (3) developing a spectral perspective on CLIP embeddings to better analyze and improve generation quality. To address these aspects, we introduce three complementary approaches: Predict, Prevent, and Evaluate (PPE) enhances the controllability and precision of language-guided image manipulation by modeling and regularizing attribute interactions with natural language. StylerDALLE explores the latent space of generative models, formulating style transfer as latent translations, supervised through CLIP-based reinforcement learning to jointly preserve style and content in generation. SpectralCLIP analyzes the frequency spectrum of CLIP embeddings to suppress common artifacts in CLIP-guided generation, improving output robustness without compromising semantic alignment. Together, these contributions highlight the potential of natural language as a flexible and high-level interface for visual generation, grounded in the capabilities of vision-language foundation models. This thesis demonstrates how language, when appropriately modeled and guided, can effectively control diverse aspects of the generative process.
9-dic-2025
XXXVII
Università degli Studi di Trento
Information and Communication Technology
Sebe, Niculae
no
Inglese
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/468130
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact