On Language as an Interface for Controlling Image Generation

Xu, Zipeng

The emergence of vision-language foundation models, such as CLIP and DALL·E, has significantly advanced the use of natural language for image generation and manipulation. These models learn broad alignments between visual and textual representations, enabling flexible and general-purpose multimodal capabilities. However, using language as an efficient and robust interface for image generation remains challenging, with limitations in controllability, semantic expressiveness, and visual fidelity. This thesis focuses on several fundamental challenges in employing language as a control interface for image generation with foundation models. Specifically, it investigates (1) enhancing the controllability and precision of language-guided generation, (2) leveraging foundation models to explore and utilize generative latent spaces, and (3) developing a spectral perspective on CLIP embeddings to better analyze and improve generation quality. To address these aspects, we introduce three complementary approaches: Predict, Prevent, and Evaluate (PPE) enhances the controllability and precision of language-guided image manipulation by modeling and regularizing attribute interactions with natural language. StylerDALLE explores the latent space of generative models, formulating style transfer as latent translations, supervised through CLIP-based reinforcement learning to jointly preserve style and content in generation. SpectralCLIP analyzes the frequency spectrum of CLIP embeddings to suppress common artifacts in CLIP-guided generation, improving output robustness without compromising semantic alignment. Together, these contributions highlight the potential of natural language as a flexible and high-level interface for visual generation, grounded in the capabilities of vision-language foundation models. This thesis demonstrates how language, when appropriately modeled and guided, can effectively control diverse aspects of the generative process.

On Language as an Interface for Controlling Image Generation / Xu, Zipeng. - (2025 Dec 09), pp. 1-116.

On Language as an Interface for Controlling Image Generation

Xu, Zipeng

2025-12-09

Abstract

The emergence of vision-language foundation models, such as CLIP and DALL·E, has significantly advanced the use of natural language for image generation and manipulation. These models learn broad alignments between visual and textual representations, enabling flexible and general-purpose multimodal capabilities. However, using language as an efficient and robust interface for image generation remains challenging, with limitations in controllability, semantic expressiveness, and visual fidelity. This thesis focuses on several fundamental challenges in employing language as a control interface for image generation with foundation models. Specifically, it investigates (1) enhancing the controllability and precision of language-guided generation, (2) leveraging foundation models to explore and utilize generative latent spaces, and (3) developing a spectral perspective on CLIP embeddings to better analyze and improve generation quality. To address these aspects, we introduce three complementary approaches: Predict, Prevent, and Evaluate (PPE) enhances the controllability and precision of language-guided image manipulation by modeling and regularizing attribute interactions with natural language. StylerDALLE explores the latent space of generative models, formulating style transfer as latent translations, supervised through CLIP-based reinforcement learning to jointly preserve style and content in generation. SpectralCLIP analyzes the frequency spectrum of CLIP embeddings to suppress common artifacts in CLIP-guided generation, improving output robustness without compromising semantic alignment. Together, these contributions highlight the potential of natural language as a flexible and high-level interface for visual generation, grounded in the capabilities of vision-language foundation models. This thesis demonstrates how language, when appropriately modeled and guided, can effectively control diverse aspects of the generative process.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data di esame finale/Defended on
	
				9-dic-2025
			
	Ciclo
	
				XXXVII
			
	Anno Accademico
	
				2024-2025
			
	Dipartimento
	
				Università degli Studi di Trento
			
	Corso di dottorato
	
				Information and Communication Technology
			
	Supervisore/Relatore di tesi Unitn (Unitn internal supervisor)
	
				Sebe, Niculae
			
	Tesi in cotutela (Bi-nationally supervised Doctoral Thesis)
	
				no
			
	Lingua (Language)
	
				Inglese
			
	Appare nelle tipologie:
	
				08.1 Tesi di dottorato (Doctoral Thesis)

File in questo prodotto:

File	Dimensione	Formato
Phd_thesis (1).pdf accesso aperto Tipologia: Tesi di dottorato (Doctoral Thesis) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 54.93 MB Formato Adobe PDF Visualizza/Apri	54.93 MB	Adobe PDF	Visualizza/Apri