One Picture and One Thousand Words: Toward integrated multimodal generative models

IRIS

Thanks to independent advances in language and image generation, we could soon be in the position to have systems that communicate with us by combining language and images in their output, a skill that humans do not possess (we receive, but we do not produce images at high speed). This paper explores some of the implications of this idea: which kinds of data sets need to be developed to train such systems, in which cases language and images could be most usefully integrated and which issues could arise on the image generation and language+images integration side. Story and dialogue illustration could be relatively low-hanging fruits for this technology, and a looped combination of I2T LLMs and T2I diffusion models is likely to play a role in solving some of the issues that arise in the design of such systems.

One Picture and One Thousand Words: Toward integrated multimodal generative models / Zamparelli, R.. - In: IJCOL. - ISSN 2499-4553. - ELETTRONICO. - 10:2(2024), pp. 31-55. [10.17454/ijcol102.03]

One Picture and One Thousand Words: Toward integrated multimodal generative models

Zamparelli, Roberto

2024-01-01

Abstract

Thanks to independent advances in language and image generation, we could soon be in the position to have systems that communicate with us by combining language and images in their output, a skill that humans do not possess (we receive, but we do not produce images at high speed). This paper explores some of the implications of this idea: which kinds of data sets need to be developed to train such systems, in which cases language and images could be most usefully integrated and which issues could arise on the image generation and language+images integration side. Story and dialogue illustration could be relatively low-hanging fruits for this technology, and a looped combination of I2T LLMs and T2I diffusion models is likely to play a role in solving some of the issues that arise in the design of such systems.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2024
			
	Titolo del periodico (Journal title)
	
				IJCOL
			
	Numero e parte del fascicolo (Issue number and part)
	
				2
			
	DOI
	
				https://dx.doi.org/10.17454/ijcol102.03
			
	Settori scientifico-disciplinari (validi fino a 24/06/2024) - Reference SSD (valid until 24/06/2024)
	
				Settore L-LIN/01 - Glottologia e Linguistica
			
	Settori scientifico-disciplinari (validi dal 09/05/2024) - Reference SSD (valid from 09/05/2024)
	
				Settore GLOT-01/A - Glottologia e linguistica
			
	Codice Scopus (Scopus identifier)
	
				2-s2.0-105005728048
			
	Tutti gli autori
	
						Zamparelli, Roberto
					
	Citazione
	
				One Picture and One Thousand Words: Toward integrated multimodal generative models / Zamparelli, R.. - In: IJCOL. - ISSN 2499-4553. - ELETTRONICO. - 10:2(2024), pp. 31-55. [10.17454/ijcol102.03]
			
	Appare nelle tipologie:
	
				03.1 Articolo su rivista (Journal article)

File in questo prodotto:

File	Dimensione	Formato
ijcol-1432.pdf accesso aperto Tipologia: Versione editoriale (Publisher’s layout) Licenza: Creative commons Dimensione 1.04 MB Formato Adobe PDF Visualizza/Apri	1.04 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/466090

Citazioni

ND

0

ND

0

social impact