Thanks to independent advances in language and image generation, we could soon be in the position to have systems that communicate with us by combining language and images in their output, a skill that humans do not possess (we receive, but we do not produce images at high speed). This paper explores some of the implications of this idea: which kinds of data sets need to be developed to train such systems, in which cases language and images could be most usefully integrated and which issues could arise on the image generation and language+images integration side. Story and dialogue illustration could be relatively low-hanging fruits for this technology, and a looped combination of I2T LLMs and T2I diffusion models is likely to play a role in solving some of the issues that arise in the design of such systems.

One Picture and One Thousand Words: Toward integrated multimodal generative models / Zamparelli, Roberto. - In: IJCOL. - ISSN 2499-4553. - ELETTRONICO. - 10:2(2024), pp. 31-55. [10.17454/ijcol102.03]

One Picture and One Thousand Words: Toward integrated multimodal generative models

Zamparelli, Roberto
2024-01-01

Abstract

Thanks to independent advances in language and image generation, we could soon be in the position to have systems that communicate with us by combining language and images in their output, a skill that humans do not possess (we receive, but we do not produce images at high speed). This paper explores some of the implications of this idea: which kinds of data sets need to be developed to train such systems, in which cases language and images could be most usefully integrated and which issues could arise on the image generation and language+images integration side. Story and dialogue illustration could be relatively low-hanging fruits for this technology, and a looped combination of I2T LLMs and T2I diffusion models is likely to play a role in solving some of the issues that arise in the design of such systems.
2024
2
Settore L-LIN/01 - Glottologia e Linguistica
Settore GLOT-01/A - Glottologia e linguistica
Zamparelli, Roberto
One Picture and One Thousand Words: Toward integrated multimodal generative models / Zamparelli, Roberto. - In: IJCOL. - ISSN 2499-4553. - ELETTRONICO. - 10:2(2024), pp. 31-55. [10.17454/ijcol102.03]
File in questo prodotto:
File Dimensione Formato  
ijcol-1432.pdf

accesso aperto

Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Creative commons
Dimensione 1.04 MB
Formato Adobe PDF
1.04 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/466090
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact