Recently media content generation, particularly image and video, using deep learning gained a lot of attention in the research community. One of the main reasons for that is the surge of the interactions in the social networks, that draw a lot of people without specialized backgrounds into the media industry. This raises the interest in the tolls for simplifying the production of the media content, such as images and videos. Another potential avenue for deep learning methods is a simplification of the content generation for the traditional media, especially creation of movies, visual effects for which require significant human efforts. One of the most promising directions, in deep learning based content production, is image animation, e.g. generation of videos based on the known appearance and the movement. While the GANs and other generative models have shown great progress in the generation of the random images and videos, photorealistic quality in image animation is still not achieved. In this work we made a step into this direction by stating the task of the image animation and proposing several new methods for solving it. In the image animation we are given a single source image and a driving video and are asked to produce a video where an object from the source image moves like an object from the driving video. Prior to adapting deep learning techniques for image animation we investigate an ability of these to work on simpler, but related task, e.g. pose-guided generation. In this task we are given source image and target pose and asked to generate a person from the source image in the target pose. We identify the main flaw of the current image2image architectures and propose a new architecture, based on deformable skip connection, to address this. We used these insides to create methods for image animation. To this end we propose an architecture called Monkey-Net that is based on the keypoints learned in the unsupervised way. However only the keypoints are not able to represent all the possible variations in the pose movement, to this end we propose to extend the keypoints representation with local affine transformation around these keypoints in the First Order Model. Finally we propose a new better way of estimation of the affine transformations using Principal Component Analysis (PCA).

Image Animation Using Deep Learning / Siarohin, Aliaksandr. - (2021 Jun 24), pp. 1-88. [10.15168/11572_310291]

Image Animation Using Deep Learning

Siarohin, Aliaksandr
2021-06-24

Abstract

Recently media content generation, particularly image and video, using deep learning gained a lot of attention in the research community. One of the main reasons for that is the surge of the interactions in the social networks, that draw a lot of people without specialized backgrounds into the media industry. This raises the interest in the tolls for simplifying the production of the media content, such as images and videos. Another potential avenue for deep learning methods is a simplification of the content generation for the traditional media, especially creation of movies, visual effects for which require significant human efforts. One of the most promising directions, in deep learning based content production, is image animation, e.g. generation of videos based on the known appearance and the movement. While the GANs and other generative models have shown great progress in the generation of the random images and videos, photorealistic quality in image animation is still not achieved. In this work we made a step into this direction by stating the task of the image animation and proposing several new methods for solving it. In the image animation we are given a single source image and a driving video and are asked to produce a video where an object from the source image moves like an object from the driving video. Prior to adapting deep learning techniques for image animation we investigate an ability of these to work on simpler, but related task, e.g. pose-guided generation. In this task we are given source image and target pose and asked to generate a person from the source image in the target pose. We identify the main flaw of the current image2image architectures and propose a new architecture, based on deformable skip connection, to address this. We used these insides to create methods for image animation. To this end we propose an architecture called Monkey-Net that is based on the keypoints learned in the unsupervised way. However only the keypoints are not able to represent all the possible variations in the pose movement, to this end we propose to extend the keypoints representation with local affine transformation around these keypoints in the First Order Model. Finally we propose a new better way of estimation of the affine transformations using Principal Component Analysis (PCA).
24-giu-2021
XXXIII
2019-2020
Ingegneria e Scienza dell'Informaz (cess.4/11/12)
Information and Communication Technology
Sebe, Niculae
no
Inglese
File in questo prodotto:
File Dimensione Formato  
phd_unitn_alisaksandr_siarohin.pdf

accesso aperto

Tipologia: Tesi di dottorato (Doctoral Thesis)
Licenza: Creative commons
Dimensione 45.32 MB
Formato Adobe PDF
45.32 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/310291
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact