Video generation is experiencing a rapid growth, driven by advances in diffusion models and the development of better and larger datasets. However, producing high-quality videos remains challenging, due to the high dimensional data and the complexity of the task. Recent efforts have primarily focused on enhancing visual quality and addressing temporal inconsistencies, such as flickering. Despite progress in these areas, the generated videos often fall short in terms of motion complexity and physical plausibility, with many outputs either appearing static or exhibiting unrealistic motion. In this work, we propose a framework aimed at improving the realism of motion in generated videos, exploring a complementary direction to much of the existing literature. Specifically, we advocate for the incorporation of a retrieval mechanism during the generation phase. The retrieved videos act as grounding signals, providing the model with demonstrations of how the objects move. Our pipeline is desi...
Video generation is experiencing a rapid growth, driven by advances in diffusion models and the development of better and larger datasets. However, producing high-quality videos remains challenging, due to the high dimensional data and the complexity of the task. Recent efforts have primarily focused on enhancing visual quality and addressing temporal inconsistencies, such as flickering. Despite progress in these areas, the generated videos often fall short in terms of motion complexity and physical plausibility, with many outputs either appearing static or exhibiting unrealistic motion. In this work, we propose a framework aimed at improving the realism of motion in generated videos, exploring a complementary direction to much of the existing literature. Specifically, we advocate for the incorporation of a retrieval mechanism during the generation phase. The retrieved videos act as grounding signals, providing the model with demonstrations of how the objects move. Our pipeline is designed to apply to any text-to-video diffusion model, conditioning a pre-trained model on the retrieved samples with minimal fine-tuning. We demonstrate the superiority of our approach through established metrics, recently proposed benchmarks, and qualitative results, and we highlight additional applications of the framework. Code available at: https://github.com/helia95/ragme.
RagMe: Retrieval Augmented Video Generation for Enhanced Motion Realism / Peruzzo, E.; Xu, D.; Xu, X.; Shi, H.; Sebe, N.. - (2025), pp. 1081-1090. ( 2025 International Conference on Multimedia Retrieval, ICMR 2025 usa 2025) [10.1145/3731715.3733417].
RagMe: Retrieval Augmented Video Generation for Enhanced Motion Realism
Peruzzo E.;Xu D.;Sebe N.
2025-01-01
Abstract
Video generation is experiencing a rapid growth, driven by advances in diffusion models and the development of better and larger datasets. However, producing high-quality videos remains challenging, due to the high dimensional data and the complexity of the task. Recent efforts have primarily focused on enhancing visual quality and addressing temporal inconsistencies, such as flickering. Despite progress in these areas, the generated videos often fall short in terms of motion complexity and physical plausibility, with many outputs either appearing static or exhibiting unrealistic motion. In this work, we propose a framework aimed at improving the realism of motion in generated videos, exploring a complementary direction to much of the existing literature. Specifically, we advocate for the incorporation of a retrieval mechanism during the generation phase. The retrieved videos act as grounding signals, providing the model with demonstrations of how the objects move. Our pipeline is desi...I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione



