DiffIR2VR-Zero:
Zero-Shot Video Restoration with Diffusion-based Image Restoration Models
- Chang-Han Yeh 1
- Chin-Yang Lin 1
- Zhixiang Wang 2
- Chi-Wei Hsiao 3
- Ting-Hsuan Chen 1
- Yu-Lun Liu 1 1National Yang Ming Chiao Tung University 2University of Tokyo 3MediaTek Inc.
Abstract
This paper introduces a method for zero-shot video restoration using pre-trained image restoration diffusion models. Traditional video restoration methods often need retraining for different settings and struggle with limited generalization across various degradation types and datasets. Our approach uses a hierarchical token merging strategy for keyframes and local frames, combined with a hybrid correspondence mechanism that blends optical flow and feature-based nearest neighbor matching (latent merging). We show that our method not only achieves top performance in zero-shot video restoration but also significantly surpasses trained models in generalization across diverse datasets and extreme degradations (8$\times$ super-resolution and high-standard deviation video denoising). We present evidence through quantitative metrics and visual comparisons on various challenging datasets. Additionally, our technique works with any 2D restoration diffusion model, offering a versatile and powerful tool for video enhancement tasks without extensive retraining. This research leads to more efficient and widely applicable video restoration technologies, supporting advancements in fields that require high-quality video output.
Video super-resolution
(a) Traditional regression-based methods such as FMA-Net are limited to the training data domain and tend to produce blurry results when encountering out-of-domain inputs. (b) Although applying image-based diffusion models such as DiffBIR to individual frames can generate realistic details, these details often lack consistency across frames. (c) Our method leverages an image diffusion model to restore videos, achieving both realistic and consistent results without any additional training.
Ours proposed zero-shot video restoration
We process low-quality (LQ) videos in batches using a diffusion model, with a keyframe randomly sampled within each batch. (a) At the beginning of the diffusion denoising process, hierarchical latent warping provides rough shape guidance both globally, through latent warping between keyframes, and locally, by propagating these latents within the batch. (b) Throughout most of the denoising process, tokens are merged before the self-attention layer. For the downsample blocks, optical flow is used to find the correspondence between tokens, and for the upsample blocks, cosine similarity is utilized. This hybrid flow-guided, spatial-aware token merging accurately identifies correspondences between tokens by leveraging both flow and spatial information, thereby enhancing overall consistency at the token level.
Hierarchical latent warping
Without requiring any training, Hierarchical latent warping provides global and local shape guidance and can achieve coherence across frames by enforcing temporal stability in latent space.
Hybrid spatial-aware token merging
Hybrid spatial-aware token merging before the self-attention layer improves temporal consistency by matching similar tokens using optical flow in the down blocks and cosine similarity in the up blocks of the UNet.
Token correspondences
Correspondences found by cosine similarity and by optical flow. At the beginning of the denoising process, the latents in the UNet downblocks are too noisy for cosine similarity to be effective, while optical flow estimated from LQ frames remains reliable. Flow and cosine similarity often identify different correspondences, so a hybrid approach is more effective.
Qualitative comparisons on video denoising
Baseline method (left) vs DiffIR2VR-Zero (right). Try selecting different methods and scenes!
Qualitative comparisons on video super-resolution
Baseline method (left) vs DiffIR2VR-Zero (right). Try selecting different methods and scenes!
Additional application: consistent video depth
Citation
Acknowledgements
This research was funded by the National Science and Technology Council, Taiwan, under Grants NSTC 112-2222-E-A49-004-MY2. The authors are grateful to Google, NVIDIA, and MediaTek Inc. for generous donations. Yu-Lun Liu acknowledges the Yushan Young Fellow Program by the MOE in Taiwan.
The website template was borrowed from Michaël Gharbi and Ref-NeRF.