Background

Untitled

Recently, StyleGAN combined with (CLIP) has become popular thanks to its ability for zero-shot image manipulation guided by text prompts.
StyleCLIP manipulates the latent guided by CLIP, while StyleGAN-Nada manipulates the generator itself guided by CLIP.
Directional CLIP loss (StyleGAN-NADA) → robust to mode-collapse issues
- By aligning the direction between the image representations with the direction between the reference text and the target text, distinct images should be generated.
When we manipulate real images with these methods, GAN-inversion, where the real images are converted to the latent, is required.

Untitled

However, its real-world application on diverse types of images is still tricky due to the limited GAN inversion performance.
This is because they have rarely seen such faces with hands during the training phase.
This issue becomes even worse in the case of images from a dataset with high variance such as church images in LSUN-Church dataset.

Untitled

Diffusion probabilistic models are a type of latent variable models that consist of a forward diffusion process and a reverse diffusion process, consisting of Markov Chain.
Recently, diffusion models have achieved great success in image generation. The latest works have demonstrated an even higher quality of image synthesis performance compared to SOTA GAN.
Furthermore, a recent denoising diffusion implicit model (DDIM) further accelerates sampling procedure and enables nearly perfect inversion.
- However, this reconstruction was briefly introduced in Appendix F of “Diffusion Models Beat GANs Image Synthesis” with just qualitative results and formulations.
- How this inversion capability of diffusion models is good compared to GAN-inversion methods hasn’t been analyzed in-depth before.

DiffusionCLIP

Untitled