Background

GAN-inversion-based Zero-shot Text-guided Manipulation
- Recently, StyleGAN combined with (CLIP) has become popular thanks to its ability for zero-shot image manipulation guided by text prompts.
- StyleCLIP manipulates the latent guided by CLIP, while StyleGAN-Nada manipulates the generator itself guided by CLIP.
- Directional CLIP loss (StyleGAN-NADA) → robust to mode-collapse issues
- By aligning the direction between the image representations with the direction between the reference text and the target text, distinct images should be generated.
- When we manipulate real images with these methods, GAN-inversion, where the real images are converted to the latent, is required.
Limitations in GAN-inversion-based Manipulation

- However, its real-world application on diverse types of images is still tricky due to the limited GAN inversion performance.
- This is because they have rarely seen such faces with hands during the training phase.
- This issue becomes even worse in the case of images from a dataset with high variance such as church images in LSUN-Church dataset.
Diffusion Models

- Diffusion probabilistic models are a type of latent variable models that consist of a forward diffusion process and a reverse diffusion process, consisting of Markov Chain.
- Recently, diffusion models have achieved great success in image generation. The latest works have demonstrated an even higher quality of image synthesis performance compared to SOTA GAN.
- Furthermore, a recent denoising diffusion implicit model (DDIM) further accelerates sampling procedure and enables nearly perfect inversion.
- However, this reconstruction was briefly introduced in Appendix F of “Diffusion Models Beat GANs Image Synthesis” with just qualitative results and formulations.
- How this inversion capability of diffusion models is good compared to GAN-inversion methods hasn’t been analyzed in-depth before.
DiffusionCLIP
