Upscaling Images with Real-ESRGAN

In a previous article I introduced how to use the diffusers library and Abyss Orange Mix 2 (AOM2) to generate glamour photography inspired synthetic images. Unfortunately the synthetic images tend to be of low resolution such as 512 by 768 pixels. In the era of HD and 4K imagery, such low resolution images are not usable by any standards and must be upscaled somehow! This is where super resolution comes in — in contrast to classical computer vision image upscaling methods such as bicubic interpolation, super resolution imaging harnesses the power of deep learning to create stunning high quality high resolution images from low resolution versions.

One of the most popular super resolution models right now is Real-ESRGAN, proposed by Xintao Wang et al. In this article we will explore how to use Real-ESRGAN to create high resolution images output by Abyss Orange Mix 2 (AOM2)!

Image Degradation in Reality One of the biggest challenges in super resolution

might seem very counter-intuitive: how do we create low resolution low quality — high resolution high quality image pairs for training the deep learning model?

Can’t we simply use computer vision software such as OpenCV or even GIMP to downscale large images and add some noise and blur to create the low resolution low quality images? The answer is no! The end result would be the model would simply learn to undo these simple artificial degradations, and nothing else!

Image quality degradation in the real world is an extremely complex process involving multiple steps — from degradations due to the photographer such as motion or focus blur, to digital effects such as image compression by apps.

In order to account for such complex image degradation effects, Real-ESRGAN is trained using images created by a high-order degradation process which are closer to those found in reality. This is an extremely important step in creating realistic training data — as the saying goes, garbage-in-garbage out!

In particular, Xintao Wang et al. used two layers of image degrading processes to create the high order degradation. Each layer consists of blurring, downsampling, noise and jpeg compression. Additionally the second layer also has a 2D sinc filter to take into account ringing and overshoot artefacts commonly encountered in reality.

The Generator and Discriminator Networks

As the name implies, Real-ESRGAN consists of a generator and a discriminator network (a GAN!). The generator network takes as input a low quality image and outputs a high quality image, while the discriminator network attempts to identify the synthetic images output by the generator from the actual images.

As can be seen in the diagram above the generator is trained not only to upscale images — it also restores details in low quality images. In order to improve the quality of the output image, internally the model uses residual in residual dense blocks as shown in the diagram below which combine multi-level residual networks and dense connections. More layers and more connections tend to increase model performance!

The discriminator used is essentially a U-net with skip connections. Spectral normalization is used to increase stability during training. The U-net predicts the realness for each pixel in the input image instead of some global realness score. This helps to provide pixel-level discrimination which boosts the enhancement of local detail and the suppression of artefacts.

Once fully trained, the generator network can then be used to create high quality high resolution images from low quality ones!

Accessing Real-ESRGAN on HuggingFace

As with diffusion models, Real-ESRGAN is also available on HuggingFace. Various models are available — some are trained to perform super resolution imaging on general images, while others are trained to do so on anime images.

In particular, for this article we will use Real-ESRGAN space created by havas79 to upscale the synthetic image right at top of this article. As this is a HuggingFace space, the model can be used directly from the browser!

We use the provided anime model with an upscaling factor of 4 — this resulted in an output image with a resolution of 2048 × 3072 up from 512 × 768, all without any loss in image quality — in fact image quality has been restored very nicely!

Super resolution and generative AI are extremely exciting technologies, and I am really looking forward to new improvements in the technology to further improve my workflow! Thank you for reading!

References

  1. https://arxiv.org/pdf/2107.10833.pdf
  2. https://xinntao.github.io/projects/RealESRGAN_src/RealESRGAN_poster.pdf
  3. https://github.com/xinntao/Real-ESRGAN
  4. https://arxiv.org/pdf/1809.00219.pdf
  5. https://arxiv.org/pdf/1802.05957.pdf