It is a crucial and desirable task in computer vision and graphics to produce creative portrait films of the highest caliber.
Although several effective models for portrait image toonification based on the potent StyleGAN have been proposed, these image-oriented techniques have clear drawbacks when used with videos, such as the fixed frame size, the requirement for face alignment, the absence of non-facial details, and temporal inconsistency.
A revolutionary VToonify framework is used to tackle the difficult controlled high-resolution portrait video style transfer.
We will examine the most recent study on VToonify in this article, including its functionality, drawbacks, and other factors.
What is Vtoonify?
VToonify framework allows for customizable high-resolution portrait video style transmission.
VToonify uses StyleGAN’s mid- and high-resolution layers to create high-quality artistic portraits based on multi-scale content characteristics retrieved by an encoder to retain frame details.
The resultant fully convolutional architecture takes non-aligned faces in variable-size movies as input, resulting in whole-face regions with realistic movements in the output.
This framework is compatible with current StyleGAN-based image toonification models, allowing them to be extended to video toonification, and inherits attractive characteristics such as adjustable color and intensity customization.
This study introduces two instantiations of VToonify based on Toonify and DualStyleGAN for collection-based and exemplar-based portrait video style transfer, respectively.
Extensive experimental findings show that the proposed VToonify framework outperforms existing approaches in making high-quality, temporally-coherent artistic portrait movies with variable style parameters.
Researchers provide the Google Colab notebook, so you can get your hands dirty on it.
How does it work?
To accomplish adjustable high-resolution portrait video style transfer, VToonify combines the advantages of the image translation framework with the StyleGAN-based framework.
To accommodate varying input sizes, the image translation system employs fully convolutional networks. Training from scratch, on the other hand, makes high-resolution and controlled style transmission impossible.
The pre-trained StyleGAN model is used in the StyleGAN-based framework for high-resolution and controlled style transfer, although it is limited to fixed picture size and detail losses.
StyleGAN is modified in the hybrid framework by deleting its fixed-sized input feature and low-resolution layers, resulting in a fully convolutional encoder-generator architecture similar to that of the image translation framework.
To maintain frame details, train an encoder to extract multi-scale content characteristics of the input frame as an additional content requirement to the generator. Vtoonify inherits the StyleGAN model’s style control flexibility by putting it into the generator to distill both its data and model.
Limitations of StyleGAN & Proposed Vtoonify
Artistic portraits are common in our daily lives as well as in creative businesses such as art, social media avatars, movies, entertainment advertising, and so on.
With the development of deep learning technology, it is now possible to create high-quality artistic portraits from real-life face photos using automated portrait style transfer.
There are a variety of successful ways created for image-based style transfer, many of which are easily accessible to beginning users in the form of mobile applications. Video material has swiftly become a mainstay of our social media feeds over the last several years.
The rise of social media and ephemeral films has increased the demand for innovative video editing, such as portrait video style transfer, to generate successful and interesting videos.
Existing image-oriented techniques have significant disadvantages when applied to movies, limiting their usefulness in automated portrait video stylization.
StyleGAN is a common backbone for developing a portrait picture style transfer model due to its capacity to create high-quality faces with adjustable style management.
A StyleGAN-based system (also known as picture toonification) encodes a real face into the StyleGAN latent space and then applies the resulting style code to another StyleGAN fine-tuned on the artistic portrait dataset to create a stylized version.
StyleGAN creates pictures with aligned faces and at a fixed size, which does not favor dynamic faces in real-world footage. Face cropping and alignment in the video sometimes result in a partial face and awkward gestures. Researchers call this issue StyleGAN’s ‘fixed-crop restriction.’
For unaligned faces, StyleGAN3 has been proposed; however, it only supports a set picture size.
Furthermore, a recent study discovered that encoding unaligned faces is more challenging than aligned faces. Incorrect face encoding is harmful to portrait style transfer, resulting in issues such as identity alteration and missing components in the reconstructed and styled frames.
As discussed, an efficient technique for portrait video style transfer must handle the following issues:
- To preserve realistic movements, the approach must be able to deal with unaligned faces and varied video sizes. A large video size, or a wide angle of view, can capture more information while keeping the face from moving out of frame.
- To compete with today’s commonly utilized HD gadgets, high-resolution video is necessary.
- Flexible style control should be offered for users to alter and pick their choice when developing a realistic user interaction system.
To that purpose, researchers suggest VToonify, a novel hybrid framework for video toonification. To overcome the fixed crop constraint, researchers first study translation equivariance in StyleGAN.
VToonify combines the benefits of the StyleGAN-based architecture and the image translation framework to achieve adjustable high-resolution portrait video style transfer.
The following are the major contributions:
- Researchers investigate StyleGAN’s fixed-crop constraint and propose a solution based on translation equivariance.
- Researchers present a unique fully convolutional VToonify framework for controlled high-resolution portrait video style transfer that supports unaligned faces and different video sizes.
- Researchers construct VToonify on the backbones of Toonify and DualStyleGAN and condense the backbones in terms of both data and model to enable collection-based and exemplar-based portrait video style transfer.
Comparing Vtoonify with other state-of-the-art models
Toonify
It serves as the foundation for collection-based style transfer on aligned faces using StyleGAN. To retrieve the style codes, researchers must align faces and crop 256256 photos for PSP. Toonify is used to generate a stylized outcome with 1024*1024 style codes.
Finally, they re-align the result in the video to its original location. The un-stylized area has been set to black.
DualStyleGAN
It is a backbone for exemplar-based style transfer based on StyleGAN. They use the same data pre- and post-processing techniques as Toonify.
Pix2pixHD
It’s an image-to-image translation model that’s commonly used to condense pre-trained models for high-resolution editing. It is trained using paired data.
Researchers utilize pix2pixHD as its additional instance map inputs since it uses extracted parsing map.
First Order Motion
FOM is a typical image animation model. It was trained on 256256 pictures and performs poorly with other image sizes. As a consequence, researchers first scale the video frames to 256*256 for FOM to animation and then resize the results to their original size.
For a fair comparison, FOM employs the first stylized frame of its approach as its reference style image.
DaGAN
It is a 3D face animation model. They use the same data preparation and postprocessing methods as FOM.
Advantages
- It can be employed in the arts, social media avatars, movies, entertainment advertising, and so forth.
- Vtoonify can also be utilized in the metaverse.
Limitations
- This methodology extracts both the data and the model from the StyleGAN-based backbones, resulting in data and model bias.
- The artifacts are caused mostly by size differences between the stylized face region and the other sections.
- This strategy is less successful when dealing with things in the face region.
Conclusion
Finally, VToonify is a framework for style-controlled high-resolution video toonification.
This framework achieves great performance in handling videos and enables broad control over the structural style, color style, and style degree by condensing StyleGAN-based image toonification models in terms of both their synthetic data and network structures.
Leave a Reply