Figure 8 - available via license: Creative Commons Attribution 4.0 International
Content may be subject to copyright.
CDF of reconstruction quality across 2M frames of all videos in our test corpus. Gemino outperforms Bicubic upsampling and SwinIR super-resolution by 0.05 and 0.1 in LPIPS at the median and tail respectively. Gemino also consistently outperforms the keypoint-based model FOMM by nearly 5 dB in SSIM and 10 dB in PSNR on all frames.
Source publication
Video conferencing systems suffer from poor user experience when network conditions deteriorate because current video codecs simply cannot operate at extremely low bitrates. Recently, several neural alternatives have been proposed that reconstruct talking head videos at very low bitrates using sparse representations of each frame such as facial lan...
Context in source publication
Context 1
... the reference frame. The keypoints and warping alone cannot force the neural decoder to synthesize a hand when its encoded features have no hand. To understand whether Gemino's benefits are restricted to a few easily reconstructed frames or benefit all frames of the video, we plot a CDF of the visual quality across all 2M frames in our corpus in Fig. 8 at around 45 Kbps when upsampling from a 256×256 frame. Gemino outperforms all other baselines across all frames and metrics. Specifically, its synthesized frames are better than FOMM by nearly 5 dB in SSIM and 10 dB in PSNR throughout. It also outperforms Bicubic and SwinIR by 0.05 and 0.1 in LPIPS at the median and tail respectively. ...