Riza Alp Guler’s research while affiliated with Imperial College London and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (12)


SPAD: Spatially Aware Multi-View Diffusers
  • Conference Paper

June 2024

·

1 Read

·

25 Citations

Yash Kant

·

Aliaksandr Siarohin

·

Ziyi Wu

·

[...]

·




Figure 5. Reposing time comparison between INS and SNARF We show the time taken by SNARF vs INS for reposing a mesh extracted at 128 3 resolution across 125 different target poses. INS performs reposing an order of magnitude faster than SNARF.
Figure 7. Comparison of INS and SNARF models undergoing extreme pose animation from PosePrior dataset.
Figure 10. Visuals showing non-rigid deformations modeled using a second MLP similar to IMAvatar [20].
Ablation Table. We perform an ablation study of INS on a clothed subject 03375 (Table 1, Row 1) from the CAPE dataset.
Invertible Neural Skinning
  • Preprint
  • File available

February 2023

·

54 Reads

Building animatable and editable models of clothed humans from raw 3D scans and poses is a challenging problem. Existing reposing methods suffer from the limited expressiveness of Linear Blend Skinning (LBS), require costly mesh extraction to generate each new pose, and typically do not preserve surface correspondences across different poses. In this work, we introduce Invertible Neural Skinning (INS) to address these shortcomings. To maintain correspondences, we propose a Pose-conditioned Invertible Network (PIN) architecture, which extends the LBS process by learning additional pose-varying deformations. Next, we combine PIN with a differentiable LBS module to build an expressive and end-to-end Invertible Neural Skinning (INS) pipeline. We demonstrate the strong performance of our method by outperforming the state-of-the-art reposing techniques on clothed humans and preserving surface correspondences, while being an order of magnitude faster. We also perform an ablation study, which shows the usefulness of our pose-conditioning formulation, and our qualitative results display that INS can rectify artefacts introduced by LBS well. See our webpage for more details: https://yashkant.github.io/invertible-neural-skinning/

Download

Context-Self Contrastive Pretraining for Crop Type Semantic Segmentation

January 2022

·

37 Reads

·

24 Citations

IEEE Transactions on Geoscience and Remote Sensing

In this article, we propose a fully supervised pretraining scheme based on contrastive learning particularly tailored to dense classification tasks. The proposed context-self contrastive loss (CSCL) learns an embedding space that makes semantic boundaries pop-up by use of a similarity metric between every location in a training sample and its local context. For crop type semantic segmentation from satellite image time series (SITS), we find performance at parcel boundaries to be a critical bottleneck and explain how CSCL tackles the underlying cause of that problem, improving the state-of-the-art performance in this task. Additionally, using images from the Sentinel-2 (S2) satellite missions we compile the largest, to our knowledge, SITS dataset densely annotated by crop type and parcel identities, which we make publicly available together with the data generation pipeline. Using that data we find CSCL, even with minimal pretraining, to improve all respective baselines and present a process for semantic segmentation at greater resolution than that of the input images for obtaining crop classes at a more granular level. The code and instructions to download the data can be found in https://github.com/michaeltrs/DeepSatModels .


Figure 3: CSCL ground truth generator example on a 2D crop type label map. To generate labels we compare the class at the center location with all locations in the defined neighbourhood. We use windows with parameters (a, b) w d = 3, w r = 1, (c) w d = 3, w r = 3, (d) w d = 5, w r = 1, (e) w d = 3, w r = 2, (f) w d = 5, w r = 2.
Figure 5: Qualitative comparison of models in Germany (left) and France (right). Left to right: ground truth labels, UNET3Df, UNET3Df-CSCL, UNET2D-CLSTM, UNET2D-CLSTM-CSCL. White "x" indicates false prediction at a particular location.
Figure 6: Model mIoU w.r.t. pretraining performance. Dashed line shows no-pretraining baseline, colormap indicates number of pretraining steps. A direct relationship is observed between pretraining quality and final performance.
Figure 7: Qualitative comparison of super-resolution models from Table 5. From left to right: UNet3Df-×4, UNet3Df-CSCL-×1, UNet3Df-CSCL-×4. White "x" indicates false prediction at a particular location.
Context-self contrastive pretraining for crop type semantic segmentation

April 2021

·

150 Reads

In this paper we propose a fully-supervised pretraining scheme based on contrastive learning particularly tailored to dense classification tasks. The proposed Context-Self Contrastive Loss (CSCL) learns an embedding space that makes semantic boundaries pop-up by use of a similarity metric between every location in an training sample and its local context. For crop type semantic segmentation from satellite images we find performance at parcel boundaries to be a critical bottleneck and explain how CSCL tackles the underlying cause of that problem, improving the state-of-the-art performance in this task. Additionally, using images from the Sentinel-2 (S2) satellite missions we compile the largest, to our knowledge, dataset of satellite image timeseries densely annotated by crop type and parcel identities, which we make publicly available together with the data generation pipeline. Using that data we find CSCL, even with minimal pretraining, to improve all respective baselines and present a process for semantic segmentation at super-resolution for obtaining crop classes at a more granular level. The proposed method is further validated on the task of semantic segmentation on 2D and 3D volumetric images showing consistent performance improvements upon competitive baselines.


Lifting AutoEncoders: Unsupervised Learning of a Fully-Disentangled 3D Morphable Model using Deep Non-Rigid Structure from Motion

April 2019

·

41 Reads

In this work we introduce Lifting Autoencoders, a generative 3D surface-based model of object categories. We bring together ideas from non-rigid structure from motion, image formation, and morphable models to learn a controllable, geometric model of 3D categories in an entirely unsupervised manner from an unstructured set of images. We exploit the 3D geometric nature of our model and use normal information to disentangle appearance into illumination, shading and albedo. We further use weak supervision to disentangle the non-rigid shape variability of human faces into identity and expression. We combine the 3D representation with a differentiable renderer to generate RGB images and append an adversarially trained refinement network to obtain sharp, photorealistic image reconstruction results. The learned generative model can be controlled in terms of interpretable geometry and appearance factors, allowing us to perform photorealistic image manipulation of identity, expression, 3D pose, and illumination properties.


Learning Image-to-Surface Correspondence

March 2019

·

33 Reads

This thesis addresses the task of establishing adense correspondence between an image and a 3Dobject template. We aim to bring vision systemscloser to a surface-based 3D understanding ofobjects by extracting information that iscomplementary to existing landmark- or partbasedrepresentations.We use convolutional neural networks (CNNs)to densely associate pixels with intrinsiccoordinates of 3D object templates. Through theestablished correspondences we effortlesslysolve a multitude of visual tasks, such asappearance transfer, landmark localization andsemantic segmentation by transferring solutionsfrom the template to an image. We show thatgeometric correspondence between an imageand a 3D model can be effectively inferred forboth the human face and the human body.


Dense Pose Transfer

September 2018

·

1,126 Reads

In this work we integrate ideas from surface-based modeling with neural synthesis: we propose a combination of surface-based pose estimation and deep generative models that allows us to perform accurate pose transfer, i.e. synthesize a new image of a person based on a single image of that person and the image of a pose donor. We use a dense pose estimation system that maps pixels from both images to a common surface-based coordinate system, allowing the two images to be brought in correspondence with each other. We inpaint and refine the source image intensities in the surface coordinate system, prior to warping them onto the target pose. These predictions are fused with those of a convolutional predictive module through a neural synthesis module allowing for training the whole pipeline jointly end-to-end, optimizing a combination of adversarial and perceptual losses. We show that dense pose estimation is a substantially more powerful conditioning input than landmark-, or mask-based alternatives, and report systematic improvements over state of the art generators on DeepFashion and MVC datasets.


DensePose: Dense Human Pose Estimation in the Wild

June 2018

·

1,482 Reads

·

1,818 Citations

In this work we establish dense correspondences between an RGB image and a surface-based representation of the human body, a task we refer to as dense human pose estimation. We gather dense correspondences for 50K persons appearing in the COCO dataset by introducing an efficient annotation pipeline. We then use our dataset to train CNN-based systems that deliver dense correspondence 'in the wild', namely in the presence of background, occlusions and scale variations. We improve our training set's effectiveness by training an inpainting network that can fill in missing ground truth values and report improvements with respect to the best results that would be achievable in the past. We experiment with fully-convolutional networks and region-based models and observe a superiority of the latter. We further improve accuracy through cascading, obtaining a system that delivers highly-accurate results at multiple frames per second on a single gpu. Supplementary materials, data, code, and videos are provided on the project page http://densepose.org.


Citations (7)


... where e v is the camera embedding for viewpoint v. To incorporate cross-view dependency into T2I diffusion models, recent approaches, such as MVDream and SPAD [22], directly alter their attention layers. In contrast, MV-Adapter [4] adopts a plug-and-play approach by duplicating self-attention layers to create multiview attention layers. ...

Reference:

Refining Few-Step Text-to-Multiview Diffusion via Reinforcement Learning
SPAD: Spatially Aware Multi-View Diffusers
  • Citing Conference Paper
  • June 2024

... References Interactive 3D Design [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27] Image Generation [9], [28], [29], [30], [31], [32], [33], [34], [35], [36], [37] 3D Scene Generation [14], [15], [16], [17], [18], [22], [23], [24], [38], [39], [40], [41], [42], [43], [44], [45], [46], [47], [48], [49], [50], [51], [52], [53], [54], [55], [56], [57], [58], [59], [60], [61], [62], [63] 3D Reconstruction [10], [11], [13], [14], [17], [19], [22], [26], [27], [38], [41], [42], [43], [48], [49], [52], [53], [58], [59], [60], [64], [65], [66], [67], [68], [69], [70], [71], [72], [73], [74], [75], [76], [77], [78], [79], [80], [81], [82], [83], [84] 3D Segmentation [10], [18], [42], [54], [61], [68], [72], [73], [75], [78], [81], [85], [86], [87], [88], [89] Text-to-3D Generation [16], [44], [45], [53], [56], [67], [78], [90], [91], [92], [93], [94], [95], [96], [97], [98], [99] Image-to-3D Conversion [67], [78], [93], [100] Dataset Augmentation [75], [78], [81], [101] Novel View Synthesis [28], [33], [70], [87] Style Transfer in 2D [30], [34], [102] Style Transfer in 3D [15], [64], [67], [76], [86], [88], [89], [103], [104], [105], [106], [107] Design Optimization [12], [20], [22], [23], [27], [40], [50], [62], [98], [107], [108] Graph Generation [109] Procedural Generation [13], [15], [22], [26], [50], [51], [57], [82], [88], [97] Animation and Rigging [97], [98] Table III WORKS CLASSIFIED BY THE PROBLEM THEY ATTEMPT TO SOLVE deformations, while GEM3D [52] breaks down the diffusion process into medial abstractions, enabling detailed skeletal synthesis for 3D shapes. These advances emphasize the importance of structural abstraction in creating functional assets. ...

Repurposing Diffusion Inpainters for Novel View Synthesis
  • Citing Conference Paper
  • December 2023

... Still, these require rebuilding the avatar for each pose, leading to topological inconsistencies and time-consuming mesh extraction using Marching Cubes (MC) (Lorensen and Cline 1987). Recent research (Kant et al. 2023) has explored invertible neural networks to address these shortages but hasn't yet solved them for textured avatars. Our approach overcomes these challenges by maintaining topological consistency and significantly improving animation speed, making it two orders of magnitude faster than previous works. ...

Invertible Neural Skinning
  • Citing Conference Paper
  • June 2023

... Satellite image time series datasets. Many SITS datasets have been created in recent years for applications to crop-type mapping [67,69,84,31,43,76], wildfire spread prediction [33], tree species identification [6], wilderness mapping [23], change detection [79,81,82], land-cover mapping [29,85], low-to-high resolution knowledge transfer [48], cloud removal [22], and electricity access detection [51]. DAFA-LS is the first open-access dataset for the detection of looted archaeological sites. ...

Context-Self Contrastive Pretraining for Crop Type Semantic Segmentation
  • Citing Article
  • January 2022

IEEE Transactions on Geoscience and Remote Sensing

... Figure 2. Illustration of the training and inference pipelines for applying the method proposed in [54] to virtually try on a loose-fitting garment. (a) During training, the garment synthesis network is trained using degraded DensePose [17] outputs. (b) During inference, a distribution mismatch between training and input data leads to suboptimal garment synthesis results. ...

DensePose: Dense Human Pose Estimation in the Wild
  • Citing Conference Paper
  • June 2018

... However, in the semisupervised setting, our method performs remarkably better than SA with a 2.35%p gain on 10% labeled data ratio. [42] 49.87 5.05 DenseReg+MDM [44] 52. 19 3.67 JMFA [45] 54.85 1.00 LAB [19] 58.85 0.83 3FabRec [6] 54.61 0.17 HybridMatch (ours) SSL (20%) 60.56 0.17 To investigate the effectiveness of our HybridMatch, we compare the proposed method with other fully-supervised models using FR and AUC metrics. ...

DenseReg: Fully Convolutional Dense Shape Regression In-the-Wild

... While the dominant paradigm to supervise pose estimators has been sparse keypoints and skeletons, a few works have explored dense, continuously varying keypoints for supervision and prediction. DenseReg [1] predicts a dense 2D deformation grid to self-supervise a face regressor. In DensePose [29,73], 2D keypoint estimation is generalized to a continuous representation, where arbitrary surface points can be regression targets, instead of having a fixed set of sparse keypoints throughout the dataset. ...

DenseReg: Fully Convolutional Dense Shape Regression In-the-Wild
  • Citing Conference Paper
  • July 2017