Sameh Khamis's research while affiliated with Google Inc. and other places

Publications (29)

Preprint
We present a large-scale synthetic dataset for novel view synthesis consisting of ~300k images rendered from nearly 2000 complex scenes using high-quality ray tracing at high resolution (1600 x 1600 pixels). The dataset is orders of magnitude larger than existing synthetic datasets for novel view synthesis, thus providing a large unified benchmark...
Preprint
Full-text available
Acquisition and creation of digital human avatars is an important problem with applications to virtual telepresence, gaming, and human modeling. Most contemporary approaches for avatar generation can be viewed either as 3D-based methods, which use multi-view data to learn a 3D representation with appearance (such as a mesh, implicit surface, or vol...
Preprint
Unsupervised generation of high-quality multi-view-consistent images and 3D shapes using only collections of single-view 2D photographs has been a long-standing challenge. Existing 3D GANs are either compute-intensive or make approximations that are not 3D-consistent; the former limits quality and resolution of the generated images and the latter a...
Preprint
Full-text available
The task of shape space learning involves mapping a train set of shapes to and from a latent representation space with good generalization properties. Often, real-world collections of shapes have symmetries, which can be defined as transformations that do not change the essence of the shape. A natural way to incorporate symmetries in shape space le...
Preprint
We present HIPNet, a neural implicit pose network trained on multiple subjects across many poses. HIPNet can disentangle subject-specific details from pose-specific details, effectively enabling us to retarget motion from one subject to another or to animate between keyframes through latent space interpolation. To this end, we employ a hierarchical...
Preprint
Full-text available
We present Neural Kernel Fields: a novel method for reconstructing implicit 3D shapes based on a learned kernel ridge regression. Our technique achieves state-of-the-art results when reconstructing 3D objects and large scenes from sparse oriented points, and can reconstruct shape categories outside the training set with almost no drop in accuracy....
Preprint
We consider the challenging problem of predicting intrinsic object properties from a single image by exploiting differentiable renderers. Many previous learning-based approaches for inverse graphics adopt rasterization-based renderers and assume naive lighting and material models, which often fail to account for non-Lambertian, specular reflections...
Preprint
We propose a method to create plausible geometric and texture style variations of 3D objects in the quest to democratize 3D content creation. Given a pair of textured source and target objects, our method predicts a part-aware affine transformation field that naturally warps the source shape to imitate the overall geometric style of the target. In...
Preprint
We propose a novel efficient and lightweight model for human pose estimation from a single image. Our model is designed to achieve competitive results at a fraction of the number of parameters and computational cost of various state-of-the-art methods. To this end, we explicitly incorporate part-based structural and geometric priors in a hierarchic...
Preprint
We explore total scene capture -- recording, modeling, and rerendering a scene under varying appearance such as season and time of day. Starting from internet photos of a tourist landmark, we apply traditional 3D reconstruction to register the photos and approximate the scene as a point cloud. For each photo, we render the scene points into a deep...
Conference Paper
Motivated by augmented and virtual reality applications such as telepresence, there has been a recent focus in real-time performance capture of humans under motion. However, given the real-time constraint, these systems often suffer from artifacts in geometry and texture such as holes and noise in the final rendering, poor lighting, and low-resolut...
Conference Paper
The advent of consumer depth cameras has incited the development of a new cohort of algorithms tackling challenging computer vision problems. The primary reason is that depth provides direct geometric information that is largely invariant to texture and illumination. As such, substantial progress has been made in human and object pose estimation, 3...
Conference Paper
Augmented reality (AR) for smartphones has matured from a technology for earlier adopters, available only on select high-end phones, to one that is truly available to the general public. One of the key breakthroughs has been in low-compute methods for six degree of freedom (6DoF) tracking on phones using only the existing hardware (camera and inert...
Preprint
Motivated by augmented and virtual reality applications such as telepresence, there has been a recent focus in real-time performance capture of humans under motion. However, given the real-time constraint, these systems often suffer from artifacts in geometry and texture such as holes and noise in the final rendering, poor lighting, and low-resolut...
Chapter
This paper presents StereoNet, the first end-to-end deep architecture for real-time stereo matching that runs at 60fps on an NVidia Titan X, producing high-quality, edge-preserved, quantization-free disparity maps. A key insight of this paper is that the network achieves a sub-pixel matching precision than is a magnitude higher than those of tradit...
Chapter
In this paper we present ActiveStereoNet, the first deep learning solution for active stereo systems. Due to the lack of ground truth, our method is fully self-supervised, yet it produces precise depth with a subpixel precision of 1 / 30th of a pixel; it does not suffer from the common over-smoothing issues; it preserves the edges; and it explicitl...
Conference Paper
Full-text available
In this paper we present ActiveStereoNet, the first deep learning solution for active stereo systems. Due to the lack of ground truth, our method is fully self-supervised, yet it produces precise depth with a subpixel precision of $1/30th$ of a pixel; it does not suffer from the common over-smoothing issues; it preserves the edges; and it explicitl...
Preprint
This paper presents StereoNet, the first end-to-end deep architecture for real-time stereo matching that runs at 60 fps on an NVidia Titan X, producing high-quality, edge-preserved, quantization-free disparity maps. A key insight of this paper is that the network achieves a sub-pixel matching precision than is a magnitude higher than those of tradi...
Preprint
In this paper we present ActiveStereoNet, the first deep learning solution for active stereo systems. Due to the lack of ground truth, our method is fully self-supervised, yet it produces precise depth with a subpixel precision of $1/30th$ of a pixel; it does not suffer from the common over-smoothing issues; it preserves the edges; and it explicitl...
Article
Full-text available
We present Motion2Fusion, a state-of-the-art 360 performance capture system that enables *real-time* reconstruction of arbitrary non-rigid scenes. We provide three major contributions over prior work: 1) a new non-rigid fusion pipeline allowing for far more faithful reconstruction of high frequency geometric details, avoiding the over-smoothing and...
Conference Paper
Full-text available
We present an end-to-end system for augmented and virtual reality telepresence, called Holoportation. Our system demonstrates high-quality, real-time 3D reconstructions of an entire space, including people, furniture and objects, using a set of new depth cameras. These 3D models can also be transmitted in real-time to remote users. This allows user...
Article
Full-text available
We contribute a new pipeline for live multi-view performance capture, generating temporally coherent high-quality reconstructions in real-time. Our algorithm supports both incremental reconstruction, improving the surface estimation over time, as well as parameterizing the nonrigid scene motion. Our approach is highly robust to both large frame-to-...
Article
Fully articulated hand tracking promises to enable fundamentally new interactions with virtual and augmented worlds, but the limited accuracy and efficiency of current systems has prevented widespread adoption. Today's dominant paradigm uses machine learning for initialization and recovery followed by iterative model-fitting optimization to achieve...

Citations

... 6.4. Yin et al. [44] recently proposed a geometry and texture stylization approach that is optimization based and uses differentiable rendering, a fundamentally different approach to operating on texture map data than the one we explore in this work. ...
... To generate images of novel view image from arbitrary viewpoints , these methods need more images as input to reconstruct the scene. Some works build 3D scenes by combining geometric representation with color [39,46], texture [7], light field [3,50] or neural rendering [1,10,30,33,37,42,48]. This 3D implicit representation method based on neural radiance fields (NeRF) [35] greatly improves the quality of novel view generation. ...
... Here we set the minimum reasonable applicationspecific inference throughput values (IP S min ) to be ∼40 and ∼6 for hand detection and eye segmentation applications respectively. The IP S min values are based on latency metrics estimated in recent studies for both applications [19], [20]. ...
... We leverage a sparse set of commodity RGBD cameras to resolve this ambiguity. LookinGood [28] and HVSNet [33] are thus closely related to our approach. They both utilize RGBD camera inputs, and they achieve great visual qualities while being able to generalize to new people and actions. ...
... Source: (Graber, 2012) More recently, depth maps find a relevant use in augmented reality application, especially to offer management of occlusion effects between the real world and the virtual contents (Valentin et al., 2019). In the Depth API provided by ARCore, depth maps are generated from the hybrid optimization of two algorithms: PatchMatch (Bleyer et al., 2011) and HashMatch (Fanello et al., 2017). ...
... This method optimizes a series of sparse hyperplanes and reduces the complexity of matching cost computation to O(1) but faces difficulties in textureless scenes due to the limitation of the shallow descriptor and local optimization framework. ActiveStereoNet (ASN) [121], as shown in Figure 17, realized an end-to-end and unsupervised deep neural networks (DNN) scheme for an RSP 3D measurement. A novel loss function was utilized in ASN to deal with the challenges of active stereo matching (e.g., illumination, high-frequency noise, occlusion). ...
... Stereo depth estimation to predict 3D geometry for practical real-world applications such as autonomous driving [2] has been developed by handcrafted methods [13,11,45,15] and deep stereo models based on supervised learning [25,4,36,17] that leverage the excellent representation power of deep neural networks. In general, given that the high performance of deep networks is guaranteed when test and training data are derived from a similar underlying distribution [6,24,14,22,7], they demand a huge amount of annotated training data to reflect a real-world distribution. ...
... To mitigate the aforementioned issues, an intuitive solution is to finetune the stereo model trained on a largescale synthetic dataset that is easier to collect groundtruth. However, despite the help of large-scale synthetic datasets, most recent works [37,28,48,46] have pointed out the limitation of fine-tuning that is incapable of collecting sufficient data in advance when running the stereo models in the open world. While domain generalization methods [44,34] have shown promising results without real images, they require high computations to provide generalized stereo models and often fail to respond to continuously changing environments. ...
... To get the same shape and texture with the real human body, there have been categories of solutions. One is the non-parametric reconstruction methods [6,7,11,31] based on multi-camera calibration and point cloud fusion, while the other is the parametric reconstruction methods [3,8,14,16] based on the deformation of 3D human model template. Parametric reconstruction methods have high performance without relying on expensive computing resources. ...
... In addition to methods for rigid objects, a large amount of work on the tracking of human hands and bodies exists. While deep learning has become highly popular for those tasks [47], [48], [49], many model-based methods still use the original formulation of Lowe to consider the underlying kinematic structure [50], [51], [52]. However, with techniques like keypoint detection and regularization being highly optimized for their respective domains, there is no straightforward application of such algorithms to arbitrary multi-body systems. ...