Conference Paper

S2CMAF: Multi-Method Assessment Fusion for Scan-to-CAD Methods

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... However, in most cases, the workflow begins with a point cloud. Various authors have highlighted the capabilities of this technology, particularly in workflows such as Scan-to-CAD [9][10][11] and Scan-to-BIM [12,13], which automate processes and significantly reduce the time required to convert spatial points into precise geometric models. However, given the large volume of data generated in point clouds, it is essential to reduce the data size through segmentation and filtering. ...
Article
Full-text available
Accurate documentation of the geometry of historical buildings presents a considerable challenge, especially when dealing with complex structures like the Metropolitan Cathedral of Valencia. Advanced technologies such as 3D laser scanning has enabled detailed spatial data capture. Still, efficient handling of this data remains challenging due to the volume and complexity of the information. This study explores the application of clustering techniques employing Machine Learning-based algorithms, such as DBSCAN and K-means, to automate the process of point cloud analysis and modelling, focusing on identifying and extracting floor plans. The proposed methodology includes data geo-referencing, culling points to reduce file size, and automated floor plan extraction through filtering and segmentation. This approach aims to streamline the documentation and modelling of historical buildings and enhance the accuracy of historical architectural surveys, significantly contributing to the preservation of cultural heritage by providing a more efficient and accurate method of data analysis.
Conference Paper
Full-text available
Remote control of an autonomous vehicle by a human operator requires low delay video transmission to resolve complex situations and ensure safety. The remote operator perceives the current traffic scenario via video streams from multiple cameras. To provide the operator with the best possible scene understanding while matching the available network resources, the video streams need to be automatically adapted. In this paper, we propose a traffic-aware multi-view video stream adaptation scheme. We estimate the importance of each camera view based on the vehicle's real-time movement in traffic. The resulting prioritization together with the total available transmission rate determines a specific bit-budget for each camera view. We optimize the video quality of each individual video stream for the given bit-budget using a quality-of-experience-driven multi-dimensional adaptation scheme. Additionally, we apply a region-of-interest mask to the rear-facing camera views. The mask removes less important areas from the image which reduces the required bitrate. All modules are implemented to extend the existing TELECARLA framework. We evaluate the proposed traffic-aware adaptation scheme in a user study. We observe a high correlation between the proposed view prioritization module and the subjective ratings obtained in the user study. The region-of-interest masking achieves Bjøntegaard Delta Rate savings of at least 19.8% compared to streaming the full camera view. The overall system improves the VMAF score by 1.86 per camera when considering the importance of the individual camera views as rated by the users. This demonstrates the potential of an individual adaptation for each camera view optimized for the current traffic situation.
Conference Paper
Full-text available
Design and optimization of vibrotactile codecs require precise measurements of the compressed signals' perceptual quality. In this paper, we present two computational approaches for estimating vibrotactile signal quality. First, we propose a novel full-reference vibrotactile quality metric called Spectral Perceptual Quality Index (SPQI), which computes a similarity score based on a computed perceptually weighted error measure. Second, we use the concept of Multi-Method Assessment Fusion (MAF) to predict the subjective quality. MAF uses a Support Vector Machine regressor to fuse multiple elementary metrics into a final quality score, which preserves the strengths of the individual metrics. We evaluate both proposed quality assessment methods on an extended subjective dataset, which we introduce as part of this work. For two of three tested vibrotactile codecs, the MSE between subjective ratings and the SPQI is reduced by 64% and 92%, respectively compared to the state of the art. With our MAF approach, we obtain the only currently available metric that accurately predicts real human user experiments for all three tested codecs. The MAF estimations reduce the average MSE to the subjective ratings over all three tested codecs by 59% compared to the best performing elementary metric.
Conference Paper
Full-text available
Bounding box regression is the crucial step in object detection. In existing methods, while ℓn-norm loss is widely adopted for bounding box regression, it is not tailored to the evaluation metric, i.e., Intersection over Union (IoU). Recently, IoU loss and generalized IoU (GIoU) loss have been proposed to benefit the IoU metric, but still suffer from the problems of slow convergence and inaccurate regression. In this paper, we propose a Distance-IoU (DIoU) loss by incorporating the normalized distance between the predicted box and the target box, which converges much faster in training than IoU and GIoU losses. Furthermore, this paper summarizes three geometric factors in bounding box regression, i.e., overlap area, central point distance and aspect ratio, based on which a Complete IoU (CIoU) loss is proposed, thereby leading to faster convergence and better performance. By incorporating DIoU and CIoU losses into state-of-the-art object detection algorithms, e.g., YOLO v3, SSD and Faster R-CNN, we achieve notable performance gains in terms of not only IoU metric but also GIoU metric. Moreover, DIoU can be easily adopted into non-maximum suppression (NMS) to act as the criterion, further boosting performance improvement. The source code and trained models are available at https://github.com/Zzh-tju/DIoU
Article
Full-text available
We present ShapeNet: a richly-annotated, large-scale repository of shapes represented by 3D CAD models of objects. ShapeNet contains 3D models from a multitude of semantic categories and organizes them under the WordNet taxonomy. It is a collection of datasets providing many semantic annotations for each 3D model such as consistent rigid alignments, parts and bilateral symmetry planes, physical sizes, keywords, as well as other planned annotations. Annotations are made available through a public web-based interface to enable data visualization of object attributes, promote data-driven geometric analysis, and provide a large-scale quantitative benchmark for research in computer graphics and vision. At the time of this technical report, ShapeNet has indexed more than 3,000,000 models, 220,000 models out of which are classified into 3,135 categories (WordNet synsets). In this report we describe the ShapeNet effort as a whole, provide details for all currently available datasets, and summarize future plans.
Conference Paper
Full-text available
The possibility to provision road vehicles unmanned and on demand will have an important influence on the development of new mobility concepts. We therefore present the teleoperated driving of road vehicles. This paper outlines the basic concepts, including a static multi-camera design, an operator interface with a sensor fusion based display, and a cellular network based video transmission and communication architecture. We also show how we manage to fulfill the system’s technical requirements with our hard- and software design and point out the occurring problems due to communication limitations and lack of situation awareness. Finally, we propose solutions to guarantee driving safety
Article
Full-text available
This paper proposes a new tree-based ensemble method for supervised classification and regression problems. It essentially consists of randomizing strongly both attribute and cut-point choice while splitting a tree node. In the extreme case, it builds totally randomized trees whose structures are independent of the output values of the learning sample. The strength of the randomization can be tuned to problem specifics by the appropriate choice of a parameter. We evaluate the robustness of the default choice of this parameter, and we also provide insight on how to adjust it in particular situations. Besides accuracy, the main strength of the resulting algorithm is computational efficiency. A bias/variance analysis of the Extra-Trees algorithm is also provided as well as a geometrical and a kernel characterization of the models induced.
Article
We address the task of aligning CAD models to a video sequence of a complex scene containing multiple objects. Our method can process arbitrary videos and fully automatically recover the 9 DoF pose for each object appearing in it, thus aligning them in a common 3D coordinate frame. The core idea of our method is to integrate neural network predictions from individual frames with a temporally global, multi-view constraint optimization formulation. This integration process resolves the scale and depth ambiguities in the per-frame predictions, and generally improves the estimate of all pose parameters. By leveraging multi-view constraints, our method also resolves occlusions and handles objects that are out of view in individual frames, thus reconstructing all objects into a single globally consistent CAD representation of the scene. In comparison to the state-of-the-art single-frame method Mask2CAD that we build on, we achieve substantial improvements on the Scan2CAD dataset (from 11.6% to 30.7% class average accuracy).
Chapter
Object recognition has seen significant progress in the image domain, with focus primarily on 2D perception. We propose to leverage existing large-scale datasets of 3D models to understand the underlying 3D structure of objects seen in an image by constructing a CAD-based representation of the objects and their poses. We present Mask2CAD, which jointly detects objects in real-world images and for each detected object, optimizes for the most similar CAD model and its pose. We construct a joint embedding space between the detected regions of an image corresponding to an object and 3D CAD models, enabling retrieval of CAD models for an input RGB image. This produces a clean, lightweight representation of the objects in an image; this CAD-based representation ensures a valid, efficient shape representation for applications such as content creation or interactive scenarios, and makes a step towards understanding the transformation of real-world imagery to a synthetic domain. Experiments on real-world images from Pix3D demonstrate the advantage of our approach in comparison to state of the art. To facilitate future research, we additionally propose a new image-to-3D baseline on ScanNet which features larger shape diversity, real-world occlusions, and challenging image views.
Chapter
We present a novel approach to reconstructing lightweight, CAD-based representations of scanned 3D environments from commodity RGB-D sensors. Our key idea is to jointly optimize for both CAD model alignments as well as layout estimations of the scanned scene, explicitly modeling inter-relationships between objects-to-objects and objects-to-layout. Since object arrangement and scene layout are intrinsically coupled, we show that treating the problem jointly significantly helps to produce globally-consistent representations of a scene. Object CAD models are aligned to the scene by establishing dense correspondences between geometry, and we introduce a hierarchical layout prediction approach to estimate layout planes from corners and edges of the scene. To this end, we propose a message-passing graph neural network to model the inter-relationships between objects and layout, guiding generation of a globally object alignment in a scene. By considering the global scene layout, we achieve significantly improved CAD alignments compared to state-of-the-art methods, improving from 41.83% to 58.41% alignment accuracy on SUNCG and from 50.05% to 61.24% on ScanNet, respectively. The resulting CAD-based representations makes our method well-suited for applications in content creation such as augmented- or virtual reality.
Conference Paper
Point cloud is a 3D image representation that has recently emerged as a viable approach for advanced content modality in modern communication systems. In view of its wide adoption, quality evaluation metrics are essential. In this paper, we propose and assess a family of statistical dispersion measurements for the prediction of perceptual degradations. The employed features characterize local distributions of point cloud attributes reflecting topology and color. After associating local regions between a reference and a distorted model, the corresponding feature values are compared. The visual quality of a distorted model is then predicted by error pooling across individual quality scores obtained per region. The extracted features aim at capturing local changes, similarly to the well-known Structural Similarity Index. Benchmarking results using available datasets reveal best-performing attributes and features, under different neighborhood sizes. Finally, point cloud voxelization is examined as part of the process, improving the prediction accuracy under certain conditions.
Book
The methodology used to construct tree structured rules is the focus of this monograph. Unlike many other statistical procedures, which moved from pencil and paper to calculators, this text's use of trees was unthinkable before computers. Both the practical and theoretical sides have been developed in the authors' study of tree methods. Classification and Regression Trees reflects these two sides, covering the use of trees as a data analysis method, and in a more mathematical framework, proving some of their fundamental properties.
Conference Paper
Self-driving or autonomous vehicle systems are being designed over the world with increasing success in recent years. In spite of many advances so far, it is unlikely that such systems are going to ever achieve perfect accuracy under all conditions. In particular, occasional failures are anticipated when such vehicles encounter situations not observed before, or conflicting information is available to the system from the environment. Under such infrequent failure scenarios, the research community has so far, considered two alternatives -- to return control to the driver in the vehicle, which has its own challenges and limitations, or to attempt to safely "park" the vehicle out of harm's way. In this paper, we argue that a viable third alternative exists -- on failure of the self-driving function in the vehicle, the system could return control to a remote human driver located in response centers distributed across the world. This remote human driver will augment the self-driving system in vehicles, only when failures occur, which may be due to bad weather, malfunction, contradiction in sensory inputs, and other such conditions. Of course, a remote driving extension is fraught with many challenges, including the need for some Quality of Service guarantees, both in latency and throughput, in connectivity between the vehicles on the road and the response center, so that the remote drivers can react efficiently to the road conditions. To understand some of the challenges, we have set up real-time streaming testbed and evaluate frame latency with different parameter settings under today's LTE and Wi-Fi networks. While additional optimization techniques can be applied to further reduce streaming latency, we recognize that significant new design of the communication infrastructure is both necessary and possible.
Article
In this paper, we introduce the concept of learning latent \emph{super-events} from activity videos, and present how it benefits activity detection in continuous videos. We define a super-event as a set of multiple events occurring together in videos with a particular temporal organization; it is the opposite concept of sub-events. Real-world videos contain multiple activities and are rarely segmented (e.g., surveillance videos), and learning latent super-events allows the model to capture how the events are temporally related in videos. We design \emph{temporal structure filters} that enables the model to focus on particular sub-intervals of the videos, and use them together with a soft attention mechanism to learn representations of latent super-events. Super-event representations are combined with per-frame or per-segment CNNs to provide frame-level annotations. Our approach is designed to be fully differentiable, enabling an end-to-end learning of latent super-event representations jointly with the activity detector using them. Our experiments with multiple public video datasets confirm that the proposed concept of latent super-event learning significantly benefits activity detection, advancing the state-of-the-arts.
Article
A key requirement for leveraging supervised deep learning methods is the availability of large, labeled datasets. Unfortunately, in the context of RGB-D scene understanding, very little data is available -- current datasets cover a small range of scene views and have limited semantic annotations. To address this issue, we introduce ScanNet, an RGB-D video dataset containing 2.5M views in 1513 scenes annotated with 3D camera poses, surface reconstructions, and semantic segmentations. To collect this data, we designed an easy-to-use and scalable RGB-D capture system that includes automated surface reconstruction and crowdsourced semantic annotation. We show that using this data helps achieve state-of-the-art performance on several 3D scene understanding tasks, including 3D object classification, semantic voxel labeling, and CAD model retrieval. The dataset is freely available at http://www.scan-net.org.
Conference Paper
Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.
Article
Given a single photo of a room and a large database of furniture CAD models, our goal is to reconstruct a scene that is as similar as possible to the scene depicted in the photograph, and composed of objects drawn from the database. We present a completely automatic system to address this IM2CAD problem that produces high quality results on challenging imagery from real estate web sites. Our approach iteratively optimizes the placement and scale of objects in the room to best match scene renderings to the input photo, used image comparison metrics trained using deep convolutional neural nets. By operating jointly on the full scene at once, we account for inter-object occlusions.
Article
Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.
Article
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.
Article
Parametric correspondence is a technique for matching images to a three dimensional symbolic reference map. An analytic camera model is used to predict the location and appearance of landmarks in the image, generating a projection for an assumed viewpoint. Correspondence is achieved by adjusting the parameters of the camera model until the appearances of the landmarks optimally match a symbolic description extracted from the image. The matching of image and map features is performed rapidly by a new technique, called "chamfer matching", that compares the shapes of two collections of shape fragments, at a cost proportional to linear dimension, rather than area. These two techniques permit the matching of spatially extensive features on the basis of shape, which reduces the risk of ambiguous matches and the dependence on viewing conditions inherent in conventional image-based correlation matching.
Vid2cad: Cad model alignment using multi-view constraints from videos
  • K.-K Maninis
  • S Popov
  • M Nießner
  • V Ferrari
Gcan: Graph-based class-level attention network for long-term action detection
  • Y Wu
  • X Su
  • R Chaudhari
Roca: Robust cad model retrieval and alignment from a single image
  • C Gümeli
  • A Dai
  • M Nießner
Vmaf: The journey continues
  • N T Blog
Toward a practical perceptual video quality metric
  • N T Blog
Scan2cad: Learning cad model alignment in rgb-d scans
  • A Avetisyan
  • M Dahnert
  • A Dai
  • M Savva
  • A X Chang
  • M Nießner