ArticlePublisher preview available
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

In this paper, we present a systematic literature review concerning 3D object recognition and classification. We cover articles published between 2006 and 2016 available in three scientific databases (ScienceDirect, IEEE Xplore and ACM), using the methodology for systematic review proposed by Kitchenham. Based on this methodology, we used tags and exclusion criteria to select papers about the topic under study. After the works selection, we applied a categorization process aiming to group similar object representation types, analyzing the steps applied for object recognition, the tests and evaluation performed and the databases used. Lastly, we compressed all the obtained information in a general overview and presented future prospects for the area. Link for the publication:
This content is subject to copyright. Terms and conditions apply.
1 3
Pattern Analysis and Applications (2019) 22:1243–1292
3D object recognition andclassication: asystematic literature review
L.E.Carvalho1,2 · A.vonWangenheim1,2
Received: 3 October 2017 / Accepted: 14 February 2019 / Published online: 27 February 2019
© Springer-Verlag London Ltd., part of Springer Nature 2019
In this paper, we present a systematic literature review concerning 3D object recognition and classification. We cover articles
published between 2006 and 2016 available in three scientific databases (ScienceDirect, IEEE Xplore and ACM), using
the methodology for systematic review proposed by Kitchenham. Based on this methodology, we used tags and exclusion
criteria to select papers about the topic under study. After the works selection, we applied a categorization process aiming
to group similar object representation types, analyzing the steps applied for object recognition, the tests and evaluation
performed and the databases used. Lastly, we compressed all the obtained information in a general overview and presented
future prospects for the area.
Keywords 3D object recognition· 3D object classification· 3D object representation· Systematic literature review
1 Introduction
The 3D object classification and recognition area, in the last
few years, experienced a growing boosted by the populariza-
tion of 3D sensors and the increased availability of 3D object
databases [41]. The methods developed for this purpose are
applied in many domains like robotics, focusing on assist-
ing the robot movement in an environment and perform-
ing object manipulation, security, where the techniques are
employed to detect dangerous objects, and 3D general object
detection, recognizing, for example, objects, faces, ears and
so on.
The seminal works interested in solving 3D/multi-view
object recognition as well as pose estimation started in the
1980s and early 1990s [22, 85, 173, 188, 203, 251, 293],
which, for some authors [48], were considered the founda-
tion of modern object recognition. A detailed history of
3D object recognition can be found in the book Computer
Vision Detection, Recognition and Reconstruction [48]. This
book contains not only a history about works interest in solv-
ing 3D object recognition problems, but also a selection of
articles covering some of the talks and tutorials held during
the first two editions of ICVSS (The International Computer
Vision Summer School) on topics such as Recognition, Reg-
istration and Reconstruction. Each chapter provides an over-
view of these challenging topics with key references to the
existing literature until 2009.
In order to provide a panorama of the area, identifying 3D
object classification and recognition methods and the rep-
resentation types or descriptions employed by those meth-
ods, we performed a systematic literature review. For this
purpose, we employed Kitchenham’s methodology [140]
for systematic literature review, which is a well-established
The objectives of this paper are to: (1) present the meth-
odology applied for the present systematic literature review
and (2) perform an overall analysis aiming to identify the
3D object representation types, the general structure used in
the analyzed works and how the evaluation and validation
were performed.
The novelty in this review is twofold: First, to the author’s
knowledge, this is the first time that Kitchenham’s meth-
odology is applied for 3D object recognition. Second, dif-
ferently from previous reviews for 3D object recognition,
where the search was restricted to a specific type of method,
we defined a 10-year window as our only search restriction
* L. E. Carvalho
A. von Wangenheim
1 Graduate Program inComputer Science, Federal University
ofSanta Catarina, Florianópolis, Brazil
2 Image Processing andComputer Graphics Lab, National
Brazilian Institute forDigital Convergence, Federal
University ofSanta Catarina, Florianópolis, Brazil
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
... Classical studies focused on handcrafted 3D object descriptors to capture the geometric essence of objects and generate a compact and uniform representation for a given 3D object [6]. For instance, Wohlkinger and Vincze [7] proposed Ensemble of Shape Function (ESF) to encode geometrical characteristics of an object using an ensemble of ten 64-bin histograms of angles, point distances, and area shape functions. ...
... Kasaei et al., [5] introduced an object descriptor called Global Orthographic Object Descriptor (GOOD) built to be robust, descriptive and efficient to compute and use. For a detailed review of various handcrafted object descriptors, we refer the reader to a comprehensive review by Carvalho and Wangenheim [6]. ...
Service robots are integrating more and more into our daily lives to help us with various tasks. In such environments, robots frequently face new objects while working in the environment and need to learn them in an open-ended fashion. Furthermore, such robots must be able to recognize a wide range of object categories. In this paper, we present a lifelong ensemble learning approach based on multiple representations to address the few-shot object recognition problem. In particular, we form ensemble methods based on deep representations and handcrafted 3D shape descriptors. To facilitate lifelong learning, each approach is equipped with a memory unit for storing and retrieving object information instantly. The proposed model is suitable for open-ended learning scenarios where the number of 3D object categories is not fixed and can grow over time. We have performed extensive sets of experiments to assess the performance of the proposed approach in offline, and open-ended scenarios. For the evaluation purpose, in addition to real object datasets, we generate a large synthetic household objects dataset consisting of 27000 views of 90 objects. Experimental results demonstrate the effectiveness of the proposed method on 3D object recognition tasks, as well as its superior performance over the state-of-the-art approaches. Additionally, we demonstrated the effectiveness of our approach in both simulated and real-robot settings, where the robot rapidly learned new categories from limited examples.
... Carvalho et al, proposed the advantage of this method is that the laser reception is faster and can be used for realtime modeling. The disadvantage of this method is that the Light Detection and Ranging machines are expensive [16]. Therefore, the method is too costly for narrow corridors modeling which do not require high accuracy. ...
Full-text available
The current methods of construct three-dimensional interaction virtual reality include Photogrammetry, Point Cloud Modeling and Building Information Modeling. However, due to the limitation of space size, these methods cannot construct three-dimensional interaction virtual reality modeling effectively, like corridor. In this paper, the method is proposed to construct corridor three-dimensional interaction virtual reality. The method based on computer vision includes data acquisition, texture stitch, path selection, and model construction. In the process of acquiring data, firstly, we obtain the perspective transformation parameters by pre-establishing the check field. Secondly, we correct the acquired data in real-time by using perspective transformation. In the process of stitching texture, according to the number of feature points in the images, we use feature/direct stitch to generate texture. In the process of selecting path, we identify the targets in image and build buffers. The absolute error and root mean square error are used to measure the accuracy of corridor three-dimensional interaction virtual reality. The experimental result shows that the absolute error of the model constructed in this paper is about 0.0507 to 0.1691. And the root mean square error is about 0.1203 to 0.1318.
... 3D object recognition has become an important research area in the 3D field [1,2], in which deep learning [3,4] and neural networks [5] have been widely used for 3D object recognition with excellent performance [6]. The 3D object recognition method based on deep learning can be subdivided into voxelbased [7,8], point-set-based [9,10], and view-based methods [11][12][13][14][15][16][17][18][19]. ...
Full-text available
Three‐dimensional (3D) object recognition based on multiple views has been a popular area of research in recent years. Existing methods based on the grouping mechanism cannot sensibly group the views. Thus, the 3D shape descriptor that is generated by the final fusion is not representative, and the recognition accuracy still requires improvement. This study proposes a double‐weighting convolutional neural network method, based on the L2‐S grouping mechanism. The designed bidirectional long short‐term memory module can learn the relationship between the views in detail and improve the quality of the extracted features. Further, the proposed L2‐S grouping mechanism can use the L2 norm property to calculate the discrimination score of views and group views more reasonably. After reasonable grouping, weighted fusion operations are used within and between groups to fuse features to obtain group‐level descriptors that better represent each group of views. Finally, compact 3D shape descriptors generated by equally important group‐level descriptors for 3D object recognition. Results of the experiments show that our method can achieve state‐of‐the‐art performance. The source code is available at
... The automated classication of three dimensional objects is a fundamental research subject highly addressed in the last decades [3]. It allows to label 3D parts by assigning them to well-known categories of parts, and then to improve their semantic meaning with additional information or even more to associate them with a precise functionality. ...
Full-text available
The work here presented is part of a wider research project aimed at extracting and using in industrial applications high level semantic information from 3D product models that are described by means of their boundary representation (B-rep). The specific focus of the paper is the recognition among the components of the CAD model of an assembly of those belonging to some categories of standard parts largely employed in mechanical industry. The knowledge of these components is crucial to understand the structure of mechanical products as they have specific meaning and function. Standard parts follow international standard in shape and dimensions, and also typical mounting schemes exist concerning their use in the product assembly. These distinctive features have been exploited as a starting point to develop a multi-step recognition algorithm. It includes a shape-based and a context-based analysis both relying on the geometric and topological analysis of a CAD model. As already anticipated by Voelcker in his visionary ability to anticipate open challenges, the shape of an object alone is not enough to understand its function. Therefore, context assessment becomes crucial to validate the recognition given by the shape-based step. It allows to uniquely recognize components in mechanical CAD models, by confirming correct results, refusing the false positives, as well as choosing the correct one when the assignment is multiple.
Full-text available
Radiography is one of the most widely used imaging techniques in the world. Since its inception, it has continued to evolve, leading to the development of intelligent and automated radiography systems that are able to perceive parts of their environment and respond accordingly. However, such systems do not provide a complete view of the examination space and are therefore unable to detect multiple objects and fully ensure the safety of patients, staff and equipment during the execution of the movement. In this paper, we present a system architecture based on ROS (Robot Operating System) to solve these challenges and integrate an autonomous X-ray device. The architecture retrieves point clouds from range sensors placed at specific locations in the examination room. By integrating different subsystems, the architecture merges the data from the different sensors to map the space. It also implements downsampling and clustering methods to identify objects and later distinguish obstacles. A subsystem generates bounding boxes based on the detected obstacles and feeds them to a motion planning framework (MoveIt!) to enable collision avoidance during motion execution. At the same time, a subsystem implements a deep neural network model (PointNet) to classify the detected obstacles. Finally, the developed system architecture provided promising results after being deployed in a Gazebo simulated examination space and on a use case test platform.
Recognizing objects by touch is a very useful skill for robots to be employed in both structured and unstructured environments. While in some applications it is useful to recognize an object from a single touch, in other scenarios specific robot movements can be used to obtain more information about the object, making recognition easier. In this letter, we show how this can be obtained through the combination of: (i) a recently developed tactile sensor that measures both normal and shear forces on multiple contact points, and (ii) an exploratory procedure that involves dynamic shaking of the gripped object. We compare the recognition accuracy in three conditions: static (i.e. single touch), short dynamic (i.e. using a small fraction of the exploratory procedure), and dynamic (i.e. using the entire exploratory procedure). We report experiments with six different machine learning techniques, and several combinations of tactile features, to recognize ten objects. Overall, our results demonstrate that: (i) the sensor we use is well suited for recognizing grasped objects with high accuracy, and (ii) the dynamic exploratory procedure provides a 38% improvement over single touch recognition. We make our data and code publicly available, to encourage reproduction of our results.
Full-text available
Recognition of an object from a point cloud, image or video is an important task in computer vision which plays a crucial role in many real-world applications. The challenges involved in object recognition, aiming at locating object instances from a large number of predefined categories in collections (images, video or, model library), are multi-model, multi-pose, complicated background, occlusion, and depth variations. In the past few years numerous methods were developed to tackle these challenges and have reported remarkable progress for 3D objects. However, suitable methods of object recognition are needed to achieve added value in built environment. Suitable acquisition methods are also necessary to compensate the impact of darkness, dirt, and occlusion. This chapter provides a comprehensive overview of the recent advances in 3D object recognition of indoor objects using Convolutional Neural Networks (CNN). Methodology for object recognition, approaches for point cloud generation, and test bases are presented. The comparison of main recognition methods based on methods of geometric shape descriptor and supervised learning and their strengths and weakness are also included. The focus lies on the specific requirements and constrains in an industrial environment like tight assembly, light, dirt, occlusion, or incomplete data sets. Finally, a recommendation for use of existing CNN framework for implementation of an automatic object recognition procedure is given.
A three-dimensional digital representation, as the Digital Twin of machines and objects within production plants, is becoming increasingly important for efficient planning and documentation of production. During the acquisition of the images, the two areas of accuracy and detail are key. The accuracy describes the precision in which the digital twin represents the real objects. For example, how well the dimensions of a machine within a reconstructed model matches its counterpart in reality. Detail refers to the level of detail in the digital model. Should the door of a machine tool be distinguishable from the machine or not. Two basic technologies have been established for scanning. The photogrammetry method creates a dense point cloud based on images using a sophisticated toolchain. The laser method firstly measures an accurate model of the environment using a laser scanner. In a second step, the color information is assigned to the measured points via the evaluation of photos. For certain applications, models resulting in highest accuracy and detail are not necessary; in these cases, speed and practicability of the acquisition is the primary concern. Due to recent advancements photogrammetry these methods are best suited in the case of generating a digital twin for production facilities. An overview of the methods available as well es the underlying principles is presented. Practical considerations and examples show the feasibility and results of different photogrammetry approaches, resulting in the presentation of a photogrammetric system well suited to the task of creating a digital twin of a production facility.
Classification of 3D objects the selection of a category in which each object belongs is of great interest in the field of machine learning. Numerous researchers use deep neural networks to address this problem, altering the network architecture and representation of the 3D shape used as an input. To investigate the effectiveness of their approaches, we conduct an extensive survey of existing methods and identify common ideas by which we categorize them into a taxonomy. Second, we evaluate 11 selected classification networks on two 3D object datasets, extending the evaluation to a larger dataset on which most of the selected approaches have not been tested yet. For this, we provide a framework for converting shapes from common 3D mesh formats into formats native to each network, and for training and evaluating different classification approaches on this data. Despite being generally unable to reach the accuracies reported in the original papers, we compare the relative performance of the approaches as well as their performance when changing datasets as the only variable to provide valuable insights into performance on different kinds of data. We make our code available to simplify running training experiments with multiple neural networks with different prerequisites.
This paper presents a novel fast and highly accurate 3-D registration algorithm. The ICP (Iterative Closest Point) algorithm converges all the 3-D data points of two data sets to the best-matching points with minimum evaluation values. This algorithm is in widespread use because it has good validity for many applications, but it extracts a heavy computational cost and is very sensitive to error. This is because it uses all the data points of two data sets and least mean square optimization. We previously proposed the M-ICP algorithm, which uses M-estimation to realize robustness against outlying gross noise with the original ICP algorithm. In this paper, we propose a novel algorithm called HM-ICP (Hierarchical M-ICP), which is an extension of the M-ICP that selects regions for matching and hierarchical searching of selected regions. This method selects regions by evaluating the variance of distance values in the target region, and homogeneous topological mapping. Some fundamental experiments using real data sets of 3-D measurement demonstrate the effectiveness of the proposed method, achieving a reduction of more than ten thousand times for computational costs. We also confirmed an error of less than 0.1% for the measurement distance.
Dimensionality reduction is the process by which a set of data points in a higher dimensional space are mapped to a lower dimension while maintaining certain properties of these points relative to each other. One important property is the preservation of the three angles formed by a triangle consisting of three neighboring points in the high dimensional space. If this property is maintained for those same points in the lower dimensional embedding then the result is a conformal map. However, many of the commonly used nonlinear dimensionality reduction techniques, such as Locally Linear Embedding (LLE) or Laplacian Eigenmaps (LEM), do not produce conformal maps. Post-processing techniques formulated as instances of semi-definite programming (SDP) problems can be applied to the output of either LLE or LEM to produce a conformal map. However, the effectiveness of this approach is limited by the computational complexity of SDP solvers. This paper will propose an alternative post-processing algorithm that produces a conformal map but does not require a solution to a SDP problem and so is more computationally efficient thus allowing it to be applied to a wider selection of datasets. Using this alternative solution, the paper will also propose a new algorithm for 3D object classification. An interesting feature of the 3D classification algorithm is that it is invariant to the scale and the orientation of the surface.
As autonomous robots expand into the service domain, new solutions to the challenge of operating in domestic environments must be developed. Widespread adoption of service robots demands high robustness to environmental change and operational wear, and minimal reliance on application specific knowledge. As such, rich sensing modalities such as vision will play a central role in their success. This book takes steps towards the realization of domestic robots by presenting an integrated systems view of computer vision and robotics, covering fundamental topics including optimal sensor design, visual servoing, 3D object modelling and recognition, and multi-cue tracking, with a solid emphasis on robustness throughout. With in-depth treatment of both theory and implementation, extensive experimental results and comprehensive multimedia support including video clips, VRML data, C++ code and lecture slides, this book has wide appeal to both theoretical and practical roboticists and stands as a valuable teaching resource.
Conference Paper
This paper presents a novel approach for detecting affine invariant interest points. Our method can deal with significant affine transformations including large scale changes. Such transformations introduce significant changes in the point location as well as in the scale and the shape of the neighbourhood of an interest point. Our approach allows to solve for these problems simultaneously. It is based on three key ideas: 1) The second moment matrix computed in a point can be used to normalize a region in an affine invariant way (skew and stretch). 2) The scale of the local structure is indicated by local extrema of normalized derivatives over scale. 3) An affine-adapted Harris detector determines the location of interest points. A multi-scale version of this detector is used for initialization. An iterative algorithm then modifies location, scale and neighbourhood of each point and converges to affine invariant points. For matching and recognition, the image is characterized by a set of affine invariant points; the affine transformation associated with each point allows the computation of an affine invariant descriptor which is also invariant to affine illumination changes. A quantitative comparison of our detector with existing ones shows a significant improvement in the presence of large affine deformations. Experimental results for wide baseline matching show an excellent performance in the presence of large perspective transformations including significant scale changes. Results for recognition are very good for a database with more than 5000 images.
Conference Paper
Consistency of image edge filtering is of prime importance for 3D interpretation of image sequences using feature tracking algorithms. To cater for image regions containing texture and isolated features, a combined corner and edge detector based on the local auto-correlation function is utilised, and it is shown to perform with good consistency on natural imagery.
Image category recognition is important to access visual information on the level of objects and scene types. So far, intensity-based descriptors have been widely used for feature extraction at salient points. To increase illumination invariance and discriminative power, color descriptors have been proposed. Because many different descriptors exist, a structured overview is required of color invariant descriptors in the context of image category recognition. Therefore, this paper studies the invariance properties and the distinctiveness of color descriptors (software to compute the color descriptors from this paper is available from in a structured way. The analytical invariance properties of color descriptors are explored, using a taxonomy based on invariance properties with respect to photometric transformations, and tested experimentally using a data set with known illumination conditions. In addition, the distinctiveness of color descriptors is assessed experimentally using two benchmarks, one from the image domain and one from the video domain. From the theoretical and experimental results, it can be derived that invariance to light intensity changes and light color changes affects category recognition. The results further reveal that, for light intensity shifts, the usefulness of invariance is category-specific. Overall, when choosing a single descriptor and no prior knowledge about the data set and object and scene categories is available, the OpponentSIFT is recommended. Furthermore, a combined set of color descriptors outperforms intensity-based SIFT and improves category recognition by 8 percent on the PASCAL VOC 2007 and by 7 percent on the Mediamill Challenge.