ArticlePDF Available

Robust multi-view PPF-based method for multi-instance pose estimation

IOP Publishing
Measurement Science and Technology
Authors:

Abstract and Figures

Multi-instance 6D pose estimation is a fundamental task in processing depth images and point cloud data for industrial robots and automation. This task is often hindered by challenges such as a high number of pseudo outliers, instance occlusions, and low overlap between instances and models. The Point Pair Feature (PPF) is a widely recognized concept for addressing multi-instance pose estimation, characterized by its lack of training requirements, robustness to occlusion, and exceptional ease of use. However, existing PPF-based methods exhibit relatively poor performance, particularly when compared to machine learning-based approaches. In this paper, we propose a robust multi-view PPF-based method that specifically addresses the challenges of generating multi-view models and enhancing model generalization. Additionally, we introduce a comprehensive usage framework for multi-view models. This framework incorporates background removal and scene segmentation for preprocessing, a multi-view PPF-based approach for primary computation, and a multi-instance spatial structure to eliminate erroneous results during post-processing. When evaluated on the ITODD datasets from the BOP Challenge, our method achieves the SOTA performance among traditional methods for 3D point cloud data, with an average recall of 69.6%. These results demonstrate the following: (1) a significant performance improvement of 44.7% compared to the leading conventional method, and (2) performance that is nearly equivalent to that of the leading machine learning method. These results underscore the robustness and effectiveness of our method in advancing multi-instance pose estimation for industrial applications.
This content is subject to copyright. Terms and conditions apply.
Measurement Science and Technology
Meas. Sci. Technol. 36 (2025) 045010 (14pp) https://doi.org/10.1088/1361-6501/adc1ea
Robust multi-view PPF-based method
for multi-instance pose estimation
Huakai Zhao1,2,3, Yuning Gao1, Mo Wu2, Caibo Hu2, Shitian Zhang4and Jiansheng Li1,
1School of Geospatial Information, Information Engineering University, with Collaborative Innovation
Center of Geo-Information Technology for Smart Central Plains, and with Key Laboratory of
Spatiotemporal Perception and Intelligent Processing, Zhengzhou, Henan 450001, People’s Republic of
China
2Beijing Satellite Navigation Center, Beijing 100094, People’s Republic of China
3State Key Laboratory of Spatial Datum, Xi’an 710054, People’s Republic of China
4China Research Institute of Radio Wave Propagation Qingdao Research Center, Qingdao 266107,
People’s Republic of China
E-mail: ljs2021@vip.henu.edu.cn and zhaohuakaikai@163.com
Received 28 October 2024, revised 8 February 2025
Accepted for publication 18 March 2025
Published 28 March 2025
Abstract
Multi-instance 6D pose estimation is a fundamental task in processing depth images and point
cloud data for industrial robots and automation. This task is often hindered by challenges such
as a high number of pseudo outliers, instance occlusions, and low overlap between instances and
models. The Point Pair Feature (PPF) is a widely recognized concept for addressing
multi-instance pose estimation, characterized by its lack of training requirements, robustness to
occlusion, and exceptional ease of use. However, existing PPF-based methods exhibit relatively
poor performance, particularly when compared to machine learning-based approaches. In this
paper, we propose a robust multi-view PPF-based method that specically addresses the
challenges of generating multi-view models and enhancing model generalization. Additionally,
we introduce a comprehensive usage framework for multi-view models. This framework
incorporates background removal and scene segmentation for preprocessing, a multi-view
PPF-based approach for primary computation, and a multi-instance spatial structure to eliminate
erroneous results during post-processing. When evaluated on the ITODD datasets from the BOP
Challenge, our method achieves the SOTA performance among traditional methods for 3D point
cloud data, with an average recall of 69.6%. These results demonstrate the following: (1) a
signicant performance improvement of 44.7% compared to the leading conventional method,
and (2) performance that is nearly equivalent to that of the leading machine learning method.
These results underscore the robustness and effectiveness of our method in advancing
multi-instance pose estimation for industrial applications.
Keywords: multi-view PPF, multiple instances, pose estimation, point pair features, PPF
Author to whom any correspondence should be addressed.
Original content from this work may be used under the
terms of the Creative Commons Attribution 4.0 licence. Any
further distribution of this work must maintain attribution to the author(s) and
the title of the work, journal citation and DOI.
1 © 2025 The Author(s). Published by IOP Publishing Ltd
Meas. Sci. Technol. 36 (2025) 045010 H Zhao et al
1. Introduction
1.1. Background
For object detection, it is essential to align point clouds
obtained at different instances. Once the alignment is com-
plete, we can determine a transformation that includes both
3D translation and 3D rotation from one instance to another.
This process, applicable to depth images and point clouds, is
referred to as point cloud alignment or registration, commonly
known as 6D pose estimation.
6D pose estimation is a fundamental challenge in pro-
cessing depth images and 3D point cloud data. This process
involves aligning two point clouds: one serves as the template,
referred to as the model, while the other, which is aligned to
it, is called the scene. A scene may contain multiple models,
which we refer to as instances. If the points in the scene can be
accurately aligned with the model, they are classied as inli-
ers; otherwise, they are considered outliers.
Unlike pairwise pose estimation, which primarily focuses
on a single model and a single instance within a scene, multi-
instance pose estimation is more common in real-world scen-
arios and introduces greater challenges for object detection.
This task involves identifying multiple instances of various
models within a scene. As illustrated in gure 1, subgure
(a) depicts a scene containing multiple instances of a single
model, while subgure (b) illustrates a scene with multiple
instances of different models.
Multi-instance pose estimation presents a signicant chal-
lenge in robot grasping tasks and various other similar scen-
arios. Several key issues arise in this context: (1) Compared to
pairwise pose estimation, this task encounters a higher num-
ber of outliers [1], which include genuine outliers from the
background as well as pseudo outliers from other instances.
(2) The occlusion from other instances often obscure parts
of objects, leading to missing points and feature confusion,
which complicates accurate pose estimation. (3) A frequent
issue in many scenes, low overlap between the model and the
scene diminishes the performance of many existing methods.
(4) Identifying the optimal model and determining the number
of instances present in the scene require extensive computa-
tional resources, making real-time processing becomes partic-
ularly challenging.
1.2. Related work
Current methods for addressing the above-mentioned chal-
lenges mainly include RANSAC-based methods, clustering-
based methods, machine learning-based methods, and
template-based methods.
RANSAC-Based Methods: RANSAC [2] is a fundamental
feature clustering method that has inspired numerous widely
used variants, such as Sequential RANSAC [3], T-Linkage
[4], Graph-Cut RANSAC [5], Progressive-X [6], Progressive-
X+[7], etc. These methods have been widely used for point
cloud registration. However, they rely heavily on hypothesis
Figure 1. Scenarios for multi-instance pose estimation. (a) Scenario
for multi-instance pose estimation of one model. (b) Scenario for
multi-instance pose estimation of different models.
sampling, requiring a high number of iterations to achieve reli-
able results. As the number of models or the proportion of
outliers increases, their efciency and robustness signicantly
diminish.
Clustering-Based Methods: Considering spatial structure,
efcient clustering methods have been proposed for solving
pairwise point cloud alignment. Notable works include: In
2020, Yang et al [8] introduced a fast, provable algorithm
called TEASER for aligning two sets of 3D points in the pres-
ence of a signicant number of outliers. This method performs
very well in pairwise point cloud alignment. Kluger et al [9]
introduced the conditional sample consensus algorithm for t-
ting multiple parametric models of consistent form to noisy
measurements. This method can be trained under both super-
vision and self-supervision. Recently, Zhang et al [10] intro-
duced a maximal cliques method based on geometric to effect-
ively improve the alignment accuracy This method relaxes the
previous maximal cluster constraints and extracts more local
consistency information in the graph, thereby enhancing the
performance of deep learning algorithms.
Spatial structure is a crucial feature for ltering out-
liers. Although the aforementioned methods can efciently
and accurately address pairwise point cloud alignment, their
performance in multi-instance pose estimation is limited.
Challenges such as pseudo outliers, occlusion from other
instances, and low overlap remain unresolved.
Machine Learning-Based Methods: Recently, machine learn-
ing methods based on the Hough transform [11,12] and
point pair features (PPF) [13] have been developed. For
example, Charles et al proposed a 3D object detection
model called VoteNet [14] which relies on the Hough vot-
ing mechanism. Similarly, an unsupervised learning method
named PPF-FoldNet [15] is developed to generate 3D local
descriptors through a folding-based auto-encoding of PPFs.
Recent advancements in machine learning have integrated 2D
images and point cloud data for comprehensive pose estim-
ation, achieving excellent results [1622]. Although machine
2
Meas. Sci. Technol. 36 (2025) 045010 H Zhao et al
learning-based methods have demonstrated commendable per-
formance, they lack a universal model and often require pre-
training for specic tasks. This limitation presents challenges
in certain scenarios, particularly in eld operations and on
edge computing devices. Given these challenges, this art-
icle primarily focuses on the performance of template-based
methods.
Template-Based Methods: The template-based method is a
valid approach for multi-instance pose estimation, where the
Hough transform combined with PPF [13] is the rst popular
method and still widely used today. This PPF-based approach
effectively addresses issues such as a huge number of missing
points and a high proportion of outliers, making it inherently
suitable for multi-instance scenarios. In the BOP challenge
[2325], up until 2019, PPF-based methods consistently out-
performed machine learning-based methods. PPF-based meth-
ods are considered the most suitable for multi-instance pose
estimation when dealing with non-RGB data or data without
training. Recently, Cui et al [26] utilized curvature inform-
ation from point pairs to enhance PPF, thereby improving
point cloud matching accuracy. Ge et al [27] proposed a
keypoint PPF (K-PPF) voting approach, which includes an
improved algorithm for sample points through a combina-
tion of curvature-adaptive and grid ISS [28]. Despite their
primary focus on optimizing PPF descriptors for perform-
ance enhancement, these methods exhibit substantial limita-
tions when applied to real-world multi-instance scenarios, spe-
cically in overcoming challenges associated with perspective
constraints, occlusion and reection.
In the domains of industrial robotics and automated grasp-
ing, PPF-based methods continue to dominate the eld. This
predominance can be attributed to two fundamental factors:
(1) While PPF methods may demonstrate inferior theoretical
performance compared to machine learning approaches, their
practical performance in specic applications remains com-
petitive. Notably, PPF methods that incorporate joint optim-
ization of scene parameters and model parameters can yield
remarkable results; (2) PPF methods offer inherent advantages
in terms of simplicity and efciency, enabling rapid imple-
mentation of matching processes without requiring extensive
training procedures. These compelling advantages motivate
our continued investigation into PPF-based methods, with the
objective of achieving substantial technological breakthroughs
in PPF applications.
1.3. PPF voting approach
The previous text introduced PPF as an effective multi-
instance point cloud matching method that does not require
pre-training. To further elaborate, we provide a detailed intro-
duction to the PPF voting approach [13].
The PPF calculates the characteristics of a pair of points,
specically the features between two oriented points piand
pj, with normal vectors niand nj. As illustrated in gure 2, for
these two oriented points piand pj, PPF can be dened as
Fij = (dij,(ni,dij ),(nj,dij),(ni,nj)),(1)
Figure 2. Point pair feature denition.
where dij =pipj,(ni,nj)represents the angle between two
corresponding normal vectors, (ni,dij)and (nj,dij )repres-
ent the angle between normal vectors and dij, respectively.
From equation (1), we can observe that PPF can be represen-
ted as a feature vector. The rst element represents distance,
while the subsequent three elements indicate angles. Any two
points can generate a feature vector.
By establishing the relationship between the feature vectors
derived from the model and the scene, we can infer that points
in the scene correspond to points in the model. This allows us
to obtain point-to-point pose transformations, which can sub-
sequently be compared to identify the optimal transformation.
In detail, if an oriented point piin the scene aligns with
a specic point in the model, ve degrees of freedom (three
translations and two rotations) are constrained. The nal rota-
tional degree of freedom is determined by a high number
of votes on the PPFs of piand pk, k =1, 2, ∙∙∙, K. Here,
pkrepresents points within a specic radius (usually the
model’s diameter) centered at piand Kis not a xed num-
ber with a magnitude of N2. A key assumption is that the
alignment accuracy increases with the similarity of the PPF
between a scene point with its surroundings and a model point
with its surroundings. The number of features for a model
Pis proportional to N2. The overall computation amount is
O(MN3). Based on the above analysis, it is evident that this
method involves a signicant amount of computation, and
the local computational effort varies. Downsampling and lim-
iting the number of points can help reduce computational
effort.
PPF-based approach ultimately utilizes a comparison-
based pose clustering method to determine the nal pose,
which is a straightforward technique. This method performs
well when the candidate set is relatively small but loses
effectiveness as the number of candidate poses increases.
Additionally, PPF-based methods rely heavily on the quality
of the model, requiring a well-constructed model to achieve
optimal matching performance.
1.4. Our work
PPF-based methods demonstrate signicant practicality in
addressing multi-instance point cloud matching challenges.
However, in real-world applications, the effectiveness of
3
Meas. Sci. Technol. 36 (2025) 045010 H Zhao et al
Figure 3. The differences between a model and its instances. Subgure (a) represents the complete model. Subgure (b) illustrates various
submodels from different perspectives.
multi-instance pose estimation is frequently limited by incom-
plete point clouds, which may arise from factors such as view-
point restrictions, reections, and occlusions. Our research
suggests that the performance of PPF methods is closely
related to the sub-models derived from various viewpoints.
Therefore, investigating multi-view models is an effective
strategy to improve the performance of PPF-based methods,
as discussed in section 2.1.
Since multi-view models generate more results and cannot
be directly matched like single models, we designed a frame-
work to effectively integrate multi-view models. This frame-
work integrates a comprehensive pipeline that encompasses
three key components: (1) preprocessing through background
removal and point cloud segmentation techniques, (2) core
computation employing a multi-view PPF-based approach,
and (3) post-processing utilizing a spatial structure mechan-
ism for effective elimination of erroneous poses. As a result,
we achieved considerable performance improvements over the
previous SOTA approach, which is also a PPF-based method.
Specically, on the ITODD depth (point cloud) data from
the BOP Challenge, our method achieved an average recall
of 69.6%, representing a signicant performance increase of
44.7% compared to the previous best conventional method.
The structure of the subsequent sections is outlined as fol-
lows: section 2introduces the multi-view PPF-based method.
Section 3describes the experiments and their results. Section 4
provides a discussion, and section 5concludes this work and
outlines future research directions.
2. Multi-view PPF-based method
This section introduces the multi-view PPF-based method,
detailing the fundamental models and concepts behind it.
It also emphasizes the signicance of multi-view models
and presents a method for generating multi-view models.
Additionally, this section also addresses the issue of model
generalization.
2.1. Multi-view models and concepts
The model can be denoted as P={piR3|i=1, 2, ∙∙∙, N},
and the scene can be denoted as Q={qjR3|j=1, 2, ∙∙∙, M},
where Nand Mrepresent the point number of model Pand
scene Q, respectively. Unless specically stated, the meanings
of all symbols in this paper remain unchanged after their initial
denition.
In multi-instance and single-model pose estimation scen-
arios, assuming there are Rinstances of a single model, the
scene Qcan be divided into R+1 subsets: Q0,Q1,Q2, ∙∙∙, QR.
Here we have Q=Q0Q1Q2 ·· · QR, where Q0repres-
ents the background points (real outliers) and Qr(1 rR)
represents the desired instance aligned to model P.
As for multi-instance and multi-model scenarios, there
are multiple models, denoted as Pm, m =1, 2, ∙∙∙, I, where
Idenotes the total number of models. Then we have Q=
I
m=1(Vm
v=1Qmv)Q0, where, Vmrepresents the number of
instances in scene for mth model (i.e. Pm), and Qmv represents
the vth instance of model Pm.
As previously mentioned, the part of the scene that aligns
with model Pmis Qmv, and it tends to differ signic-
antly from model Pm. As shown in gure 3, we observe
a comparison between a model and its instances in a
scene. These instances, originating from diverse perspect-
ives, exhibit considerable variations. It is even challen-
ging to determine whether these instances belong to the
same model. This highlights the importance of incorporat-
ing multiple views to enhance the robustness and accur-
acy of pose estimation, as relying on a single view may
not fully capture the variations introduced by different
perspectives.
4
Meas. Sci. Technol. 36 (2025) 045010 H Zhao et al
In detail, the diverse perspectives present a signicant com-
mon challenge in multi-instance pose estimation. In a captured
scene, a complete model is often unavailable. Instead, multiple
stacked instances are obtained from different viewpoints. As
previously mentioned, the variations between instances can be
considerable, making it impractical to use a complete model to
identify an incomplete instance. Furthermore, when employ-
ing a complete model for pose estimation, the complexity of
certain models can be high due to an excessive number of
points after downsampling, or they may omit important fea-
tures due to a reduced number of points. Given these chal-
lenges, the question arises: what type of model is most suit-
able for describing instances within a scene? The answer lies in
considering the instances themselves as models. Specically,
segments of the model derived from various perspectives
should be utilized as the model. These incomplete instances,
observed from different perspectives, are often referred to as
multi-view models. When integrated with multi-view models,
PPF-based methods are specically termed multi-view PPF-
based methods.
Furthermore, while the PPF-based method inherently
addresses missing points, the multi-view PPF-based method
remains crucial for enhancing performance. (1) As previously
mentioned, multi-view models can accurately characterize the
features of the instances in a scene, which facilitates accept-
ing the correct results and rejecting the incorrect ones. (2) For
complicated models, the large number of features can signi-
cantly reduce computational efciency. Downsampling can
serve as a compromise, but it may obscure the crucial details
of the model. Conversely, submodels with fewer points can
preserve detailed local features.
2.2. Generation of multi-view models
In this subsection, we introduce an efcient process for gener-
ating multi-view models, with a focus on practical applicabil-
ity in real-world pose estimation tasks. We propose an innov-
ative approach that combines Physically Based Rendering
(PBR) with real-world instance data to generate a robust multi-
view model.
Regarding the generation of multi-view models, we ini-
tially attempted two approaches. The rst method for gener-
ating multi-view models in PBR entails projecting a model
from six xed directions, namely upward, downward, left-
ward, rightward, forward, and backward. This process yields
six sets of point cloud models. Although these models can be
readily generated, they frequently lack practicality and appear
rather clumsy. Another method in PBR for the generation of
multi-view models involves rotating the viewpoints at regular
intervals. This method gives rise to a substantial number of
viewpoint models with extensive coverage. Nevertheless, des-
pite its wider coverage, these models remain inapplicable for
practical uses owning to multiple factors. Firstly, real-world
instances, which are generally measured from actual phys-
ical objects, frequently suffer from missing points as a con-
sequence of reections, occlusions, or sensor limitations. This
stands in stark contrasts to the idealized, complete models gen-
erated via PBR. Secondly, certain submodels originating from
specic viewpoints can be deceptive and might trigger inac-
curate matching. For example, when observing a cylinder from
the bottom, it may merely present as a circle, failing to encap-
sulate the full three-dimensional geometry of the model and
thereby leading to substantial misinterpretation. Thirdly, some
PBR models could prove unfeasible in real-world settings due
to stability issues. To illustrate, on a at surface, a cone can
only maintain stability in two orientations: with its base resting
on the plane or with its side in contact with the plane. Finally,
the generation of identical or geometrically similar submod-
els from multiple perspectives, a phenomenon referred to as
model duplication, can lead to computational redundancy. A
representative example is the cylindrical model, where sub-
models generated through rotational transformations about its
central axis exhibit complete indistinguishability.
Addressing these issues, we propose an efcient multi-
view model generation method that requires only limited train-
ing data and real-world observations. Our method comprises
three sequential phases: initial candidate model generation,
followed by model renement, and concluding with result gen-
eralization. The following section detail the implementation
and technical aspects of the rst two stages.
Our approach is based on two key considerations: (1)
the instances within the scene represent measurements of
the model from a specic perspective; (2) the poses of the
instances in the scene indicate the transformation of the model
within that scene. Consequently, the points in the scene that
coincide with the transformed model reect the specic per-
spective of the model that we intend to capture. In other words,
the instances within the scene can be directly utilized as sub-
models, and the points of intersection between the transformed
model and the points in the scene can also serve as submod-
els. While these two methods are indistinguishable when the
instances align with the model, practical applications may
yield differing results due to factors such as model representa-
tion and measurement characteristics. Therefore, we prefer to
use the former method. This represents the rst step in gener-
ating a candidate set of multi-view models.
Figure 4illustrates the comparison between real-world sub-
models generated from specic perspectives and the corres-
ponding original model from the same viewpoint. In each sub-
gure, the right column displays the real-world submodels,
while the left column represents the original model. A notable
divergence is observed among the submodels. Despite being
only a fraction of the comprehensive model, these submod-
els effectively encapsulate the unique attributes of the model
from their respective viewpoints. As a result, the application
of multi-view submodels signicantly enhances the accuracy
of outcomes, particularly in the domain of multi-instance pose
estimation. This not only supports our previous assertion that
submodels are more effective in representing real-world scen-
arios across diverse instances but also highlights the efcacy
of the aforementioned multi-view modeling approach.
Next, we need to optimize these multi-view models. Based
on the previously mentioned approaches, we can generate
5
Meas. Sci. Technol. 36 (2025) 045010 H Zhao et al
Figure 4. The comparison between the submodels generated from a specic perspective and the original model is illustrated in three
subgures: (a), (b), and (c). In each subgure, the left column represents the original model, while the right column displays instances from
that specic perspective.
numerous models from various perspectives that exist in real-
ity. However, this does not fully resolve the issue of model
duplication. To address this problem, we must eliminate
duplicate and similar models. By comparing the similarity or
overlap between submodels, we can achieve this objective.
In this context, we employ the Iterative Closest Point (ICP)
method to align the models and remove those with excessive
overlap. For instance, models with an overlap exceeding 60%
can be considered identical.
Through the utilization of our generated candidate model
set and subsequent optimization processes, we have estab-
lished relatively effective multi-view models. Nevertheless,
these models demonstrate signicant overtting tendencies
when deployed in real-world scenarios. The following section
addresses this problem and improves the generation method
for multi-view models.
2.3. Model generalization
Within the aforementioned framework, we have developed
a computationally efcient and methodologically straightfor-
ward approach for multi-view model generation. While this
method demonstrates considerable practicality, its perform-
ance in our experiments is unsatisfactory, particularly in terms
of the model’s generalization ability. We illustrate this using
the PPF as a foundational method. This situation is analogous
to machine learning methods that, when overtted to a spe-
cic training set, may exhibit poor generalization capabilities,
resulting in suboptimal performance on other test datasets.
To overcome this challenge, we apply generalization tech-
niques when utilizing multi-view models. Figure 5presents
the comparison of experimental results before and after model
generalization. All pose estimation results were obtained using
the previously mentioned method from a single-perspective
model in gure 5(a). As shown in gure 5(b), the 36 instances
within the scene appear highly similar, making it challenging
for humans to distinguish between them. Nevertheless, there
are still subtle differences among these instances. When we
use a specic instance as the model, the pose estimation for
that instance is the most accurate, but the estimations for the
other instances are less precise.
We calculate the pose estimation scores for various models,
with the statistical results shown in gure 5(c). Here, the blue
diamond ( ) line represents the scores for the original non-
generalized model, while the red cross ( ) line represents the
scores for the generalized model. These results are arranged in
descending order based on their scores. Each curve includes
scores for 37 results in total, with 36 correct results corres-
ponding to the 36 instances in gure 5(b) and one incorrect
result. Overall, the values on each curve are close but con-
sistently decline, indicating that, despite the similarity among
the instances within the scene, using a single instance as the
model results in signicant discrepancies in the scores, as pre-
viously noted. In terms of the overall differences between the
two curves, the small variance in scores of the generalized
model indicates its ability to generalize, demonstrating more
consistent performance across different instances.
In conclusion, when utilizing multi-view models, it is
essential not only to generate these models using the methods
described in section 2.2 but also to apply generalization tech-
niques. We propose two methods for model generalization:
(1) Smoothing lter for the models. This approach is effective
because the collected point cloud data for surface matching
exactly represents the scene’s surface, and applying smooth-
ing can reduce the noise of the perspective model. (2) Model
fusion. The objective is to integrate multiple multi-view mod-
els to create a generalized model. This approach is utilized in
our work and is explained in detail below.
For a model Qviewed from a specic perspective, its gen-
eralization process is outlined as follows. Let Qirepresent the
overlapping models with Q, where 1 in. For any point
pQ, we search for a corresponding point piin Qithat is
6
Meas. Sci. Technol. 36 (2025) 045010 H Zhao et al
Figure 5. Comparison of experimental results before and after model generalization. (a) The model. (b) The matching results, with blue
representing the scene and red representing the correct matching results that transform the model within the scene. (c) Relative scores for
different results before and after generalization, where the rst 36 counts represent correct results, and the 37th (the last one) indicates the
rst incorrect result. The blue diamond ( ) represents the results of the original model, while the red cross ( ) represents the results of the
generalized model.
identical to (or similar to) p. If no such point is found, we set
pi=0. The generalized point ˜
pis then dened as follows:
˜
p=piαi
αi
,(2)
where,
αi={1,pi=0,
0,pi=0.(3)
The generalized model is ˜
Q={˜
p}.Here, the calculation
process is somewhat complex, so we employ the follow-
ing simplied method for the calculations: rst, we perform
the calculations for Q(Q1Q2 · ·· Qn), and then apply
voxel downsampling to obtain the results ˜
Q.
2.4. Multi-view pose estimation framework
In the previous section, we discussed the generation and
generalization methods for multi-view models. This section
introduces the multi-view pose estimation framework. Our
framework consists of three main components: the primary
algorithm (the multi-view PPF-based method), data prepro-
cessing; and spatial structure-based optimization. Each com-
ponent is elaborated upon in the following subsections.
2.4.1. Preprocessing. The scene preprocessing includes
background removal, scene segmentation, and other related
tasks. The purposes of this preprocessing are to: (1) eliminate
irrelevant points, thereby reducing interference in matching
and enhancing accuracy; and (2) decrease computational effort
by minimizing the number of points involved, as the PPF-
based method requires substantial computational resources.
Background Removal: Background removal refers to elim-
inating irrelevant parts of the scene that are not related to
the model. In many datasets, background points can be the
ground or a workbench, a specic shape container where
objects are stacked, or a workspace marked by specic identi-
ers. Removing these can reduce the size of the scene points,
minimize interference, lower computational complexity, and
enhance matching efciency and accuracy. The background
removal method should be tailored to the specic characterist-
ics of the scene.
Scene Segmentation: Scene segmentation involves splitting the
scene into multiple parts, with each part ideally containing
only one instance of the model. This can effectively reduce
matching complexity and improve matching accuracy. After
scene segmentation, the scene can be divided into many parts.
In this case, pose estimation based on multi-view PPF can
be performed within one part, which signicantly reduces
computation and interference. However, it is important to
note that scene segmentation is not always effective, espe-
cially in multi-instance scenarios where it can be particularly
challenging.
2.4.2. Spatial structure-based optimization. After comput-
ing the multi-view PPF, approximately aligned results can be
generated, which require further processing and organization
to achieve the highest level of accuracy. Firstly, precise align-
ment by ICP is required. Since the results obtained by multi-
view PPF-based matching are generally accurate, the ICP can
simply utilize the basic version available in the point cloud
library [29].
Considering that many matching results are erroneous for
multi-instance pose estimation due to interference and occlu-
sion between different instances, there are problems such as
model penetration, oating in the air, and inaccurate posture.
These problems are easily noticeable for humans but challen-
ging for machines and non-composite programs. To address
these issues, we implement edge consistency testing, overlap
testing, boundary testing, and spatial pose testing—all based
on spatial structures.
7
Meas. Sci. Technol. 36 (2025) 045010 H Zhao et al
Figure 6. Overall framework for multi-view PPF-based pose estimation. The overall framework comprises two main components, as
illustrated in the gure. The upper section depicts the multi-view model preparation phase, encompassing model generation and
generalization processes. The lower section represents the real-time pose estimation pipeline, consisting of three sequential stages: scene
preprocessing, multi-view PPF-based pose estimation, and spatial structure-based optimization.
Edge Consistency Testing. Edge consistency testing evaluates
how well the edges of the model align with those present in
the scene. This evaluation assumes that a correct pose res-
ult in a higher degree of edge overlap between the model
and the instance when viewed from the same perspective.
Consequently, we utilize edge detection to assess the accur-
acy of the matching results. This test is mainly utilized to pro-
cess point cloud models with a simple structure. Some existing
methods also use the edge-matching approach to align mod-
els [30], and this testing method shares a similar conceptual
foundation.
Overlap Testing. Overlap testing assesses the intersection
between scene points and transformed model points based on
the acquired pose. When the scene is relatively complete, the
combination of the multi-view PPF-based method and ICP can
theoretically produce matching results with a very high over-
lap ratio. The overlap ratio is one of the most crucial indicators
for evaluating the quality of matching results and serves as an
intuitive metric.
Boundary Testing. Boundary testing primarily targets mis-
aligned points, unlike overlap testing, which mainly focuses
on correctly aligned points. If a 6D pose is inaccurate or incor-
rect, parts of the model transformed into the scene may occupy
incorrect positions. Boundary testing focuses on counting the
points that are misaligned rather than those that are correctly
positioned.
Spatial Pose Testing. Spatial pose testing primarily focuses
on evaluating the stability of the model’s position within
the scene. This test assesses whether the model is oating,
partially submerged below the ground, or intersecting with
other objects. It is capable of identifying issues that may not
be detected by edge consistency testing, overlap testing, and
boundary testing.
The overall testing framework, which is based on spatial
structure, employs a check-and-reject mechanism to elimin-
ate poses that do not meet the specied criteria. It retains
only the poses that pass this validation and subsequently
sorts them according to specic parameters. For relatively
intact scenes, the multi-view PPF-based method can achieve
satisfactory performance. However, in real-world scenarios
where issues such as reections and lack of texture res-
ult in deciencies in the point cloud, it is crucial to util-
ize the aforementioned robust methods to accurately identify
the correct poses. Additionally, multi-view models gener-
ate a greater number of potential poses, making it neces-
sary to use the testing framework to further lter the correct
poses.
Figure 6illustrates the overall framework for multi-view
PPF-based pose estimation. Before performing pose estima-
tion, it is essential to prepare multi-view models, which con-
tains model generation and model generalization. Once the
models are ready, real-time pose estimation can be executed.
For each scene, preprocessing is initially performed to elimin-
ate background and noise points while also calculating nor-
mal vectors and conducting segmentation. Following this,
multi-view PPF-based pose estimation is carried out, pro-
ducing multiple pose estimates for various models. Finally,
post-processing is required for the numerous matching res-
ults obtained, which includes edge consistency testing, overlap
testing, boundary testing, and spatial pose testing, to acquire
the correct poses.
8
Meas. Sci. Technol. 36 (2025) 045010 H Zhao et al
3. Experiments
3.1. Datasets and evaluation methodology
We utilize the datasets and evaluation methodology from the
BOP Challenge [23,24], a competition organized to reect
the current state in the eld of 6D object pose estimation from
RGB-D images. In this study, we primarily focus on evaluating
the ITODD point cloud data, which is more representative of
industrial scenarios.
This dataset contains 28 models and 722 scenarios, with
each scenario containing multiple instances of the models.
The data was collected through scanning equipment, reecting
real-world data challenges such as reections and occlusions.
This provides an accurate reection of the issues and com-
plexities associated with industrial data. Therefore, this dataset
is considered the most challenging in the BOP challenge and
consistently yields the poorest results across various methods.
Machine learning methods also perform poorly, which can
be attributed to: (1) The scarcity of authentic training images
accurately labeled with 6D object poses, as data annotation is
an expensive and time-consuming process; and (2) The signi-
cant disparity between real and synthetic images.
The performance of our method, along with other
approaches is assessed using BOP Challenge’s performance
evaluation methods, specically focusing on the following
indicators:
(1) VSD (Visible Surface Discrepancy) treats indistinguish-
able poses as equivalent by considering only the visible
parts of the object.
(2) MSSD (Maximum Symmetry-Aware Surface Distance)
considers a set of pre-identied global object symmetries
and measures the surface deviations in 3D space.
(3) MSPD (Maximum Symmetry-Aware Projection Distance)
considers the object symmetries and measures the perceiv-
able deviations.
The fraction of annotated object instances for which a cor-
rect pose is estimated is referred to as recall. The average
recall (AR) is dened as the mean of the recall rates. The
overall accuracy of a method on a dataset D is measured by:
ARD=(ARVSD +ARMSSD +ARMSPD)/3, which is calculated
using estimated poses of all objects from D.
3.2. Parameter settings
PPF parameters. The parameters used in PPF-based matching
have a signicant impact on the matching performance. While
optimizing these parameters can further improve the perform-
ance, it is not the focus of this work. In our experiment, we set
the PPF angle discretization value to 15, the distance discret-
ization value to the voxel size after voxel downsampling, the
clustering angle discretization value to 15, and the distance
discretization value to 3 times the voxel size. These parameter
values were selected to achieve a balance between computa-
tional efciency and matching accuracy.
Testing parameters. During the optimization stage, the edge
consistency needs to be greater than 0.3. The overlap ratio
should be above 0.2. For boundary testing, the proportion of
points that exceed the boundaries must be less than 10%. For
spatial pose testing, instances should have no more than 10%
of their points below the plane. These thresholds help to elim-
inate erroneous poses and improve overall accuracy.
Parameters of Multi-View Models. The generation of multi-
view models is based on the test data, which we adjust accord-
ing to their relative completeness. There is no special treatment
involved; instead, we eliminate models with lower complete-
ness. This is crucial because, in real-world scenarios, some
models may perform inadequately. As illustrated in gure 4,
certain models may fail to accurately capture the features of
the object, leading to inaccuracies in pose estimation. For
generalization, we primarily rely on the model fusion and
model downsampling techniques discussed in section 2.3. This
approach is also quite straightforward and intuitive.
3.3. Performance comparison
The device used in this study is a CPU computing unit, spe-
cically the AMD Ryzen 7 5800X with 8 cores and 16 threads.
No GPU devices were utilized in this experiment. This study
primarily focuses on improving matching accuracy rather
than prioritizing the optimization of computational speed.
Consequently, enhancing the code implementation, leveraging
advanced devices, or employing CUDA for accelerated com-
puting can result in faster computations.
The performance comparison experiments consist of two
parts: (1) a comparison with conventional methods based on
point cloud data, which serves as the primary comparison
for this study. Given that our method signicantly outper-
forms these conventional approaches, we further compared it
with machine learning-based methods; (2) a comparison with
machine learning methods based on RGB-D data.
3.3.1. Comparison with the conventional method. The per-
formance comparison between our method and the latest con-
ventional methods on ITODD point cloud data is presented in
table 1. Our method achieves a remarkable 44.7% enhance-
ment over the previous best-performing method ‘SFM-
TransferTech’. Additionally, this result consistently remained
below 0.5 for an extended period. For the rst time, our method
achieves a breakthrough performance score of 0.696. Notably,
many parameters in our method were not meticulously ne-
tuned or tailored for specic models, indicating its potential
for further performance enhancement.
The experimental results demonstrate consistent and signi-
cant improvements in our method across all metrics, under-
scoring the reliability of our approach. Additionally, the results
9
Meas. Sci. Technol. 36 (2025) 045010 H Zhao et al
Table 1. Performance Comparison of the Conventional Methods on ITODD Point Cloud Data.
Method AR ARVSD ARMSSD ARMSPD
Our Method 0.696 0.623 0.741 0.724
SFM-TransferTech [13] 0.481 0.437 0.506 0.501
XYZ-SurfaceMatching [13] 0.471 0.431 0.502 0.48
ZTE_PPF [13] 0.470 0.404 0.510 0.495
Drost-CVPR10-3D-Edges [13] 0.462 0.415 0.494 0.478
Vidal-Sensors18 [31] 0.435 0.414 0.458 0.434
Table 2. Performance Comparison of Learning-based Methods on ITODD RGB-D Data.
Data Type Method AR ARVSD ARMSSD ARMSPD
RGB-D GPose2023 [16] 0.704 0.645 0.734 0.734
GDRNPP-PBR-RGBD-MModel [17] 0.679 0.622 0.707 0.708
GDRNPPDet_PBRReal +GenFlow-MultiHypo
[18,19]
0.647 0.591 0.678 0.673
ZebraPoseSAT-EffnetB4_rened
(DefaultDetections-2023) [20]
0.618 0.557 0.647 0.651
RDPN (cir) [21] 0.575 0.558 0.588 0.579
SurfEmb-PBR-RGBD [22] 0.538 0.497 0.558 0.560
D Our Method 0.696 0.623 0.741 0.724
Table 3. Ablation studies were conducted on various parameters. The parameter “multi-view“ refers to the use of multi-view models as
opposed to a single model. The parameter “testing” indicates whether the check is utilized. The parameter ”points” denotes the number of
points retained after model downsampling.
AR ARVSD ARMSSD ARMSPD Multi-view Testing Points
1 0.696 0.623 0.741 0.724 Yes Yes 1000
2 0.676 0.601 0.724 0.704 Yes No 1000
3 0.588 0.519 0.638 0.607 No Yes 1000
4 0.585 0.516 0.635 0.603 No No 1000
5 0.673 0.597 0.722 0.699 Yes Yes 500
6 0.664 0.589 0.712 0.689 Yes No 500
7 0.496 0.444 0.536 0.507 No Yes 500
8 0.491 0.438 0.532 0.503 No No 500
validate the superiority of multi-view models in capturing the
characteristics of different scene instances compared to single-
view models.
3.3.2. Comparison with learning-based method for RGB-
D data. Encouraged by the signicant improvements
our method offers over conventional pose estimation tech-
niques, we further compared it with machine learning-based
approaches to highlight its advantages.
The performance comparison is illustrated in table 2. The
best-performing machine learning method is GPose [16],
which achieved an AR of 0.704, while our method attained
a comparable AR of 0.696. It is important to note that sev-
eral machine learning methods, such as GPose, rely on addi-
tional prerequisites, including: (1) pre-training on a sub-
stantial amount of data, and (2) the simultaneous use of
point cloud data and RGB data. In contrast, our method
neither relies on RGB data nor requires pre-training. Its
performance, which is comparable to that of machine learning
techniques, is mainly due to the use of multi-view models
that fully leverage the features of the real-world instances,
fundamentally representing a form of post-training feature
extraction.
3.4. Ablation study
To demonstrate the effectiveness of our method, we conducted
additional ablation experiments. The results of these experi-
ments are presented in table 3. The following conclusions can
be drawn:
First, analyzing the ‘multi-view’ parameter claries
whether multi-view models are utilized. A comparison of
rows 1 and 3, as well as rows 5 and 7, demonstrates that
the incorporation of multi-view models produces higher val-
ues for AR, ARVSD, ARMSSD , and ARMSPD. This indicates
that multi-view models can capture more comprehensive
information about models, thereby enhancing overall
performance.
10
Meas. Sci. Technol. 36 (2025) 045010 H Zhao et al
Figure 7. Grayscale images and their corresponding point clouds are displayed, with the grayscale image shown at the top and the
corresponding point cloud at the bottom. There are three cases: (a) the case where point cloud data is severely incomplete, (b) the case
where the model can only be captured from a single angle, and (c) the case where models obstruct each other.
Next, the inuence of the ‘testing’ parameter is analyzed.
This parameter indicates whether post-optimization testing is
utilized. By comparing rows 1 and 2, as well as rows 5 and
6, it is evident that incorporating testing leads to improved
performance metrics. Specically, when testing is employed,
the values for AR, ARVSD, ARMSSD, and ARMSPD are higher,
reecting enhanced model accuracy, reduced variation, lower
mean squared error. This underscores the importance of incor-
porating testing in the pose estimation framework to achieve
optimal performance.
Finally, the impact of the ‘points’ parameter, which indic-
ates the number of points retained after model downsampling,
is examined. A comparison of rows 1 and 5, as well as rows
2 and 6, reveals that reducing the number of points from 1000
to 500 generally results in a slight decrease in performance
across all metrics. However, this decrease is not substantial,
suggesting that downsampling can be an effective strategy
for reducing computational complexity without signicantly
compromising model accuracy. It is important to note that even
with fewer points, the use of multi-view models and checks
still produces competitive performance metrics.
In conclusion, the ablation studies demonstrate that the
‘multi-view’ parameter plays a crucial role in enhancing
model performance across various metrics, which valid-
ates our method. The ‘testing’ parameter has a relatively
lesser impact on improving model performance. While down-
sampling can be utilized to reduce computational load, it
should be approached with caution to prevent signicant
performance degradation. These ndings provide valuable
insights for optimizing the model’s design and performance
in future research endeavors.
4. Discussion
In this section, we discuss the detailed process and paramet-
ers of our method of multi-view models. The goal is to utilize
our method more efciently and promptly set parameters in
specic scenarios. This section consists of three subsections:
real-world data challenges and solutions, computational com-
plexity analysis, and limitations.
4.1. Real-world data challenges and solutions
In the Introduction, we discuss the challenges associated with
real-world multi-instance pose estimation, which include more
outliers, occlusion from other instances, low overlap, and sig-
nicant computational demands. To assess the effectiveness of
our method in addressing these issues, we utilize the ITODD
point cloud data from the BOP Challenge dataset for testing,
as it closely resembles industrial data. This dataset exhibits
the poorest 6D position estimation among the BOP Challenge
datasets, primarily due to the aforementioned challenges and
reections in the industrial model, which result in incom-
plete point cloud data. Consequently, this multi-model, multi-
instance matching dataset presents inherent challenges.
Here, we also want to discuss the challenges associated
with ITODD data. As illustrated in gure 7, the upper panel
11
Meas. Sci. Technol. 36 (2025) 045010 H Zhao et al
displays a grayscale image, while the lower panel presents the
corresponding depth images. Some depth images have been
rotated to better reveal their missing components. The key
observations are as follows: (1) As depicted in gure 7(a),
for certain simple objects in the industrial sector, the collec-
ted point cloud data is severely incomplete due to insufcient
color information and the presence of reections. (2) As shown
in gure 7(b), for more complex objects, data can only be cap-
tured from a single viewpoint, resulting in severely incom-
plete instances. (3) As illustrated in gure 7(c), occlusion
between different instances further exacerbates the incom-
pleteness of the point cloud data. In industrial environments,
matching multiple instances of point clouds presents consid-
erable challenges.
PPF voting-based methods are inherently well-suited for
multi-instance matching. Their voting and clustering mech-
anisms effectively select multiple competing results without
compromising the consistency of clustering, a challenge often
encountered with other clustering methods in the presence of
multiple instances.
In response to challenges such as low overlap and reec-
tion, multi-view models can be utilized to identify suit-
able models for matching in these scenarios. This is also
why our approach demonstrates signicant performance
improvements.
4.2. Computational complexity analysis
The computational complexity of multi-view PPF-based pose
estimation can be analyzed as follows:
(1) For a given model pose estimation within a scene, it is
essential to generate multi-view submodels, assuming the
total number of submodels is S.
(2) For each given sub-model, which represents a specic
viewpoint, a PPF-based matching process must be conduc-
ted. This process involves the following steps: (a) Iterate
over each point qiin the scene, where i=1, 2, ∙∙∙, M. (b)
For each point qi, iterate over its surrounding point qkwith
a radius equal to the model’s diameter, where k=1, 2,
∙∙∙, K. The total number of points qkis K, which is not a
xed number but can be on the order of N2. (c) For each
point qiand qk, calculate their PPFs and select those with
identical features from the model features. The number
of identical features is denoted by P, which varies signi-
cantly for different features, assuming an average num-
ber of N. The computational complexity for this step is
O(MN3). (d) Finally, perform pose clustering. The clus-
tering method has a computational complexity of less than
O(MN). Therefore, the overall computational complexity
for a model from a single viewpoint is O(MN3).
(3) Using a model to generate Ssubmodels for pose estimation
results in a computational complexity of O(SMN3).
In general, the value of Sranges from 2 to 20, while the
value of M varies depending on the specic model, ranging
from 400 to 1600. The value of N is contingent upon the num-
ber of points after preprocessing in the scene, typically ranging
from 1000 to 10 000 in this study. Based on these parameters,
we can observe that when the parameters are at their lower
limits, the computational load of the algorithm is relatively
low. However, when the parameters reach their upper limits,
a signicant amount of computation is required. Furthermore,
it is important to note that for the PPF-based method applied
to symmetrical models, the probability of repetition for PPFs
increases, which, to some extent, raises the computational
load.
Experimental results demonstrate that our multi-view mod-
els achieve superior performance while using a lower bound of
parameters. However, to reach the upper bound on parameters
and further enhance performance, as discussed in the Ablation
Study, where the number of points is approximately 1000, it
is essential to implement GPU-based computing acceleration.
Although the method described in this study does not fully
utilize GPUs for computational acceleration, our attempt to
implement PPF-based matching on a GPU can reduce com-
putation time to 40 milliseconds for a scene. A more detailed
and comprehensive implementation will be part of our future
work.
To clarify the impact of the model’s point count on per-
formance and computational complexity, further explanation
is necessary. In practical applications, the PPF series method
typically involves downsampling the model so that the voxel
size is 1/20 of the model’s diameter (length). At this stage,
common model sizes generally range from 300 to 600 points.
The computational efciency is notably high, with the time
required to compute a single image not exceeding one second.
However, if the model’s point count is maintained around
1000, experimental validation indicates that this can enhance
the performance of pose estimation, although it may reduce
computational efciency. This primarily depends on the spe-
cic requirements of the application.
4.3. Limitation
In this study, several parameters remain constant, including
those related to PPF, downsampling, and testing within the
space structure-based optimization. Theoretically, adjusting
parameters for different models could enhance the accuracy
of pose estimation. However, we have not yet conducted com-
prehensive research in this area.
During the testing of the optimization process, multiple
parameters are evaluated sequentially, and a valid pose must
successfully pass all tests. A signicant drawback is that if the
threshold is set too stringently, many accurate results may be
excluded; conversely, if it is set too leniently, incorrect res-
ults may not be ltered out. The criteria for selecting the nal
results are also predetermined. While these xed paramet-
ers are practical, they may hinder the achievement of optimal
outcomes.
Additionally, our approach indicates that the generation of
multi-view models still requires testing or real data. If a given
model can accurately represent instances within a scene, it
can indeed generate multi-view models without relying on
testing data. However, the given model is typically a CAD
le or another form of digital representation, and in practical
12
Meas. Sci. Technol. 36 (2025) 045010 H Zhao et al
applications, there are often inherent discrepancies between
the given model and the actual instances measured in the
scene. We can observe that generating models based on test-
ing or real data and generalizing them is a process similar to
training in machine learning methods. However, our process
is considerably simpler than the training process. In practical
industrial applications, it may involve only a few adjustments
made by personnel using the software.
5. Conclusions and future work
5.1. Conclusion
In this paper, we propose a multi-view pose estimation method
based on PPF. Our study primarily involves generating multi-
view models, enhancing model generalization, and utilizing
these models for multi-instance pose estimation. Through
comparative experiments, we demonstrate that our method sig-
nicantly outperforms traditional PPF-based techniques, and
we validate the effectiveness of multi-view models through
ablation studies.
5.2. Future work
In the future, we will conduct the following three research
works:
(1) Optimize multi-view model pose estimation with GPU
acceleration.
(2) Adaptively adjust multi-view model parameters for better
performance.
(3) Explore the integration of multi-view methods with
machine learning techniques.
Data availability statement
The data that support the ndings of this study are openly
available at the following URL/DOI: https://bop.felk.cvut.cz.
Acknowledgments
This work was supported in part by the National Natural
Science Foundation of China under Grant 42330113, in
part by the State Key Laboratory of Spatial Datum under
Grant SKLGIE2023-M-2-2, and in part by the National
Key Laboratory of Electromagnetic Environment under Grant
6142403210201.
Conflict of interest
The authors have no relevant nancial or non-nancial
interests to disclose.
ORCID iDs
Huakai Zhao https://orcid.org/0000-0002-8268-4885
Jiansheng Li https://orcid.org/0000-0002-7761-8103
References
[1] Prokop M, Shaikh S A and Kim K S 2019 Low overlapping
point cloud registration using line features detection Remote
Sens. 12 61
[2] Fischler M A and Bolles R C 1981 Random sample consensus:
a paradigm for model tting with applications to image
analysis and automated cartography Commun. ACM
24 381–95
[3] Vincent E and Laganiére R 2001 Detecting planar
homographies in an image pair ISPA 2001. Proc. 2nd Int.
Symp. on Image and Signal Processing and Analysis. In
Conjunction with 23rd Int. Conf. on Information
Technology Interfaces (IEEE Cat.) (IEEE) pp 182–7
[4] Magri L and Fusiello A 2014 T-linkage: a continuous
relaxation of j-linkage for multi-model tting Proc. IEEE
Conf. on Computer Vision and Pattern Recognition pp
3954–61
[5] Barath D and Matas J 2018 Graph-cut RANSAC Proc. IEEE
Conf. on Computer Vision and Pattern Recognition
pp 6733–41
[6] Barath D and Matas J 2019 Progressive-x: efcient, anytime,
multi-model tting algorithm Proc. IEEE/CVF Int. Conf. on
Computer Vision pp 3780–8
[7] Barath D, Rozumny D, Eichhardt I, Hajder L and Matas J
2021 Progressive-x+: clustering in the consensus space
(arXiv:2103.13875)
[8] Yang H, Shi J and Carlone L 2020 Teaser: fast and certiable
point cloud registration IEEE Trans. Robot. 37 314–33
[9] Kluger F, Brachmann E and Ackermann H 2020 Robust
multi-model tting by conditional sample consensus Proc.
IEEE Conf. on Computer Vision and Pattern Recognition
(IEEE Computer Society Press) pp 4633–42
[10] Zhang X et al 2023 3D Registration with Maximal Cliques
Proc. IEEE/CVF Conf. on Computer Vision and Pattern
Recognition pp 17745–54
[11] Hough P V C 1962 Method and means for recognizing
complex patterns: U.S. Patent 3,069,654[P]
[12] Illingworth J and Kittler J 1988 A survey of the Hough
transform Comput. Vis. Graph. Image Process. 44 87–116
[13] Drost B et al 2010 Model globally, match locally: efcient and
robust 3D object recognition 2010 IEEE Computer Society
Conf. on Computer Vision and Pattern Recognition (IEEE)
pp 998–1005
[14] Qi C R et al 2019 Deep hough voting for 3d object detection in
point clouds Proc. IEEE/CVF Int. Conf. on Computer
Vision pp 9277–86
[15] Deng H, Birdal T and Ilic S 2018 PPF-foldnet: unsupervised
learning of rotation invariant 3d local descriptors Proc.
European Conf. on Computer Vision (ECCV)
[16] Di Y, Zhang R, Lou Z, Manhardt F, Ji X, Navab N and
Tombari F 2022 Gpv-pose: category-level object pose
estimation via geometry-guided point-wise voting Proc.
IEEE/CVF Conf. on Computer Vision and Pattern
Recognition pp 6781–91
[17] Liu X, Zhang R, Zhang C, Wang G, Tang J, Li Z and Ji X 2025
GDRNPP: A geometry-guided and fully learning-based
object pose estimator IEEE Trans. on Pattern Analysis and
Machine Intelligence (TPAMI) (https://doi.org/10.1109/
TPAMI.2025.3553485)
[18] Hai Y, Song R, Li J and Hu Y 2023 Shape-constraint recurrent
ow for 6d object pose estimation Proc. IEEE/CVF Conf.
on Computer Vision and Pattern Recognition pp 4831–40
[19] Edstedt J, Athanasiadis I, Wadenbäck M and Felsberg M 2023
DKM: dense kernelized feature matching for geometry
estimation Proc. IEEE/CVF Conf. on Computer Vision and
Pattern Recognition pp 17765–75
[20] Su Y et al 2022 Zebrapose: coarse to ne surface encoding for
6dof object pose estimation Proc. IEEE/CVF Conf. on
Computer Vision and Pattern Recognition pp 6738–48
13
Meas. Sci. Technol. 36 (2025) 045010 H Zhao et al
[21] Hong Z W, Hung Y Y and Chen C S 2024 RDPN6D:
residual-based dense point-wise network for 6Dof object
pose estimation based on RGB-D images Proc. IEEE/CVF
Conf. on Computer Vision and Pattern Recognition
pp 5251–60
[22] Haugaard R L and Buch A G 2022 Surfemb: dense and
continuous correspondence distributions for object pose
estimation with learnt surface embeddings Proc. IEEE/CVF
Conf. on Computer Vision and Pattern Recognition
pp 6749–58
[23] Hodaˇ
n T et al 2020 BOP challenge 2020 on 6D object
localization Computer Vision–ECCV 2020 Workshops,
Proc., Part II 16 (Glasgow, UK,23–28 August 2020)
(Springer) pp 577–94
[24] Hodan T et al 2024 BOP challenge 2023 on detection
segmentation and pose estimation of seen and unseen rigid
objects Proc. IEEE/CVF Conf. on Computer Vision and
Pattern Recognition pp 5610–9
[25] Hodan T et al 2018 BOP: Benchmark for 6D object pose
estimation Proc. European Conf. on Computer Vision
(ECCV) pp 19–34
[26] Cui X, Yu M, Wu L and Wu S 2022 A 6D pose estimation for
robotic bin-picking using point-pair features with curvature
(Cur-PPF) Sensors 22 1805
[27] Ge Z, Shen X, Gao Q, Sun H, Tang X and Cai Q 2022 A fast
point cloud recognition algorithm based on keypoint pair
feature Sensors 22 6289
[28] Zhong Y, 2009 Intrinsic shape signatures: a shape descriptor
for 3D object recognition 2009 IEEE 12th Int. Conf. on
Computer Vision Workshops, ICCV Workshops(Kyoto,
Japan) pp 689–96
[29] Rusu R B and Cousins S 2011 3d is here: point cloud library
(pcl) 2011 IEEE Int. Conf. on Robotics and Automation
(IEEE) pp 1–4
[30] Tao W, Hua X, Chen Z and Tian P 2020 Fast and
automatic registration of terrestrial point clouds
using 2D line features Remote Sens.
12 1283
[31] Vidal J, Lin C Y, Lladó X and Martí R 2018 A method
for 6D pose estimation of free-form rigid objects using
point pair features on range data Sensors
18 2678
14
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
6D pose estimation of rigid objects is a long-standing and challenging task in computer vision. Recently, the emergence of deep learning reveals the potential of Convolutional Neural Networks (CNNs) to predict reliable 6D poses. Given that direct pose regression networks currently exhibit suboptimal performance, most methods still resort to traditional techniques to varying degrees. For example, top-performing methods often adopt an indirect strategy by first establishing 2D-3D or 3D-3D correspondences followed by applying the RANSAC-based PnP or Kabsch algorithms, and further employing ICP for refinement. Despite the performance enhancement, the integration of traditional techniques makes the networks time-consuming and not end-to-end trainable. Orthogonal to them, this paper introduces a fully learning-based object pose estimator. In this work, we first perform an in-depth investigation of both direct and indirect methods and propose a simple yet effective Geometry-guided Direct Regression Network (GDRN) to learn the 6D pose from monocular images in an end-to-end manner. Afterwards, we introduce a geometry-guided pose refinement module, enhancing pose accuracy when extra depth data is available. Guided by the predicted coordinate map, we build an end-to-end differentiable architecture that establishes robust and accurate 3D-3D correspondences between the observed and rendered RGB-D images to refine the pose. Our enhanced pose estimation pipeline GDRNPP (GDRN Plus Plus) conquered the leaderboard of the BOP Challenge for two consecutive years, becoming the first to surpass all prior methods that relied on traditional techniques in both accuracy and speed. The code and models are available at https://github.com/shanice-l/gdrnpp bop2022.
Conference Paper
Full-text available
As a fundamental problem in computer vision, 3D point cloud registration (PCR) aims to seek the optimal pose to align a point cloud pair. In this paper, we present a 3D registration method with maximal cliques (MAC). The key insight is to loosen the previous maximum clique constraint, and mine more local consensus information in a graph for accurate pose hypotheses generation: 1) A compatibility graph is constructed to render the affinity relationship between initial correspondences. 2) We search for maximal cliques in the graph, each of which represents a consensus set. We perform node-guided clique selection then, where each node corresponds to the maximal clique with the greatest graph weight. 3) Transformation hypotheses are computed for the selected cliques by the SVD algorithm and the best hypothesis is used to perform registration. Extensive experiments on U3M, 3DMatch, 3DLoMatch and KITTI demonstrate that MAC effectively increases registration accuracy, outperforms various state-of-the-art methods and boosts the performance of deep-learned methods. MAC combined with deep-learned methods achieves state-of-the-art registration recall of 95.7% / 78.9% on 3DMatch / 3DLoMatch.
Article
Full-text available
At present, PPF-based point cloud recognition algorithms can perform better matching than competitors and be verified in the case of severe occlusion and stacking. However, including certain superfluous feature point pairs in the global model description would significantly lower the algorithm’s efficiency. As a result, this paper delves into the Point Pair Feature (PPF) algorithm and proposes a 6D pose estimation method based on Keypoint Pair Feature (K-PPF) voting. The K-PPF algorithm is based on the PPF algorithm and proposes an improved algorithm for the sampling point part. The sample points are retrieved using a combination of curvature-adaptive and grid ISS, and the angle-adaptive judgment is performed on the sampling points to extract the keypoints, therefore improving the point pair feature difference and matching accuracy. To verify the effectiveness of the method, we analyze the experimental results in scenes with different occlusion and complexity levels under the evaluation metrics of ADD-S, Recall, Precision, and Overlap rate. The results show that the algorithm in this paper reduces redundant point pairs and improves recognition efficiency and robustness compared with PPF. Compared with FPFH, CSHOT, SHOT and SI algorithms, this paper improves the recall rate by more than 12.5%.