Available via license: CC BY 4.0

Content may be subject to copyright.

Received March 7, 2022, accepted March 22, 2022, date of publication March 25, 2022, date of current version March 31, 2022.

Digital Object Identifier 10.1109/ACCESS.2022.3162200

Orthogonal Single-Target Tracking

YOUJIN KIM AND JUNSEOK KWON , (Member, IEEE)

School of Computer Science and Engineering, Chung-Ang University, Seoul 06974, South Korea

Corresponding author: Junseok Kwon (jskwon@cau.ac.kr)

This work was supported in part by the Chung-Ang University Graduate Research Scholarship Grants in 2021, and in part by the National

Research Foundation of Korea (NRF) Grant funded by the Korea Government [Ministry of Science and ICT (MSIT)] under Grant

NRF-2020R1C1C1004907.

ABSTRACT In this study, we propose a novel Wasserstein distributional tracking method that can balance

approximation with accuracy in terms of Monte Carlo estimation. To achieve this goal, we present three

different systems: sliced Wasserstein-based (SWT), projected Wasserstein-based (PWT), and orthogonal

coupled Wasserstein-based (OCWT) visual tracking systems. Sliced Wasserstein-based visual trackers can

ﬁnd accurate target conﬁgurations using the optimal transport plan, which minimizes the discrepancy

between appearance distributions described by the estimated and ground truth conﬁgurations. Because this

plan involvesa ﬁnite number of probability distributions, the computation costs can be considerably reduced.

Projected Wasserstein-based and orthogonal coupled Wasserstein-based visual trackers further enhance the

accuracy of visual trackers using bijective mapping functions and orthogonal Monte Carlo, respectively.

Experimental results demonstrate that our approach can balance computational efﬁciency with accuracy, and

the proposed visual trackers outperform other state-of-the-art visual trackers on several benchmark visual

tracking datasets.

INDEX TERMS Computer vision, distance measurement, probability distribution.

I. INTRODUCTION

Visual tracking is a fundamental technique that can be used

to predict target object (e.g., vehicle) trajectories. Recently,

visual tracking has enhanced its performance by deﬁning

visual tracking problems in the Wasserstein space. This

Wasserstein space enables the accurate measurement of the

distance between probability distributions. Because it can

handle probability distributions, the Wasserstein distance has

been used in various computer vision applications (e.g.,

classiﬁcation [1], detection [2], visual tracking [3], and 3D

representation [4]) and has been applied to several machine

learning tasks (e.g., semi-supervised learning [5], adversarial

learning [6], meta learning [7], reinforcement learning [8],

and metric learning [9]).

Conventional visual tracking typically adopts the matching

metrics in the Euclidean space, e.g., l1and l2norms, Kullback

Leibler divergence, and Jensen-Shannon divergence, while

having several limitations under real-world visual-tracking

environments. For example, l1and l2norms cannot accurately

measure the discrepancy between the distributions. Kullback

Leibler divergence is asymmetric, whereas Jensen-Shannon

divergence is discontinuous and is not proportional to the

The associate editor coordinating the review of this manuscript and

approving it for publication was Wai-Keung Fung .

discrepancy between the distributions. Thus, a new match-

ing metric is required in the Wasserstein space, which has

been rarely explored in visual tracking. In particular, the

Wasserstein distance can measure the discrepancy between

probability distributions of the reference appearance and the

current target appearance at the estimated state. Because

visual trackers explicitly consider the discrepancy of prob-

ability distributions, they can encode the uncertainty in mea-

suring the distance from the distributional perspective.

However, calculating the Wasserstein distance requires

high computational costs and is intractable in real-world

settings with limited resources. To alleviate this prob-

lem, the following methods attempt to approximate the

Wasserstein distance: For example, Kolouir et al. [10] pro-

jected the Wasserstein distance into one-dimensional

spaces and presented the sliced Wasserstein distance.

Cuturi et al. [11] transformed the optimal transport problems

into maximum-entropy problems to speed up the computation

and introduced the Sinkhorn distance. Genevaay et al. [12]

proposed a stochastic optimization method for dealing with

large-scale optimal transport problems. While these methods

have made the distance computation tractable, they inevitably

degrade the Wasserstein distance accuracy.

Thus, it is important to balance the approximation with

accuracy in the computation of the Wasserstein distance.

VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ 33527

Y. Kim, J. Kwon: Orthogonal Single-Target Tracking

FIGURE 1. Framework of the proposed visual tracking system. The

proposed visual tracker proposes a new state at each time and estimate

the target appearance. Then, our visual tracker compares the reference

target appearance with the estimated target appearance from the

distributional perspective using three Wasserstein-based distances, which

are Sliced Wasserstein distance, Projected Wasserstein distance, and

Orthogonal coupled Wasserstein distance.

For this purpose, we adopt a variant of the sliced Wasser-

stein distance augmented by orthogonal coupling in the

course of Monte Carlo simulation on the Wasserstein dis-

tance [13], called orthogonal coupled Wasserstein (OCW).

Our OCW method can preserve the distance information in

high-dimensional space, although the method approximates

the Wasserstein distance to reduce computational cost.

In this study, we aim to solve visual tracking problems

using the proposed OCW. The proposed visual tracking

method represents a target appearance vector as a target

appearance distribution to cope with ambiguities in the

appearance representation. Subsequently, the OCW accu-

rately and efﬁciently minimizes the discrepancy between the

estimated and ground-truth target appearance distributions to

obtain an accurate target conﬁguration.

The contributions of the proposed method are as follows:

•We develop a novel sliced Wasserstein-based visual

tracking system (SWT), in which two appearance dis-

tributions described by estimated conﬁgurations and

ground truth conﬁgurations become similar via the opti-

mal transport plan. This plan can be conducted using

a ﬁnite number of probability distributions; thus, the

computational costs can be considerably reduced.

•We present a novel projected Wasserstein-based visual

tracking system (PWT), in which the discrepancy

between the aforementioned sliced Wasserstein distance

and true Wasserstein distance can be minimized using

bijective mapping functions.

•We propose a novel orthogonal coupled Wasserstein-

based visual tracking system (OCWT), in which the

aforementioned projected distance can induce accurate

projection directions using orthogonal Monte Carlo.

Figure 1describes the framework of the visual tracking

system.

The remainder of this paper is organized as follows.

Section II relates the proposed method to the existing meth-

ods. Sections III,IV, and Vpropose a visual tracking method

based on the sliced Wasserstein, projected Wasserstein,

and orthogonal coupled Wasserstein distances, respectively.

Section VI-A describes the experimental settings used in

this study. We compare the proposed visual tracker with

other state-of-the-art methods using the object tracking

benchmark (OTB) and large-scale single object tracking

(LaSOT) datasets in Sections VI-C and VI-D, respectively.

Section VI-B analyzes our proposed visual trackers in depth.

We conclude the study in Section VII.

II. RELATED WORK

While visual tracking has a long history, in this section,

we discuss the methods most relevant to our study, which

can be categorized into three groups: Wasserstein distribu-

tional visual tracking, visual tracking via projection, and deep

learning-based visual tracking.

A. WASSERSTEIN DISTRIB UTIONAL VISUAL TRACKING

Yao et al. [14] transformed visual tracking problems into

transportation problems via linear programming algorithms,

where 1-Wasserstein distances (i.e., earth mover’s dis-

tances) were used as a distance metric. Danu et al. [15]

employed the Wasserstein distance in a particle ﬁlter formu-

lation to compare estimated multi-target states with ground

truths in multi-sensor environments. Zeng et al. [3] mea-

sured the discrepancy between target-speciﬁc features using

the 1-Wasserstein distance to accurately track vehicles.

Danis et al. [16] used the Wasserstein distance to evaluate

Bluetooth data via a sequential Monte Carlo method.

In contrast to these methods that use Wasserstein dis-

tributions to enhance the visual tracking accuracy, we use

the orthogonal coupled Wasserstein distance to balance the

accuracy with computational efﬁciency.

B. VISUAL TRACKING VIA PROJ ECTION

Xiao et al. [17] designed random projection matrices to

ﬁnd subspaces that make visual trackers robust to noise.

Zhang et al. [18] transformed visual tracking problems into

projection problems, in which a robust target representa-

tion model is learned via a projection onto the l+pball.

Zhang et al. [19] proposed a visual tracker based on a struc-

turally random projection for dimensionality reduction of the

template space, in which the original distance was preserved

with an efﬁcient computation. Danelljan et al. [20] projected

color names on an orthonormal basis of a 10-dimensional

subspace to extract sophisticated color features for visual

tracking.

In contrast to these methods that project the Euclidean

space into the subspaces of the target appearance, we project

the Wasserstein space and explicitly guide the projection

direction for accurate visual tracking.

C. DEE P LEARNING-BASED VISUAL TRACKING

Li et al. [21] presented Siamese deep neural architec-

tures combined with region proposal networks, which

aimed to search for candidate regions for target objects.

Valmadre et al. [22] proposed deep neural networks based

33528 VOLUME 10, 2022

Y. Kim, J. Kwon: Orthogonal Single-Target Tracking

on correlation ﬁlters that efﬁciently compared deep features

with reference features. Zhang et al. [23] introduced very

deep neural networks to extract representative features for

accurate visual tracking. Bertinetto et al. [24] made Siamese

networks fully convolutional for accurate and fast matching.

Li et al. [25] applied meta information to deep neural net-

works for fast adaptation in different visual tracking envi-

ronments and changes in target appearances. Zhu et al. [26]

enhanced the discriminative power of deep neural networks

using both negative and positive samples for target objects.

Choi et al. [27] boosted the adaptive representation ability of

deep neural networks using gradient information for visual

tracking. Bhat et al. [28] used discriminative classiﬁers for

deep neural networks, in which classiﬁer weights were gen-

erated via a novel optimization technique. Guo et al. [29]

presented dynamic Siamese network architectures that enable

the update of target appearances online.

In contrast to these methods, we do not use complex

deep neural architectures. Nevertheless, our proposed visual

tracker exhibits state-of-the-art visual tracking performance,

because target appearances are described by Wasserstein dis-

tributions; thus, several variations in target appearances can

be covered during visual tracking.

D. OTHER VISUAL TRACKING

Li et al. [30] proposed a dual-regression framework for

visual tracking, which combines discriminative fully convo-

lutional module (for discriminative ability) and a ﬁne-grained

correlation ﬁlter (for accurate localization). Fan et al. [31]

introduced a novel interactive learning framework for visual

tracking, in which multiple convolutional ﬁlter models are

interacted with each other and their responses are fused based

on the conﬁdence scores. Liu et al. developed robust visual

trackers for thermal infrared objects based on multi-level

similarity models under the Siamese framework [32], via the

multi-task framework [33], and using the pretrained convolu-

tional neural networks [34].

Muresan et al. [35] introduced a multi-object tracking

method based on a afﬁnity measurement function and a con-

text aware descriptor for 3D objects. Karunasekera et al. [36]

presented a multi-object visual tracking system using a new

dissimilarity measure that considers object motion, appear-

ance, structure, and size. Braso and Laura [37] proposed fully

differentiable message passing networks for multi-object

tracking, which is formulated as network ﬂows.

In contrast to these methods, we presented a novel mathe-

matical approach based on the Wasserstein distance to boost

the visual tracking performance. Thus, this approach can

be integrated into existing visual trackers to improve their

performance. Please note that using the Wasserstein distance

enables us to use many of useful mathematical properties.

III. SLICED WASSERSTEIN-BASED VISUAL TRACKING

A. SLICE D WASSERSTEIN DISTANCE

The p−Wasserstein distance Wpmeasures the discrepancy

between two probability distributions (i.e., µ, ν ∈PRd),

where PRddenotes the set of distributions deﬁned on

Rdand the p-th moment. We then deﬁne the p−Wasserstein

distance as follows:

Wp(µ, ν)=inf

γ∈0(µ,ν)ZRd×Rd

||x−y||p

2γ(dx,dy)1/p

,(1)

where 0(µ, ν) denotes the set of joint probability distribu-

tions deﬁned on Rd×Rd(i.e., 0(µ, ν)⊆PRd×Rd).

In (1), we can ﬁnd the optimal transport plan γbetween µ

and ν, inducing Wp.

The Wasserstein distance in (1) can directly consider prob-

ability distributions. However, it is difﬁcult to deﬁne the set

of joint probability distributions 0(µ, ν). Thus, conventional

approaches [38] approximate νas {νm}M

m=1and Wp(µ, ν) as

argµmin PM

m=1wmWp(µ, νm), where wmdenotes the m-th

weight. As an alternative approach, µand νare assumed to

have one-dimensional probability distributions (i.e., µ, ν ∈

PR1). Then, we can ﬁnd the optimal transport plan γ

using a ﬁnite number of probability distributions, which can

considerably reduce the computational costs. This approach

induces a sliced Wasserstein distance [13], [39].

To compute the sliced Wasserstein distance, we deﬁne the

unit sphere Sd−1in Rd. Subsequently, for a vector s∈Sd−1,

we deﬁne the projection map proj, which transforms x∈Rd

into ≺s,x∈ R1(i.e., projs(x)=≺ s,x). We deﬁne

the projection of the probability distribution µas proj#

s(µ).

Using proj#

s(µ), we can deal with one-dimensional probabil-

ity distributions. Then, the sliced Wasserstein distance Wslice

p

is deﬁned as follows.

Wslice

p(µ, ν)=Es∈Sd−1Wpproj#

s(µ),proj#

s(ν).(2)

In (2), Eis implemented via a Monte Carlo simulation with

Nsamples (i.e., s1,· · · ,sN∈Sd−1) as (3).

e

Wslice

p(µ, ν)=1

N

N

X

n=1

Wpproj#

sn(µ),proj#

sn(ν).(3)

B. VISUAL TRACKING

With e

Wslice

p, we present a novel sliced Wasserstein

distance-based visual tracker. In the visual tracking con-

text, µand νindicate the estimated and ground-truth target

appearance distributions, respectively. We adopt empirical

distributions for µand ν, which are deﬁned as follows:

µ=1

M

M

X

m=1

I(xm), ν =1

M

M

X

i=1

I(ym).(4)

In (4), I(xm) denotes an indicator function (i.e., I(xm)=1,

if x=xm; otherwise, I(xm)=0). We extract Mappearance

feature vectors using Mmoments.

We deﬁne a target object conﬁguration at time tas Ot=

{o1,o2,o3}, where o1,o2, and o3denote x-axis position,

y-axis position, and scale of the target in an image, respec-

tively. Given the best target conﬁguration at time t−1,

ˆ

Ot−1, our goal of visual tracking is to ﬁnd the best target

conﬁguration at time t,ˆ

Ot. For this purpose, we randomly

VOLUME 10, 2022 33529

Y. Kim, J. Kwon: Orthogonal Single-Target Tracking

Algorithm 1 Sliced Wasserstein Distance-Based

Tracker (SWT)

Input: ˆ

Ot−1

Output: ˆ

Ot

1: nO(c)

toC

c=1∼Nˆ

Ot−1, 62

2: for c=1 to Cdo

3: e

Wslice

p(µ(c), ν)=1

NPN

n=1Wpproj#

sn(µ(c)),proj#

sn(ν)

4: end for

5: c∗=arg mince

Wslice

pµ(c), νfor c=1,· · · ,C

6: ˆ

Ot=O(c∗)

t

Algorithm 2 Projected Wasserstein Distance-Based

Tracker (PWT)

Input: ˆ

Ot−1

Output: ˆ

Ot

1: nO(c)

toC

c=1∼Nˆ

Ot−1, 62

2: for c=1 to Cdo

3: Wproj

p(µ(c), ν)=1

MN PN

n=1PM

m=1projsnew

n(xm)−

projsnew

n(b(ym))p

2

4: end for

5: c∗=arg mincWproj

pµ(c), νfor c=1,· · · ,C

6: ˆ

Ot=O(c∗)

t

search for candidate conﬁgurations around ˆ

Ot−1. Thus, our

motion model is based on a normal distribution, as follows.

nO(c)

toC

c=1∼Nˆ

Ot−1, 62.(5)

In (5), O(c)

tdenotes the c-th candidate conﬁguration that is

proposed based on a normal distribution with center ˆ

Ot−1and

standard deviation 6. Subsequently, we measure the sliced

Wasserstein distance e

Wslice

pbetween appearance distributions

described by candidate conﬁguration O(c)

tand ground truth

conﬁguration OGT

t, which are µ(c)and ν, respectively. Our

objective is to ﬁnd the best index c∗, in which the corre-

sponding appearance distribution µ(c)described by candidate

conﬁguration O(c)

tcan minimize the distance:

c∗=arg min

ce

Wslice

pµ(c), νfor c=1,· · · ,C.(6)

In (6), the best target conﬁguration at time tis ˆ

Ot=O(c∗)

t.

Algorithm 1shows the entire pipeline of the proposed visual

tracker based on the sliced Wasserstein distance.

IV. PROJECTED WASSERSTEIN-BASED VISUAL TRACKING

A. PROJECTE D WASSERSTEIN DISTANCE

Using the sliced Wasserstein distance, we can considerably

reduce the computational cost, but can obtain erroneous

results, because there exists discrepancy between sliced

Wasserstein distance and true Wasserstein distance. In partic-

ular, according to sin (3), the projected vector projs(x) can be

Algorithm 3 Orthogonal Coupled Wasserstein

Distance-Based Tracker (OCWT)

Input: ˆ

Ot−1

Output: ˆ

Ot

1: nO(c)

toC

c=1∼Nˆ

Ot−1, 62

2: for c=1 to Cdo

3: e

Wort

p(µ(c), ν)=1

NPN

n=1Wpproj#

sort

n(µ(c)),proj#

sort

n(ν)

4: end for

5: c∗=arg mince

Wort

pµ(c), νfor c=1,· · · ,C

6: ˆ

Ot=O(c∗)

t

biased [40]. In particular, projs(x)<projs(x0) does not make

projs(y)<projs(y0). To solve this problem, bijective map-

ping has been introduced to measure the sliced Wasserstein

distance [13]. Bijective mapping induces

projsb(y)<projsb(y0),if projs(x)<projs(x0).(7)

In (7), the bijective mapping b(·) can be implemented by

sorting {ym}M

m=1, which results in ysort

mM

m=1, and selecting

ysort

argsort(xm)for xm, where argsort returns indices that sort

{xm}M

m=1and argsort(xm) returns the index of xm.

Subsequently, the projection is conducted using a new

projection vector snew ∈Sd−1, which is different from sin

(7). Using snew, we can prevent the aforementioned projection

from being biased. The projected Wasserstein distance is then

deﬁned as (8).

Wproj

p(µ, ν)=

1

MN

N

X

n=1

M

X

m=1projsnew

n(xm)−projsnew

n(b(ym))p

2,(8)

where xm∼µand ym∼νas in (4).

B. VISUAL TRACKING

Our objective is to ﬁnd the best index c∗, in which the corre-

sponding appearance distribution µ(c)described by candidate

conﬁguration O(c)

tcan minimize the distance as (9).

c∗=arg min

c

Wproj

pµ(c), νfor c=1,· · · ,C,(9)

where the best target conﬁguration at time tis ˆ

Ot=O(c∗)

t.

Algorithm 2shows the entire pipeline of the proposed visual

tracker based on the projected Wasserstein distance.

V. ORTHOGONAL COUPLED WASSERSTEIN-BASED

TRACKING

A. ORTHOGONAL COUPLED WASSERSTEIN DISTANCE

Using the projected Wasserstein distance, we can reduce the

discrepancy between the sliced Wasserstein distance and the

true Wasserstein distance. However, the projection direction

of sin projsis crucial for the success of the projected Wasser-

stein distance, as mentioned in [41]. In this context, we use

33530 VOLUME 10, 2022

Y. Kim, J. Kwon: Orthogonal Single-Target Tracking

TABLE 1. Quantitative comparison of the proposed methods. The best

results are written in boldface.

TABLE 2. Analysis of the proposed OCWT according to different values of

N(Monte Carlo samples) in (3). The best results are written in boldface.

orthogonal directions, because orthogonal directions of pro-

jection vectors guarantee the improvement of estimator vari-

ance for the projected Wasserstein distance, as proven in [13].

To sample mutually orthogonal vectors sort

1,· · · ,sort

N∈Sd−1

(i.e., ≺sort

i,sort

j= 0 for i6= j), we employ orthogonal

Monte Carlo (OMC) techniques in [42]. Using the OMC,

mutually orthogonal vectors can be efﬁciently obtained form

the unit sphere Sd−1in Rd.

Let Gbe a d-dimensional Givens rotation [43]. Then, Gis

an an orthogonal matrix in Sd−1, which is parameterized with

two indices i,j∈ {1,· · · d}and an angle θ∈[0,2π), as (10).

G[i,j, θ ]k,l=

cos(θ) if k=l∈ {i,j}

−sin(θ) if k=i,l=j

sin(θ) if k=j,l=i

1 if k=l/∈ {i,j}

0 otherwise,

(10)

where all coordinates of Rdare ﬁxed except iand j, and the

two-dimensional subspace is spanned using the rotation of θ.

Using G[i,j, θ ], we can sample orthogonal vectors via Kac’s

random walk on the Markov chain Kt|∞

t=1.

K1:T=

T

Y

t=1

G[it,jt, θt].(11)

In (11), the sequence of Kt×sort

nis a Markov chain on

Sd−1[44]. Then, the orthogonal coupled Wasserstein distance

Wort

pis deﬁned as follows.

Wort

p(µ, ν)=Es∈Sd−1Wpproj#

sort (µ),proj#

sort (ν).(12)

In (12), Eis implemented via a Monte Carlo simulation with

Nsamples (i.e., sort

1,· · · ,sort

N∈Sd−1) as (13).

e

Wort

p(µ, ν)=1

N

N

X

n=1

Wpproj#

sort

n(µ),proj#

sort

n(ν).(13)

B. VISUAL TRACKING

Our objective is to ﬁnd the best index c∗, in which the corre-

sponding appearance distribution µ(c)described by candidate

conﬁguration O(c)

tcan minimize the distance as (14):

c∗=arg min

ce

Wort

pµ(c), νfor c=1,· · · ,C,(14)

TABLE 3. Analysis of the proposed OCWT according to different values of

C(candidate configurations) in (5). The best results are written in

boldface.

TABLE 4. Analysis of the proposed OCWT according to different values of

M(moment statistics) in (4). The best results are written in boldface.

where the best target conﬁguration at time tis ˆ

Ot=O(c∗)

t.

Algorithm 3shows the entire pipeline of our visual tracker

based on the orthogonal coupled Wasserstein distance.

VI. EXPERIMENTS

A. EXPER IMENTAL SETTINGS

1) OTB DATASET

To demonstrate the effectiveness of the proposed methods,

we compared three proposed visual trackers (i.e., SWT,

PWT, and OSWT) with 9 recent deep learning-based visual

trackers (i.e., ECO-HC [45], TADT [46], SiamRPN++ [21],

SINT-op [47], C-COT [48], DAT [49], ECO [45],

SiamDW [23], and SINT [47]) using the OTB dataset [50].

This dataset includes various attributes for visual track-

ing environments, including out-of-view, out-of-plane rota-

tion, deformations, motion blur, scale variation, illumination

change, fast motion, background clutter, in-plane rotation,

low resolution, and occlusions. To evaluate the visual tracking

methods, precision and success plots, and the area under the

curve (AUC) were used, in which the precision plot computed

the ratio of frames such that the discrepancy between the

estimated and ground-truth conﬁgurations of the targets is

less than a speciﬁc threshold. The success plot computed the

percentage of frames such that the intersection of the union

between the estimated and ground-truth bounding boxes is

greater than a speciﬁc threshold. AUC was used to compute

the area under the success plot.

2) LaSOT DATASET

We also compared our visual trackers with visual track-

ers (e.g., StructSiam [51], DASiam [26], GlobalTrack [52],

SiamRPN++ [21], ATOM [53], ECO [45], CFNet [22], and

SPLT [54]) including state-of-the-art correlation ﬁlter-based

trackers (e.g., GFSDCF [55], ASRCF [56], STRCF [57],

and BACF [58]) using the LaSOT dataset [59]. This dataset

contains 1,400 test sequences, in which the average length

is greater than 2,512 frames. To evaluate the visual tracking

methods, precision, normalized precision, and area under the

curve were used.

3) VOT DATASET

In addition, we compared visual trackers (e.g., CFCF [60],

LSART [61], CFWCR [62], and ECO [45]) using the

VOLUME 10, 2022 33531

Y. Kim, J. Kwon: Orthogonal Single-Target Tracking

FIGURE 2. Quantitative comparison with non-deep-learning visual trackers using the OTB dataset.

FIGURE 3. Quantitative comparison with deep-learning visual trackers using the OTB dataset.

TABLE 5. Analysis of the proposed OCWT according to different values of

T(frames) in (11). The best results are written in boldface.

VOT2017 dataset, which contains 60 videos with diverse

attributes. To evaluate the visual trackers, accuracy and

robustness metrics were used.

4) HYPERPARAMETERS

For the experiments, we used N=100 Monte Carlo samples

in (3), M=4 moment statistics (i.e., mean, variance, skew-

ness, and kurtosis) in (4), C=10 candidate conﬁgurations in

(5), 6= {0.1,0.1,0.001}in (5), and T=90 frames in (11).

B. ANALYSIS OF THE PROPOSED METHOD

To examine the effectiveness of each proposed technique in

Table 1, we compared the proposed SWT with its extensions,

PWT and OCWT. As shown in the table, describing multiple

appearances of the target using Wasserstein distributions

is helpful for accurate visual tracking, where our simple

SWT-based visual tracker outperforms state-of-the-art visual

trackers including GlobalTrack in terms of normalized preci-

sion (as shown in Table 6).

We also examined the robustness of the proposed method

against hyperparameter settings. Table 2shows that the pro-

posed OCWT is not sensitive to different settings for the

number of Monte Carlo samples. Although the OCWT exhib-

ited more accurate results with more samples at the cost of

computational time, it still shows accurate visual tracking

performance even with 50 samples. Table 3includes the

visual tracking results of the proposed OCWT according to

the different number of candidate conﬁgurations (Cin (5)).

If we consider a large number of candidate regions for the

target, we have more chances of getting trapped in local

minima; thus, visual tracking accuracy decreased when we

used 20 candidate regions. In contrast, if we consider a very

small number of candidate regions for the target, the visual

tracking accuracy can decrease because search areas are not

sufﬁcient to ﬁnd the target. However, in any case, our tracker

is not sensitive to the number of candidate conﬁgurations.

33532 VOLUME 10, 2022

Y. Kim, J. Kwon: Orthogonal Single-Target Tracking

FIGURE 4. Qualitative evaluation using the OTB dataset. The yellow and red boxes denote the tracking results of ground truths and the proposed

visual tracker, respectively.

Table 4lists the visual tracking results of the proposed

OCWT according to the different numbers of moment statis-

tics (Min (4)). As shown in the table, using a single moment

statistic to describe the target appearance was not sufﬁcient to

accurately track the target. If we use more than four moment

statistics, the visual tracking performance converges, where

our visual tracker can successfully track the target. Table 5

shows that the proposed OCWT is not sensitive to different

settings with respect to the number of frames (Tin (11)).

Although we could obtain more accurate orthogonal vectors

with a large number of frames, the performance improvement

was not signiﬁcant. Even though the orthogonal vectors are

not accurate, using them is crucial for robust visual tracking.

It should be noted that the proposed OCWT with orthogonal

vectors considerably outperforms the PWT without orthogo-

nal vectors.

C. COMPARISONS ON THE OTB DATASET

Our method was quantitatively compared with non-deep-

learning visual trackers. As shown in Figure 2, the proposed

method considerably surpassed existing non-deep-learning

visual trackers in all evaluation metrics (i.e., precision plot,

success plot, and AUC). While the second-best methods are

Struck and SCM for the precision and success plots, respec-

tively, the proposed method outperformed these methods by a

large margin. Empirically, we argue that accurate visual track-

ing results of our method are induced by precisely measuring

the discrepancy between two distributions of estimated and

ground-truth appearances via advanced Wasserstein-based

techniques. Our method was also compared with recent

deep-learning visual trackers, as shown in Figure 3. The

method exhibited state-of-the-art performance in all evalu-

ation metrics, although our method also adopted no com-

plex deep neural network architecture. In contrast, SiamDW

showed the second-best performance in terms of the precision

plot, even though it employed a deeper and wider neural

network architecture for visual tracking. Thus, this quantita-

tive comparison veriﬁed the effectiveness of our Wasserstein

distributional tracking, in which the discrepancy between

the two appearance distributions is efﬁciently minimized.

VOLUME 10, 2022 33533

Y. Kim, J. Kwon: Orthogonal Single-Target Tracking

TABLE 6. Quantitative comparison using the LaSOT dataset. The best results are written in boldface.

FIGURE 5. Success plot of visual trackers using the LaSOT dataset.

FIGURE 6. Normalized precision plot of visual trackers using the LaSOT dataset.

It is noteworthy that we present a novel appearance model

for visual tracking based on the Wasserstein distribution; thus

the proposed technique can be plugged into existing visual

trackers to improve their visual tracking accuracy.

Figure 4shows the qualitative visual tracking results of our

method for the OTB dataset. The test video sequences contain

fast motions (e.g., (a) Biker, (b) Bolt, and (c) Deer sequences),

nonrigid deformation (e.g., (d) Diving, (e) Ironman, and

(f) Jump sequences), background clutter (e.g., (g) Matrix,

(h) MotorRolling, and (i) Shaking sequences), occlusions

(e.g., (g) Matrix and (l) Soccer sequences), illumination

changes (e.g., (e) Ironman, (g) Matrix, (i) Shaking, and

(j) Singer2 sequences), and small objects (e.g., (f) Jump

and (k) Skiing sequences). Although these sequences are

very challenging, our method accurately tracked the tar-

gets. This accurate visual tracking performance steps from

the modeling of multiple appearances using the Wasserstein

distributions.

33534 VOLUME 10, 2022

Y. Kim, J. Kwon: Orthogonal Single-Target Tracking

FIGURE 7. Quantitative comparison of visual trackers using the VOT

dataset.

TABLE 7. Comparisons of speed in terms of frames per second (FPS). The

best results are written in boldface.

D. COMPARISONS ON THE LaSOT DATASET

Table 6shows quantitative comparisons between the pro-

posed OCWT and recent state-of-the-art visual trackers

using the LaSOT dataset. As shown in the table, our

method produces accurate visual tracking results and out-

performs other visual trackers, where GlobalTrack shows

the second-best visual tracking performance. However,

GlobalTrack adopted a complex backbone network (ResNet)

to extract representative features, while the proposed method

used a small backbone network (VGG) to exhibit state-of-

the-art performance with small computational costs. These

experimental results demonstrate that the advantage of using

Wasserstein distributions for the target appearances makes

the proposed visual tracker robust to several variations in

the target appearances, which can be caused by illumination

changes, deformation, and background clutters.

E. COMPARISONS ON THE VOT DATASET

Figures 5and 6show the success and normalized precision

plots of visual trackers using the LaSOT dataset, respectively.

As shown in ﬁgures, the proposed visual tracker, OCWT,

is comparable with recent state-of-the-art visual trackers such

as DiMP and LTMU, while our method considerably outper-

forms state-of-the-art correlation ﬁlter-based trackers (e.g.,

GFSDCF [55], ASRCF [56], STRCF [57], and BACF [58]).

Figure 7demonstrates the effectiveness of the proposed

method in the VOT dataset. The proposed visual tracker,

OCWT, is the state-of-the-art visual tracker in terms of accu-

racy, while its robustness is also competitive to other methods.

LSART exhibits the best performance in terms of robustness,

but it inaccurately tracks target objects compared with the

proposed method.

F. COMPARISONS OF SPEED

Table 7reports speed in terms of FPS. Correlation ﬁlter-based

visual trackers are fast, because mathematical operations are

computationally efﬁcient. The proposed method can also

compute 79 frames per second, which is relatively faster

than other non-correlation ﬁlter-based visual trackers. This

indicates that the proposed orthogonal coupled Wasserstein

distribution is useful for improving visual tracking accuracy

with low computational costs.

VII. CONCLUSION

In this study, we propose a novel Wasserstein distri-

butional tracking method that can balance approxima-

tion with accuracy in terms of Monte Carlo estimation.

To achieve this goal, we present three different visual tracking

systems: sliced Wasserstein-based, projected Wasserstein-

based, and orthogonal coupled Wasserstein-based. Sliced

Wasserstein-based visual trackers can ﬁnd accurate target

conﬁgurations using the optimal transport plan, which min-

imizes the discrepancy between appearance distributions

described by the estimated and ground truth conﬁgurations.

Because this plan involves a ﬁnite number of probabil-

ity distributions, the computation costs can be consider-

ably reduced. Projected Wasserstein-based and orthogonal

coupled Wasserstein-based visual trackers further enhance

the accuracy of visual trackers using bijective mapping func-

tions and orthogonal Monte Carlo, respectively. Experimental

results demonstrate that our approach can balance compu-

tational efﬁciency with accuracy and the proposed visual

trackers outperform other state-of-the-art visual trackers on

benchmark visual tracking datasets.

REFERENCES

[1] S. Kolouri, Y. Zou, and G. K. Rohde, ‘‘Sliced Wasserstein kernels for

probability distributions,’’ in Proc. CVPR, 2016, pp. 5258–5267.

[2] Y. Han, X. Liu, Z. Sheng, Y. Ren, X. Han, J. You, R. Liu, and Z. Luo,

‘‘Wasserstein loss based deep object detection,’’ in Proc. CVPRW, 2020,

pp. 4299–4305.

[3] Y. Zeng, X. Fu, L. Gao, J. Zhu, H. Li, and Y. Li, ‘‘Robust multivehicle

tracking with Wasserstein association metric in surveillancevideos,’’ IEEE

Access, vol. 8, pp. 47863–47876, 2020.

[4] D. W.Shu, S. W. Park, and J. Kwon, ‘‘3D point cloud generative adversarial

network based on tree structured graph convolutions,’’ in Proc. ICCV,

2019, pp. 3859–3868.

[5] J. Solomon, R. Rustamov, L. Guibas, and A. Butscher, ‘‘Wasserstein prop-

agation for semi-supervised learning,’’ in Proc. ICML, 2014, pp. 306–314.

[6] S. W. Park and J. Kwon, ‘‘SphereGAN: Sphere generative adversarial

network based on geometric moment matching and its applications,’’

IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 3, pp. 1566–1580,

Mar. 2020.

[7] V. K. Verma, D. Brahma, and P. Rai, ‘‘Meta-learning for generalized zero-

shot learning,’’ in Proc. AAAI, 2020, pp. 6062–6069.

[8] A. M. Metelli, A. Likmeta, and M. Restelli, ‘‘Propagating uncertainty in

reinforcement learning via Wasserstein barycenters,’’ in Proc. NeurIPS,

2019, pp. 4333–4345.

[9] J. Xu, L. Luo, C. Deng, and H. Huang, ‘‘Multi-level metric learning via

smoothed Wasserstein distance,’’ in Proc. IJCAI, 2018, pp. 2919–2925.

[10] S. Kolouri, K. Nadjahi, U. Simsekli, R. Badeau, and G. Rohde, ‘‘General-

ized sliced Wasserstein distances,’’ in Proc. NeurIPS, 2019, pp. 261–272.

[11] M. Cuturi, ‘‘Sinkhorn distances: Lightspeed computation of optimal trans-

port,’’ in Proc. NIPS, 2013, pp. 2292–2300.

[12] A. Genevay, M. Cuturi, G. Peyré, and F. Bach, ‘‘Stochastic optimization

for large-scale optimal transport,’’ in Proc. NIPS, 2016, pp. 3440–3448.

[13] M. Rowland, J. Hron, Y. Tang, K. Choromanski, T. Sarlos, and A. Weller,

‘‘Orthogonal estimation of Wasserstein distances,’’ in Proc. Mach. Learn.

Res., 2019, pp. 186–195.

[14] G. Yao and A. Dani, ‘‘Visual tracking using sparse coding and earth

mover’s distance,’’ 2018, arXiv:1804.02470.

VOLUME 10, 2022 33535

Y. Kim, J. Kwon: Orthogonal Single-Target Tracking

[15] D. Danu, T. Kirubarajan, and T. Lang, ‘‘Wasserstein distance for the fusion

of multisensor multitarget particle ﬁlter clouds,’’ in Proc. ICIF, 2009,

pp. 25–32.

[16] F.S. Danis and A. T. Cemgil, ‘‘Model-based localization and tracking using

Bluetooth low-energy beacons,’’ Sensors, vol. 17, p. 2484, Nov. 2017.

[17] L. Xiao, H. Wang, and Z. Hu, ‘‘Visual tracking via adaptive random

projection based on sub-regions,’’ IEEE Access, vol. 6, pp. 41955–41965,

2018.

[18] T. Zhang, B. Ghanem, S. Liu, and N. Ahuja, ‘‘Robust visual tracking via

multi-task sparse learning,’’ in Proc. CVPR, 2012, pp. 2042–2049.

[19] S. Zhang, H. Zhou, F. Jiang, and X. Li, ‘‘Robust visual tracking using

structurally random projection and weighted least squares,’’ IEEE Trans.

Circuits Syst. Video Technol., vol. 25, no. 11, pp. 1749–1760, Nov. 2015.

[20] M. Danelljan, F. S. Khan, M. Felsberg, and J. Van De Weijer, ‘‘Adap-

tive color attributes for real-time visual tracking,’’ in Proc. CVPR, 2014,

pp. 1090–1097.

[21] B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, and J. Yan, ‘‘SiamRPN++:

Evolution of Siamese visual tracking with very deep networks,’’ in Proc.

CVPR, 2018, pp. 4282–4291.

[22] J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, and P. H. S. Torr,

‘‘End-to-end representation learning for correlation ﬁlter based tracking,’’

in Proc. CVPR, 2017, pp. 2805–2813.

[23] Z. Zhang and H. Peng, ‘‘Deeper and wider Siamese networks for real-time

visual tracking,’’ in Proc. CVPR, Jun. 2019, pp. 4591–4600.

[24] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr,

‘‘Fully-convolutional Siamese networks for object tracking,’’ in Proc. Eur.

Conf. Comput. Vis., 2016, pp. 850–865.

[25] P. Li, B. Chen, W. Ouyang, D. Wang, X. Yang, and H. Lu, ‘‘GradNet:

Gradient-guided network for visual object tracking,’’ in Proc. ICCV, 2019,

pp. 6162–6171.

[26] Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, and W. Hu, ‘‘Distractor-aware

Siamese networks for visual object tracking,’’ in Proc. ECCV, 2018,

pp. 101–117.

[27] J. Choi, J. Kwon, and K. M. Lee, ‘‘Deep meta learning for real-time target-

aware visual tracking,’’ in Proc. ICCV, 2019, pp. 911–920.

[28] G. Bhat, M. Danelljan, L. V. Gool, and R. Timofte, ‘‘Learning discrimina-

tive model prediction for tracking,’’ in Proc. ICCV, 2019, pp. 6182–6191.

[29] Q. Guo, W. Feng, C. Zhou, R. Huang, L. Wan, and S. Wang, ‘‘Learning

dynamic Siamese network for visual object tracking,’’ in Proc. ICCV, 2017,

pp. 1763–1771.

[30] X. Li, Q. Liu, N. Fan, Z. Zhou, Z. He, and X. Jing, ‘‘Dual-regression model

for visual tracking,’’ Neural Netw., vol. 132, pp. 364–374, Oct. 2020.

[31] N. Fan, Q. Liu, X. Li, Z. Zhou, and Z. He, ‘‘Interactive convolutional

learning for visual tracking,’’ Knowl. Based Syst., vol. 214, Oct. 2021,

Art. no. 106724.

[32] Q. Liu, X. Li, Z. He, N. Fan, D. Yuan, and H. Wang, ‘‘Learning deep

multi-level similarity for thermal infrared object tracking,’’ IEEE Trans.

Multimedia, vol. 23, pp. 2114–2126, 2021.

[33] Q. Liu, X. Li, Z. He, N. Fan, D. Yuan, W. Liu, and Y. Liang, ‘‘Multi-task

driven feature models for thermal infrared tracking,’’ in Proc. AAAI, 2020,

pp. 11604–11611.

[34] Q. Liu, X. Lu, Z. He, C. Zhang, and W. Chen, ‘‘Deep convolutional

neural networks for thermal infrared object tracking,’’ Knowl. Based Syst.,

vol. 134, pp. 189–198, Jun. 2017.

[35] M. P. Muresan and S. Nedevschi, ‘‘Multi-object tracking of 3D cuboids

using aggregated features,’’ in Proc. ICCP, 2019, pp. 11–18.

[36] H. Karunasekera, H. Wang, and H. Zhang, ‘‘Multiple object tracking with

attention to appearance, structure, motion and size,’’ IEEE Access, vol. 7,

pp. 104423–104434, 2019.

[37] G. Brasó and L. Leal-Taixé, ‘‘Learning a neural solver for multiple object

tracking,’’ in Proc. CVPR, 2020, pp. 6247–6257.

[38] M. Staib, J. M. Claici, S. amd Solomon, and S. Jegelka, ‘‘Parallel streaming

Wasserstein barycenters,’’ in Proc. NIPS, 2017, pp. 2292–2300.

[39] J. Rabin, G. Peyr, J. Delon, and M. Bernot, ‘‘Wasserstein barycenter and

its application to texture mixing,’’ Lect. Notes Comput. Sci., vol. 6667,

pp. 435–446, Mar. 2011.

[40] H. V. Hasselt, ‘‘Double Q-learning,’’ in Proc. NIPS, 2010, pp. 2613–2621.

[41] F. Pitié, A. C. Kokaram, and R. Dahyot, ‘‘Automated colour grading using

colour distribution transfer,’’ CVIU, vol. 107, nos. 1–2, pp. 123–137, 2007.

[42] K. Choromanski, M. Rowland, W. Chen, and A. Weller, ‘‘Unifying orthog-

onal Monte Carlo methods,’’ in Proc. ICML, 2019, pp. 1203–1212.

[43] W.Givens, ‘‘Computation of plane unitary rotations transforming a general

matrix to triangular form,’’ J. Soc. Ind. Appl. Math., vol. 6, no. 1, pp. 26–50,

1958.

[44] N. S. Pillai and A. Smith, ‘‘KAC’s walk on N-sphere mixes in n log n

steps,’’ Ann. Appl. Probab., vol. 27, no. 1, pp. 631–650, 2017.

[45] M. Danelljan, G. Bhat, F. Shahbaz Khan, and M. Felsberg,‘‘ECO: Efﬁcient

convolution operators for tracking,’’ in Proc. CVPR, 2017, pp. 6638–6646.

[46] X. Li, C. Ma, B. Wu, Z. He, and M.-H. Yang, ‘‘Target-aware deep track-

ing,’’ in Proc. CVPR, 2019, pp. 1369–1378.

[47] R. Tao, E. Gavves, and A. W. Smeulders, ‘‘Siamese instance search for

tracking,’’ in Proc. CVPR, 2016, pp. 1420–1429.

[48] M. Danelljan, A. Robinson, F.Khan, and M. Felsberg, ‘‘Beyond correlation

ﬁlters: Learning continuous convolution operators for visual tracking,’’ in

Proc. ECCV, 2016, pp. 472–488.

[49] S. Pu, Y. Song, C. Ma, H. Zhang, and M.-H. Yang, ‘‘Deepattentive tracking

via reciprocative learning,’’ in Proc. NIPS, 2018, pp. 1–8.

[50] Y. Wu, J. Lim, and M.-H. Yang, ‘‘Online object tracking: A benchmark,’’

in Proc. CVPR, 2013, pp. 2411–2418.

[51] Y. Zhang, L. Wang, J. Qi, D. Wang, M. Feng, and H. Lu, ‘‘Structured

Siamese network for real-time visual tracking,’’ in Proc. ECCV, 2018,

pp. 351–366.

[52] L. Huang, X. Zhao, and K. Huang, ‘‘GlobalTrack: A simple and strong

baseline for long-term tracking,’’ in Proc. AAAI, 2019, pp. 351–366.

[53] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg, ‘‘Atom: Accurate

tracking by overlap maximization,’’ in Proc. CVPR, 2019, pp. 4660–4669.

[54] B. Yan, H. Zhao, D. Wang, H. Lu, and X. Yang, ‘‘‘Skimming-perusal’

tracking: A framework for real-time and robust long-term tracking,’’ in

Proc. ICCV, 2019, pp. 2385–2393.

[55] T. Xu, Z. Feng, X. Wu, and J. Kittler, ‘‘Joint group feature selection and

discriminative ﬁlter learning for robust visual object tracking,’’ in Proc.

ICCV, 2019, pp. 7950–7960.

[56] K. Dai, D. Wang, H. Lu, C. Sun, and J. Li, ‘‘Visual tracking via

adaptive spatially-regularized correlation ﬁlters,’’ in Proc. CVPR, 2019,

pp. 4670–4679.

[57] F. Li, C. Tian, W. Zuo, L. Zhang, and M. H. Yang, ‘‘Learning spatial-

temporal regularized correlation ﬁlters for visual tracking,’’ in Proc. CVPR,

2018, pp. 4904–4913.

[58] H. Kiani Galoogahi, A. Fagg, and S. Lucey, ‘‘Learning background-

aware correlation ﬁlters for visual tracking,’’ in Proc. CVPR, 2017,

pp. 1135–1143.

[59] H. Fan, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, H. Bai, Y. Xu, C. Liao, and

H. Ling, ‘‘LaSOT: A high-quality benchmark for large-scale single object

tracking,’’ 2018, arXiv:2009.03465.

[60] E. Gundogdu and A. A. Alatan, ‘‘Good features to correlate for visual

tracking,’’ 2017, arXiv:1704.06326.

[61] C. Sun, D. Wang, H. Lu, and M.-H. Yang, ‘‘Learning spatial-aware regres-

sions for visual tracking,’’ in Proc. CVPR, 2018, pp. 8962–8970.

[62] Z. He, Y. Fan, J. Zhuang, Y. Dong, and H. Bai, ‘‘Correlation ﬁlters with

weighted convolution responses,’’ in Proc. ICCVW, 2017, pp. 1992–2000.

YOUJIN KIM received the B.S. degree in inte-

grative engineering from Chung-Ang University,

Seoul, South Korea, in 2021, where she is currently

pursuing the M.S. degree in artiﬁcial intelligence.

Her research interests include generative model,

graph model, and deep neural networks.

JUNSEOK KWON (Member, IEEE) received

the B.Sc. degree, the M.Sc. degree in the

topic of object tracking (supervised by Prof.

Kyoung Mu Lee), and the Ph.D. degree in elec-

trical engineering and computer science from

Seoul National University, South Korea, in 2006,

2008, and 2013, respectively. He was a Postdoc-

toral Researcher under Prof. Luc Van Gool with

the Computer Vision Laboratory, ETH Zurich,

from 2013 to 2014. He is currently an Asso-

ciate Professor with the School of Computer Science and Engineering,

Chung-Ang University, Seoul, South Korea. He is working in the ﬁeld of

object tracking to capture the dynamics of cities. His research interests

include visual tracking, visual surveillance, and Monte Carlo Sampling

method and its variants.

33536 VOLUME 10, 2022