Available via license: CC BY 4.0
Content may be subject to copyright.
New Improvements in Solving Large LABS Instances Using Massively Parallelizable
Memetic Tabu Search
Zhiwei Zhang,1Jiayu Shen,1Niraj Kumar,1, ∗and Marco Pistoia1 , †
1Global Technology Applied Research, JPMorgan Chase, New York, NY 10017, USA
(Dated: April 2, 2025)
Low Autocorrelation Binary Sequences (LABS) is a particularly challenging binary optimization
problem which quickly becomes intractable in finding the global optimum for problem sizes beyond
66. This aspect makes LABS appealing to use as a test-bed for meta-heuristic optimization solvers
to target large problem sizes. In this work, we introduce a massively parallelized implementation
of the memetic tabu search algorithm to tackle LABS problem for sizes up to 120. By effectively
combining the block level and thread level parallelism framework within a single Nvidia-A100 GPU,
and creating hyper optimized binary-valued data structures for shared memory among the blocks,
we showcase up to 26 fold speedup compared to the analogous 16-core CPU implementation. Our
implementation has also enabled us to find new LABS merit factor values for twelve different problem
sizes between 92 and 118. Crucially, we also showcase improved values for two odd-sized problems
{99,107}whose previous best known results coincided with the provably optimal skew-symmetric
search sequences. Consequently, our result highlights the importance of a focus on general-purpose
solver to tackle LABS, since leveraging its skew-symmetry could lead to sub-optimal solutions.
INTRODUCTION
Meta-heuristic methods, also known as local-
neighborhood search methods, have become pivotal
in the advancement of solving combinatorial opti-
mization problems [5]. Methods such as evolutionary
algorithms, simulated annealing, and tabu search,
offer flexible framework adaptable to a wide range of
problems, often providing near-optimal solutions with
reasonable computational time [2,3,20,36]. Among
these meta-heuristics, the memetic tabu search has
shown remarkable efficacy in navigating large and com-
plex landscapes [34]. This hybrid approach combines
the global search capabilities inherent in the memetic
evolutionary algorithms with local stochastic search of
tabu search method equipped with adaptive memory, to
enhance solution quality and convergence speed. This
technique has been applied to a variety of problems
with great success; graph coloring [14], vehicle routing
problems [12], traveling salesman problems [15,18],
quadratic assignment problems [11,15,25], large
optimization problems on hundreds of heterogeneous
machines [15,35], and many other problems. A literature
survey on memetic tabu can be found in [37].
Meta-heuristics stand out from other optimization
techniques due to a plethora of reasons. First, they
are gradient-free methods and allow for moves based
on acceptance-rejection criteria, which particularly helps
avoid early-stopping to suboptimal solutions, as opposed
to gradient descent based methods [19]. Secondly they
balance the exploration and exploitation of the candi-
date solutions in the search landscape, thus allowing for
∗Corresponding Author.
Email: niraj.x7.kumar@jpmchase.com
†Principal Investigator.
Email: marco.pistoia@jpmchase.com
a more thorough examination of the search landscape.
Finally, the parallelization of these techniques with ran-
dom initial starts further amplify their capability, en-
abling rapid search space exploration. By distribut-
ing the search mechanism across multiple processors or
cores, parallel meta-heuristics, in particular memetic-
tabu, efficiently explore diverse regions of the solution
space, thereby increasing the likelihood of identifying
high-quality solutions [13].
Similar to meta-heuristics, global optimization meth-
ods such as branch-and-bound can also be employed to
tackle these optimization problems with rigorous guaran-
tees of finding the optimal solutions [24,27]. However, for
a variety of such large-scale problems, they can be com-
putationally prohibitive due to their exhaustive search
nature. In contrast, due to the versatility and paral-
lelization of meta-heuristics, they often rapidly converge
to high-quality solutions without the need for exhaus-
tive enumeration, outperforming global solvers in terms
of speed and scaling with the problem size.
In this work, we realize the massively parallelized
memetic tabu search algorithm to solve the Low Autocor-
relation Binary Sequence (LABS) problem [6]. Charac-
terized by long-range fourth-order spin interaction terms
within its objective function, LABS presents substantial
computational challenges in finding solutions with mini-
mal objective function values, even for relatively modest
problem sizes, due to its non-linear and non-convex na-
ture. Consequently, it serves as an excellent testbed for
evaluating the performance of optimization solvers, par-
ticularly those based on meta-heuristic approaches. The
LABS problem is significant in various fields, including
telecommunications and physics, where it is used to de-
sign sequences with minimal autocorrelation [4,6,22,32].
Memetic tabu search is known to particularly exhibit
a competitive runtime scaling for LABS [19]. Our effi-
cient GPU parallelization builds upon this technique to
allow for finding best known LABS solutions upto prob-
arXiv:2504.00987v1 [cs.DC] 1 Apr 2025
2
95 100 105 110 115
Sequence Length (N)
7.8
8.0
8.2
8.4
8.6
8.8
9.0
9.2
Merit Factor
N=92
8.50
8.64 N=98
8.81
9.08
N=104
8.84
9.01
N=106
8.01
8.71
N=108
8.31
8.70
N=110
8.37
8.56
N=112
7.96
8.62
N=114
7.95
8.36
N=116
8.27
8.39
N=118
8.22
8.63
N=99
8.49
8.86
N=107
8.46
8.51
Old Merit Factor
New Merit Factor
Figure 1: Improvements in the LABS merit factor objective value using our massively GPU parallelized memetic
tabu search for 12 new problem sizes between N= 92 to N= 118, improving the MF range from [7.95,8.84] to
[8.36,9.08]. The new sequences for odd N= 99,107 improve the MF from optimal skew-symmetric sequences.
lem size 120 using a single Nvidia-A100 GPU with 6912
CUDA cores, 80 GB memory and 164 KB shared memory
per streaming processor, within 48 hours. Crucially, our
approach has enabled finding new LABS sequences for
twelve different problem sizes between 92 and 118 with
better objective value compared to any reported previ-
ous solutions [8]. Moreover, for two of these odd prob-
lem sizes, our improved objective values surpass those
achieved by the provably optimal skew-symmetric se-
quences, which were previously the best reported in the
literature [7]. Our findings suggest that relying solely on
the skew-symmetric property of LABS to constrain the
search space may result in sub-optimal solutions. Conse-
quently, this work highlights the importance of enhanc-
ing the general-purpose solvers to effectively address this
particularly challenging problem.
The following sections are structured as follows. Sec-
tion Iprovides a succinct summary of our key results. We
describe the LABS problem along with its hardness, un-
derlying symmetry, and existing methods in Section II.
The memetic tabu methodology followed by our GPU
parallelization technique is discussed in Section III and
IV. We detail our results, including the new MF values,
scaling, and deviation from skew-symmetry in Section V.
We conclude of our work in Section VI.
I. SUMMARY OF RESULTS
The primary contributions of our work are as follows,
1. A delicate design of a two-level (block and thread)
parallelism framework to allow for an optimized
implementation of memetic tabu search for LABS
problem, based on the embarrassing parallelization
framework on GPUs.
2. Exploitation of fast shared memory in the Nvidia-
A100 GPU by optimizing the algorithm’s data
structure storage into binary-value, enabling max-
imum GPU utility. This results in shared memory
size of 5 KB, sufficient to run LABS for problem
size up to 187.
3. Twelve new LABS merit factor values found
for both odd and even sequences N=
92,98,99,104,106,107,108,110,112,114,116, and
118 (see Figure 1and Table II).
4. Analysis and evidence that our new sequences devi-
ate from skew symmetric sequence, thus not likely
to be found by first applying skew symmetric and
then unrestricted search.
5. GPU implementation achieving up to 26 fold
speedup in time-to-solution compared to 16 core
CPU implementation (Figure 4).
II. PROBLEM DESCRIPTION
We begin by formally describing the LABS problem.
Consider a binary sequence of length N,S=s1s2· · · sN
3
with spin variable si∈ {−1,1}for 1⩽i⩽N. Define the
aperiodic autocorrelation of elements in sequence Swith
distance kas,
Ck(S) =
N−k
X
i=1
sisi+k.(1)
The objective of LABS is to find the optimal assign-
ment of Swhich minimizes the corresponding energy
function, expressed as the quadratic sum of its correla-
tions,
S∗=arg min
S∈{−1,1}N
EN(S) =
N−1
X
k=1
C2
k(S).(2)
or, equivalently, maximizes the merit factor (MF),
S∗=arg max
S∈{−1,1}N
FN(S) = N2
2EN(S)(3)
A. Hardness of LABS and Existing Methods
As a binary combinatorial optimization problem,
LABS has a search space of 2Nfor a problem of size
N. The particular hardness of LABS comes due to
the fourth order dependency among the spin variables
{si, sj, sk, sl} ∈ Sin the energy function, or, equiva-
lently, the MF expression. As highlighted by the authors
in [16], the LABS energy function objective (and equiva-
lently the MF) has O(N2)non-zero quadratic and O(N3)
non-zero fourth order spin interactions. This leads to
the search landscape of LABS being equivalent to a di-
lute 4-spin Ising glass model with exponentially (in N)
many local minimas, while the global minima becomes
extremely isolated deep and narrow resembling the shape
of “golf-holes”. We visually depict the spread of local and
global minima for N= 12,15,18 in Figure 2to further
illustrate this point. This explains the inability of any
known method to reach the asymptotic merit factor of
FN≈12.3248 as N→ ∞ as proposed by Golay using
the ergodicity postulate [23]1.
With regards to provably optimal sequences, branch-
and-bound technique has been employed to find and
prove the optimality of sequences for LABS with N⩽66
[29]. While it is a promising result, the reported run
time scaling of the above global optimization method is
O(1.73N)which quickly renders it infeasible for large N.
For context, the wallclock time, as reported by [29], to
solve N= 66 was roughly 55 days using the Linux cluster
comprising of a total of 248 (virtual) cores.
1The largest known LABS experiments reported to date is for
N= 1010 with FN= 6.367 which is a significant departure from
Golay’s postulate for large N.
Simultaneously, a suite of mata-heuristic methods have
also been applied to solve LABS in the past decades.
While these methods do not provide theoretical guaran-
tees of the optimality of the reached solution, they usu-
ally yield faster runtime given the target objective func-
tion value. Among such known general purpose meta-
heuristics for LABS, Kernighan–Lin [10] exhibits a run-
time scaling of O(1.463N), the evolutionary-strategies al-
gorithm [10] exhibits O(1.397N), and memetic tabu [19]
exhibits O(1.34N). In addition, the quantum approx-
imate optimization algorithm combined with quantum
minimum finding has been applied to solve LABS till
problem size 40 with a scaling O(1.21N)to demonstrate
the evidence of scaling advantage for the quantum algo-
rithm [33].
B. Sequence Symmetries of LABS
A simple observation would reveal that LABS sequence
has two exact symmetries i.e., these operations on the se-
quences preserve the objective function value exactly [7],
1. complementation:si7→ −si,∀i∈[N]
2. reversal:si7→ sN+1−i,∀i∈[N]
Respecting these symmetries while performing the opti-
mization, results in a four fold reduction of the search
space. We note that a straightforward method to imple-
ment the complementation symmetry is by setting the
first spin variable in the sequence, s1→1which reduces
the search space trivially by half.
C. Skew-Symmetry of LABS
Another class of restriction into the LABS sequence
structure is the presence of skew-symmetric sequences
for problem of odd length, as introduced by Golay [21].
Formally, for an odd length N= 2n+ 1, skew-symmetry
is defined as,
sn+i= (−1)isn−i, i ∈[n](4)
As shown by [17], sequences with this restriction result
in the aperiodic autocorrelation values Ck= 0 for all
even k. The restriction over searching for solution within
the skew-symmetric constraint significantly reduces the
computational complexity of solving LABS due to fact
that skew-symmetric subspace is of size 2(N+1)/2as op-
posed to 2Nfor unrestricted search. This has been ex-
ploited to find the optimal skew-symmetric solutions for
odd N⩽119 values using the branch-and-bound tech-
nique with a complexity O(1.34N)[7], as opposed to
O(1.73N)exhibited by the method using unrestricted
search [29]. Furthermore, the integration of this con-
straint into meta-heuristic search methods has signifi-
cantly enhanced their efficiency for odd N. Specifically,
4
Figure 2: 2-D pro jections of the solution spaces of LABS problem for N= 12,15,18. All 2Nsequences are mapped
by Uniform Manifold Approximation and Projection (UMAP) [28] algorithms, approximately depicting the locality
and global structure of the solution space. A sequence is marked a local optimum if its merit factor is strictly higher
than all its Nneighbors. The number of local optima grows exponentially [16], surrounding a limited number of
isolated global optima, making LABS quickly intractable as Ngrows.
the memetic tabu search achieves a complexity scaling
of O(1.16N), as opposed to O(1.34N)for an unrestricted
search [7,19]. Additionally, self-avoiding walks demon-
strate a scaling of O(1.15N)[7], and the stochastic search
method xLastovka exhibits a scaling of O(1.18N)[9]. In-
spired by this idea, efforts have also been directed to-
wards finding LABS sequences with improved objective
values for even N≳150, thorough concepts of quasi-
skew symmetry and sequences operators [17], and dual-
step optimization that combines skew-symmetric search
with unrestricted search [31].
However, skew-symmetry is not necessarily an exact
symmetry for optimal solutions of odd N,i.e., for some
values of odd N, the optimal solutions do not exhibit this
property. This is evident from the fact out of the 32 odd
values of Nwithin 3⩽N⩽66, as reported by[29], only
16 Nvalues have skew-symmetric optimal solutions. Ad-
ditionally, 6 values of Nhave both skew-symmetric and
non-skew-symmetric optimal solutions. And, crucially,
10 values of Ndo not exhibit skew-symmetry in their
optimal solutions. Hence, restricting the search space to
incorporate this constraint, does not guarantee the opti-
mal solution for odd Nvalues.
In this work, we showcase that without imposing the
skew-symmetry constraint, we find sequences for two
odd N= 99,107 with objective value better than the
value obtained by optimal skew-symmetric sequences.
Further, we analyze the results from Ref. [31] to show
that their newly found even-Nsequences using skew-
symmetry search followed by unrestricted search, are still
relatively close to a skew-symmetric sequence. This is in
contrast to the large deviation from skew-symmetry in
our new found sequences for even N’s with the general
purpose memetic tabu solver. This may imply that using
skew-symmetric sequences neighborhoods might make
the search easier to be trapped, and a full-space search
may still be useful, especially given the search landscape
of the LABS problem.
III. METHODOLOGY
To make this paper self-contained, we revisit the
methodology of the memetic algorithm integrated with
tabu search for addressing the LABS problem, as pro-
posed in [19]. In this section, we extend this approach
by efficiently implementing the parallelization on GPUs.
This hybrid approach combines the global search capabil-
ities inherent in the memetic evolutionary algorithms [3]
with local stochastic search of tabu search. The memetic
algorithm operates by maintaining a population of binary
sequences that are initialized randomly. During each it-
eration, it generates a new sequence, referred to as a child
sequence, through probabilistic recombination and muta-
tion processes. This child sequence subsequently serves
as the initial point for the tabu search algorithm. Tabu
search equips the greedy local search method with a tabu
list to avoid flipping recently altered bits, thus facilitating
a more thorough exploration of the search space [20]. The
combination of memetic and tabu search allows for effec-
tive exploration and exploitation of the solution space,
balancing between diversification and intensification.
A. Memetic Algorithms
The pseudo code for the memetic tabu algorithm used
in this work is shown in Algorithm 1. The algorithm
starts by initializing the population of Krandom binary
sequences and creating a bestSeq register storing the
sequence with the lowest energy among the chosen K
sequences. Subsequently in each iteration, with proba-
bility pcomb, the algorithm selects two parent sequences
and recombines them to produce a child sequence. Next,
each bit of the child sequence is mutated with probability
pmutate. The resulting sequence is then sent to the tabu
local search algorithm for further exploration which out-
puts the new child sequence. If the energy corresponding
5
Algorithm 1 Memetic-Tabu(N, targetE)
1: population ←Krandom binary sequences of length N
2: bestSeq ←sequence with lowest energy from population
3: while E(bestSeq) > targetE do
4: if randReal(0,1) < pcomb then
5: parent1,parent2←selectParents(population)
6: child ←Combine(parent1,parent2)
7: else
8: child ←Sample(population)
9: end if
10: Mutate(child, pmutate)
11: child ←Tabu-Search(child)
12: update bestSeq to child if E(bestSeq)>E(child)
13: Replace a random individual in population by child
14: end while
to this new child sequence is lower than the bestSeq
energy, the bestSeq is updated. Finally, a random in-
dividual in the initial population is replaced by the new
child sequence returned by the local search solver. For
this work, we set memetic algorithm parameters to be
K= 100,pcomb = 0.9and pmutate =1
Nwhich is consis-
tent with [19].
B. Tabu Search
Algorithm 2 Tabu-Search(seq)
1: Initialize tabuList[1,· · · , N] = 0 and bestSeqTS=seq
2: pivot ←seq
3: for t= 1,2,· · · ,maxIter do
4: i←arg max
i=1,...,N,tabuList[i]<t
E(flip(pivot, i))
5: /* Evaluate energy of all Nneighborhoods of pivot.
Randomly choose one if there are multiple */
6: pivot ←flip(pivot,i)
7: tabuList[i] = t+randomInt(minTabu, maxTabu)
8: Update tableC and vectorC
9: Update bestSeqTS to pivot if E(bestSeqTS)>
E(pivot)
10: end for
11: return bestSeqTS
Tabu search takes a sequence from the memetic al-
gorithm and runs a greedy local search on it with the
restrictions given by the tabu list. The tabu list maps
each coordinate to the future time represented as a num-
ber of iterations after which the coordinate would be al-
lowed to get flipped. In each iteration, the algorithm
checks the energy value of all Nneighborhoods of the
current search pivot and greedily select the best bit to
flip unless it is forbidden by the tabu list in this iteration
2. After a flip is made, the bit is forbidden from get-
ting changed again for a random number of iterations.
The algorithm returns the sequence visited in all itera-
tions with lowest energy. The details of the tabu search
procedure is shown in Algorithm 2. The setting of hy-
perparameters in [19] are maxIter =randInt[N
2,3
2N].
minTabu =⌊0.1×maxIter⌋,maxTabu =⌊0.12×maxIter⌋.
C. Data Structures for Efficient Energy Evaluation
Naively computing the energy function value of a se-
quence according to Eq. 2takes O(N2)time and checking
all neighborhoods at a pivot would take O(N3), which
can be computationally prohibitive. Several data struc-
tures have been proposed to make each neighborhood en-
ergy evaluation in O(N)and checking all neighborhood
O(N2). We use the data structures proposed in [19],
named tableC and vectorC, as shown in Table I, in our
framework. By maintaining those data structures of the
pivot sequence in tabu search, the energy of each neigh-
borhood can be computed in linear time. Constructing
the structures takes O(N2)time while updating the them
after flipping a bit takes linear time as only one column
and row needs to be changed.
tableC(S) vectorC(S)
s1s2s2s3s3s4s4s5s1s2+s2s3+s3s4+s4s5
s1s3s2s4s3s5s1s3+s2s4+s3s5
s1s4s2s5s1s4+s2s5
s1s5s1s5
Table I: The binary, left upper triangle matrix tableC
and the integer vector vectorC.
IV. MASSIVE PARALLELISM WITH GPU
Although the memetic-tabu algorithm described in
Section III exhibits competitive exponential scaling [19],
previous implementations have been inefficient for two
reasons. First, the time needed to reach the target value
of a single run depends heavily on the initialization. To
reach the target value quickly, it is ideal to run a num-
ber of replicas with different random seeds simultane-
ously, which is challenging for CPU implementations due
to the limitations in the desired number of cores. Sec-
ond, previous implementations have not parallelized any
2The jth neighborhood of a given sequence Sis the new sequence
obtained by flipping the jth variable while the other variables
remain the same.
6
/ / pr oc edur e on a t hr ead bl oc k
For i t er = 1, 2, ?, K
chi l d = memet i cAl gor i t hm( popul at i on)
chi l d = par al l el TabuSear c h( chi l d)
Repl ac e an i ndi vi dual i n popul at i on
by chi l d.
I f r eac h bes t known ener gy, s et GLF.
I f GLF i s set , t er mi nat e.
1
2
3
4
5
6
7
Grid
Block 1 Block 2 Block 3 Block n
M emeticTabu - Thread Block
shared memory
Th read
1Th read
2Th read
3Th read
k
Jobs of a thread
1. compute energy for a
neighbor of search pivot.
2. update a part of the data
structure.
3. initialize a part of
population
...
GPU Kernel
Meme t i cTabu<#bl ock s = n, #t hr eads = k>1
Best sequence &
energy
global memory
kernel program
tabu list Table C population
global termination flag (GLF)
global best sequence & energy
CPU Host
...
...
...
...
population
recombination
mutation
Tabu Search
replacement
Figure 3: Architecture of our Memetic Algorithm with Tabu Search approach on GPUs. The CPU host launches a
single kernel MemeticTabu once, avoiding data transportation and host-device switching. In the kernel, each thread
block runs a replica of memetic tabu search algorithm. The fast-but-small shared memory in each block, which can
be accessed by all threads, is fully exploited by storing data structures. In each block, the computationally heavy
steps are parallelized by multiple threads in a block. Early termination is enabled by storing the global termination
flag (GLF) as well as the best sequence and its energy, in the global memory of GPU.
part of the computationally expensive components, such
as neighborhood energy checks, and update of the data
structure.
In this work, we adapt the memetic-tabu algorithm to
be massively parallelized, fully executed on GPUs. The
parallelization is implemented at two distinct levels,
1. block-level parallelism, where multiple algorithm
replicas are executed concurrently, each initial-
ized with different random seeds, across the thread
blocks of a GPU;
2. thread-level parallelism, where computationally in-
tensive steps of the algorithm within each replica
are executed in parallel by multiple threads within
a single thread block.
We use compact data structures such as bit vectors to
fit all data structures in limited but fast shared mem-
ory to exploit its efficiency compared against GPU global
memory. A global termination flag is stored in the global
memory for early termination when any block reaches the
target value. By this architecture, thousands of thread
blocks can be activated simultaneously, quickly explor-
ing the search space with different initializations. Our
framework for parallelizing memetic-tabu algorithm on a
GPU is illustrated in Figure 3.
A. All-in-GPU: Block-level and Thread-Level
Parallelism
Algorithm 3 Memetic-Tabu-GPU⟨blocks,
threadsPerBlock⟩(N, targetE)
1: #For each block do in parallel
2: population ←Krandom binary sequences of length N
3: bestSeq ←sequence with lowest energy from population
4: while E(bestSeq) > targetE do
5: if randReal(0,1) < pcomb then
6: parent1,parent2←selectParents(population).
7: child ←Combine(parent1,parent2)
8: else
9: child ←a random individual from population
10: end if
11: Mutate(child, pmutate)
12: child ←Tabu-Search(child)
13: update bestSeq to child if E(bestSeq)>E(child)
14: Replace a random individual in population by child
15: if E(child)≤targetE then GTF ←True
16: if GTF then return
17: end while
Most of the running time in Algorithm 3is spent, how-
ever, in tabu search, where energy evaluations are fre-
quently needed. We parallelize tabu search by multiple
threads in a block as shown in Algorithm 4. Specifically,
the parallelization for neighborhood energy checking and
data structure updating theoretically reduce the running
time by a factor of threadsPerBlock. Our design of a
GPU implementation includes two levels of parallelism.
7
Algorithm 4 Tabu-Search(seq)<threads>
1: Initialize tabuList[N] = 0 and bestSeqTS=seq
2: maxIter ←randomInt(0, N ) + ⌊N
2⌋
3: pivot ←seq
4: for t= 1,2,· · · , maxI ter do
5: #For each thread do in parallel{
6: i←arg max
i=1,...,N,tabuList[i]<t
E(flip(pivot, i))
7: /* Evaluate energy of all Nneighborhoods of pivot.
Randomly choose one if there are multiple */ }
8: pivot ←flip(pivot,i)
9: tabuList[i] = t+randomInt(minTabu, maxTabu)
10: update bestSeqTS to pivot if E(bestSeqTS)>E(pivot)
11: end for
12: return bestSeqTS
In the block level, multiple replicas are run with different
random initializations in each thread block of a GPU to
increase the probability of reaching the target value. In
the thread level, several key steps in the algorithm are
parallelized by multiple threads in each replica. All the
computational work is done in GPU, which avoids the
overhead of frequent host-device switch and data trans-
portation.
Each block of the GPU runs a replica of memetic-tabu
algorithm, shown in Algorithm 3, independently with dif-
ferent random seed. Therefore, a large number of blocks
in GPUs can explore a diverse set of regions in the so-
lution space simultaneously. A global termination flag
(GTF) is placed in the global memory of GPU which can
be accessed by all blocks. When a block reaches the tar-
get value, it sets GTF so that all blocks can shut down
timely.
B. Shared Memory Exploitation and Compact
Data Structures
The shared memory of a GPU, accessible by all threads
of a block, can be 100+ times faster than the global mem-
ory though with considerably limited size [24]. To exploit
the efficiency of shared memory in our design, all data
used in the algorithm including population,tabulist,
vectorC and tableC are compactly stored in shared
memory. Nevertheless, to maximize the number of ac-
tive blocks running simultaneously, each replica can only
use a limited amount of shared memory. For instance, on
a Nvidia-A100 GPU, each block should use no more than
maxSharedMemPerSM/maxBlocksPerSM = 164KB/32 ≈5
KB shared memory, which makes fitting all data struc-
tures in the shared memory a challenging task.
Given the limited amount of memory, we optimize the
storage of all binary-valued data structures population
and tableC, by employing compact bit-vector representa-
tions implemented through integer arrays. This approach
results in an eightfold reduction in memory usage com-
pared to previous implementations, thereby significantly
enhancing the efficiency of data storage and management
within the algorithm. Moreover, the matrix tableC is
upper-left triangular so only the non-zero entries need to
be stored, saving another N
2bits. In our implementa-
tion, the shared memory size 5 KB suffices for N⩽187
for population size K= 100, while directly adapting the
previous implementation makes the memory size of 5 KB
insufficient even for N⩽11.
V. RESULTS
A. Experiment Setup
We implement our approach in C++ and CUDA. The
experiments are conducted in a single Nvidia-A100 GPU
with 6912 CUDA cores, 80 GB memory and 164 KB
shared memory per streaming processor (SM). We dy-
namically assign the number of blocks and number of
threads per block according to the problem size N. As it
is a common practice to set threadsPerBlock as a mul-
tiplier of number of threads in a GPU wrap (32), we set
threadsPerBlock to the smallest multiplier of 32 that is
larger than N. The number of blocks is set to
blocks =#SMs ×min(maxBlocksperSM,maxTreadsPerSM
threadsPerBlock ),
to maximize the number of active blocks under the
restrictions of the specific GPU configurations.
The stopping criterion is set to be reaching the previ-
ous best-known results. We also run the implementation
of [19] on an AWS c5.24xlarge CPU node with 48 physi-
cal cores at 3.056 GHZ for a comparison with multicore
CPU implementations.
B. Showcasing New MF Values
The optimal solution for the LABS problem with N >
66 has remained unknown, primarily due to the non-
scalability of the global branch-and-bound method [29].
Notably, the best-known solutions for problem sizes up
to N⩽120, derived through meta-heuristic approaches,
have not seen improvements since 2020 [8]. However,
our implementation of a GPU-parallelized memetic-tabu
search has demonstrated advancements in this space for
92 ⩽N⩽118 by obtaining sequences with better merit
factors compared to the previously best-known results,
as detailed in Table II and Figure 1. These improved re-
sults were achieved by executing our implementation on
a single Nvidia-A100 GPU, with all considered problem
sizes completing within a runtime of less than 48 hours.
For even sequences within the range of N=
92 to N= 118, our study has yielded improved
results for ten problem sizes: specifically, N=
92,98,104,106,108,110,112,114,116,118. For problem
8
N Old E New E Old MF New MF New Sequences S d(S)
Even Sequences
92 498 490 8.50 8.64 EE01C0E77667DD34DAE94B5 12
BB5495B2233288618FBC1E0 9
98 545 529 8.81 9.08 993A76393CDB4BCF78FF00AA8 13
104 612 600 8.84 9.01 0F80F3C20AAB1844295B6ED666 12
106 701 645 8.01 8.71 696D27CA748C66071EAFDD1177C 14
108 702 670 8.31 8.70 F2401F3D9FF1CF46D58D96AA959 14
110 723 707 8.37 8.56 33F83762F128DF55064B5B7BE79C 16
112 788 728 7.96 8.62 A0496A493FAECCC8AFC3D50E738F 13
114 817 777 7.95 8.36 B6C3648DB19C8C387A9EAA8578000 15
116 814 802 8.27 8.39 52DF72096B92DCC87407044C8E1D5 14
118 847 807 8.22 8.63 E003FB9F87C674E5CD6D34CCB2AD54 15
Odd Sequences
99 577 553 8.49 8.86 0CF30C003783CBCC8DA92AAD4 15
107 677 673 8.45 8.51 A78785B8D72AEA6D99DECDFD004 17
Table II: New Results for the LABS Problem with old and new energy (E), Merit Factors (MF), sequences in
hexadecimal encoding (zeros are attached to the end of the sequence to make their lengths multipliers of 4) and
deviation from skew symmetry (d). All sequences we found are with high d(S)(≥9), indicating they are not likely to
be found by skew-symmetry-based methods. For N= 92 two non-equivalent sequences are discovered.
sizes N= 94,96,100,102, the sequences generated by
our implementation exhibit MF values that are consis-
tent with the previously reported best-known results [8].
Additionally, our implementation has identified two novel
results for odd sequences at N= 99 and N= 107. No-
tably, while the prior best-known results for these se-
quences were associated with optimal skew-symmetric se-
quences, our findings indicate that the optimal objective
values for these problems do not coincide with the opti-
mal skew-symmetric values.
C. Scaling and Runtime
Our implementation of the memetic algorithm with
tabu search on a GPU was tested on LABS problem for
all problem sizes N⩽120. As illustrated in Figure 5,
the GPU-based memetic tabu search maintains the same
scaling factor as its analogous CPU implementation. No-
tably, Figure 4demonstrates the constant factor speedup
achieved by the GPU implementation. Specifically, our
approach delivers an 8- to 26-fold acceleration for prob-
lem sizes ranging from 55 to 83 when utilizing an Nvidia
A100 GPU, compared to the analogous implementation
on a 16-core CPU.
Figure 5shows the time-to-solution (TTS) scaling of
20 25 30 35 40 45 50 55 60 65 70 75 80 85
N
0
5
10
15
20
25
Acceleration Factor
GPU-A100 Acceleration
CPU-48cores Acceleration
CPU-16cores Baseline
Figure 4: Acceleration of our GPU implementation
against previous version on multiple CPU cores for
time-to-solution (TTS). The most significant speed-up
is 26.5×on N= 68, where the GPU implementation
takes 5.15 seconds and the CPU version needs 136.64
and 45.43 seconds for 16 and 48 cores respectively.
our GPU implementation of the memetic-tabu algorithm
and comparing it with the analogous 16 core and 48 core
CPU implementation. Further, as a reference, we in-
clude the TTS scaling of Gurobi, the global optimization
9
0 20 40 60 80 100
N
10 4
10 3
10 2
10 1
100
101
102
103
104
105
Time (seconds)
Gurobi: 1.706
N
CPU-16cores: 1.344
N
CPU-48cores: 1.352
N
GPU-A100: 1.313
N
Figure 5: Median time to target energies with the error
band representing the intervals from Q1 (25%
percentile) to Q3 (75% percentile). The scalings are
from exponential fits with (1) Nfit
min = 20 for Gurobi; (2)
Nfit
min = 50 for memetic-tabu GPU-A100, CPU-16cores,
and CPU-48cores.
20 30 40 50 60 70
N
fit
min
1.2
1.3
1.4
1.5
1.6
1.7
1.8
Scaling exponent
Gurobi
CPU-16cores
CPU-48cores
GPU-A100
Figure 6: The scaling factor bfrom the a×bNfit with
the error band representing the 95% CI.
solver based on branch-and-bound [1]. The experiments
for Gurobi were conducted with Gurobi version 11.0.3
on 16 threads on a AMD CPU with 48 physical cores
and 96 logical cores. In Gurobi, the LABS problem is
encoded as a quadratic constrained binary optimization
problem with auxiliary variables [26]. Although Groubi
is a global optimization solver, in our experiments we
set the stopping criterion to reaching the previous best-
known energy, instead of proving the global optimality.
For the TTS scaling, we fit the median runtime by the
function a×bNfor Nin intervals [Nfit
min, N fit
max]where
Nfit
min is the starting point in the fit as a variable. Nfit
max
is the end point in the fit, taken as 40 for Gurobi, 82 for
memetic-tabu CPU-16cores, 85 for CPU-64cores and 98
for GPU-A100. A linear fit is conducted with ordinary
Nfit
min a(95% CI) b(95% CI)
20 2.05e-05 (9.71e-06, 4.34e-05) 1.220 (1.205, 1.235)
25 8.22e-06 (3.86e-06, 1.75e-05) 1.235 (1.221, 1.250)
30 3.20e-06 (1.49e-06, 6.88e-06) 1.252 (1.237, 1.266)
35 1.22e-06 (5.59e-07, 2.65e-06) 1.268 (1.253, 1.283)
40 4.60e-07 (2.06e-07, 1.03e-06) 1.284 (1.269, 1.299)
45 1.56e-07 (7.01e-08, 3.49e-07) 1.302 (1.288, 1.317)
50 8.23e-08 (3.18e-08, 2.13e-07) 1.313 (1.296, 1.330)
55 5.24e-08 (1.61e-08, 1.71e-07) 1.320 (1.300, 1.341)
60 4.61e-08 (9.76e-09, 2.18e-07) 1.322 (1.296, 1.348)
65 9.37e-08 (1.18e-08, 7.42e-07) 1.311 (1.278, 1.345)
70 1.93e-07 (1.05e-08, 3.55e-06) 1.300 (1.255, 1.346)
Table III: Fit results with varying Nfit
min of the median
runtime with fit function a×bN(in seconds). Numbers
in parentheses are 95% confidence intervals.
least squares for the logarithm of median runtime ver-
sus Nand then convert the logarithm back to times in
seconds. The fit function values aand b, along with the
range within 95% confidence interval (CI), for our GPU
parallelized memetic tabu implementation is reported in
Table III.
Figure 6compares the scaling factor, b, for the three
implementations of memetic tabu and the Gurobi solver.
We note that memetic tabu implementation runtime for
N≲40 is dominated by the kernel launching overhead
and problem instance construction, so the exponential
fit with these points may lead to artificially lower scaling
factors. Consequently, we report the TTS of memetic-
tabu with Nfit
min = 50 in Figure 5. Our GPU imple-
mentation scaling factor, as reported in Table III, aligns
with Ref. [19] with their reported scalings 1.34N, and
1.35Nreported in Refs. [7,33]. Further, the 95%-CI of
the Gurobi scaling has overlap with that of the time-to-
solution (TTS) scaling of Gurobi reported in Ref. [33].
D. Deviation from Skew-Symmetry
In Appendix A, we introduce d(S), the deviation from
skew symmetry for sequence S. The definition of d(S)
applies for both odd and even sequences. We calculate
d(S)for all our newly found sequences and show the re-
sults in Table II. The d(S)of our sequences is at least 9,
which are considerably greater than a typical sequence
found by the skew-symmetry-based method of Ref. [31],
giving evidence that our sequences are not likely to be
found by skew-symmetry-based methods. In Ref. [30] all
optimal skew-symmetric sequences for odd N⩽119 are
found and proved optimal via branch-and-bound. The
10
previous best results for N= 99,107 are given by op-
timal skew-symmetric sequences before this work. Our
non-skew-symmetric results for N= 99,107 further con-
firm the global sub-optimality of optimal skew-symmetric
sequences.
VI. DISCUSSION
In this work, we present a framework for the massive
parallelization of memetic algorithm with tabu search,
specifically tailored for the LABS problem on a single
Nvidia A100 GPU. Our implementation leverages a care-
fully crafted design of block-level and thread-level par-
allelism, further optimized by utilizing the fast shared
memory available among blocks. By encoding the al-
gorithm’s data storage in a binary-valued format, we
achieve optimal GPU utilization, thereby enhancing com-
putational efficiency.
Using LABS as a testbed, we showcase the efficacy
of our implementation in rapidly finding the best-known
solution up to problem size 120. This is illustrated by
up to a 26 fold speedup of our GPU implementation over
the analogous memetic tabu implementation on a 16 core
GPU. Additionally, Our implementation exhibits a con-
siderable scaling advantage over the global branch-and-
bound solver of Groubi for problem size N⩽91. Fur-
ther, our optimized implementation has also allowed us
to improve upon the best-known LABS objective value
for twelve different problem sizes from 92 ⩽N⩽118.
Among the newly identified sequences that exhibit
improvements over previously established LABS objec-
tive values, two belong to odd problem sizes, specifi-
cally N= 99, and N= 107. This is of particular
importance, since odd-sized LABS problems exhibit the
skew-symmetry property which quadratically reduces the
search space. Historically, leveraging this property, prior
methods have reported optimal skew-symmetric LABS
objective values for these problem sizes, which were also
the best known LABS objective values. Our improved
results for these problem sizes highlights the potential
limitations of solely utilizing the skew-symmetry prop-
erty to constrain the search space since this approach
may inadvertently increase susceptibility to local optima.
This suggests that a general purpose methods perform-
ing unrestricted search across the entire space remains
advantageous, particularly for complex search landscape
of LABS.
ACKNOWLEDGMENTS
We thank our colleagues at the Global Technology
Applied Research center of JPMorganChase for support
and helpful feedback. Special thanks to Ruslan Shay-
dulin, Yue Sun, Atithi Acharya, Rudy Raymond, and
Tyler Chen for their valuable discussions regarding the
manuscript, to Jacob Albus for technical support in set-
ting up Gurobi experiments, and to Pragna Subrah-
manya and Sriram Gunja Yechan for technical support
around GPU implementation.
Appendix A: Analysis of deviation from skew
symmetry for solutions
For odd N= 2n−1, skew symmetry is defined as the
condition sn+l= (−1)lsn−lfor all possible pairs, i.e.,
l= 1,2,· · · , n −1. The number of pairs do not satisfy
the skew-symmetric condition can be used as a metric for
how distant a sequence is from skew symmetry. Formally,
the number of non-skew-symmetric pairs for a sequence
S=s1s2· · · sNcan be calculated as
nnon−skew(S) = 1
2
n−1
X
l=1
|sn+l−(−1)lsn−l|.(A1)
For even-length sequences, Ref. [17] introduces se-
quence operators that append or delete elements from the
beginning or the end of sequences, so that even-length so-
lutions can be built based on odd-length skew-symmetric
solutions, and the deviation from skew symmetry of an
even-length sequence can also be analyzed. Moreover,
Ref. [31] uses the sequence operators and also combines
with the rotation defines as
Rot(s1s2· · · sN;i) = si+1si+2 · · · sNs1· · · si(A2)
which moves the first ielements to the end of the se-
quence.
“Compositions of sequence operators (appending and
deleting at the beginning and/or the end) and rotation”
are equivalent to the “compositions of insertion and dele-
tion at arbitrary locations and rotation”. Therefore, we
define the insertion and deletion operators at arbitrary
locations as
Ins(s1s2· · · sN;i, ξ) = s1s2· · · siξsi+1 · · · sN,(A3)
Del(s1s2· · · sN;i) = s1s2· · · si−1si+1 · · · sN,(A4)
where ξ∈ {−1,1}. Allowing one time of insertion and
deletion, we define the metric for the deviation of skew
symmetry as
11
d(S) = min min
i,r,ξ nnon−skew(Ins(Rot(S;r); i, ξ)),min
i,r nnon−skew(Del(Rot(S;r); i)),(A5)
as the minimum number of non-symmetric pairs with ro-
tation and any insertion and deletion by one time. This
gives some demonstration that the sequence may be diffi-
cult to obtain under the operations used by Ref. [31]. In
principle, multiple times of insertion and deletion can be
allowed, but this would make the calculation of Eq. A5
very expensive and eventually become equivalent to a
complete search when the number of insertion/deletion
becomes O(N).
We list the values of nnon−skew for odd Nand
nnon−skew,1−ins/del for even Nwith our newly found se-
quences in Table II. As a comparison, valid sequences
from Ref. [31] have nnon−skew ≤3for all odd Nex-
cept for N= 463 and nnon−skew,1−ins/del ≤3for all even
N. This indicates it may be difficult for skew-symmetry-
based methods to find better solutions, although we note
that Ref. [31] deals with N≥450 much greater than the
regime we work with in this paper.
[1] Gurobi Optimization, www.gurobi.com.
[2] Thomas Bäck and Hans-Paul Schwefel. An overview
of evolutionary algorithms for parameter optimization.
Evolutionary computation, 1(1):1–23, 1993.
[3] Thomas Bartz-Beielstein, Jürgen Branke, Jörn Mehnen,
and Olaf Mersmann. Evolutionary algorithms. Wiley
Interdisciplinary Reviews: Data Mining and Knowledge
Discovery, 4(3):178–195, 2014.
[4] J. Bernasconi. Low autocorrelation binary sequences :
statistical mechanics and configuration space analysis.
Journal de Physique, 48(4):559–567, 1987.
[5] Christian Blum and Andrea Roli. Metaheuristics in com-
binatorial optimization: Overview and conceptual com-
parison. ACM computing surveys (CSUR), 35(3):268–
308, 2003.
[6] A. Boehmer. Binary pulse compression codes. IEEE
Transactions on Information Theory, 13(2):156–167,
April 1967.
[7] Borko Bošković, Franc Brglez, and Janez Brest. Low-
autocorrelation binary sequences: On improved merit
factors and runtime predictions to achieve them. Applied
Soft Computing, 56:262–285, July 2017.
[8] B. Bošković, F. Brglez, and J. Brest. A GitHub Archive
for Solvers and Solutions of the labs problem. For up-
dates, see https://github.com/borkob/git_labs. , Jan-
uary 2016.
[9] Janez Brest and Borko Boskovic. A heuristic algo-
rithm for a low autocorrelation binary sequence problem
with odd length and high merit factor. IEEE Access,
6:4127–4134, 2018.
[10] Franc Brglez, Xiao Yu Li, Matthias F. Stallmann, and
Burkhard Militzer. Reliable cost predictions for finding
optimal solutions to labs problem: Evolutionary and al-
ternative algorithms. Proc. of The Fifth Int. Workshop
on Frontiers in Evolutionary Algorithms (FEA’2003)
under JCIS’2003, (18), September 2003.
[11] Jaishankar Chakrapani and Jadranka Skorin-Kapov.
Massively parallel tabu search for the quadratic as-
signment problem. Annals of Operations Research,
41(4):327–341, 1993.
[12] Jean-François Cordeau and Mirko Maischberger. A paral-
lel iterated tabu search heuristic for vehicle routing prob-
lems. Computers & Operations Research, 39(9):2033–
2050, 2012.
[13] Teodor G Crainic, Jean-Yves Potvin, and Michel Gen-
dreau. Parallel tabu search. Université de Montréal,
Centre de recherche sur les transports, 2005.
[14] Jacek Dąbrowski. Parallelization techniques for tabu
search. In Bo Kågström, Erik Elmroth, Jack Don-
garra, and Jerzy Waśniewski, editors, Applied Parallel
Computing. State of the Art in Scientific Computing,
pages 1126–1135, Berlin, Heidelberg, 2007. Springer
Berlin Heidelberg.
[15] I. De Falco, R. Del Balio, E. Tarantino, and R. Vaccaro.
Improving search by incorporating evolution principles in
parallel tabu search. In Proceedings of the First IEEE
Conference on Evolutionary Computation. IEEE World
Congress on Computational Intelligence, pages 823–828
vol.2, 1994.
[16] Viviane M de Oliveira, José F Fontanari, and Pe-
ter F Stadler. Metastable states in short-ranged p-spin
glasses. Journal of Physics A: Mathematical and General,
32(50):8793, 1999.
[17] Miroslav Dimitrov. New classes of binary sequences with
high merit factor. arXiv preprint arXiv:2206.12070, 2022.
[18] C.-N. Fiechter. A parallel tabu search algorithm for
large traveling salesman problems. Discrete Applied
Mathematics, 51(3):243–267, 1994.
[19] José E Gallardo, Carlos Cotta, and Antonio J Fernán-
dez. A memetic algorithm for the low autocorrelation bi-
nary sequence problem. In Proceedings of the 9th annual
conference on genetic and evolutionary computation,
pages 1226–1233, 2007.
[20] Fred Glover. Tabu search: A tutorial. Interfaces,
20(4):74–94, 1990.
[21] M. Golay. A class of finite binary sequences with al-
ternate auto-correlation values equal to zero (corresp.).
IEEE Transactions on Information Theory, 18(3):449–
450, May 1972.
[22] M. Golay. Sieves for low autocorrelation binary se-
quences. IEEE Transactions on Information Theory,
23(1):43–51, January 1977.
[23] M. Golay. The merit factor of long low autocorrela-
tion binary sequences (corresp.). IEEE Transactions on
Information Theory, 28(3):543–549, May 1982.
[24] Design Guide. Cuda c++ programming guide. NVIDIA,
July, 2020.
[25] Tabitha James, Cesar Rego, and Fred Glover. A coop-
12
erative parallel tabu search algorithm for the quadratic
assignment problem. European Journal of Operational
Research, 195(3):810–826, 2009.
[26] Jozef Kratica. A mixed integer quadratic programming
model for the low autocorrelation binary sequence prob-
lem. Serdica Journal of Computing, 6(4):385–400, 2013.
[27] Eugene L Lawler and David E Wood. Branch-and-bound
methods: A survey. Operations research, 14(4):699–719,
1966.
[28] Leland McInnes, John Healy, and James Melville. Umap:
Uniform manifold approximation and projection for di-
mension reduction. arXiv preprint arXiv:1802.03426,
2018.
[29] Tom Packebusch and Stephan Mertens. Low auto-
correlation binary sequences. Journal of Physics A:
Mathematical and Theoretical, 49(16):165001, March
2016.
[30] S. D. Prestwich. Improved branch-and-bound for low
autocorrelation binary sequences. arXiv:1305.6187, 2013.
[31] Blaž Pšeničnik, Rene Mlinarič, Janez Brest, and Borko
Bošković. Dual-step optimization for binary sequences
with high merit factors. arXiv preprint arXiv:2409.07222,
2024.
[32] M. Schroeder. Synthesis of low-peak-factor signals and
binary sequences with low autocorrelation (corresp.).
IEEE Transactions on Information Theory, 16(1):85–89,
January 1970.
[33] Ruslan Shaydulin, Changhao Li, Shouvanik Chakrabarti,
Matthew DeCross, Dylan Herman, Niraj Kumar, Jeffrey
Larson, Danylo Lykov, Pierre Minssen, Yue Sun, Yuri
Alexeev, Joan M. Dreiling, John P. Gaebler, Thomas M.
Gatterman, Justin A. Gerber, Kevin Gilmore, Dan
Gresh, Nathan Hewitt, Chandler V. Horst, Shaohan
Hu, Jacob Johansen, Mitchell Matheny, Tanner Mengle,
Michael Mills, Steven A. Moses, Brian Neyenhuis, Peter
Siegfried, Romina Yalovetzky, and Marco Pistoia. Evi-
dence of scaling advantage for the quantum approximate
optimization algorithm on a classically intractable prob-
lem. Science Advances, 10(22), May 2024.
[34] Allyson Silva, Leandro C Coelho, and Maryam Darvish.
Quadratic assignment problem variants: A survey and
an effective parallel memetic iterated tabu search.
European Journal of Operational Research, 292(3):1066–
1084, 2021.
[35] El-Ghazali Talbi, Zouhir Hafidi, and Jean-Marc Geib.
Parallel Tabu Search for Large Optimization Problems,
pages 345–358. Springer US, Boston, MA, 1999.
[36] Peter JM Van Laarhoven, Emile HL Aarts, Peter JM van
Laarhoven, and Emile HL Aarts. Simulated annealing.
Springer, 1987.
[37] Allan J Wilson, DR Pallavi, M Ramachandran, Sathi-
yaraj Chinnasamy, and S Sowmiya. A review on
memetic algorithms and its developments. Electrical and
Automation Engineering, 1(1):7–12, 2022.
DISCLAIMER
This paper was prepared for informational purposes by
the Global Technology Applied Research center of JP-
Morgan Chase & Co. This paper is not a product of the
Research Department of JPMorgan Chase & Co. or its
affiliates. Neither JPMorgan Chase & Co. nor any of its
affiliates makes any explicit or implied representation or
warranty and none of them accept any liability in con-
nection with this paper, including, without limitation,
with respect to the completeness, accuracy, or reliability
of the information contained herein and the potential le-
gal, compliance, tax, or accounting effects thereof. This
document is not intended as investment research or in-
vestment advice, or as a recommendation, offer, or solic-
itation for the purchase or sale of any security, financial
instrument, financial product or service, or to be used in
any way for evaluating the merits of participating in any
transaction.