Conference PaperPDF Available


Semi-Global Matching (SGM) is a popular algorithm to calculate depth maps in stereo images offering the best trade-off among accuracy, computational costs and high frame rates. This paper presents two architectural improvements in FPGA implementations of SGM to achieve high frame rates. First, a highly parallel, pipelined and scalable architecture is implemented which stores the intermediate values internally in the Block-RAMs of the FPGA, rendering external, off-chip memory obsolete. The architecture facilitates the parallelization of the cost computations, over the complete disparity range, in every clock cycle. Secondly, a novel SGM architecture based on multi-clock systems is introduced, which allows the integration of both, disparity-level and row-level paral-lelism and thus obtain even higher FPS rates. Results show that the FPS obtained are higher than any other FPGA based SGM implementation available in literature. On a Virtex-7 FPGA device, for VGA images (640 × 480 pixels) and a disparity range of 128, a rate of 475 FPS is achieved. A rate as high as 70 FPS is realized for Full-HD images (1920 × 1080 pixels).
A Parallel Architecture for High Frame Rate
Stereo using Semi-Global Matching
Akshay Jain
Alexander Fell
Saket Anand
Indraprastha Institute of
Information Technology
Delhi, India
Semi-Global Matching (SGM) is a popular algorithm to calculate depth maps in
stereo images offering the best trade-off among accuracy, computational costs and high
frame rates. This paper presents two architectural improvements in FPGA implementa-
tions of SGM to achieve high frame rates. First, a highly parallel, pipelined and scalable
architecture is implemented which stores the intermediate values internally in the Block-
RAMs of the FPGA, rendering external, off-chip memory obsolete. The architecture
facilitates the parallelization of the cost computations, over the complete disparity range,
in every clock cycle. Secondly, a novel SGM architecture based on multi-clock systems
is introduced, which allows the integration of both, disparity-level and row-level paral-
lelism and thus obtain even higher FPS rates. Results show that the FPS obtained are
higher than any other FPGA based SGM implementation available in literature. On a
Virtex-7 FPGA device, for VGA images (640×480 pixels) and a disparity range of 128,
a rate of 475 FPS is achieved. A rate as high as 70 FPS is realized for Full-HD images
(1920 ×1080 pixels).
1 Introduction
Various autonomous and semi-autonomous systems like Unmanned Aerial Vehicles (UAVs),
Unmanned Ground Vehicles (UGVs) and Advanced Driver Assistant Systems (ADAS) rely
on depth perception for effectively sensing their environments by detecting, localizing and
identifying objects around them. Often sensors like LIDARs or RADARs are employed for
detection and localization of objects. While these sensors provide accurate depth percep-
tion, detection and localization, in many applications these sensors prove to be inadequate.
RADARs and 2D LIDARs can only provide range and azimuth information, but not eleva-
tion. 3D LIDARs can yield range, azimuth and elevation, however, they are prohibitively
expensive and have poor spatial resolution. Moreover, both LIDARs and RADARs fail to
capture appearance, making identification of objects very difficult.
Stereo cameras serve as an attractive alternative to LIDARs and RADARs due to their
cost effectiveness, small form factor, and high spatial resolution. However, the depth esti-
mates obtained from stereo cameras are much poorer compared to LIDARs and RADARs,
2017. The copyright of this document resides with its authors.
It may be distributed unchanged freely in print or electronic forms.
particularly at larger depths. This problem can be mitigated by applying depth super-reso-
lution techniques, which rely on fusing multiple depth estimates [22]. This fusion is reliable
only if the environment is changing slowly, or equivalently, if the depth estimation can op-
erate at a high frame rate (frames per second or FPS). In addition, high FPS depth maps
are constructive for other applications like real-time projection mapping [14] and camera
tracking [6]. Unfortunately, high resolution images increase the computational complexity
of stereo based depth estimation and therefore achieving higher FPS is very challenging.
This challenge is addressed in this paper and a novel, highly parallel and scalable implemen-
tation of a popular technique called Semi-Global Matching (SGM) [8], allowing to achieve
significantly higher FPS, is proposed.
Stereo depth maps are generated using dense point correspondences across the image
pairs. Several approaches exist for dense correspondence search and can be categorized into
local, global and semi-global approaches [20]. In local approaches like Sum of Absolute
Differences (SAD) and Census based rank transform [24], a small part of the image, which
is a window centered around a pixel, is used to calculate the disparity independently. The
advantage of the local algorithms is a high throughput at the cost of reduced accuracy. On the
other hand, global approaches such as Graph Cuts [15,20] and Belief Propagation [4], utilize
the complete image, resulting in a higher accuracy at a lower throughput. The semi-global
approach provides a trade-off between speed and accuracy by using a local method as an
initial input followed by the minimization of the global cost function, e.g. the Semi-Global
Matching (SGM) algorithm [8].
Software implementations of these algorithms executed on General Purpose Processors
(GPP), are too slow to be used for applications requiring high frame rates. This limitation
of GPP implementations is due to the sequential nature of the execution of instructions in
these architectures. Moreover, the limited number of cores and small cache memory further
restrict GPPs processing power. Alternatively GPUs offer a high throughput at the cost of
high power consumption (hundreds of Watt) rendering them infeasible in small, portable
applications like UAVs. FPGAs overcome these drawbacks of GPPs and GPUs and offer
a compact, light weight, fast and low power solution to the problem. However, in order to
develop an FPGA implementation, the algorithm needs to be redesigned to fully exploit the
parallel nature of the target architecture.
The main contribution of this paper is a novel, highly parallel and scalable implementa-
tion of the SGM algorithm. It utilizes internal BlockRAMs as dual-port RAMs and multi-
clock based subsystems to obtain a high throughput resulting in high FPS rates. First a single
clock based disparity parallel architecture is introduced which calculates disparity at every
clock cycle. Secondly a multi-clock architecture is presented to process multiple rows in
parallel and obtain even higher FPS. The key features of the architecture in comparison to
other SGM implementations are:
Stream based operation with no external memory
Multi-clock based, highly parallel and scalable architecture
Row-level parallelism integrated with disparity-level parallelism
Higher frame rates obtained upto Full-HD images compared to state-of-the-art
This paper is organized as follows: In section 2related work is discussed, while in sec-
tion 3a brief overview of the SGM algorithm is given. The proposed implementation of the
SGM algorithm on an FPGA is presented in detail in section 4. In section 5obtained results
are compared and section 6concludes the paper.
2 Related Work
To achieve high frame rates, hardware implementations of stereo correspondence algorithms
have always attracted attention of researchers. SAD implementations with different modifi-
cations can be found in [7,12,16] and [17]. The window based approach in SAD suggests
a parallel, stream based architecture to achieve high FPS rates. The authors in [13] im-
plemented a Census based architecture and achieved 230 FPS for VGA images (640×480
pixels) and 64 distinct disparities.
With the advancement in FPGA technology, even complex computations required by
global methods, can be mapped on its hardware resources. Dynamic Programming (DP)
based designs were implemented by [11,19]. In [19] 64 FPS were achieved for VGA images
with 128 disparities. However maps generated by DP suffer from a streaking effect as the
minimization function considers only one direction.
An SGM implementation was first introduced in [5]. It achieves 25 FPS for VGA im-
ages and 128 disparities. The design utilizes external memory located off-chip, to store the
intermediate path costs increasing computation latencies. In addition scalability for higher
throughput was limited by the lack of resources of the architecture. The most referred SGM
implementation is given in [2]. The presented architecture is a technique based on a systolic
array with two-dimensional parallelization concept. The design is able to achieve FPS rates
as high as 103 and 167 for VGA images with 128 and 64 disparities respectively. These rates
are achieved by computing multiple pixels in parallel (depending on the number of available
image rows) for a single disparity value. However with every additional row added to the
parallel computation, memory requirements and latency increases significantly.
In order to deal with the high memory requirements of SGM, [9] implemented an efficient
SGM algorithm (eSGM) reducing the memory requirements at the expense of an overhead
of 56% in compute time. This resulted in a low FPS rate of 33 FPS for VGA images with
64 disparities. In another implementation based on a combination of FPGA and mobile
CPU, [10] implemented an SGM based design which aimed at low power and lower-end
FPGAs. For 752 ×480 pixels image, a rate of 60 FPS was achieved at a disparity range of
only 32 pixels. A recent SGM implementation with cross-based cost aggregation [23] claims
to achieve the highest frame rates for high definition images. The proposed architecture is
able to achieve better frame rates than [23].
In this paper an implementation is proposed that achieves high FPS rates with 128 dis-
parities utilizing only internal Dual-Port BlockRAM (BRAM) integrated in the FPGA. The
path cost computation has been successfully parallelized covering the complete disparity
range in every clock cycle. Further a multi-clock design is used to integrate row level par-
allelism along with the disparity parallel design. Section 4discusses the architecture and its
implementation in detail.
3 SGM Overview
SGM [8] minimizes the global cost function given by (1) along different one dimensional
Lr(p,d) = C(p,d) + min(Lr(pr,d),Lr(pr,d1) + P
Lr(pr,d+1) + P
i(Lr(pr,i) + P
2)) min
k(Lr(pr,k)) (1)
Figure 1: This figure illustrates (a) 8 path and (b) 4 path orientations
C(p,d)represents the initial cost for pixel pat disparity level d, calculated by one of the
local methods like SAD. Path cost, Lr, is calculated for a pixel pby adding C(p,d)and
the penalized minimum path cost of the previous pixel, pr. There is no penalty for the
path cost at the same disparity, d. To accommodate surfaces with smoothly varying depths,
a small penalty P
1is imposed on the adjacent disparities (d±1), while a larger penalty P
is added to the minimum of the costs of all the remaining disparity levels. The subtrahend
in (1) limits the maximum value resulting in a reduced bit-width of Lr(p,d)lowering the
overall memory requirements. The path costs are aggregated and the disparity level assigned
to the pixel pis the one which minimizes its aggregated cost. For the detailed algorithm refer
to [8].
A total of eight paths oriented at multiples of 45start from the edge of the image and
end at pixel p(x,y)as shown in fig. 1. Since images are streamed in a raster scan method, it
becomes impossible to calculate the costs of lower four paths (180
,270and 315),
unless the complete image is first stored and then processed in two passes (forward and
backward pass). It has been observed that a depth map obtained using only the four paths
shown in fig. 1(0
,90and 135), is similar to that obtained using eight paths [2]. As
there is no need for separate forward and backward passes, the requirements for memory
space and computation time are reduced at a marginal increase in error by 1-2%.
4 Architecture and Implementation
The complete design is divided into two subsystems as shown in fig. 2. The rectified left
and right image pixels are streamed in a raster scan mode into the SAD Computation Unit.
The initial costs C(p,d), computed by the SAD Unit are sent to the Path Computation Unit
via Row-Demux and FIFO synchronizers. Path costs generated by the Path Computation
Unit are aggregated by the Path Aggregation Units and given to the Disparity Calculation
Units. The disparity calculation follows a winner takes all (WTA) approach and outputs the
minimum aggregated cost index as the final disparity. These disparities are then interleaved
via FIFO synchronizers and Row-Mux to obtain final disparities in sequence. Section 4.1
explains in detail how disparity level parallelism is achieved. The disparity value is calcu-
lated at every clock cycle. Section 4.2 discusses the scheme and architecture of the overall
multi-clock design. The multi-clock design allows to integrate disparity level and row-level
parallelism. The implementation details are discussed in section 4.3.
4.1 Disparity-Level Parallelism
This section introduces the proposed architecture design that achieves disparity-level paral-
lelism and therefore is able to compute the final disparity value in every clock cycle. For
n×f f
Figure 2: Overall Architecture: The Path Computation Unit is designed to process multiple
rows along with disparity-level parallelism. The multi-clock design with two subsystems
operating at different frequencies and communicating via FIFO synchronizers, support the
integration of row-level parallelism.
simplicity the architecture for single row computation is explained in the following subsec-
4.1.1 Path Computation Unit
The architecture of the Path Computation Unit is shown in fig. 3. In every clock cycle
the SAD values are taken as input and the corresponding four path costs (for the complete
disparity range) are calculated. Identical Cost Units are used to calculate the path costs L0,
L45,L90and L135parallelly. To calculate Lr(p,d)the value for Lr(pr,d)is required.
Figure 3: Architecture of the Path Computation Unit
For path L0,pris the same pixel for which new path costs were calculated in the previous
clock cycle. They are fed back via a delay unit (refer to fig. 3). However for the paths L45,
L90and L135the corresponding prpixels are different and the same optimization cannot
be applied. Thus, they need to be stored and then fetched separately as per the demand. This
irregular access pattern makes parallelization of SGM complex and increases the memory
requirements. Therefore, three dual-port BRAMs are implemented to address this problem
and thus eliminating the need of off-chip, external memory. An up-counter with a maximum
count equal to the width of the image, is used for the address generation. Three different
logic units generate the required signals (addra, addrb, wea and web in fig. 3) based on the
counter value. More details about the signal generation are given in section 4.1.3.
4.1.2 Cost Unit
The Cost Unit is inspired by [18] which calculates the path costs for a fixed disparity range
in parallel. The design has been extended for the whole disparity range to calculate the
path costs in parallel. Although this extension results in a higher resource requirement, a
better performance in terms of FPS is achieved. On obtaining the new path costs, a tree
based approach is used to calculate the new minimum min(L(p,d)). This minimum value is
concatenated with the calculated path costs and stored together in the BRAM.
4.1.3 Memory Mapping and Retrieval of Path Costs
Three dual port memories are used to store and retrieve the three path costs (refer to fig. 3).
Every path cost consists of a total of dvalues plus one minimum, i.e. d+1 values. These val-
ues are concatenated and stored as one word in the BRAM. The total width of each memory
is configured according to the width wof the image. A total memory capacity of w+1, w+1
and w+2 words is required for L45,L90and L135respectively. The boundary conditions
are considered for each path while allocating the memory.
Considering a pixel pat position (x,y)in the image, to calculate its path costs Lr(p,d)
the previous path costs Lr(pr,d)are read. In the next clock cycle Lr(q,d)for pixel q
with position (x+1,y)are calculated, while the results of Lr(p,d)are stored. The path
costs are read and written from different addresses of the block RAMs. These addresses
can be obtained in accordance with the xcoordinates of the pixel being processed. Fig. 4
summarizes this relation by giving an example for 2 pixels, pand qat x=4 and x=5
raddr 34x1
waddr 34x1
raddr 45x
waddr 34x1
raddr 56x+1
waddr 34x1
p(x=4,y)q(x=5,y)for any pixel (x,y)
p q
Figure 4: Access patterns for read and write operations of Path Costs
The access pattern for pixel pat the position x=4 is shown in fig. 4. Path costs L0are
obtained from the Delay Unit as explained in section 4.1.1.L45,L90and L135are read
from the address locations 3, 4 and 5 (x1, xand x+1) of their corresponding BRAMs,
respectively. Due to the parallel design of the Cost Units, the new path costs are available in
one clock cycle. At x=5, these values need to be written at their required memory locations.
L45calculated at x=4 will be required for pixel t(fig. 4), thus it needs to be stored at address
location 4 in the next clock cycle. Similarly, L90and L135, required for pixel sand r, are
stored at address location 4 in their respective memories.
Since, there is a need to access different addresses for reading and writing in a single
clock cycle, dual-port BRAMs are used. The memories are implemented in a read before
write mode, to prevent any overwriting. Using these conditions the write enable signals
and addresses for both port A and B of the dual-port memory are generated by the Signal
Generation Logic Units (fig. 3). Port A is always used for reading out the output, while port
B is used to only write the calculated path costs. This memory mapping allows to achieve
high FPS rates without using an external memory.
4.2 Multi-Clock Based Integration of Row-Level Parallelism
With the above approach one pixel is processed every clock cycle and the disparity is cal-
culated parallelly. However, the maximum operating frequency is limited by the computa-
tionally expensive Path Computation Unit. Therefore, in every clock cycle more pixels need
to be processed to achieve higher FPS. Although the row-level parallelism is similar to [2],
the proposed design calculates disparities in every clock cycle. To allow this integration the
SAD Unit must provide the initial costs at higher throughputs (ntimes in order to process
npixels in one clock cycle). Due to an inherent parallel structure and lower complexity,
SAD Unit is able to run at higher frequencies compared to the overall SGM design. This
advantage of SAD can be leveraged in a multi-clock architecture.
Fig. 2shows the top level architecture of the multi-clock based row-level parallelism
integrated with disparity-level parallelism. The design operates on two clocks (n×fand f).
The high frequency subsystem operates on a clock frequency of n×f, where nis the number
of rows which are used for parallel processing. The FIFOs select their read and write clock
frequencies equal to that of the subsystem from which the read enable (re) and write enable
(we) signals are fed into them, respectively.
As shown in fig. 2, SAD costs are calculated in the high frequency subsystem. These
initial costs have to cross over into the low frequency subsystem utilizing nFIFO synchro-
nizers. Similarly, the ndisparities calculated parallelly by the low frequency subsystem are
read sequentially by the high frequency subsystem via FIFO synchronizers. Row-Demux and
Row-Mux units are used to deinterleave and interleave the initial costs and the final disparity
values, respectively. The final disparities are read out from the high frequency subsystem.
4.3 Implementation Details
The design is implemented using a combination of Bluespec System Verilog (BSV) [1] and
the Xilinx System Generator (Sysgen) tool. Verilog codes obtained from BSV are integrated
in the Sysgen design. Using Xilinx ISE/Vivado, the Sysgen design is used for logic synthe-
sis and implementation onto a target FPGA. Practically the image resolutions for different
cameras vary, but the disparity range for depth maps is fixed. The BSV codes and Sysgen
design can be easily scaled for any image resolution by changing the input parameters for a
fixed disparity value. The design implements the same SGM algorithm as that by OpenCV
Semi Global Block Matching (OCV-SGBM) [3].
5 Results
Different measures are used to compare the results of real-time implementations of stereo
correspondence algorithms. As stated earlier, currently SGM provides the best trade-off
Table 1: Error rates for all pixels, including the occluded pixels, at a threshold of 1 pixel
Work Method Tsukuba Venus Cones Teddy Average
[13] Census 11.56% 5.27% 17.58% 21.5% 14%
[19] DP-ML 13.58% 5.33% - - >13%
[2]1SGM + Rank transform (4-paths) 6.8% 4.1% 9.5% 13.3% 8.425%
[23] SGM + cross-based cost aggregation 3.27% 0.89% 7.74% 12.1% 6%
Proposed SGM + SAD (4-paths) 6.34% 2.95% 11.46% 11.71% 8.12%
1results available for non-occluded regions.
(a) (b) (c)
Figure 5: cones and teddy dataset (a) Original (b) Ground Truths (c) Obtained maps.
between accuracy and speed. The accuracy is measured by calculating the percentage of
pixels, for which the calculated disparity differs from the ground truth by a given thresh-
old. Section 5.1 discusses the accuracy results obtained for a threshold of 1 pixel of the
Middlebury [20,21] dataset.
The comparison of the speed for real-time implementations is not straight forward. This
is due to the variations in the different evaluation platforms used and modifications in the
algorithm implemented. One method is to observe the minimum clock frequency for which
real-time results (30 FPS) are achieved. Section 5.2 compares the results obtained using this
method and clearly shows that the proposed approach outperforms current implementations
available. The other method is to evaluate the maximum performance, i.e. million disparity
estimates per second (MDE/s) which reflects the FPS rates. Results for the MDE/s and
maximum FPS obtained are discussed in section 5.3.
5.1 Accuracy
All pixels, including the occluded pixels have been considered while calculating the error
rates. Table 1shows the error rates on the Middlebury [20,21] dataset, for a threshold of
1 pixel. The average error rates for [13] and [19] are on a higher side (around 14%) as
compared to that of [8,23] and proposed work (8.425, 6 and 8.12%, resp.). A slight increase
in error of the proposed implementation as compared to [23] is justified, as SGM with cross-
based aggregation is slightly more accurate than with SAD. However, SAD costs can be
generated at higher speeds in comparison to cross-based aggregated costs. Fig. 5shows
disparity maps generated, corresponding to the right image of the cones and teddy dataset.
5.2 Minimum Clock Frequency for 30 FPS
The minimum clock frequency required for 30 FPS is considered a fair estimate to compare
real-time implementations. Since this is moderately independent of the evaluation platform
and accounts mostly for the algorithm and the architecture employed. Table 2shows the
comparison for VGA images. The SGM implementation of [2] obtained the 30 FPS require-
ment for VGA images at a minimum frequency of 38 MHz. The proposed work is able to
achieve the same FPS at a frequency as low as 9.3 MHz.
Table 2: Minimum clock frequency required to achieve 30 FPS for 640 ×480 images
Work [13] [19] [5] [2]Proposed
Disparity Range 64 128 64 128 128
Minimum Clock Frequency 12MHz 100MHz 133MHz 38MHz 9.3MHz
Table 3: Maximum performance results for 640 ×480 images on Virtex-5 LX220T
d=64 d=128
Max. freq. FPS LUTs Memory
(BRAM) Max. freq. FPS LUTs Memory
[2] 133MHz 167 35.3% 2380kb 133MHz 107 35.3% 2380kb
Proposed 74MHz 240 34% 2808kb 54MHz 175 68.7% 5076kb
5.3 Maximum FPS and Corresponding Operating Frequency
The maximum FPS obtained and the corresponding operating frequency are mainly depen-
dent on the platform for which the design is targeted. The proposed design is first compared
with [2], which implemented SGM on the same FPGA (Xilinx Virtex-5 LX220T). Table 3
shows the results obtained for the single row design (due to limited resources of Virtex-5 it
does not support multiple row design). For the same specifications the proposed work (single
row) achieves 40 to 60% higher FPS at 0.4 to 0.55 times of the maximum frequency. This is
obtained at the expense of using extra LUTs and inbuilt BRAMs.
Table 4compares the result for two architectures, both for single row (only disparity-
level parallelism) and double row (multi-clock with n=2). It shows the results obtained for
Xilinx Virtex-7 FPGAs for different image resolutions. Table 4confirms that moving from
single row to double row architecture, the FPGA resource utilization, i.e. LUT (look-up-
table) and BRAM, increases significantly. For similar increment in the FPGA utilization,
extending the architecture for n3 improves the FPS only marginally by 2-3%. Based
on preliminary results the SAD Computation Unit poses a bottleneck due to the maximum
operating frequency limit.
For VGA images a maximum FPS rate of 475 and for full-HD rate of 70 FPS is achieved.
The comparison of million disparities estimated per second (MDE/s) with previous works is
given in table 5. As it can be observed, the proposed work gives highest MDE/s and is thus
able to compute the highest FPS over all image resolutions.
Table 4: Summary of results and resource utilization for different image resolutions on
Virtex-7 690t FPGA with d=128. Columns marked ‘Single’ and ‘Double’ respectively
refer to implementations with only disparity-level parallelism and integrated row-level par-
allelism with two rows.
Image Size LUTs BRAMs (kb) Max. Frequency1FPS
Single Double Single Double Single Double Single Double
640 ×480 90123 (21%) 168471 (39%) 5076 (10%) 7452 (14%) 79 MHz 73 MHz 257 475
960 ×540 90532 (21%) 168381 (39%) 5076 (10%) 8208 (16%) 79 MHz 73 MHz 152 277
1280 ×720 90069 (21%) 169080 (39%) 10098 (21%) 16290 (31%) 77 MHz 70 MHz 83 152
1920 ×1080 90871 (21%) 170161 (39%) 10098 (21%) 16290 (31%) 79 MHz 73 MHz 38 70
1Maximum operating frequency of the lower frequency subsystem in multi-clock design, i.e. f.
Table 5: Comparison of FPGA based SGM Implementations
Work Image Size Disparity Levels FPS MDE/s
[5] 680 ×400 64 25 435.2
[2] 640 ×480 128 103 4050
[23] 1600 ×1200 128 42.61 10472
Proposed 1920 ×1080 128 70 18580
6 Conclusion
In this paper a new architecture for implementing the SGM algorithm is introduced. By
utilizing dual-port Block RAMs (no external memory) and parallel path cost computation a
faster scan-aligned SGM implementation is presented. The multi-clock based design allows
the integration of disparity-level and row-level parallelism. Results show that the proposed
design is faster than any other previous work reported in literature. An FPS rate of 475 for
VGA images is achieved. The design is scaled upto Full-HD images for which high frame
rates of 70 FPS are obtained. Thus, a method to achieve high frame rates and accurate depth
maps for higher resolution stereo images has been presented.
[1] Arvind. Bluespec: A Language for Hardware Design, Simulation, Synthesis and Ver-
ification Invited Talk. In Proceedings of the First ACM and IEEE International Con-
ference on Formal Methods and Models for Co-Design, MEMOCODE ’03, page 249,
Washington, DC, USA, 2003. IEEE Computer Society. ISBN 0-7695-1923-7. URL
[2] C. Banz, S. Hesselbarth, H. Flatt, H. Blume, and P. Pirsch. Real-time stereo vi-
sion system using semi-global matching disparity estimation: Architecture and FPGA-
implementation. In Embedded Computer Systems (SAMOS), 2010 International Con-
ference on, pages 93–101, July 2010. doi: 10.1109/ICSAMOS.2010.5642077.
[3] Gary Bradski. The OpenCV Library. Dr. Dobb’s Journal of Software Tools, November
[4] Pedro F. Felzenszwalb and Daniel P. Huttenlocher. Efficient Belief Propagation for
Early Vision. Int. J. Comput. Vision, 70(1):41–54, October 2006. ISSN 0920-
5691. doi: 10.1007/s11263-006-7899-4. URL
[5] Stefan K Gehrig, Felix Eberli, and Thomas Meyer. A real-time low-power stereo vision
engine using semi-global matching. In International Conference on Computer Vision
Systems, pages 134–143. Springer, 2009.
[6] Ankur Handa, Richard Newcombe, Adrien Angeli, and Andrew Davison. Real-time
camera tracking: When is high frame-rate best? Computer Vision–ECCV 2012, pages
222–235, 2012.
[7] Masanori Hariyama, Yasuhiro Kobayashi, Haruka Sasaki, and Michitaka Kameyama.
FPGA implementation of a stereo matching processor based on window-parallel-and-
pixel-parallel architecture. IEICE Transactions on Fundamentals of Electronics, Com-
munications and Computer Sciences, 88(12):3516–3522, 2005.
[8] Heiko Hirschmüller. Stereo processing by semiglobal matching and mutual informa-
tion. IEEE Transactions on pattern analysis and machine intelligence, 30(2):328–341,
[9] Heiko Hirschmüller, Maximilian Buder, and Ines Ernst. Memory efficient semi-global
matching. ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Informa-
tion Sciences, 3:371–376, 2012.
[10] D. Honegger, H. Oleynikova, and M. Pollefeys. Real-time and low latency embedded
computer vision hardware based on a combination of fpga and mobile cpu. In 2014
IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 4930–
4935, Sept 2014. doi: 10.1109/IROS.2014.6943263.
[11] Hong Jeong et al. Real-time stereo vision FPGA chip with low error rate. In Multimedia
and Ubiquitous Engineering, 2007. MUE’07. International Conference on, pages 751–
756. IEEE, 2007.
[12] Yunde Jia, Xiaoxun Zhang, Mingxiang Li, and Luping An. A miniature stereo vision
machine (MSVM-III) for dense disparity mapping. In Pattern Recognition, 2004. ICPR
2004. Proceedings of the 17th International Conference on, volume 1, pages 728–731.
IEEE, 2004.
[13] Seunghun Jin, Junguk Cho, Xuan Dai Pham, Kyoung Mu Lee, Sung-Kee Park, Mun-
sang Kim, and Jae Wook Jeon. FPGA design and implementation of a real-time stereo
vision system. IEEE transactions on circuits and systems for video technology, 20(1):
15–26, 2010.
[14] CHEN Jun, Takashi Yamamoto, Tadayoshi Aoyama, Takeshi Takaki, and Idaku Ishii.
Real-Time Projection Mapping Using High-Frame-Rate Structured Light 3D Vision.
SICE Journal of Control, Measurement, and System Integration, 8(4):265–272, 2015.
[15] Vladimir Kolmogorov and Ramin Zabih. Computing visual correspondence with oc-
clusions using graph cuts. In Computer Vision, 2001. ICCV 2001. Proceedings. Eighth
IEEE International Conference on, volume 2, pages 508–515. IEEE, 2001.
[16] SungHwan Lee, Jongsu Yi, and JunSeong Kim. Real-time stereo vision on a recon-
figurable system. In International Workshop on Embedded Computer Systems, pages
299–307. Springer, 2005.
[17] Stefania Perri, Daniela Colonna, Paolo Zicari, and Pasquale Corsonello. SAD-based
stereo matching circuit for FPGAs. In Electronics, Circuits and Systems, 2006.
ICECS’06. 13th IEEE International Conference on, pages 846–849. IEEE, 2006.
[18] Mikołaj Roszkowski and Grzegorz Pastuszak. FPGA design of the computation unit
for the semi-global stereo matching algorithm. In Design and Diagnostics of Electronic
Circuits & Systems, 17th International Symposium on, pages 230–233. IEEE, April
[19] Siraj Sabihuddin, Jamin Islam, and W James MacLean. Dynamic programming ap-
proach to high frame-rate stereo correspondence: A pipelined architecture imple-
mented on a field programmable gate array. In Electrical and Computer Engineering,
2008. CCECE 2008. Canadian Conference on, pages 001461–001466. IEEE, 2008.
[20] Daniel Scharstein and Richard Szeliski. A Taxonomy and Evaluation of Dense Two-
Frame Stereo Correspondence Algorithms. International Journal of Computer Vision,
47(1-3):7–42, April 2002. ISSN 0920-5691. doi: 10.1023/A:1014573219977. URL
[21] Daniel Scharstein and Richard Szeliski. High-accuracy Stereo Depth Maps Using
Structured Light. In Proceedings of the 2003 IEEE Computer Society Conference
on Computer Vision and Pattern Recognition, CVPR’03, pages 195–202, Washington,
DC, USA, 2003. IEEE Computer Society. ISBN 0-7695-1900-8, 978-0-7695-1900-5.
[22] Sebastian Schuon, Christian Theobalt, James Davis, and Sebastian Thrun. LidarBoost:
Depth Superresolution for ToF 3D shape scanning. In Computer Vision and Pattern
Recognition, 2009. CVPR 2009. IEEE Conference on, pages 343–350. IEEE, 2009.
[23] W. Wang, J. Yan, N. Xu, Y. Wang, and F. H. Hsu. Real-Time High-Quality Stereo
Vision System in FPGA. IEEE Transactions on Circuits and Systems for Video Tech-
nology, 25(10):1696–1708, Oct 2015. ISSN 1051-8215. doi: 10.1109/TCSVT.2015.
[24] Ramin Zabih and John Woodfill. Non-parametric local transforms for computing visual
correspondence. In European conference on computer vision, pages 151–158. Springer,
... Scanline-level parallelism Accuracypreserved resource reduction Dependencyresolving scheme [30], [31], [32], [33] [24], [25] [26] Proposed studies [18], [19], [20], [21] have shown that SGM often achieves higher accuracy than traditional global methods by coupling with state-of-the-art convolutional neural-networkbased algorithms, which replaces a part of pipelines in stereo processing with a efficient neural-network. Despite of its advantage, it still requires large computations, e.g., 0.3 seconds for a 0.5 megapixel image on a 2.7 GHz CPU [22]. ...
... Several studies [27], [28], [29] presented memory-efficient algorithms to reduce the requirement to less than 10% of that of the original SGM. Moreover, many researchers have made an effort to increase the throughput of SGM [24], [25], [26], [30], [31], [32], [33]. The difficulty in the implementation comes from the inherent data dependency problem in the SGM algorithm. ...
... The operation in the recursion, moreover, requires considerable computations, such as a minimum selection among hundreds of elements within a single clock period, which primarily increases the latency of the depth value computation at each pixel. To increase the throughput, several studies [30], [31], [32], [33] applied a scanline-level parallelism which computes the multiple depth values on the rows in an image, but the data dependency problem still remains. Li et al. [26] proposed a dependency-resolving scheme that interleaves the scanlines with 16-stage deep pipeline architecture. ...
Full-text available
Semiglobal matching is an accurate stereo depth estimation algorithm, whereas implementing the high-throughput architecture has been challenging due to the inherent recursion on inter-pixel cost aggregation. Especially, the computation on horizontal scan pass is the critical path causing the throughput bottleneck. In this paper, we propose a new cluster-wise cost aggregation algorithm and its optimized architecture that enables to pipeline the inter-pixel aggregation and parallelize the scanline-level disparity computation. The proposed approach is performed not on every pixel but on each group of pixels, which significantly alleviates the timing constraint for the recursion. The disparity values at shifted multiple pixel positions are concurrently computed within a single clock period. We also propose the memory reduction scheme selecting a tiny number of informative values, which achieves 96% memory reduction compared to the straightforward approach storing overall values. The system-on-chip-based tiled processing scheme is employed, which allows the implementation without an external memory. The proposed architecture computes a depth map with 128 disparity levels at 103 frames per second on a full HD image on the Zynq ultrascale+ MPSoC platform, thus providing 2.6 times faster performance with a comparable accuracy compared to the state-of-the-art 8-path semiglobal matching implementation.
Autonomous navigation of ships is actively discussed nowadays. It is essential to recognize the surrounding conditions using radars and acquire position information using GPS. However, each of these existing sensors has its own problems. In order to achieve safer autonomous navigation in scenes such as automatic berthing, passing in a narrow line and closing to other ships, an imaging system that integrates various information is essential. Therefore, the authors propose a 3D position measurement method for ships using a stereo camera. In the case of the automatic berthing system, detailed information such as the position of various parts of the hull, heading, roll angle, etc. is necessary. This paper describes a system that measures 3D locations of many parts of ship’s hull with a dense stereo algorithm applied to stereo-pair images captured by cameras set up on land. An experiment to measure an actual ship as a target was conducted. The result indicated that the more disparity data, which meant depth of the ship from the cameras, were obtained by semi-global matching (SGM) even from the featureless images like ship’s hull and the measurement error was within 1 m.
Full-text available
In this paper, the authors report on the development of a projection-mapping system that can project RGB light patterns that are enhanced for three dimensional (3D) scenes using a graphics processing unit (GPU) based high-frame-rate (HFR) vision system synchronized with HFR projectors. The proposed system can acquire 512 × 512 depth-images in real time at 500 fps. The depth-images processing is accelerated by installing a GPU board for parallel processing of Gray-code structured light illumination using infrared (IR) light patterns projected from an IR projector. Using the computed depth-image, suitable RGB light patterns to be projected are generated in real time for enhanced application tasks. They are projected from an RGB projector as augmented information onto a 3D scene with pixel-wise correspondence even when the 3D scene is time-varied. Experimental results obtained from enhanced application tasks for time-varying 3D scenes such as (1) depth-based color mapping, (2) augmented reality (AR) spirit level and (3) AR wristwatch confirm the efficacy of our system.
Conference Paper
Full-text available
Recent developments in smartphones create an ideal platform for robotics and computer vision applications: they are small, powerful, embedded devices with low-power mobile CPUs. However, though the computational power of smartphones has increased substantially in recent years, they are still not capable of performing intense computer vision tasks in real time, at high frame rates and low latency. We present a combination of FPGA and mobile CPU to overcome the computational and latency limitations of mobile CPUs alone. With the FPGA as an additional layer between the image sensor and CPU, the system is capable of accelerating computer vision algorithms to real-time performance. Low latency calculation allows for direct usage within control loops of mobile robots. A stereo camera setup with disparity estimation based on the semi global matching algorithm is implemented as an accelerated example application. The system calculates dense disparity images with 752x480 pixels resolution at 60 frames per second. The overall latency of the disparity estimation is less than 2 milliseconds. The system is suitable for any mobile robot application due to its light weight and low power consumption.
Conference Paper
Full-text available
Semi-Global Matching (SGM) is a robust stereo method that has proven its usefulness in various applications ranging from aerial image matching to driver assistance systems. It supports pixelwise matching for maintaining sharp object boundaries and fine structures and can be implemented efficiently on different computation hardware. Furthermore, the method is not sensitive to the choice of parameters. The structure of the matching algorithm is well suited to be processed by highly paralleling hardware e.g. FPGAs and GPUs. The drawback of SGM is the temporary memory requirement that depends on the number of pixels and the disparity range. On the one hand this results in long idle times due to the bandwidth limitations of the external memory and on the other hand the capacity bounds are quickly reached. A full HD image with a size of 1920 x 1080 pixels and a disparity range of 512 pixels requires already 1 billion elements, which is at least several GB of RAM, depending on the element size, wich are not available at standard FPGA- and GPU-boards. The novel memory efficient (eSGM) method is an advancement in which the amount of temporary memory only depends on the number of pixels and not on the disparity range. This permits matching of huge images in one piece and reduces the requirements on the bandwidth for real-time mobile robotics. The feature comes at the cost of 50% more compute operations as compared to SGM. This overhead is compensated by the previously idle compute logic within the FPGA and the GPU and therefore results in an overall performance increase. We show that eSGM produces the same high quality disparity images as SGM and demonstrate its performance both on an aerial image pair with 142 MPixel and within a real-time mobile robotic application. We have implemented the new method on the CPU, GPU and FPGA. We conclude that eSGM is advantageous for a GPU implementation and essential for an implementation on our FPGA.
Conference Paper
Higher frame-rates promise better tracking of rapid motion, but advanced real-time vision systems rarely exceed the standard 10 60Hz range, arguing that the computation required would be too great. Actually, increasing frame-rate is mitigated by reduced computational cost per frame in trackers which take advantage of prediction. Additionally, when we consider the physics of image formation, high frame-rate implies that the upper bound on shutter time is reduced, leading to less motion blur but more noise. So, putting these factors together, how are application-dependent performance requirements of accuracy, robustness and computational cost optimised as frame-rate varies? Using 3D camera tracking as our test problem, and analysing a fundamental dense whole image alignment approach, we open up a route to a systematic investigation via the careful synthesis of photorealistic video using ray-tracing of a detailed 3D scene, experimentally obtained photometric response and noise models, and rapid camera motions. Our multi-frame-rate, multi-resolution, multi-light-level dataset is based on tens of thousands of hours of CPU rendering time. Our experiments lead to quantitative conclusions about frame-rate selection and highlight the crucial role of full consideration of physical image formation in pushing tracking performance.
Stereo vision is a well-known technique for acquiring depth information. In this paper, we propose a real-time high-quality stereo vision system in field-programmable gate array (FPGA). Using absolute difference-census cost initialization, cross-based cost aggregation, and semiglobal optimization, the system provides high-quality depth results for high-definition images. This is the first complete real-time hardware system that supports both cost aggregation on variable support regions and semiglobal optimization in FPGAs. Furthermore, the system is designed to be scaled with image resolution, disparity range, and parallelism degree for maximum parallel efficiency. We present the depth map quality on the Middlebury benchmark and some real-world scenarios with different image resolutions. The results show that our system performs the best among FPGA-based stereo vision systems and its accuracy is comparable with those of current top-performing software implementations. The first version of the system was demonstrated on an Altera Stratix-IV FPGA board, processing 1024 × 768 pixel images with 96 disparity levels at 67 frames/s. The system is then scaled up on a new Altera Stratix-V FPGA and the processing ability is enhanced to 1600 × 1200 pixel images with 128 disparity levels at 42 frames/s.
Conference Paper
In recent years, it is possible to observe an increased interest in methods of stereo-vision as they allow simple 3D representations of the scene acquired with two optical cameras. However, not all methods described in literature are suitable for a real-time implementation in hardware. One method generally considered to produce good results and viable for such an implementation is semi-global stereo matching. Nevertheless, there are not many in-depth descriptions of hardware architectures of modules implementing this method in the literature. In this article we try to fill in this gap by presenting a detailed FPGA-oriented architecture of the basic computation unit supporting the semi-global matching algorithm. The unit is responsible for carrying out all calculation necessary to select the best disparity for each pixel. We also present a novel solution to the problem of storing paths' costs for each pixel and discuss the trade-offs between the size of the unit in an FPGA chip and precision of the computed disparity map.
Stereo matching is one of the most active research areas in computer vision. While a large number of algorithms for stereo correspondence have been developed, relatively little work has been done on characterizing their performance. In this paper, we present a taxonomy of dense, two-frame stereo methods. Our taxonomy is designed to assess the different components and design decisions made in individual stereo algorithms. Using this taxonomy, we compare existing stereo methods and present experiments evaluating the performance of many different variants. In order to establish a common software platform and a collection of data sets for easy evaluation, we have designed a stand-alone, flexible C++ implementation that enables the evaluation of individual components and that can easily be extended to include new algorithms. We have also produced several new multi-frame stereo data sets with ground truth and are making both the code and data sets available on the Web. Finally, we include a comparative evaluation of a large set of today's best-performing stereo algorithms.
Conference Paper
We have developed a miniature stereo vision machine (MSVM-III) with three cameras for generating high-resolution dense disparity maps at the video rate. The MSVM-III uses only one FPGA chip to compactly compute trinocular rectification, LoG filtering, and area-based matching. The machine, running at 60 MHz, could process more than 30 fps dense disparity maps with 640×480 pixels in 64-pixel disparity search range, and 120 fps with 320×240 pixels. Moreover, the MSVM-III has an IEEE 1394 port to a host at the video rate, an interface port to LCD as a miniature 3D imager, and a user board for controlling small mobile robot or other autonomous systems.