Real-Time Low Level Feature Extraction for on-board Robot Vision Systems

Roberto Pirrone, Giuseppe Careri, F. Saverio Fabiano, Antonio Gentile, IEEE, Member and Salvatore Gaglio, IEEE, Member
Dipartimento di Ingegneria Informatica
Università di Palermo, Viale delle Scienze, 90128 Palermo, Italy
Email: {pirrone, gentile, gaglio}@unipa.it

Abstract—Robot vision systems notoriously require large computing capabilities, rarely available on physical devices. Robots have limited embedded hardware, and almost all sensory computation is delegated to remote machines. Emerging gigascale integration technologies offer the opportunity to explore alternative computing architectures that can deliver a significant boost to on-board computing when implemented in embedded, reconfigurable devices. This paper explores the mapping of low level feature extraction on one such architecture, the Georgia Tech SIMD Pixel Processor (SIMPil). The Fast Boundary Web Extraction (fBWE) algorithm is adapted and mapped on SIMPil as a fixed-point, data parallel implementation. Application components and their mapping details are provided in this contribution along with a detailed analysis of their performance.

I. INTRODUCTION

Typical robot vision routines such as self localization, or 3D perception via calibrated cameras require large computing capabilities. Autonomous robot platforms have limited space to dedicate to such high level tasks because on-board computers are busy most the time with motor control, and sensorial data acquisition. Even more limited embedded hardware is available on small wheeled robots for which almost all sensory computation is delegated to remote machines. Also in the case of robots equipped with on-board computer, most processing focuses on motion control, and low level sensorial data elaboration while heavy computer vision tasks, like image segmentation and object recognition, are performed in background, via fast connections to a host computer. Emerging gigascale integration technologies offer the opportunity to explore alternative approaches to domain specific computing architectures that can deliver a significant boost to on-board computing when implemented in embedded, reconfigurable devices. This paper describes the mapping of low level feature extraction on a reconfigurable platform based on the Georgia Tech SIMD Pixel Processor (SIMPil). In particular, an adaptation of the Boundary webs Extractor (BWE) has been implemented on SIMPil exploiting the large amount of data parallelism inherently present in this application. The BWE [1] is derived from the original Grossberg’s Boundary Contour System (BCS) and extracts a dense map of isoluminance contours from the input image. This map contains actual edges along with a compact representation of local surface shading, and it is useful for high level vision tasks like Shape-From-Stereo. The Fast Boundary Web Extraction (fBWE) algorithm has been implemented in fixed point as a feed-forward processing pipeline thus avoiding BWE feedback loop, and achieving a considerable speed-up when compared against the standard algorithm. Application components and their mapping details are provided in this contribution along with a detailed analysis of their performance. Results are shown that illustrate execution times in the order of 170 µsec for a 256000 pixel image. The rest of this paper is organized as follows. Section II introduces the Georgia Tech SIMPil architecture, and implementation efforts on FPGA. Section III provides some remarks on the original Grossberg’s BCS, and its derived BWE model. In section IV the fBWE system is described, and its mapping onto SIMPil detailed. Section V reports extensive experiments with the fBWE compared with the BWE results, while in section VI some conclusions are drawn.

II. SIMPil FPGA IMPLEMENTATION

The GeorgiaTech SIMD Pixel Processor (SIMPil) architecture consists of a mesh of SIMD processors on top of which an array of image sensors is integrated [2] [3]. Each processing element includes a RISC load/store datapath plus an interface to a 4×4 sensor subarray. A 16-bit datapath has been implemented which includes a 32-bit multiply-accumulator unit, a 16 word register file, and 64 words of local memory (the ISA allows for up to 256 words). The SIMD execution model allows the entire image projected on many PEs to be acquired in a single cycle. Large arrays of SIMPil PEs can be simulated using the SIMPil Simulator, an instruction level simulator. Early prototyping efforts have proved the feasibility of direct coupling of a simple processing core with a sensor device [4]. A 16 bit prototype of a SIMPil PE was designed in 0.8 µm CMOS process and fabricated through MOSIS. A 4096 PE target system has been used in the simulations. This system is capable to deliver a peak throughput of about 1.5 Tops/sec in a monolithic device, enabling image and video processing applications that are currently unapproachable using today’s portable DSP technology. The SIMPil architecture is designed for image and video processing applications. In general, this class of applications is very computational intensive and requires high throughput to handle the massive data flow in real-time. However, these applications are also characterized by a large degree of data parallelism, which is maximally
exploited by focal plane processing. Image frames are available simultaneously at each PE in the system, while retaining their spatial correlation. Image streams can be therefore processed at frame rate, with only nominal amount of memory required at each PE [2]. The performance and efficiency of the SIMPil have been tested on a large application suite that spans the target workload. For the SIMPil processing element, an application suite is selected from the DARPA Image Understanding suite [5]. These applications are expressed in SIMPil assembly language, and executed using an instruction level simulator, SIMPilSim which provides various execution statistics. All applications are executed on a simulated 4096 processing element system with 16 pixels mapped to each PE for an aggregate 256×256 image size. All applications run well within real-time frame-rates and exhibit large system utilization figures (90% or more for most application). Details can be found in [2]. To bring SIMPil performance onto robot platform, a reconfigurable platform based on FPGA devices is being developed. This platform uses a parameterized SIMPil core (SIMPil-K) described in the VHDL hardware description language. The SIMPil-K platform is an array of Processing Elements (PE) and interconnection registers which can be configured to fit any FPGA device at hand. Figure 1 shows the high-level functional schema of a 4×4 SIMPil-K array and its NEWS interconnection network. Each NEWS register supports communication among a particular node (i.e. PE) and its north and west neighbours. By replicating this model, a NEWS (North, East, West, South) network is obtained, with every node connected to its four neighbours. SIMPil-K receives an instructions stream through a dedicated input port. The instruction stream is then broadcast to each PE. To upload and download image data, SIMPil-K uses a boundary I/O mechanism supported by its boundary nodes (i.e. PEs laid on its East/West edge): every east-edge node uploads a K-bit data word from its boundary-input port to the general purpose register file; every west-edge node downloads a K-bit data word from its register file to the boundary-output port. An upload/download operation (one word per node) takes only one clock cycle. Both boundary input and output operations are enabled by a single instruction, XFERB. When a NEWS transfer instruction arrives, it needs only one clock cycle to transfer the data word from each node to a neighbour one, in a specified direction. The SIMPil-K platform can be reconfigured by varying a number of architectural parameters, as detailed in Table I. This allows for experimentation with a large set of different system configurations, which is instrumental to determine the appropriate system characteristics for each application environment AW and RAW parameters set the address space of register file and memory, respectively. PPE specifies the number of image pixels mapped to each PE. The Influence parameter toggle between a fixed instruction width (24 bit) and a variable one (8+K bits). The interface of a processing element has two input ports for clock signals, a reset input port and the instruction stream port. NEWS transfers are carried through the three bidirectional dedicated ports (NEWS ports) which drive three NEWS buses, namely the North/West Bus, East Bus and South Bus. Boundary data input and output are carried through the two dedicated boundary ports. The processing element parameterized architecture is described in figure 2. There are four communication buses shared by the functional units. All functional units can be reconfigured based on the datapath width selected. A single PE can perform integer operations on K-bits. Dedicated barrel shift unit and multiply-accumulate unit are instrumental to speed-up most image processing kernels. The Sleep Unit verifies and updates the node activity state, thus allowing execution flow control based on each PE local data. The SIMPil-K system has been simulated and synthesized on FPGA; synthesis statistics about employed resources have been generated and analyzed. Several 16-bit SIMPil-K versions on an eight million gates FPGA have been implemented: particularly, 2-by-2, 4-by-4 and 8-by-8 16-bit SIMPil-K arrays have a resources use percentage respectively of 3.3%, 13.3%, and 53.3%.

![Fig. 1. K-bit 4-by-4 SIMPil-K array](image)

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Function</th>
<th>Values</th>
<th>Constr.</th>
<th>Def.</th>
</tr>
</thead>
<tbody>
<tr>
<td>K</td>
<td>Word Width</td>
<td>[4, 16, 32, 64]</td>
<td>-</td>
<td>16</td>
</tr>
<tr>
<td>X</td>
<td>Array Columns</td>
<td>X ∈ N</td>
<td>X, Y = 2^4</td>
<td>4</td>
</tr>
<tr>
<td>Y</td>
<td>Array Rows</td>
<td>Y ∈ N</td>
<td></td>
<td>4</td>
</tr>
<tr>
<td>AW</td>
<td>File Address Width</td>
<td>AW ∈ [1, 16]</td>
<td>I = off</td>
<td>4</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>AW ≤ 4</td>
<td></td>
</tr>
<tr>
<td>RAW</td>
<td>Local RAM Address Width</td>
<td>RAW ∈ [1, 64]</td>
<td>RAW ≤ K</td>
<td>4</td>
</tr>
<tr>
<td>PPE</td>
<td>Pixel per Processing Element ratio</td>
<td>PPE ∈ N</td>
<td>PPE = p^x, p ∈ N</td>
<td>8</td>
</tr>
<tr>
<td>Influence</td>
<td>Instructions Format Change Enable</td>
<td>I ∈ {off, on}</td>
<td>-</td>
<td>off</td>
</tr>
</tbody>
</table>

**TABLE I**

SIMPil-K Architectural Parameters
III. THE BOUNDARY WEBS EXTRACTOR

The original BCS architecture was proposed by Grossberg and Mingolla [6] as a neural model, aimed to explain some psychological findings about perceptual grouping of contours in vision: it was part of a more complex theory regarding human perception of shapes and colors. In this formulation, the BCS is a multi-layer recurrent network trained using a competitive-cooperative scheme until an equilibrium state is reached. The network takes the input form a gray-level image, and the aim of the first competitive stage is to reduce the activation diffusion beyond contours endpoints. Activation laws are, in general, differential equations with respect to time, but in the BCS computational model they are computed at equilibrium ($d/dt = 0$). In the case of the Competition I layer the dynamic activation rule is:

$$\frac{dw_{ijk}}{dt} = w_{ijk} + I + BJ_{ijk} + vijk - Bw_{ijk}\sum_{(p,q)}J_{pqr}A_{pqij}$$

and the equilibrium activation $w_{ijk}$ for each cell in this stage is computed as:

$$w_{ijk} = \frac{I + BJ_{ijk} + vijk}{1 + B\sum_{(p,q)}J_{pqr}A_{pqij}}$$

where $vi_{ij}$ is the feedback signal, $A_{pqij}$ are the coefficient of a small kernel with cylindrical shape, while $I$ and $B$ are suitable constants. In following equations capital letters without indexes are constant values used to tune the model.

The second competitive stage performs competition among orientations inside the same cell: this is a local contour refinement mechanism which will be enhanced by the cooperative stage. The activation law has the following form:

$$y_{ijk} = \frac{C[w_{ijk} - w_{ijk}]^+}{D + \sum_{m=1}^{n}[w_{ijm} - w_{ijm}]^+}$$

where capital indexes are referred to orthogonal direction with respect to the current one. The cooperative stage performs long range cooperation between cells with the same orientation that are displaced in a wide neighborhood. In this way long contours completion is enabled. Considering the vector $d$ connecting the position $(i, j)$ with a generic neighbor $(p, q)$, the following quantities can be defined $N_{pqij} = |d|$ and $Q_{pqij} = \langle d \rangle$. While the cooperative activation law is:

$$z_{ijk} = g(\sum_{(p,q)}(y_{pqij} - y_{pqij})[G_{pqij}(r,k)]^+)$$

where:

$$g(x) = \frac{H[x]^+}{K + [x]^+},$$

$$G_{pqij}^{(r,k)} = \exp(-2(N_{pqij}/P - 1)^2) \cdot |\cos(Q_{pqij} - r)R\cos(Q_{pqij} - k)T$$

This very complex kernel has the form of two elongated blobs aligned with the orientation k, and exponentially decreasing...
towards 0. In particular, $P$ represents the optimal distance from the cooperative cell at which maximum input activation is collected. Finally, feedback is provided from the cooperative stage to the first competitive one, in order to enforce those activations that are aligned with emergent contours and decrease spurious ones. The form of the feedback signal is:

$$v_{ijk} = \frac{L[z_{ijk} - M]^+}{1 + L \sum_{p,q} [z_{pqk} - M]^+ W_{pqij}}$$

(6)

where $W_{pqij}$ are the coefficient of a small cylinder shaped kernel. BCS provides a compact description of image shading at selectable resolution levels: shading, in turn, can be used to perform shape estimation, while boundary webs can be used as low level features for contour extraction, alignment, or stereo matching. Possible uses of BCS have been explored by some of the authors resulting in a software implementation of the BCS, called Boundary Web Extractor (BWE) which has been used as a low level feature extraction module in different vision systems. In particular, a neural shape estimation model has been proposed [1]. Another approach [7] performs BWE analysis on stereo couples. Input images are analyzed both with standard correlation operator over pixels intensities, and with BWE as a supplementary feature. The high resolution achievable by the BWE analysis enables dense depth maps. The main objective of BWE is to perform local brightness estimation and emergent contours alignment. In particular, $N$ couples of dually oriented Gabor masks have been used as receptive fields to obtain $n$ activation values discarding, for each couple, the mask providing negative output. The resulting OC Filter is described by the following equation:

$$J_{ijk} = [U_{ijk}]^+ + [V_{ijk}]^+$$

(7)

where $U_{ijk}$ and $V_{ijk}$ are the outputs of two dual Gabor masks. The generic Gabor filter has been selected in our implementation with a width $w$ equal to 8 pixels, $2N = 24$. The filter equation is:

$$M_{ijk} = \alpha e^{-\beta(\gamma B^2 + C^2)} \sin(\delta C)$$

$$B = (w - p) \cos(2k\pi/N) - (q - s) \sin(2k\pi/N)$$

$$C = (w - p) \sin(2k\pi/N) + (q - s) \cos(2k\pi/N)$$

(8)

Here $s$ is the application step of the masks; the $\alpha, \ldots, \delta$ parameters have been heuristically tuned. The kernel in eqs. 3 and 6 have been selected with gaussian shape, and the subtractive term in the exponential part of $G_{pqijk}^r$ kernel has been suppressed, and all constant values in the equations have been suitably tuned. To ensure the kernel to be symmetric, its central value has been forced to be 0 in order to avoid the exponential function to give a positive value when $N_{pqij} = 0$. Finally, we can give a formulation of the BWE structure as a 3D matrix containing, at each location $(i,j)$, $2N$ activation values belonging to a star of vectors.

$$\mathbf{BW} = \{\mathbf{B}_{ij}\} \quad i, j = 1, \ldots, M$$

$$\mathbf{B}_{ij} = \{b_{ijk}\} \quad k = 1, \ldots, 2N$$

(9)

Each vector represents the value of the image contrast along the orthogonal direction with respect to its phase. As a consequence of the modified OC Filter behaviour, the location $\mathbf{B}_{ij}$ of the $\mathbf{BW}$ matrix contains $N$ couples, each of them having a null vector that corresponds to the negative output of the filter with at same orientation.

$$b_{ijk} = b_{ijk}e^{\theta_{ijk}} \quad b_{ijk} = \max(|b_{ijk}|, 0) \quad \theta_{ijk} = k\pi/N$$

(10)

For computer vision purposes the average boundary webs are noticeable because they provide a single estimation of the local image contrast at each spatial location, both as intensity and direction. The average process is computed using a suitable average function $f_{av}$:

$$\mathbf{A}_{BW} = \{a_{ij}\} \quad i, j = 1, \ldots, M$$

$$\forall i, j \quad a_{ij} = a_{ij}e^{\theta_{ij}} \quad f_{av}(\mathbf{B}_{ij})$$

(11)

The average function can be selected according to several criteria: the maximum value or the vector sum of all the elements at each location; we selected a form of $f_{av}$ that weights each intensity with the cosine of the angle between the phase value and a mean phase angle, obtained weighting each phase with the respective intensity.

$$f_{av} : \quad \theta_{ij} = k_M\pi/N, \quad k_M = \sum_{k=1}^{2N} b_{ijk}k / \sum_{k=1}^{2N} b_{ijk}$$

$$a_{ij} = \frac{2N}{\sum_{k=1}^{2N} \text{abs}(\cos(\theta_{ijk} - \theta_{ij})) b_{ijk}}$$

(12)

Figure 3 makes a comparison between the original BCS and BWE both for the actual output, and for the average one.

IV. THE fBWE SYSTEM

The main idea about the fBWE implementation is to design a massively parallel algorithm that should be robust with respect to noise while producing an output as similar as possible to the true BWE architecture. The main performance drawbacks of the BWE network are the presence of a feedback loop aimed to put the whole system in a steady state, and the use of floating point calculations. The fBWE system is a feed-forward elaboration pipeline that is completely implemented using 16-bit integer maths, according to SIMPil-K requirements. The architecture relies on the cascade of the OC Filter, and a competitive-cooperative pipeline. The SIMPil-K configuration we used, is made of $32 \times 32$ PEs with a PPE equal to 64, that is each sub-image is 8×8 pixels wide. The whole process has been applied to 256×256 images, and $M = 64$ so there is a 4 pixels overlapping along each direction between two adjacent neighborhoods. Gabor masks in the OC Filter have been implemented using equation 8,
and have been provided to the PE array as a suitable gray level image. The original floating point values obtained for the weights have been approximated to 8-bit integer values, and the minimum value has been added to each of them to obtain a correct dynamics in the range $[0, \ldots, 255]$. The same mask is loaded into all the PEs in one column. Each row of the first image contains only 16 different orientations repeated twice, while the second one depicts the last 8 orientations repeated four times. At loading time, the offset is subtracted from each Pixel Register in the PE to correct the weights. After loading the input image the true filtering starts. The R15 register of each PE contains the correct value for the orientation $k$ in order to store the result in the correct position after each filtering step. Due to the overlapping, each mask is used to convolve four neighborhoods shifting only one half of a sub-image between two PEs at each step, according to the scheme West-North-East-South. Finally, the Gabor masks image is shifted in the West direction by 8 pixels starting again the filtering cycle. The same procedure is adopted for the second Gabor masks image, but the filtering cycle is iterated only 8 times. After the filtering phase each PE contains four adjacent locations each containing $N$ non null orientations due to the application of equation 7. The OC Filter output is quit precise in the determination of the orientations, but it suffers from its locality. Contours are not perfectly aligned, and they tend to double along a direction due to the activations present in couples of overlapped regions which intersect the same contour line. The competitive-cooperative pipeline tends to eliminate these problems without the use of a feedback scheme. Here the outputs of the OC Filter are grouped as $N$ orientation images $64 \times 64$ pixels wide. The pipeline is split into two parallel branches: at the first step each orientation image is processed with a $3 \times 3$ high pass filter in the left branch, and a median filter of the same size in the right one. The left processing is aimed to enrich details, and to strengthen the contours, while the median filter is a form of blurring intended force close orientations to align thus correcting the OC Filter spurious outputs. The implementation of these filters in SIMPil-K implies that each PE needs a frame of 12 values surrounding the ones stored in its local memory. So a suitable transfer routine has been set up to obtain these values form the 8-neighborhood surrounding the PE. The four filtered values are again stored in the PE’s local memory. The next step in both the pipeline branches is the suppression of uniform activation values. When an image region consisting on the location $(i, j)$ exhibits a uniform luminance without perceivable contrast variation along any direction the fBWE activations $b_{ijk}$ are almost of the same magnitude and a sort of little star is visualized in the output. To avoid this behaviour the uniform activations suppression acts according to the following rule:

$$b_{ij} - \hat{b}_{ij} \leq 0.2\hat{b}_{ij} \Rightarrow \forall k \ b_{ijk} \equiv 0$$

$$\hat{b}_{ij} = \max_k (b_{ijk})$$

$$\hat{b}_{ij} = \min_k (b_{ijk})$$

Here the threshold value of 0.8 has been selected on the basis of a trial and error process. After uniform activations suppression the maximum values $\bar{b}_{ij}$ and $\bar{b}_{ij}$ are selected at each location for the left and right branches thus obtaining two average boundary webs images, using $\max(\cdot)$ in place of the averaging function $f_{av}$. High pass, and median filters give rise to extremely different dynamics in the two pipeline branches, so a gain element has been placed in the high pass branch to normalize these ranges. The gain factor has been determined as

$$A_s = \frac{\max_{ij}(\bar{b}_{ij}^r)}{\max_{ij}(\bar{b}_{ij}^l)}$$

In all our experiments $A_s$ assumed values between 6 and 7. Before the conjunction of the two branches with the union pixel by pixel of the left $(W_L)$ and right $(W_R)$ image, a sharp threshold $S$ has been applied in order to join exactly $W_L$ and $W_R$. The value of $S$ has been selected as the 30% of the maximum activation in $W_L$, and all the values in $W_L$ that are over the value of $S$ are joined with all the values of $W_R$ that are beneath the same threshold. The joined image $W_J$ can be defined as $W_J = [(W_{J,ij}, k_{ij})]$ where for each location $(i, j)$ the amplitude, and the relative orientation value are defined. The last step is the cooperative filtering that generates the fBWE image $W$, and is aimed to enforce aligned neighboring activations. An activation is enforced if its orientation is slightly different from the one of the location at the center of the filter mask, otherwise it is decreased. The generic weight $M_{pq}$ of the filter applied to the location $(i, j)$ is defined as:

$$M_{pq} = \begin{cases} 
1 - \frac{|k_{pq} - k_{ij}|}{N/2} & \frac{|k_{pq} - k_{ij}|}{N/2} < N/2 \\
1 - \frac{|k_{pq} - k_{ij} - N|}{N/2} & \text{otherwise}
\end{cases}$$

Also in this case it is necessary for each PE to obtain 12 values from its eight neighbors.
V. EXPERIMENTAL RESULTS

Several experiments have been conducted on a set of images with different pictorial features: real images with a lot of shading, highly textured images, high contrast ones, and artificial pictures with both high dynamics (like cartoons) and poor one (Kanizsa figures). In Figure 4 the BWE and fBWE images are reported along with a diagram of the local orientation differences $d_{ij} = k_{ij}^{(BWE)} - k_{ij}^{(fBWE)}$. It can be noticed that the two implementations are perceptually equivalent, and the major differences are present in the uniform brightness regions. In these parts of the image the BWE exhibits some small residual activations due to the feedback based stabilization process, while the fBWE suppresses them at all. In the case of Kanizsa figures with a few well distinct gray levels (see Figure 5) the OC Filter alone performs better of the fBWE, so it has been selected as the system output. As regards the performance, the BWE execution time in our experiments ranges from 14.94 sec. in the case of Kanizsa figure to 68.54 sec. for the Lena and Tank images, while fBWE has a constant execution time of 0.168 msec. This is an obvious finding because the fBWE is a feed-forward architecture, while the BWE is not, and its convergence to a steady state depends on the input brightness structure.

VI. CONCLUSION

A Fast Boundary Web Extraction (fBWE) algorithm was presented in this paper as a fixed-point, data parallel implementation of the BWE. fBWE was mapped on SIMPl-K reconfigurable FPGA based platform. Application components and their mapping details were provided along with a detailed analysis of their performance. Experimental results illustrate the significant gain achieved over the traditional BWE, with execution times allowing ample room for real-time processing of typical subsequent tasks in a complete robot vision system.

REFERENCES