Computation Core Binding in GTM Mapping on Reconfigurable Computers

Xuejun Liang
Jackson State University
Jackson, MS 39217
xuejun.liang@jsums.edu

ABSTRACT
A combinatorial optimization problem, where the cost function is the FPGA computation time and the constraint is the FPGA board resource, is formulated as a step in mapping generalized template matching operations onto an FPGA board. The problem is then simplified from a multiple FPGA chip case into a single FPGA chip case. Algorithms are proposed to solve the optimization problem. Experimental results are also given to show the efficiency of proposed algorithms.

Categories and Subject Descriptors

General Terms
Algorithms, Performance, Design, Experimentation

Keywords
FPGA, Reconfigurable Computing, Template Matching, High-Level Synthesis, Combinatorial Optimization

1. INTRODUCTION
The reconfigurable computer addressed in the paper is a host computer with a co-processor board based on field programmable gate arrays (FPGAs). The target FPGA board may contain multiple FPGA chips, each with an array of homogeneous memory banks. Annapolis’s FPGA boards and SRC’s MAP processors are such examples.

The generalized template matching (GTM) operations [1] include image-processing algorithms for template matching, 2D digital filtering, morphologic operations, motion estimation, and so on. They all involve moving a "window" (or template) pixel by pixel in a scanned line order.

The overall approach of mapping the GTM operations onto reconfigurable computers consists of three steps. The first two steps enumerate, evaluate, and list enough number of basic GTM building blocks, called region functions (RFs). Each RF contains an FPGA buffer and a pipelined functional unit, called a unit function, which evaluates the window computation at one or more consecutive pixel locations. Different RFs will have different throughputs, occupy different FPGA areas, and require different numbers of memory ports. The third step, called RF binding, is to select one or more RFs for each FPGA chip such that the total FPGA execution time is minimal under the FPGA board resource constraints such as the number of FPGA chips, the size of FPGA chips, and the number of memory ports. RFs on all FPGA chips work independently and in parallel on different image regions and/or, if any, different templates under the control of a host program.

2. RF BINDING
In the RF binding process, the selected RFs are assigned to FPGA chips, memory ports, templates, and processing regions (consecutive rows of an image region). The process therefore includes the FPGA chip binding, the memory port binding, the image region partitioning and the processing region binding, and the template binding. For an FPGA chip, i (1 ≤ N_{FPGA}), the region functions RF_{i,j}, 1 ≤ j ≤ q(i), together form the chip design, which has to satisfy an FPGA area constraint and a memory port constraint. As a result, for the FPGA chip binding and the memory port binding, a combinatorial optimization problem can be formulated as follows.

\[
\text{To minimize \quad \max \{Time(RF_{i,j}) \mid 1 \leq i \leq N_{FPGA}, \text{ and } 1 \leq j \leq q(i)\}}
\]

Subject to

\[
\sum_{i,j \in q(i)} \text{Area}(RF_{i,j}) \leq S_{FPGA}, \quad 1 \leq i \leq N_{FPGA}
\]

\[
\sum_{i,j \in q(i)} \text{Port}(RF_{i,j}) \leq N_{MP}, \quad 1 \leq i \leq N_{FPGA}
\]

In the above formulation, the objective function is the GTM computation time, N_{FPGA} is the number of FPGA chips on the target board, S_{FPGA} is the size (number of slices) of FPGA chip, and N_{MP} is the number of memory ports connected to each FPGA chip. Time(RF_{i,j}) is the RF_{i,j} execution time, Area(RF_{i,j}) is the RF_{i,j} FPGA area, and Port(RF_{i,j}) is the number of memory ports used by RF_{i,j}. The execution time of the GTM design is the maximum execution time of all RF_{i,j} execution times as all RF_{i,j} work independently and in parallel.

To formulate the RF binding problem completely, we define the workload of a GTM operation to be the sum of products of the number of rows of each image region and the number of templates that are applied to the image region.
Similarly, the workload of a RF after the image region partition, the processing region binding, and the template binding is defined to be the sum of products of rows of each assigned processing region and the number of assigned corresponding templates. Therefore, it is clear that the image region partitioning, the processing region binding, and the template binding is a way to partition the GTM workload among the selected RFs.

Therefore, the RF binding problem can be formulated in terms of the workload concept as follows.

To minimize

\[
\max \{ S(RF_{i,j}) \times WL(RF_{i,j}) \mid 1 \leq i \leq N_{FPGA}, \ \text{and} \ 1 \leq j \leq q(i) \}
\]

subject to

\[
\begin{align*}
\sum_{1 \leq j \leq q(i)} \text{Area}(RF_{i,j}) & \leq S_{FPGA}, \ 1 \leq i \leq N_{FPGA} \\
\sum_{1 \leq j \leq q(i)} \text{Port}(RF_{i,j}) & \leq N_{MP}, \ 1 \leq i \leq N_{FPGA} \\
\sum_{i=1}^{N_{RF}} \sum_{j=1}^{q(j)} WL(RF_{i,j}) &= GTM_workload
\end{align*}
\]

(2.2)

In the above formulation, \( S(RF_{i,j}) \) is the computation time of RF\(_{i,j} \) for one image row under one template, \( WL(RF_{i,j}) \) is the workload of RF\(_{i,j} \), and the last constraint is for the workload partitioning among selected RFs.

Note that there is a solution to Problem (2.2) in which all FPGA chips contain a common set of RF designs. Therefore, the RF binding problem (2.2) can be simplified to one for a single FPGA chip case as follows.

To minimize

\[
\max \{ S(FR_{j}) \times WL(FR_{j}) \mid 1 \leq j \leq q\}
\]

subject to

\[
\begin{align*}
\sum_{1 \leq j \leq q} \text{Area}(RF_{j}) & \leq S_{FPGA} \\
\sum_{1 \leq j \leq q} \text{Port}(RF_{j}) & \leq N_{MP} \\
\sum_{j=1}^{q} WL(FR_{j}) &= single\_workload
\end{align*}
\]

(2.3)

where single\_workload = GTM\_workload / \( N_{FPGA} \).

For each selected RF\(_{j} \), let the workload be defined as

\[
WL(FR_{j}) = \frac{single\_workload}{S(FR_{j}) \times \left( \sum_{i} \frac{1}{S(RF_{i})} \right)}
\]

(2.4)

Then, the RF binding problem (2.3) can be further simplified as

To maximize

\[
\sum_{1 \leq j \leq q} 1/S(FR_{j})
\]

subject to

\[
\begin{align*}
\sum_{1 \leq j \leq q} \text{Area}(RF_{j}) & \leq S_{FPGA} \\
\sum_{1 \leq j \leq q} \text{Port}(RF_{j}) & \leq N_{MP}
\end{align*}
\]

(2.5)

A naive method to solve (2.5) can be through enumeration of the solution space and compute the following: For each \( q \) (\( 1 \leq q \leq N_{MP} \)), select \( q \) RFs out of the candidate RFs, verify the constraint conditions, and compute the sum of \( 1/S(RF_{j}) \).

In order to reduce the search space, the candidate RF designs can be divided into \( N_{MP} \) groups, denoted by \( \text{Cad}(i) \) (\( i = 1, 2, \ldots, N_{MP} \)), such that each candidate RF design in \( \text{Cad}(i) \) requires \( i \) memory ports. In this way RF designs can be selected from individual groups instead of from the whole set of candidate RF designs provided that it is known which groups RF designs should be chosen from. This can be achieved by enumerating the solution space of the memory port constraint in Problem (2.5) first. Then each solution provides information about which groups to choose from. Based on this idea, Problem (2.5) can be solved by solving the following two problems.

To maximize

\[
\sum_{1 \leq j \leq q} 1/S(FR_{j})
\]

subject to

\[
\sum_{1 \leq j \leq q} \text{Area}(RF_{j}) \leq S_{FPGA}
\]

(2.6)

(2.7)

Problem (2.6) is closely related to the integer partition problem. Using the dynamic programming technique can solve this problem efficiently.

Problem (2.7) is a well-known bounded knapsack problem. There exist many classic approaches for solving it. This work provides a new algorithm, called Multi-Dimensional Binary Search, to solve the knapsack program. It uses the divide and conquer method. In each search step, either the search terminates because a search result is found or no result can be found, or the search problem is divided into some smaller size problems.

3. EXPERIMENT RESULTS

From Section 4, there are three methods to solve (2.5). The first one (called the naïve method) is to search the whole design space. The second and the third methods are to solve (2.6) first and then to solve (2.7) for each solution of (2.6). In the second method (called the simple grouping method), (2.7) is solved by searching all the combinations from RF groups. In the third method (called the multi-dimensional binary search), (2.7) is solved by the multi-dimensional binary search algorithm. The three methods are implemented and used for different FPGA sizes. The average search space sizes and the average computation times are listed in Table 1.

<table>
<thead>
<tr>
<th></th>
<th>Space Size</th>
<th>Time (in Second)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Naïve</td>
<td>2880952</td>
<td>148.05</td>
</tr>
<tr>
<td>Simple Grouping</td>
<td>9982</td>
<td>0.526</td>
</tr>
<tr>
<td>Multi-Dimensional Binary Search</td>
<td>1138</td>
<td>0.1288</td>
</tr>
</tbody>
</table>

4. REFERENCES