[show abstract][hide abstract] ABSTRACT: In real-time digital signal processing (DSP) architec-tures using heterogeneous functional units (FUs), it is crit-ical to select the best FU for each task. However, some tasks may not have fixed execution times. This paper mod-els each varied execution time as a probabilistic random variable and solves heterogeneous assignment with prob-ability (HAP) problem. The solution of the HAP problem assigns a proper FU type to each task such that the to-tal cost is minimized while the timing constraint is satis-fied with a guaranteed confidence probability. The solu-tions to the HAP problem are useful for both hard real-time and soft real-time systems. Two algorithms, one is optimal and the other is heuristic, are proposed to solve the general problem. The experiments show that our algorithms can ef-fectively reduce the total cost with guaranteed confidence probabilities satisfying timing constraints. For example, our algorithms achieve an average reduction of 33.5% on total cost with 90% confidence probability satisfying timing constraints compared with the previous work using worst-case scenario.
[show abstract][hide abstract] ABSTRACT: Switching activity and schedule length are the two most important factors that influence the energy con-sumption of an application executed on a VLIW (Very Long Instruction Word) processor. Considering these two factors together, we propose an instruction-level energy-minimization scheduling technique to reduce the energy consumption of applications on VLIW processors. We first formally prove that this problem is NP-complete. Then three heuristic algorithms, MSAS, MLMSA, and EMSA, are proposed. While switching activity and schedule length are given higher priority in MSAS and MLMSA, respectively, EMSA gives the best result considering both of them. The experimental results show that EMSA gives a ¢ ¤ £ ¦ ¥ % reduction in energy compared with the traditional list scheduling approach on average.
[show abstract][hide abstract] ABSTRACT: Usually the covering problem requires all elements in a sys-tem to be covered. In some situations, it is very difficult to figure out a solution, or unable to cover all given elements because of resource constraints. In this paper, we study the issue of the partial covering problem. This problem is also referred to the robust k-center problem and can be applied to many fields. The partial covering problem becomes even more harder when we need to determine the subset of the group of all available elements to share resources. Several approximation algorithms are proposed to cover the most elements in this paper. For some real time systems, such as the battlefield communication system, the algorithm pre-sented with polynomial-time complexity can be efficiently applied. The algorithm complexity analysis illustrates the improvement made by our algorithms, which are compared with other papers for the partial covering problem in the literature. The experimental results show that the perfor-mance of our algorithms is much better than other existing 3-approximation algorithm for the robust k-center problem.
[show abstract][hide abstract] ABSTRACT: Many computation-intensive or recursive applications commonly found in digital signal processing and image processing applications can be repre-sented by data-flow graphs (DFGs). In our previous work, we proposed a new technique, extended retiming, which can be combined with minimal unfolding to transform a DFG into one which is rate-optimal. The result, however, is a DFG with split nodes, a concise representation for pipelined schedules. This model and the extraction of the pipelined schedule it rep-resents have heretofore not been explored. In this paper, we demonstrate one scheduling algorithm for such graphs, and then discuss a way to reduce the hardware requirements of the resulting schedule. In the process, we state and prove a tight upper bound on the minimum number of processors required to execute the static schedule produced by our algorithms. Finally, we demonstrate our methods on a specific example.
[show abstract][hide abstract] ABSTRACT: Loop fusion is commonly used to improve the instruction-level parallelism of loops for high-performance embedded computing systems. Loop fusion, however, is not always directly applicable because the fusion prevention dependen-cies may exist among loops. Most of the existing techniques still have limitations in fully exploiting the advantages of loop fusion. In this paper, we present a general loop fu-sion technique for loops or nested loops based on the loop dependency graph model, retiming, and multi-dimensional retiming concepts. We show that any "J+K" model loop can be legally fused using our legalizing fusion technique. Polynomial-time algorithms are developed to solve the loop fusion problem for "J+K" model loops considering both tim-ing and code size of the final code. Our technique produces the final code and calculates the resultant code size directly from the retiming values. The experimental results show that our loop fusion technique always significantly reduces the schedule length.
[show abstract][hide abstract] ABSTRACT: Since data dependencies greatly decrease instruction level parallelism, minimizing dependencies becomes a crucial part of the process of parallelizing sequential code. Elim-inating all unnecessary hazards leads to the more efficient use of resources, fewer processor stalls and easily main-tainable code. Previously we proposed a novel approach for eliminating redundant data dependencies from code. In this paper, we review this method and show how this elim-ination technique may be combined with unfolding so as to parallelize code even further.
[show abstract][hide abstract] ABSTRACT: Loop fusion is widely used to exploit the instruction-level parallelism by transforming separate loops into one loop for applications of embedded systems. Loop fusion, however, is not always applicable because of the existence of the fusion-prevention dependencies among loops. Therefore, techniques for eliminating the fusion-prevention dependencies are necessary for fully exploiting the benefits of the loop fusion. In this paper we present an efficient loop fusion technique based on loop dependency graph model and multi-dimensional retiming concept. Legalizing fusion theorems are derived for loops to be legally fused. Polynomial-time le-galizing fusion algorithms are developed to solve the loop fusion problems for 1-level loops and 2-level nested loops. Our loop fusion techniques are carefully designed to consider multiple optimization objectives, such as minimizing the code size and the critical path of the fused loop. The resultant code size can be accurately computed. The experimental results show that our loop fusion technique always significantly reduces the schedule length.
[show abstract][hide abstract] ABSTRACT: Sensor nodes usually work under dynamic changing, hard-to-predict environments and have limited lifetime. We use a novel adaptive online energy saving (AOES) algorithm to save total energy consumption for heteroge-neous sensor networks. Due to the uncertainties in exe-cution time of some tasks and multiple working mode of each node, this paper models each varied execution time as a probabilistic random variable to save energy by se-lecting the best mode assignment for each node, which is called Mode Assignment with Probability (MAP) prob-lem. We propose an optimal sub-algorithm MAP Opt to minimize the total energy consumption while satisfying the timing constraint with a guaranteed confidence prob-ability. The experimental results show that our approach achieves significant energy saving than previous work.
[show abstract][hide abstract] ABSTRACT: Many computation-intensive iterative or recursive applications com-monly found in digital signal processing and image processing applica-tions can be represented by data-flow graphs (DFGs). The execution of all tasks of a DFG is called an iteration, with the average computation time of an iteration the iteration period. A great deal of research has been done attempting to optimize such applications by applying various graph transformation techniques to the DFG in order to minimize this iteration period. Two of the most popular are retiming and unfolding, which can be performed in tandem to achieve an optimal iteration period. However, the result is a transformed graph which is much larger than the original DFG. In our previous work, we proposed a new technique, extended retiming, which can be combined with minimal unfolding to transform a DFG into one whose iteration period matches that of the optimal schedule under a pipelined design. In this paper, we augment our previous work by design-ing an efficient retiming algorithm which may be applied directly to a DFG instead of the larger unfolded graph.
[show abstract][hide abstract] ABSTRACT: We report two fast and scalable scheduling algorithms that pro-vide exact bandwidth guarantee, low delay bound, and reason-able jitter in input-queued switches. The two schedulers find a maximum input/output matching in a single iteration. They sustain 100% throughput under both uniform and bursty traffic. They work many times faster than existing scheduling schemes and their speed does not degrade with increased switch size. SRA and SRA+ algorithms are of O(1) time complexity and can be implemented in simple hardware. SRA tends to in-cur different delays to flows of different classes of service due to their different subscribed portions of the total bandwidth. SRA+ is a weighted version of SRA. SRA+ improves over SRA in that all flows undergo the same delays regardless of their bandwidth shares. The schedulers operate on queue groups at the crossbar arbiters in a distributed manner.
[show abstract][hide abstract] ABSTRACT: In a large-scale adaptive mobile wireless network, long-distance message transfer can be routed by introduced mobile servers while nearby mobile units can contact each other through direct constructed communication channels. In order to achieve scalability by direct connection among mobile units and reliability sustained by mobile servers for QoS, a mixed wireless in-frastructure incorporating mobile servers and ad hoc communications is investigated. A cluster of mobile units can be modelled as a mobile node. Both mobile nodes and servers are free of movement under the adaptive mobile wireless networks. In this paper, two graph models are introduced to represent mobile node communication requirements as well as mobile server config-urations. The dynamically changed topology for mobile nodes and mobile servers together with other constraints (e.g., transmission range, bandwidth) make the assignment of mobile nodes to mobile servers difficult. Such assignment can be formalized as the partition problem, which is proved to be NP-complete to attain optimal solutions. Based on the generated two modelling graphs, polynomial-time algorithms are developed to the partitioning problem such that the communication among different mobile nodes are successfully switched by mobile servers. The experimental environment simulates the dynamically modified network topology of a wireless network consisting of roaming mobile nodes. The simulation results show that the proposed techniques yield good assignments with similar performance as those produced by exhaustive approaches.
[show abstract][hide abstract] ABSTRACT: We report three fast and scalable scheduling algorithms that provide exact bandwidth guar-antee, low delay bound, and reasonable jitter in input-queued switches. The three schedulers find a maximum input/output matching in a single iteration. They sustain 100% throughput under both uniform and bursty traffic. They work many times faster than existing schedul-ing schemes and their speed does not degrade with increased switch size. SRA and SRA+ algorithms are of O(1) time complexity and can be implemented in simple hardware. SRA tends to incur different delays to flows of different classes of service due to their different sub-scribed portions of the total bandwidth. SRA+, a weighted version of SRA, operates on cells that arrive uniformly and on cells of packets such that the cells of whole packets are switched contiguously. SRA+ improves over SRA in that all flows undergo the same delays regardless of their bandwidth shares. The schedulers operate on queue groups at the crossbar arbiters in a distributed manner.