Dani Voitsechov's research while affiliated with Technion - Israel Institute of Technology and other places

Publications (6)

Article
Throughput architectures such as GPUs require substantial hardware resources to hold the state of a massive number of simultaneously executing threads. While GPU register files are already enormous, reaching capacities of 256KB per streaming multiprocessor (SM), we find that nearly half of real-world applications we examined are register-bound and...
Article
Traditional von Neumann GPGPUs only allow threads to communicate through memory on a group-to-group basis. In this model, a group of producer threads writes intermediate values to memory, which are read by a group of consumer threads after a barrier synchronization. To alleviate the memory bandwidth imposed by this method of communication, GPGPUs p...
Conference Paper
We propose the hybrid dataflow/von Neumann vector graph instruction word (VGIW) architecture. This data-parallel architecture concurrently executes each basic block's dataflow graph (graph instruction word) for a vector of threads, and schedules the different basic blocks based on von Neumann control flow semantics. The VGIW processor dynamically c...
Article
We present the single-graph multiple-flows (SGMF) architecture that combines coarse-grain reconfigurable computing with dynamic dataflow to deliver massive thread-level parallelism. The CUDA-compatible SGMF architecture is positioned as an energy efficient design alternative for GPGPUs. The architecture maps a compute kernel, represented as a dataf...
Conference Paper
We present the single-graph multiple-flows (SGMF) architecture that combines coarse-grain reconfigurable computing with dynamic dataflow to deliver massive thread-level parallelism. The CUDA-compatible SGMF architecture is positioned as an energy efficient design alternative for GPGPUs. The architecture maps a compute kernel, represented as a dataf...

Citations

... The final category, Multi-threading, explains the techniques proposed in [80,81,82] to solve the problem of accelerating multi-threaded kernels efficiently on CGRA. ...
... Gilani et al. [7], Esfeden et al. [2], as well as Wang and Zhang [25] investigate optimizations based on narrow integers which are detected at run time, in stark contrast to our static approach which works for all types of narrow data. Voitsechov et al. [23] employ narrow integer-packing based on static analysis, but they do not support floats. ...
... Since the 1990s, a number of influential CGRAs-such as Morphosys [26] , ADRES [27] , PACT XPP [28] , and REMUS [29] -have been developed, targeting the applications of signal processing, multimedia, and so on. In recent years, researchers have continued to study the design of CGRAs and have proposed many latest implementations, such as Plasticine [1] , CGRA-ME [30] , PX-CGRA [31] , i-DPs CGRA [32] , dMT-CGRA [33] . However, CGRAs are not yet widely used in industry due to their inconsistent structures, and immature programming and compilation tools, which will be explained in detail later on. ...
... One representative example includes LAC [48], targeted at matrix factorization. Ordered Parallelism and Synchronization: A conceptually similar work to ours from the GPGPU space is dMT-CGRA [49], which proposes inter-thread communication between SIMT threads [50,51]. Prabhakar et al [52] develops "nested-parallelism," which enables coupling of datap-aths with different nesting of parallel patterns. ...
... Configuration data for the various dataflow graphs can be retained in main memory until it is needed, and is often cached in configuration memory local to the CGRA. Exploiting spatial data parallelism in application kernels and efficiently processing kernel's dataflow graphs [26][27][28][29][30], CGRAs, such as those used in DySER [28] and SGMF [29], considerably improve performance and energy consumption compared to conventional processors; in particular, CGRAs can practically eliminate the large energy overheads of fetching and scheduling instructions in conventional out-oforder processors. Compared to fine-grain reconfigurable accelerators (e.g., FPGAs), CGRAs are less flexible, but this specialization results in higher performance, lower energy consumption, and much smaller configuration data (shorter configuration time) [26,31,32]. ...
... Such CGRAs are specifically proposed to counter out-oforder processors which dynamically exploit ILP in hardware. CGRAs like [23,30,78,79,80] use spatial computation model to implement ILP. In this technique, order of execution and data preparation is done by static scheduling of a DFG. ...