John Kim

Korea Advanced Institute of Science and Technology , Sŏul, Seoul, South Korea

Are you John Kim?

Claim your profile

Publications (27)4.07 Total impact

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The scalability trends of modern semiconductor technology lead to increasingly dense multicore chips. Unfortunately, physical limitations in area, power, off-chip bandwidth, and yield constrain single-chip designs to a relatively small number of cores, beyond which scaling becomes impractical. Multi-chip designs overcome these constraints, and can reach scales impossible to realize with conventional single-chip architectures. However, to deliver commensurate performance, multi-chip architectures require a cross-chip interconnect with bandwidth, latency, and energy consumption well beyond the reach of electrical signaling. We propose Galaxy, an architecture that enables the construction of a many-core "virtual chip" by connecting multiple smaller chiplets through optical fibers. The low optical loss of fibers allows the flexible placement of chiplets, and offers simpler packaging, power, and heat requirements. At the same time, the low latency and high bandwidth density of optical signaling maintain the tight coupling of cores, allowing the virtual chip to match the performance of a single chip that is not subject to area, power, and bandwidth limitations. Our results indicate that Galaxy attains speedup of 2.2x over the best single-chip alternatives with electrical or photonic interconnects (3.4x maximum), and 2.6x smaller energy-delay product (6.8x maximum). We show that Galaxy scales to 4K cores and attains 2.5x speedup at 6x lower laser power compared to a Macrochip with silicon waveguides.
    In Proceedings of the ACM International Conference on Supercomputing (ICS); 06/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: Many-core processors will have many processing cores with a network-on-chip (NoC) that provides access to shared resources such as main memory and on-chip caches. However, locally-fair arbitration in multi-stage NoC can lead to globally unfair access to shared resources and impact system-level performance depending on where each task is physically placed. In this work, we propose an arbitration to provide equality-of-service (EoS) in the network and provide support for location-oblivious task placement. We propose using probabilistic arbitration combined with distance-based weights to achieve EoS and overcome the limitation of round-robin arbiter. However, the complexity of probabilistic arbitration results in high area and long latency which negatively impacts performance. In order to reduce the hardware complexity, we propose an hybrid arbiter that switches between a simple arbiter at low load and a complex arbiter at high load. The hybrid arbiter is enabled by the observation that arbitration only impacts the overall performance and global fairness at a high load. We evaluate our arbitration scheme with synthetic traffic patterns and GPGPU benchmarks. Our results shows that hybrid arbiter that combines round-robin arbiter with probabilistic distance-based arbitration reduces performance variation as task placement is varied and also improves average IPC.
    IEEE Transactions on Computers 06/2014; 63(6):1487-1500. · 1.47 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A cost-efficient network-on-chip is needed in a scalable many-core systems. Recent multicore processors have lever-aged a ring topology and hierarchical ring can increase scal-ability but presents different challenges, including higher hop count and global ring bottleneck. In this work, we describe a hierarchical ring topology that we refer to as a transportation-network-inspired network-on-chip (tNoC) that leverages prin-ciples from transportation network systems. In particular, we propose a novel hybrid flow control for hierarchical ring topol-ogy to scale the topology efficiently. The flow control is hybrid in that the channels are allocated on flit granularity while the buffers are allocated on packet granularity. The hybrid flow control enables a simplified router microarchitecture (to min-imize per-hop latency) as router input buffers are minimized and buffers are pushed to the edges, either at the output ports or at the hub routers that interconnect the local rings to the global ring – while still supporting virtual channels to avoid protocol deadlock. We also describe a packet-quota-system (PQS) and a separate credit network that provide congestion management, support prioritized arbitration in the network, and provide support for multiflit packets. A detailed evaluation of a 64-core CMP shows that the tNoC improves performance by up to 21% compared with a baseline, buffered hierarchical ring topology while reducing NoC energy by 51%.
    The 20th IEEE International Symposium On High Performance Computer Architecture; 02/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: We present a novel, linear programming (LP)-based scheduling algorithm that exploits heterogeneous multicore architectures such as CPUs and GPUs to accelerate a wide variety of proximity queries. To represent complicated performance relationships between heterogeneous architectures and different computations of proximity queries, we propose a simple, yet accurate model that measures the expected running time of these computations. Based on this model, we formulate an optimization problem that minimizes the largest time spent on computing resources, and propose a novel, iterative LP-based scheduling algorithm. Since our method is general, we are able to apply our method into various proximity queries used in five different applications that have different characteristics. Our method achieves an order of magnitude performance improvement by using four different GPUs and two hexa-core CPUs over using a hexa-core CPU only. Unlike prior scheduling methods, our method continually improves the performance, as we add more computing resources. Also, our method achieves much higher performance improvement compared with prior methods as heterogeneity of computing resources is increased. Moreover, for one of tested applications, our method achieves even higher performance than a prior parallel method optimized manually for the application. We also show that our method provides results that are close (e.g., 75 percent) to the performance provided by a conservative upper bound of the ideal throughput. These results demonstrate the efficiency and robustness of our algorithm that have not been achieved by prior methods. In addition, we integrate one of our contributions with a work stealing method. Our version of the work stealing method achieves 18 percent performance improvement on average over the original work stealing method. This result shows wide applicability of our approach.
    IEEE transactions on visualization and computer graphics. 09/2013; 19(9):1513-1525.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Bufferless on-chip networks are an alternative type of on-chip network organization that can improve the cost-efficiency of an on-chip network by removing router input buffers. However, bufferless on-chip network performance degrades at high load because of the increased network contention and large number of deflected packets. The energy benefit of bufferless network is also reduced because of the increased deflection. In this work, we propose a novel flow control for bufferless on-chip networks in high-throughput manycore accelerator architectures to reduce the impact of deflection routing. By using a clumsy flow control (CFC), instead of the per-hop flow control that is commonly used in buffered on-chip networks, we are able to reduce the amount of deflection by up to 92% on high-throughput workloads. As a result, on average, CFC can approximately match the performance of a baseline buffered router while reducing the energy consumption by approximately 39%.
    IEEE Computer Architecture Letters 07/2013; 12(2):47-50. · 1.00 Impact Factor
  • Source
    Dennis Abts, John Kim
    03/2011; Morgan & Claypool Publishers.
  • [Show abstract] [Hide abstract]
    ABSTRACT: The increasing number of integrated components on a single chip has increased the importance of on-chip networks. A significant part of on-chip network routers is the buffer, as it occupies a large area and consumes a significant amount of power. In this work, we propose FlexiBuffer, a microarchitecture in which we minimize buffer leakage power by using fine-grained power gating and adjusting the size of the active buffers adaptively. We propose two microarchitecture techniques to support fine-grained power gating ‐ early credit in credit-based flow control and new buffer organizations to overcome the limitation of circular buffers. Our results show that, with minimal loss in performance, we can reduce the leakage power of on-chip network router buffers by up to 61% and overall router power consumption by up to 39%.
    Proceedings of the 48th Design Automation Conference, DAC 2011, San Diego, California, USA, June 5-10, 2011; 01/2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: Memory controllers in graphics processing units (GPU) often employ out-of-order scheduling to maximize row access locality. However, this requires complex logic to enable out-of-order scheduling compared with in-order scheduling. To provide a low-cost and low-complexity memory scheduling, we propose an alternative memory scheduling where the memory scheduling is performed not at the destination (i.e., memory controller) but is done at the source (i.e., the cores). We propose two complementary techniques in source-based memory scheduling -- network congestion-aware source throttling and super packets, where multiple request packets are grouped together to create a single super packet. By combing these techniques, the performance across a wide range of application is within 95% of the complex FR-FCFS on average and at significantly lower cost and complexity.
    2011 International Conference on Parallel Architectures and Compilation Techniques, PACT 2011, Galveston, TX, USA, October 10-14, 2011; 01/2011
  • Minjeong Shin, John Kim
    [Show abstract] [Hide abstract]
    ABSTRACT: On-chip networks are becoming more important as the number of on-chip components continue to increase. 2D mesh topology is a commonly assumed topology for on-chip networks but in this work, we make the argument that 2D torus can provide a more cost-efficient on-chip network since the on-chip network datapath is reduced by 2× while providing the same bisection bandwidth as a mesh network. Our results show that 2D torus can achieve an improvement of up to 1.9× over a 2D mesh in performance per watt metric. However, routing deadlock can occur in a torus network with the wrap-around channel and requires additional virtual channels for deadlock avoidance. In this work, we propose deadlock recovery with tokens (DRT) in on-chip networks that exploits on-chip networks - exploiting the abundant wires available while minimizing the need for additional buffers. As a result, deadlocks can be exactly detected without having to rely on a timeout mechanism and when needed, recover from the deadlock. We show how DRT results in minimal loss in performance, compared with deadlock avoidance using virtual channels, while reducing the on-chip network complexity.
    IEEE 29th International Conference on Computer Design, ICCD 2011, Amherst, MA, USA, October 9-12, 2011; 01/2011
  • Source
    Yan Pan, John Kim, Gokhan Memik
    [Show abstract] [Hide abstract]
    ABSTRACT: The nanophotonic signaling technology enables efficient global communication and low-diameter networks such as crossbars that are often optically arbitrated. However, existing optical arbitration schemes incur costly overheads (e.g., waveguides, laser power, etc.) to avoid starvation caused by their inherent fixed priority, which limits their applicability in power-bounded future many-core processors. On the other hand, quality-of-service (QoS) support in the on-chip network is becoming necessary due to an increase in the number of components in the network. Most prior work on QoS in on-chip networks has focused on conventional multi-hop electrical networks, where the efficiency of QoS is hindered by the limited capabilities of electrical global communication. In this work, we exploit the benefits of nanophotonics to build a lightweight optical arbitration scheme, FeatherWeight, with QoS support. Leveraging the efficient global communication, we devise a feedback-controlled, adaptive source throttling scheme to asymptotically approach weighted max-min fairness among all the nodes on the chip. By re-using existing datapath components to exchange minimal global information, FeatherWeight provides freedom from starvation while resulting in negligible (
    01/2011;
  • [Show abstract] [Hide abstract]
    ABSTRACT: The unique characteristics of prefetch traffic have not been considered in on-chip network design for multicore architectures. Most prefetchers are often oblivious to the network congestion when generating prefetech requests. In this work, we investigate the interaction between prefetchers and on-chip networks and exploit the synergy of these two components in multi-core architectures. We explore prefetchaware on-chip networks that differentiates between prefetch and demand traffic by prioritizing demand traffic. In addition, we propose prefetch control mechanism based on network congestion. Our evaluations show that the combination of the proposed prefetch-aware router architecture and congestion sensitive prefetch control improves the performance of benchmarks by 11-13% on average, up to 30% on some of the workloads.
    2011 International Conference on Parallel Architectures and Compilation Techniques, PACT 2011, Galveston, TX, USA, October 10-14, 2011; 01/2011
  • Source
    Workshop on the Interaction between Nanophotonic Devices and Systems (WINDS) co-located with the 43rd International Symposium on Microarchitecture (MICRO),. 12/2010;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: With the number of cores on a chip continuing to increase, proper evaluation of on-chip network is critical for not only network performance but also overall system performance. In this paper, we show how a network-only simulation can be limited as it does not provide an accurate representation of system performance. We evaluate traditionally used open loop simulations and compare them to closed-loop simulations. Although they use different methodologies, measurements, and metrics, we identify how they can provide very similar results. However, we show how the results of closed-loop simulations do not correlate well with execution-driven simulations. We then add simple extensions to the closed-loop simulation to model the impact of the processor and the memory system and show how the correlation with execution-driven simulations can be improved. The proposed framework/methodology provides a fast simulation time while providing better insights into the impact of network parameters on overall system performance.
    Conference on High Performance Computing Networking, Storage and Analysis, SC 2010, New Orleans, LA, USA, November 13-19, 2010; 01/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Emerging many-core chip multiprocessors will integrate dozens of small processing cores with an on-chip interconnect consisting of point-to-point links. The interconnect enables the processing cores to not only communicate, but to share common resources such as main memory resources and I/O controllers. In this work, we propose an arbitration scheme to enable equality of service (EoS) in access to a chip’s shared resources. That is, we seek to remove any bias in a core’s access to a shared resource based on its location in the CMP. We propose using probabilistic arbitration combined with distance-based weights to achieve EoS and overcome the limitation of conventional round-robin arbiter. We describe how nonlinear weights need to be used with probabilistic arbiters and propose three different arbitration weight metrics – fixed weight, constantly increasing weight, and variably increasing weight. By only modifying the arbitration of an on-chip router, we do not require any additional buffers or virtual channels and create a simple, low-cost mechanism for achieving EoS. We evaluate our arbitration scheme across a wide range of traffic patterns. In addition to providing EoS, the proposed arbitration has additional benefits which include providing quality-of-service features (such as differentiated service) and providing fairness in terms of both throughput and latency that approaches the global fairness achieved with age-base arbitration – thus, providing a more stable network by achieving high sustained throughput beyond saturation.
    43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2010, 4-8 December 2010, Atlanta, Georgia, USA; 01/2010
  • Source
    Yan Pan, John Kim, Gokhan Memik
    [Show abstract] [Hide abstract]
    ABSTRACT: On-chip network is becoming critical to the scalability of future many-core architectures. Recently, nanophotonics has been proposed for on-chip networks because of its low latency and high bandwidth. However, nanophotonics has relatively high static power consumption, which can lead to inefficient architectures. In this work, we propose FlexiShare - a nanophotonic crossbar architecture that minimizes static power consumption by fully sharing a reduced number of channels across the network. To enable efficient global sharing, we decouple the allocation of the channels and the buffers, and introduce novel photonic token-stream mechanism for channel arbitration and credit distribution The flexibility of FlexiShare introduces additional router complexity and electrical power consumption. However, with the reduced number of optical channels, the overall power consumption is reduced without loss in performance. Our evaluation shows that the proposed token-stream arbitration applied to a conventional crossbar design improves network throughput by 5.5× under permutation traffic. In addition, FlexiShare achieves similar performance as a token-stream arbitrated conventional crossbar using only half the amount of channels under balanced, distributed traffic. With the extracted trace traffic from MineBench and SPLASH-2, FlexiShare can further reduce the amount of channels by up to 87.5%, while still providing better performance - resulting in up to 72% reduction in power consumption compared to the best alternative.
    16th International Conference on High-Performance Computer Architecture (HPCA-16 2010), 9-14 January 2010, Bangalore, India; 01/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The on-chip network of emerging many-core CMPs enables the sharing of numerous on-chip components. This on-chip network needs to ensure fairness when accessing the shared resources. In this work, we propose providing equality of service (EoS) in future many-core CMPs on-chip networks by leveraging distance, or hop count, to approximate the age of packets in the network. We propose probabilistic arbitration combined with distance-based weights to achieve EoS and overcome the limitation of conventional round-robin arbiter. We describe how nonlinear weights need to be used with probabilistic arbiters and propose three different arbitration weight metrics - fixed weight, constantly increasing weight, and variably increasing weight. By only modifying the arbitration of an on-chip router, we do not require any additional buffers or virtual channels and create a complexity-effective mechanism for achieving EoS.
    19th International Conference on Parallel Architecture and Compilation Techniques (PACT 2010), Vienna, Austria, September 11-15, 2010; 01/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Abstract We present a novel, hybrid parallel continuous collision detection (HPCCD) method that exploits the availability of multi-core CPU and GPU architectures. HPCCD is based on a bounding volume hierarchy (BVH) and selectively performs lazy reconstructions. Our method works with a wide variety of deforming models and supports self-collision detection. HPCCD takes advantage of hybrid multi-core architectures – using the general-purpose CPUs to perform the BVH traversal and culling while GPUs are used to perform elementary tests that reduce to solving cubic equations. We propose a novel task decomposition method that leads to a lock-free parallel algorithm in the main loop of our BVH-based collision detection to create a highly scalable algorithm. By exploiting the availability of hybrid, multi-core CPU and GPU architectures, our proposed method achieves more than an order of magnitude improvement in performance using four CPU-cores and two GPUs, compared to using a single CPU-core. This improvement results in an interactive performance, up to 148 fps, for various deforming benchmarks consisting of tens or hundreds of thousand triangles.
    Computer Graphics Forum 10/2009; 28:1791-1800. · 1.60 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: ABSTRACT In the near term, Moore’s law will continue to provide an increasing number,of transistors and therefore an increasing number,of on-chip cores. Limited pin bandwidth,prevents the integration of a large number,of memory,controllers on- chip. With many cores, and few memory controllers, where to locate the memory,controllers in the on-chip intercon- nection fabric becomes an important and as yet unexplored question. In this paper we show how,the location of the memory,controllers can reduce contention (hot spots) in the on-chip fabric and lower the variance in reference latency. This in turn provides predictable performance for memory- intensive applications regardless of the processing core on which a thread is scheduled. We explore the design space of on-chip fabrics to,nd optimal memory controller placement relative to dierent topologies (i.e. mesh and torus), routing algorithms, and workloads. Categories and Subject Descriptors: C.1.2 [Computer Systems Organization]: Interconnection architectures; B.4.2 [Input/ Output Devices]: Channels and Controllers General Terms: Performance, Design Keywords: interconnection networks, memory controllers,
    36th International Symposium on Computer Architecture (ISCA 2009), June 20-24, 2009, Austin, TX, USA; 06/2009
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Recently proposed high-radix interconnection networks [10] require global adaptive routing to achieve optimum performance. Existing direct adaptive routing methods are slow to sense congestion remote from the source router and hence misroute many packets before such congestion is detected. This paper introduces indirect global adaptive routing (IAR) in which the adaptive routing decision uses information that is not directly available at the source router. We describe four IAR routing methods: credit round trip (CRT) [10], progressive adaptive routing (PAR), piggyback routing (PB), and reservation routing (RES). We evaluate each of these methods on the dragonfly topology under both steady-state and transient loads. Our results show that PB, PAR, and CRT all achieve good performance. PB provides the best absolute performance, with 2-7% lower latency on steady-state uniform random traffic at 70% load, while PAR provides the fastest response on transient loads. We also evaluate the implementation costs of the indirect adaptive routing methods and show that PB has the lowest implementation cost requiring
    36th International Symposium on Computer Architecture (ISCA 2009), June 20-24, 2009, Austin, TX, USA; 06/2009
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Future many-core processors will require high-performance yet energy-efficient on-chip networks to provide a communication substrate for the increasing number of cores. Recent advances in silicon nanophotonics create new opportunities for on-chip networks. To efficiently exploit the benefits of nanophotonics, we propose Firefly - a hybrid, hierarchical network architecture. Firefly consists of clusters of nodes that are connected using conventional, electrical signaling while the inter-cluster communication is done using nanophotonics - exploiting the benefits of electrical signaling for short, local communication while nanophotonics is used only for global communication to realize an efficient on-chip network. Crossbar architecture is used for inter-cluster communication. However, to avoid global arbitration, the crossbar is partitioned into multiple, logical crossbars and their arbitration is localized. Our evaluations show that Firefly improves the performance by up to 57% compared to an all-electrical concentrated mesh (CMESH) topology on adversarial traffic patterns and up to 54% compared to an all-optical crossbar (OP XBAR) on traffic patterns with locality. If the energy-delay-product is compared, Firefly improves the efficiency of the on-chip network by up to 51% and 38% compared to CMESH and OP XBAR, respectively.
    36th International Symposium on Computer Architecture (ISCA 2009), June 20-24, 2009, Austin, TX, USA; 01/2009