John Kim

Northwestern University, Evanston, IL, United States

Are you John Kim?

Claim your profile

Publications (17)0 Total impact

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The scalability trends of modern semiconductor technology lead to increasingly dense multicore chips. Unfortunately, physical limitations in area, power, off-chip bandwidth, and yield constrain single-chip designs to a relatively small number of cores, beyond which scaling becomes impractical. Multi-chip designs overcome these constraints, and can reach scales impossible to realize with conventional single-chip architectures. However, to deliver commensurate performance, multi-chip architectures require a cross-chip interconnect with bandwidth, latency, and energy consumption well beyond the reach of electrical signaling. We propose Galaxy, an architecture that enables the construction of a many-core "virtual chip" by connecting multiple smaller chiplets through optical fibers. The low optical loss of fibers allows the flexible placement of chiplets, and offers simpler packaging, power, and heat requirements. At the same time, the low latency and high bandwidth density of optical signaling maintain the tight coupling of cores, allowing the virtual chip to match the performance of a single chip that is not subject to area, power, and bandwidth limitations. Our results indicate that Galaxy attains speedup of 2.2x over the best single-chip alternatives with electrical or photonic interconnects (3.4x maximum), and 2.6x smaller energy-delay product (6.8x maximum). We show that Galaxy scales to 4K cores and attains 2.5x speedup at 6x lower laser power compared to a Macrochip with silicon waveguides.
    In Proceedings of the ACM International Conference on Supercomputing (ICS); 06/2014
  • Source
    Yan Pan, John Kim, Gokhan Memik
    [Show abstract] [Hide abstract]
    ABSTRACT: The nanophotonic signaling technology enables efficient global communication and low-diameter networks such as crossbars that are often optically arbitrated. However, existing optical arbitration schemes incur costly overheads (e.g., waveguides, laser power, etc.) to avoid starvation caused by their inherent fixed priority, which limits their applicability in power-bounded future many-core processors. On the other hand, quality-of-service (QoS) support in the on-chip network is becoming necessary due to an increase in the number of components in the network. Most prior work on QoS in on-chip networks has focused on conventional multi-hop electrical networks, where the efficiency of QoS is hindered by the limited capabilities of electrical global communication. In this work, we exploit the benefits of nanophotonics to build a lightweight optical arbitration scheme, FeatherWeight, with QoS support. Leveraging the efficient global communication, we devise a feedback-controlled, adaptive source throttling scheme to asymptotically approach weighted max-min fairness among all the nodes on the chip. By re-using existing datapath components to exchange minimal global information, FeatherWeight provides freedom from starvation while resulting in negligible (
    01/2011;
  • Source
    Dennis Abts, John Kim
    01/2011; Morgan & Claypool Publishers.
  • Source
    Workshop on the Interaction between Nanophotonic Devices and Systems (WINDS) co-located with the 43rd International Symposium on Microarchitecture (MICRO),. 12/2010;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Emerging many-core chip multiprocessors will integrate dozens of small processing cores with an on-chip interconnect consisting of point-to-point links. The interconnect enables the processing cores to not only communicate, but to share common resources such as main memory resources and I/O controllers. In this work, we propose an arbitration scheme to enable equality of service (EoS) in access to a chip’s shared resources. That is, we seek to remove any bias in a core’s access to a shared resource based on its location in the CMP. We propose using probabilistic arbitration combined with distance-based weights to achieve EoS and overcome the limitation of conventional round-robin arbiter. We describe how nonlinear weights need to be used with probabilistic arbiters and propose three different arbitration weight metrics – fixed weight, constantly increasing weight, and variably increasing weight. By only modifying the arbitration of an on-chip router, we do not require any additional buffers or virtual channels and create a simple, low-cost mechanism for achieving EoS. We evaluate our arbitration scheme across a wide range of traffic patterns. In addition to providing EoS, the proposed arbitration has additional benefits which include providing quality-of-service features (such as differentiated service) and providing fairness in terms of both throughput and latency that approaches the global fairness achieved with age-base arbitration – thus, providing a more stable network by achieving high sustained throughput beyond saturation.
    43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2010, 4-8 December 2010, Atlanta, Georgia, USA; 01/2010
  • Source
    Yan Pan, John Kim, Gokhan Memik
    [Show abstract] [Hide abstract]
    ABSTRACT: On-chip network is becoming critical to the scalability of future many-core architectures. Recently, nanophotonics has been proposed for on-chip networks because of its low latency and high bandwidth. However, nanophotonics has relatively high static power consumption, which can lead to inefficient architectures. In this work, we propose FlexiShare - a nanophotonic crossbar architecture that minimizes static power consumption by fully sharing a reduced number of channels across the network. To enable efficient global sharing, we decouple the allocation of the channels and the buffers, and introduce novel photonic token-stream mechanism for channel arbitration and credit distribution The flexibility of FlexiShare introduces additional router complexity and electrical power consumption. However, with the reduced number of optical channels, the overall power consumption is reduced without loss in performance. Our evaluation shows that the proposed token-stream arbitration applied to a conventional crossbar design improves network throughput by 5.5× under permutation traffic. In addition, FlexiShare achieves similar performance as a token-stream arbitrated conventional crossbar using only half the amount of channels under balanced, distributed traffic. With the extracted trace traffic from MineBench and SPLASH-2, FlexiShare can further reduce the amount of channels by up to 87.5%, while still providing better performance - resulting in up to 72% reduction in power consumption compared to the best alternative.
    16th International Conference on High-Performance Computer Architecture (HPCA-16 2010), 9-14 January 2010, Bangalore, India; 01/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The on-chip network of emerging many-core CMPs enables the sharing of numerous on-chip components. This on-chip network needs to ensure fairness when accessing the shared resources. In this work, we propose providing equality of service (EoS) in future many-core CMPs on-chip networks by leveraging distance, or hop count, to approximate the age of packets in the network. We propose probabilistic arbitration combined with distance-based weights to achieve EoS and overcome the limitation of conventional round-robin arbiter. We describe how nonlinear weights need to be used with probabilistic arbiters and propose three different arbitration weight metrics - fixed weight, constantly increasing weight, and variably increasing weight. By only modifying the arbitration of an on-chip router, we do not require any additional buffers or virtual channels and create a complexity-effective mechanism for achieving EoS.
    19th International Conference on Parallel Architecture and Compilation Techniques (PACT 2010), Vienna, Austria, September 11-15, 2010; 01/2010
  • [Show abstract] [Hide abstract]
    ABSTRACT: ABSTRACT In the near term, Moore’s law will continue to provide an increasing number,of transistors and therefore an increasing number,of on-chip cores. Limited pin bandwidth,prevents the integration of a large number,of memory,controllers on- chip. With many cores, and few memory controllers, where to locate the memory,controllers in the on-chip intercon- nection fabric becomes an important and as yet unexplored question. In this paper we show how,the location of the memory,controllers can reduce contention (hot spots) in the on-chip fabric and lower the variance in reference latency. This in turn provides predictable performance for memory- intensive applications regardless of the processing core on which a thread is scheduled. We explore the design space of on-chip fabrics to,nd optimal memory controller placement relative to dierent topologies (i.e. mesh and torus), routing algorithms, and workloads. Categories and Subject Descriptors: C.1.2 [Computer Systems Organization]: Interconnection architectures; B.4.2 [Input/ Output Devices]: Channels and Controllers General Terms: Performance, Design Keywords: interconnection networks, memory controllers,
    36th International Symposium on Computer Architecture (ISCA 2009), June 20-24, 2009, Austin, TX, USA; 06/2009
  • [Show abstract] [Hide abstract]
    ABSTRACT: Evolving technology and increasing pin-bandwidth motivate the use of high-radix routers to reduce the diameter, latency, and cost of interconnection networks. This migration from low-radix to high-radix routers is demonstrated with the recent introduction of high-radix routers and they are expected to impact networks used in large-scale systems such as multicomputers and data centers. As a result, a scalable and a cost-efficient topology is needed to properly exploit high-radix routers.High-radix networks require longer cables than their low-radix counterparts. Because cables dominate network cost, the number of cables, and particularly the number of long, global cables should be minimized to realize an efficient network. In this paper, we introduce the dragonfly topology which uses a group of high-radix routers as a virtual router to increase the effective radix of the network. With this organization, each minimally routed packet traverses at most one global channel. By reducing global channels, a dragonfly reduces cost by 20% compared to a flattened butterfly and by 52% compared to a folded Clos network in configurations with > 16K nodes. The paper also introduces two new variants of global adaptive routing that enable load-balanced routing in the dragonfly. Each router in a dragonfly must make an adaptive routing decision based on the state of a global channel connected to a different router. Because of the indirect nature of this routing decision, conventional adaptive routing algorithms give degraded performance. We introduce the use of selective virtual-channel discrimination and the use of credit round-trip latency to both sense and signal channel congestion. The combination of these two methods gives throughput and latency that approaches that of an ideal adaptive routing algorithm.
    IEEE Micro. 01/2009; 29:33-40.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Recently proposed high-radix interconnection networks [10] require global adaptive routing to achieve optimum performance. Existing direct adaptive routing methods are slow to sense congestion remote from the source router and hence misroute many packets before such congestion is detected. This paper introduces indirect global adaptive routing (IAR) in which the adaptive routing decision uses information that is not directly available at the source router. We describe four IAR routing methods: credit round trip (CRT) [10], progressive adaptive routing (PAR), piggyback routing (PB), and reservation routing (RES). We evaluate each of these methods on the dragonfly topology under both steady-state and transient loads. Our results show that PB, PAR, and CRT all achieve good performance. PB provides the best absolute performance, with 2-7% lower latency on steady-state uniform random traffic at 70% load, while PAR provides the fastest response on transient loads. We also evaluate the implementation costs of the indirect adaptive routing methods and show that PB has the lowest implementation cost requiring
    36th International Symposium on Computer Architecture (ISCA 2009), June 20-24, 2009, Austin, TX, USA; 01/2009
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Future many-core processors will require high-performance yet energy-efficient on-chip networks to provide a communication substrate for the increasing number of cores. Recent advances in silicon nanophotonics create new opportunities for on-chip networks. To efficiently exploit the benefits of nanophotonics, we propose Firefly - a hybrid, hierarchical network architecture. Firefly consists of clusters of nodes that are connected using conventional, electrical signaling while the inter-cluster communication is done using nanophotonics - exploiting the benefits of electrical signaling for short, local communication while nanophotonics is used only for global communication to realize an efficient on-chip network. Crossbar architecture is used for inter-cluster communication. However, to avoid global arbitration, the crossbar is partitioned into multiple, logical crossbars and their arbitration is localized. Our evaluations show that Firefly improves the performance by up to 57% compared to an all-electrical concentrated mesh (CMESH) topology on adversarial traffic patterns and up to 54% compared to an all-optical crossbar (OP XBAR) on traffic patterns with locality. If the energy-delay-product is compared, Firefly improves the efficiency of the on-chip network by up to 51% and 38% compared to CMESH and OP XBAR, respectively.
    36th International Symposium on Computer Architecture (ISCA 2009), June 20-24, 2009, Austin, TX, USA; 01/2009
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Sharing on-chip network resources efficiently is criti- cal in the design of a cost-efficient network on-chip (NoC). Concentration has been proposed for on-chip networks but the trade-off in concentration implementation and perfor- mance has not been well understood. In this paper, we de- scribe cost-efficient implementations of concentration and show how external concentration provides a significant re- duction in complexity (47% and 36% reduction in area and energy, respectively) compared to previous assumed inte- grated (high-radix) concentration while degrading overall performance by only 10%. Hybrid implementations of con- centration is also presented which provide additional trade- off between complexity and performance. To further reduce the cost of NoC, we describe how channel slicing can be used together with concentration. We propose virtual con- centration which further reduces the complexity - saving area and energy by 69% and 32% compared to baseline mesh and 88% and 35% over baseline concentrated mesh.
    Third International Symposium on Networks-on-Chips, NOCS 2009, May 10-13 2009, La Jolla, CA, USA. Proceedings; 01/2009
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: With the trend towards increasing number of cores in chip multiprocessors, the on-chip interconnect that connects the cores needs to scale efficiently. In this work, we propose the use of high-radix networks in on-chip interconnection net- works and describe how the flattened butterfly topology can be mapped to on-chip networks. By using high-radix routers to reduce the diameter of the network, the flattened butterfly offers lower latency and energy consumption than conven- tional on-chip topologies. In addition, by exploiting the two dimensional planar VLSI layout, the on-chip flattened but- terfly can exploit the bypass channels such that non-minimal routing can be used with minimal impact on latency and en- ergy consumption. We evaluate the flattened butterfly and compare it to alternate on-chip topologies using synthetic traffic patterns and traces and show that the flattened but- terfly can increase throughput by up to 50% compared to a concentrated mesh and reduce latency by 28% while re- ducing the power consumption by 38% compared to a mesh network.
    Microarchitecture, 2007. MICRO 2007. 40th Annual IEEE/ACM International Symposium on; 01/2008
  • Conference Paper: Flattened butterfly
    [Show abstract] [Hide abstract]
    ABSTRACT: Increasing integrated-circuit pin bandwidth has motivateda corresponding increase in the degree or radix of interconnection networksand their routers. This paper introduces the flattened butterfly, a cost-efficient topology for high-radix networks. On benign (load-balanced) traffic, the flattened butterfly approaches the cost/performance of a butterfly network and has roughly half the cost of a comparable performance Clos network.The advantage over the Clos is achieved by eliminating redundant hopswhen they are not needed for load balance. On adversarial traffic, the flattened butterfly matches the cost/performance of a folded-Clos network and provides an order of magnitude better performance than a conventional butterfly.In this case, global adaptive routing is used to switchthe flattened butterfly from minimal to non-minimal routing - usingredundant hops only when they are needed. Minimal and non-minimal, oblivious and adaptive routing algorithms are evaluated on the flattened butterfly.We show that load-balancing adversarial traffic requires non-minimalglobally-adaptive routing and show that sequential allocators are required to avoid transient load imbalance when using adaptive routing algorithms.We also compare the cost of the flattened butterfly to folded-Clos, hypercube,and butterfly networks with identical capacityand show that the flattened butterfly is more cost-efficient thanfolded-Clos and hypercube topologies.
    the 34th annual international symposium; 01/2007
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Increasing integrated-circuit pin bandwidth has motivateda corresponding increase in the degree or radix of interconnection networksand their routers. This paper introduces the flattened butterfly, a cost-efficient topology for high-radix networks. On benign (load-balanced) traffic, the flattened butterfly approaches the cost/performance of a butterfly network and has roughly half the cost of a comparable performance Clos network.The advantage over the Clos is achieved by eliminating redundant hopswhen they are not needed for load balance. On adversarial traffic, the flattened butterfly matches the cost/performance of a folded-Clos network and provides an order of magnitude better performance than a conventional butterfly.In this case, global adaptive routing is used to switchthe flattened butterfly from minimal to non-minimal routing - usingredundant hops only when they are needed. Minimal and non-minimal, oblivious and adaptive routing algorithms are evaluated on the flattened butterfly.We show that load-balancing adversarial traffic requires non-minimalglobally-adaptive routing and show that sequential allocators are required to avoid transient load imbalance when using adaptive routing algorithms.We also compare the cost of the flattened butterfly to folded-Clos, hypercube,and butterfly networks with identical capacityand show that the flattened butterfly is more cost-efficient thanfolded-Clos and hypercube topologies.
    34th International Symposium on Computer Architecture (ISCA 2007), June 9-13, 2007, San Diego, California, USA; 01/2007
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Recent increase in the pin bandwidth of integrated-circuits has motivated an increase in the degree or radix of interconnection network routers. The folded-Clos network can take advantage of these high-radix routers and this paper investigates adaptive routing in such networks. We show that adaptive routing, if done properly, outperforms oblivious routing by providing lower latency, lower latency variance, and higher throughput with limited buffering. Adaptive routing is particularly useful in load balancing around nonuniformities caused by deterministically routed traffic or the presence of faults in the network. We evaluate alternative allocation algorithms used in adaptive routing and compare their performance. The use of randomization in the allocation algorithms can simplify the implementation while sacrificing minimal performance. The cost of adaptive routing, in terms of router latency and area, is increased in high-radix routers. We show that the use of imprecise queue information reduces the implementation complexity and precomputation of the allocations minimizes the impact of adaptive routing on router latency
    SC 2006 Conference, Proceedings of the ACM/IEEE; 12/2006
  • [Show abstract] [Hide abstract]
    ABSTRACT: Recent increases in the pin bandwidth of integrated-circuits has motivated an increase in the degree or radix of interconnection network routers. The folded-Clos network can take advantage of these high-radix routers and this paper investigates adaptive routing in such networks. We show that adaptive routing, if done properly, outperforms oblivious routing by providing lower latency, lower latency variance, and higher throughput with limited buffering. Adaptive routing is particularly useful in load balancing around nonuniformities caused by deterministically routed traffic or the presence of faults in the network. We evaluate alternative allocation algorithms used in adaptive routing and compare their performance. The use of randomization in the allocation algorithms can simplify the implementation while sacrificing minimal performance. The cost of adaptive routing, in terms of router latency and area, is increased in high-radix routers. We show that the use of imprecise queue information reduces the implementation complexity and precomputation of the allocations minimizes the impact of adaptive routing on router latency.
    Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing, November 11-17, 2006, Tampa, FL, USA; 01/2006