ArticlePDF Available

Survey of Novel Architectures for Energy Efficient High-Performance Mobile Computing Platforms

MDPI
Energies
Authors:

Abstract and Figures

There are many real-world applications that require high-performance mobile computing systems for onboard, real-time processing of gathered data due to latency, reliability, security, or other application constraints. Unfortunately, most existing high-performance mobile computing systems require a prohibitively high power consumption in the face of the limited power available from the batteries typically used in these applications. For high-performance mobile computing to be practical, alternative hardware designs are needed to increase the computing performance while minimizing the required power consumption. This article surveys the state-of-the-art in high-efficiency, high-performance onboard mobile computing, focusing on the latest developments. It was found that more research is needed to design high-performance mobile computing systems while minimizing the required power consumption to meet the needs of these applications.
Content may be subject to copyright.
Citation: O’Connor, O.; Elfouly, T.;
Alouani, A. Survey of Novel
Architectures for Energy Efficient
High-Performance Mobile
Computing Platforms. Energies 2023,
16, 6043. https://doi.org/10.3390/
en16166043
Academic Editor: Luigi Fortuna
Received: 26 June 2023
Revised: 11 August 2023
Accepted: 16 August 2023
Published: 18 August 2023
Copyright: © 2023 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
energies
Review
Survey of Novel Architectures for Energy Efficient
High-Performance Mobile Computing Platforms
Owen O’Connor , Tarek Elfouly and Ali Alouani *
Department of Electrical and Computer Engineering, Tennessee Technological University,
Cookeville, TN 38505, USA; ococonnor42@tntech.edu (O.O.); telfouly@tntech.edu (T.E.)
*Correspondence: aalouani@tntech.edu; Tel.: +1-931-372-3383
Abstract: There are many real-world applications that require high-performance mobile computing
systems for onboard, real-time processing of gathered data due to latency, reliability, security, or other
application constraints. Unfortunately, most existing high-performance mobile computing systems
require a prohibitively high power consumption in the face of the limited power available from the
batteries typically used in these applications. For high-performance mobile computing to be practical,
alternative hardware designs are needed to increase the computing performance while minimizing
the required power consumption. This article surveys the state-of-the-art in high-efficiency, high-
performance onboard mobile computing, focusing on the latest developments. It was found that
more research is needed to design high-performance mobile computing systems while minimizing
the required power consumption to meet the needs of these applications.
Keywords:
power; high-performance mobile computing systems; endpoint computing; onboard
computing; embedded computing; energy efficiency; co-processors; acceleration
1. Introduction
In recent years, the capabilities of high-performance computing systems have been
growing in leaps and bounds, with individual processors reaching hundreds if not thou-
sands of teraflops [
1
]. At the cutting edge of performance, Oak Ridge National Lab’s
Frontier supercomputer is able to perform more than one quintillion floating point opera-
tions per second [
2
]. This performance comes at a price, both in terms of physical space and
the massive power consumption required. While these constraints are no longer an issue in
workstations and data centers, they are major issues for mobile computing systems.
Current and future voice- and gesture-controlled smart devices, wearable and portable
medical monitoring devices, advanced metering infrastructure in smart grid and sensors
for monitoring power generation such as those outlined in [
3
], and many other applications
requiring local, onboard processing of big data have to use high-performance mobile
computing systems. One of the limiting factors affecting the acceptance of smart devices is
concerns about the users’ privacy [
4
]. To mitigate these concerns, the devices must be able
to execute complex machine learning algorithms on sensor data without offloading data to
a central server. For many medical and smart grid applications, large amounts of sensor
data must be acted on quickly and reliably in order to make a timely diagnosis or issue a
warning about impending medical or stability issues, respectively.
Current mobile computing systems with insufficient onboard computational perfor-
mance will often offload sensor data to remote servers for processing, but this approach
has several disadvantages. First, it requires the system to maintain a constant connection to
the remote server in order to transfer data. A constant, high-bandwidth connection comes
with its own associated power draw [
5
]. The data transfer may also introduce unwanted
latency, especially for applications with time sensitive outcomes such as medical or AMI
devices. Additionally, data transfer may not be possible in remote areas where the access
Energies 2023,16, 6043. https://doi.org/10.3390/en16166043 https://www.mdpi.com/journal/energies
Energies 2023,16, 6043 2 of 15
to communication is poor or non-existent. Finally, there have been concerns raised about
user privacy when data is transferred to a remote system [
4
]. Data collected by sensors
often includes sensitive personal information from the users, such as biomedical data. This
data must be protected before it is transmitted from the system, adding to the overall
complexity of the design. Research has been conducted regarding better ways of protecting
user privacy, such as encryption or anonymization. However, the most reliable method
of protecting sensitive data would be to never transfer it off the device if it is possible to
process it locally.
Furthermore, offloading data to central servers for processing introduces issues on the
server side as well. In order to process all the incoming data in a reasonable amount of time,
expensive, high-bandwidth data centers are needed [
6
]. In order to alleviate these issues
somewhat while keeping the computational load off of the mobile computing system, an
alternative technique called edge computing has been proposed. In edge computing, data
processing is offloaded to one of many small, localized servers instead of a large centralized
data center [
7
]. In this way, the computational performance and bandwidth required for
each of the edge computing servers is greatly reduced, decreasing the overall cost and
complexity of the system. This technique has gained traction in recent years, with much
research being conducted to find the most efficient way to distribute a workload across
an edge computing environment and different network typologies to yield the fastest and
most reliable results
[810]
. While edge computing does offer several advantages over the
more traditional, centralized data processing, it still does not fully address the concerns
outlined in the previous paragraph. Edge computing techniques still require a connection
to a server to offload data processing too, and although the connections tend to have a
much lower latency [11] this still introduces reliability and security concerns.
In order for these issues to be fully remedied, true onboard or “endpoint” computing
is needed. Unfortunately, due to the limited computational performance of the mobile
computing systems currently available and the complexity of the tasks they would have
to perform, this would be next to impossible [
7
]. As such, new and much more efficient
mobile computing hardware is required. A brief overview of the existing mobile computing
hardware and its limitations will be provided in Section 2. Section 3then provides a
review of research being conducted into alternative hardware designs that can potentially
provide the additional efficiency required for onboard, real-time processing of data in
mobile computing systems.
2. Existing Solutions
There are several different approaches to accelerating mathematical operations in con-
ventional, commercially available mobile computing systems. At the most basic level, there
are the floating point units (FPUs) seen in architecture such as the Cortex M4 and Cortex
M7 [
12
,
13
]. These FPUs allow the processors to perform more floating point operations per
second compared to similar processors without FPUs, but they are still very limited. The
style of FPUs used in these architectures are typically only capable of performing a single
floating point operation at a time, limiting the overall throughput. This limited throughput,
combined with the non-negligible power consumption of the FPU, leads to a mediocre
computational efficiency of the overall system, typically less than one billion operations per
second per Watt (1 GOPS/W) [
14
]. In applications where a large amount of computation
needs to be performed every second or where high efficiency is required, some alternative
solution is needed.
Two of the most direct approaches used to increase the computational performance of
mobile computing systems are techniques called single instruction, multiple data (SIMD)
and the closely related single instruction, multiple thread (SIMT). With these techniques,
the same operation is performed simultaneously on multiple data points, greatly increasing
the overall throughput of the processor without an unreasonable increase in hardware cost.
This parallel processing approach makes it very easy to scale up an architecture, allowing for
extremely high throughput at the cost of increased power consumption. This scalability can
Energies 2023,16, 6043 3 of 15
be seen in Nvidia’s Jetson Orin line of high-performance mobile computing systems. At the
lowest end of their product stack is a system capable of performing 40 trillion operations
per second (TOPS) with a maximum power consumption of 15 W [
15
]. Their flagship
system, on the other hand, has been scaled up to be capable of performing 275 TOPS with
a maximum power consumption of 60 W [
1
]. The parallel nature of these processors also
grants them a higher efficiency than the simpler FPUs, with modern processors approaching
an efficiency of around 5 TOPS/W [
1
]. The high efficiency of SIMD/SIMT architectures
have been optimized even further without sacrificing computational power through even
more highly specialized processors. By focusing on a very small set of operations that the
processor can perform, designers are able to greatly optimize both the physical size and
power consumption of the mobile computing system. One such example is AMD’s Versal
AI Edge processors [
16
]. Because these processors are designed specifically to accelerate
AI operations and nothing else, they need a much lower data precision and as a result
a much smaller ALU with a much lower power consumption. These optimizations give
specialized processors a significant increase in efficiency when compared to contemporary,
non-specialized processors [17].
In the most extreme case of application specialization are field programmable gate
arrays (FPGAs) and application specific integrated circuits (ASICs). ASICs are designed
specifically to perform one task and one task only, often eschewing a traditional Harvard or
von Neumann architecture for one purpose built for the application. As a result, ASICs are
capable of delivering extremely fast, extremely efficient data processing but require huge
amounts of investment into research and development to implement. Because of this, ASICs
are usually confined to large-volume applications, such as networking and communication
equipment where the additional development costs can be spread across thousands or
even millions of products [
18
,
19
] . For smaller-volume applications, commercially available
FPGAs can be used to implement custom logic and custom architecture without the need
for designing a complete chip from scratch. However, FPGAs still present additional design
complexities over a conventional processor and offer lower overall efficiencies and speeds
than ASICs.
Conventional portable computing accelerators capable of onboard processing of large
amounts of data often have extremely high power requirements [
1
,
16
]. As a result of these
high power requirements, large, heavy, and expensive batteries are required for the system
to have any reasonable battery life. Similarly, if the requirements for the device demand
a small form factor and large battery life, the amount of data processing that the device
is capable of will be limited. In the applications discussed above where large amounts of
onboard data processing are required while maintaining low power consumption, these
trade offs are not possible. As a result, new and more efficient hardware designs for
high-performance mobile computing systems are required for these applications. Section 3
discusses research being conducted into alternative hardware designs for high-performance
mobile computing systems in more detail.
3. Literature Survey
To enable real-time onboard processing of data, new architectural paradigms are
needed. In 1961, Landauer established a theoretical upper bound on the efficiency of a
non-reversible computing system using thermodynamic analysis [
20
]. While it is unlikely
that a computing system will reach this perfect efficiency anytime in the near future, the
conventional approaches in use are still multiple orders of magnitude away from the limit.
According to Landauer’s work, the theoretical minimum energy that can be consumed
when a bit is flipped is 2.6
×
10
21
J [
20
], but even current state-of-the art computing
systems require something on the order of 1.0 ×1013 J [21].
Researchers have identified several inefficiencies in the architectural design of a com-
puting system, in the way data is stored and processed, and in the technology used to
implement the system. These inefficiencies include complications from modeling a physical
system in binary logic [
22
], memory bottlenecks in the von Neumann architecture [
23
],
Energies 2023,16, 6043 4 of 15
wasted energy dissipated within logic gates [
24
] (pp. 181–185), and inefficiencies rising
from the physical properties of transistors themselves [
25
]. This paper provides a survey
of several related high-performance computing systems focusing on the power efficiency,
while also discussing total computational performance and the architectural design of
the system.
Power consumption of a processor is typically divided into two parts—dynamic
and static power consumption. Dynamic power is consumed when internal signals are
changed, e.g., when data is being processed or transferred, and is primarily dependent on
the size and number of transistors, the supply voltage, how frequently the signals change,
and the operating frequency of the system [
24
] (pp. 185–194). In a conventional CMOS
circuit, the primary cause of dynamic power consumption is the gate capacitance of the
transistors being repeatedly charged from the source voltage and then discharged directly
to ground [
24
] (pp. 181–185). Dynamic power consumption can be further divided into two
parts, the power consumed to store and transfer data and the power used to operate on
that data. In a general purpose processor, the power required to store and transmit data
can be close to or even above 50% of the total power consumed by the processor [
26
]. Static
power, on the other hand, is consumed both when the processor is actively processing
data and when it is idle. In a conventional CMOS circuit, static power consumption
is primarily caused by internal leakage of the transistors. Static power consumption is
primarily dependant on the source voltage, transistor size, and number of transistors
[24]
(pp. 194–200). As the transistors used in modern processors have gotten progressively
smaller, they have allowed more and more current leakage. This increase in leakage current
has led to a corresponding increase in wasted energy from static power consumption.
Optimization techniques for conventional CMOS designs will focus on decreasing one
of these sources of power consumption. For example, the specialized circuits discussed
earlier, such as the TPUs and ASICs, reduce dynamic power consumption by decreasing
the number of transistors required to perform any given task while simultaneously cutting
back on internal data transfer [
17
]. Dynamic power consumption can also be reduced at the
cost of computational power by limiting the operational frequency. When the operational
frequency of a processor is lowered, the voltage required for the processor to function
properly is also reduced. By dynamically varying the operational frequency and source
voltage simultaneously, not only is the dynamic and static power consumption reduced,
but the computational efficiency is increased [
27
]. Idle power consumption can be reduced
even further with techniques called power gating and clock gating, where power sources
and clock signals are completely shut off from portions of a circuit when they are not being
used. However, this does not have any significant effect on efficiency when the processor is
under full load, and even shows mixed results for extremely low power circuits [28].
The rest of this section will cover alternative techniques that are being researched to
avoid or mitigate these inefficiencies, grouped into general sections based on what basic
approach the researchers took.
3.1. Analog Computing
One major limitation of the mobile computing systems currently in use is the need to
translate complex physical systems or even simple mathematical operations into discrete
binary operations. One alternative that has been proposed to this approach is analog
computing, which uses the physical properties of circuit components to create analogous
models of physical systems or mathematical equations directly in hardware. In an analog
computer, a variable is represented by a continuous value such as voltage, current, or
resistance instead of a vector of bits. This approach greatly decreases the bus width
required to transfer data and the number of transistors required to process it, but it does
come with its own disadvantages. Fabrication inconsistencies in the photolithography
process means that no two analog components are guaranteed to have the same parameters.
Due to the nature of analog circuits, these small inconsistencies accumulate and create
inaccuracies in the output. Additionally, any time data needs to be translated between
Energies 2023,16, 6043 5 of 15
the analog and digital domains, digital to analog converters (DACs) or analog to digital
converters (ADCs) are needed, increasing circuit complexity and power draw.
Similarly to digital circuits, analog circuits can be designed to be general purpose or
optimized for a specific problem. As expected, the application specific analog circuits are
often highly efficient and can take up very little die space. One example is presented in [
29
],
where the authors propose a hybrid analog approach for highly efficient detection of voice
keywords. Because they were able to operate directly on analog data captured from the
microphone, they were also able to avoid expensive conversions between the digital and
analog domains. The authors of [
30
] applied a similar approach to data pre-processing of
EEG signals.
One approach that has been taken for more general purpose analog computing is to
design an analog computer with configurable parameters to model a family of mathematical
systems. Two popular approaches, due to their applicability to many real world problems,
are linear algebra systems and partial differential equations (PDEs). The authors of [
31
] note
some limitations of existing analog PDE solvers, primarily the size required, and propose
a potential solution that uses an improved, iterative algorithm. Using this algorithm, the
authors were able to demonstrate a four order of magnitude improvement in the die size
required for a given computational performance. An alternative approach is presented
in [
32
], where the authors choose to focus solely on non-linear PDEs. This restriction meant
that the authors were able design a very simple, parallelizable circuit to find the solution to
the PDE at any given point. In [
22
], the authors designed an analog computer to model
and solve linear algebra systems using ordinary differential equations with extremely high
efficiency. The authors extrapolated their results to a 600 mm
2
chip, a similar size to high
end data center accelerators, and predicted that even with such a large die their design
would still only draw about 1 W of power.
Another option for more general purpose analog computing is a field programmable
analog array, or FPAA [
33
]. An FPAA is essentially the analog equivalent of a FPGA,
where there are numerous small computation blocks joined together with a network of
configurable interconnects. Similar to an FPGA, this configurability gives FPAAs a large
amount of flexibility at the cost of an increased die size and lower maximum performance.
The analog nature of an FPAA, however, introduces a new set of design complexities.
In an FPGA, the configurable interconnects do little more than create a small delay, but
in an FPAA they also can have a small effect on the value of the signal being passed.
Because of manufacturing inconsistencies, this small effect can vary from one interconnect
to another, creating uncertainty in how any given configuration will perform. The authors
of [
34
] propose a modification to the basic design of an FPAA along with a calibration
method to minimize the effects of these inconsistencies, leading to a more reliable system
overall. FPAAs have also been demonstrated with good results in digital/analog hybrid
architectures, as demonstrated in [
35
]. The authors of this article integrated an FPAA
with a traditional digital microprocessor, allowing the authors to simultaneously take
advantage of the benefits of both analog and digital computing while mitigating some of
the disadvantages of both.
3.2. In-Memory Computing
The von Neumann and Harvard architectures are capable of extremely efficient oper-
ations in applications where only small amounts of data need to be processed, but start
suffering as more and more data needs to be processed. One of the major factors contribut-
ing to this loss of efficiency, known as the von Neumann bottleneck or the “memory wall”,
was first recognized multiple decades ago [
23
]. In the von Neumann architecture, any data
that needs to be processed must be transferred from memory to some small number of
register files located within the processor core, have the necessary operations performed on
it, and then transferred back to memory. Due to the mismatch of the maximum bandwidth
of memory and the number of operations the processor core is capable of performing per
second, this setup often leaves the processor core starved for data. The addition of local
Energies 2023,16, 6043 6 of 15
high-speed cache memory and out-of-order or speculative instruction execution within the
processor core have alleviated this problem somewhat, but they both come with their own
disadvantages. Cache memory takes up a large amount of die size and is extremely power
intensive [
26
] and various implementations of speculative or out of order execution have
introduced major hardware level security vulnerabilities to processors [
36
]. In addition to
limiting overall throughput, this data transfer to and from the processing core is a major
contributing factor in the power draw of a von Neumann system [26].
One solution that has been proposed to bypass the von Neumann bottleneck is inte-
grating numerous small computing elements into memory in a technique called in-memory
computing, or IMC. These computing elements will operate on the data directly in place
in memory, often in parallel with each other, removing the need to transfer data back and
forth between the memory and the central processor. In order to save on die space, designs
usually opt to implement a very small number of functions on the in-memory circuitry. One
popular operation to implement due to its usefulness for data processing, signal processing,
and neural networks is the multiply and accumulate operation (MAC).
Many optimizations have been made to the basic IMC MAC architecture over the
years. In [
37
], in-memory MAC operations are added to the base layer of a high-bandwidth,
3D-stacked memory solution. This approach greatly increased the amount of data points
that can be operated on in place at any given time, further reducing the amount of data
transfer required. Another optimization more tailor-fit to a specific problem is discussed
in [
38
]. In this paper, the authors use an optimization technique already common in
convolutional neural networks and use it to perform MAC operations on large bins of data
simultaneously. This binning approach greatly reduces the total number of operations that
need to be performed, leading to a decrease in overall power consumption.
Another method that has been used to further reduce the amount of die space required
to perform MAC operations in memory is stochastic computing [
39
]. In stochastic com-
puting, values are represented by the proportion of high and low bits in a signal instead
of the location of the bits as in a standard binary value. This approach leads to a lower
maximum precision available for any given bit-width, but has the advantage of being able
to perform arithmetic operations directly with simple logic gates. This not only decreases
the die size required to perform MAC operations, but also greatly decreases the power
draw. Unfortunately, the translation between traditional binary representations of numbers
and their stochastic equivalent can draw a lot of power. The authors of [
40
] worked around
this issue by implementing an IMC architecture that stores and operates on stochastic
values natively, only converting them to binary when needed by the main processor.
Hybrid designs that combine analog and in-memory computing have also been pro-
posed [
41
]. Similarly to stochastic computing, using analog computing in memory leads
to significant improvements in efficiency and die size, but requires expensive conversion
circuits in order to use it in tandem with digital circuitry. To combat this, the authors set up
their designs so that multiple analog elements would share a single ADC. This reduced the
overall conversion bandwidth but also decreased the die size and power draw required,
increasing the overall efficiency of the computing system.
3.3. Adiabatic Computing
In traditional CMOS designs, large amounts of energy are lost by the repeated charging
and discharging of the internal capacitance in the transistors [
24
]. Adiabatic computing,
also known as charge recovery computing, improves the overall efficiency of a system
by recovering a portion of that charge at the end of every clock cycle. In order to recover
the charge, additional transistors are required, increasing the physical size of the circuit.
Additionally, there are extra design rules that need to be followed when implementing
an adiabatic circuit, complicating the design process. One of the major requirements is
that all operations must be reversible, which can introduce otherwise unnecessary signals
into a design. Furthermore, even though adiabatic logic can greatly decrease the dynamic
Energies 2023,16, 6043 7 of 15
power draw of a circuit, static power dissipation is still a major problem that must be
considered [42].
Because of these limitations, the majority of research has been in implementing com-
ponents of a processor with the highest power consumption in adiabatic logic. These
components can then be used as part of a hybrid design, either with a conventional CMOS
processor or with one of the other techniques discussed in this paper, to increase the overall
efficiency of the system. The authors of [
43
] compare the performance and efficiency of
four possible implementations of an adiabatic adder. For their tests, the authors simulated
a four-bit adder in each of the four adiabatic logic families, and compared the adders based
on their physical size in the layout and their power consumption at various frequencies.
As noted earlier, memory and data transfer is one of the primary sources of power dis-
sipation in a traditional digital computer [
26
]. Because of this, the authors of [
44
] chose
to demonstrate the feasibility of using charge recovery computing for memory circuits.
Their adiabatic implementation of a 5T SRAM cell showed impressive savings in power
dissipation, though it did come at the cost of increased write times [
44
]. Adiabatic logic
can also be used to implement other sequential logic elements, as demonstrated in [
45
].
In this paper, the authors used adiabatic computing to implement data and toggle flip
flops, and then characterized their power consumption in comparison to a traditional DFF
implemented in traditional CMOS logic. At a frequency of 25 MHz, they were able to
demonstrate an impressive efficiency improvement of 164 times that of the conventional
CMOS design, though this dropped off to about 7 times when the frequency was increased
to 100 MHz.
Adiabatic computing principles have also been used in the design of analog processor
components. The authors of [
46
,
47
] propose two possible adiabatic implementations
of a comparator with promising results. In [
46
], the authors demonstrate an adiabatic
implementation of the strongARM comparator that decreases energy dissipation by 55%
when compared to a conventional CMOS implementation. The authors of [
47
], on the other
hand, primarily focus on increasing the maximum operational frequency of an existing
adiabatic comparator design, though they are still also able to demonstrate significant
energy savings. Studies have also shown great success in using adiabatic logic as an
interface between the analog and digital domains. As discussed in Section 3.1, converting
between the analog and digital representations of a value is a major limiting factor for the
efficiency of an analog computer system. The authors of [
48
] mitigate this conversion cost
with a high speed, high precision ADC using adiabatic logic with a respectable efficiency of
4.4 fJ per conversion step. The authors of [
49
] on the other hand implemented a flash ADC
with a three-bit resolution that drew about 90% less power compared to a conventional
CMOS design when operated at 1 MHz.
One of the first implementations of a fully adiabatic processor was published in
1995 in [
50
]. In this thesis the author discusses the various issues that are encountered
when designing an adiabatic processor, and where future work could be performed to
mitigate these issues and improve the overall design. One of the largest issues presented
is dealing with extraneous data required to meet the reversibility requirements of an
adiabatic design. This extraneous data must be generated as part of the computations, but
is otherwise unneeded for the processor to function and increases the overall complexity of
the processor [50].
In recent years, fully adiabatic systems have seen great interest for their potential
applications in medical sensors and encryption. Because of their use of charge recovery,
adiabatic circuits not only have increased efficiency but also present much more uniform
power draw profiles. The authors of [
51
] take advantage of this second fact to design a
high-efficiency cryptographic accelerator that is much more resistant to power-based side-
channel attacks than similar designs implemented in conventional logic. The authors of [
52
]
take this concept one step farther, implementing the entire medical device in adiabatic logic.
By using only adiabatic logic, the authors were able to power the device entirely off of
energy harvested from radio frequency signals, removing the need for a physical power
Energies 2023,16, 6043 8 of 15
source. The authors note that the ability for the device to be powered wirelessly would be
an incredible advantage for implanted medical devices.
As with analog computing, adiabatic computing can be combined with other tech-
niques for even greater improvements in efficiency. One example of such a hybrid design
is seen in [
53
], which combines elements of in-memory, analog, and adiabatic comput-
ing into a single design. As is common with other IMC designs, the authors chose to
accelerate MAC operations for use in linear algebra operations. One interesting feature
with their design is that it can be operated as both a conventional CMOS circuit or as
a resonant adiabatic circuit for even greater energy efficiency. The design is capable of
performing 640 million MAC operations every second with a computational efficiency of
1.1 TMACS/mW of power consumed when operated in adiabatic mode. This is around
three orders of magnitude more efficient than NVIDIA’s new H100 data center accelerator,
demonstrating the incredible gains in efficiency that can be achieved with these techniques.
3.4. Alternate Technologies
All of the techniques mentioned in the previous sections use traditional circuit el-
ements such as transistors, resistors, and capacitors. However, research has also been
conducted using more exotic circuit components to perform computation, often with some
combination of analog or IMC techniques. These approaches typically attempt to solve
some limitation of transistor based systems, such as static power consumption or the high
power draw of transistor based SRAM.
One alternative to transistor-based memory circuits is resistive RAM, also known
as ReRAM or RRAM. In traditional memory circuits, logic states are represented by the
charge level of a capacitor or the voltage on a feedback network, both of which come with
their own complications and sources of power dissipation. RRAM however represents
logic states with the resistance of circuit components such as memristors or MLC flash
cells. These components are usually nonvolatile and do not require a power source to
maintain their values, decreasing the overall power draw of the memory subsystem. The
authors of [
54
] discuss the use of MLC-based RRAM in an analog IMC linear algebra
accelerator and various techniques that can be used to compensate for manufacturing
inconsistencies. One main drawback of MLC-based RRAM is limited operational life, since
the MLC flash used naturally degrades every time a new value is written. Memristor-
based RRAM solves this issue, though memristors have a major disadvantage of being a
much less mature technology. The authors of [
55
] use a memristor-based RRAM system to
implement an analog linear algebra accelerator. The authors of [56] on the other hand use
RRAM to implement an alternative, highly efficient digital logic family based on the logical
implication operation. There are several previous papers that had implemented a similar
logical family, though they suffered from reliability or speed issues. The authors of [
56
]
solved these issues by adding an additional amplifier to the design.
Another alternative memory system that has been proposed to improve efficiency is
spin transfer torque magnetic RAM (STT-MRAM). STT-MRAM utilizes an effect called tun-
neling magnetoresistance that occurs when two conductive magnetic bodies are separated
by an extremely thin insulating layer. When the two magnetic bodies have aligned polari-
ties, electrons are more likely to tunnel through the insulating layer than if the magnetic
fields are opposed [
57
]. This can be used to store and retrieve data by measuring the minus-
cule leakage current that tunnels through the insulating layer. In [
58
], STT-MRAM serves as
the data storage for a general purpose digital IMC device, reducing the power consumption
overhead consumed by the memory itself. STT-MRAM can also be used to store continuous
analog values. By varying the alignment of the magnetic fields, the amount of leakage
current that passes through the insulating layer can be controlled. This technique is used
in [
59
,
60
] to implement low power in-memory analog computing circuitry. The authors
of [
59
] use extra circuitry to improve the ratio between the high-state leakage current and
the low-state leakage current, allowing for higher efficiency while performing in-memory
linear algebra operations. The authors of [
60
] take advantage of the low voltages required
Energies 2023,16, 6043 9 of 15
for STT-MRAM to implement an extremely low power consumption accelerator for neural
networks.
A completely different fabrication technique is demonstrated in [
25
], where calcula-
tions are performed by mechanical logic gates implemented using micro-electromechanical
systems (MEMS). The design is compatible with adiabatic logic, and the authors claim that
since the design is contact-free it does not suffer from the leakage seen in transistor-based
implementations of adiabatic circuits. For simplicity, the authors implemented the design
in a relatively large MEMS technology which affected the overall efficiency, but they point
out that it should scale very well to smaller, state-of-the-art MEMS.
4. Discussion and Future Directions
An overview of the strengths and weaknesses of the various optimizations discussed in
this survey is given in Table 1, along with the limitations of conventional computer systems
that they are intended to solve. A more detailed comparison of the operational efficiency of
the various systems discussed in this paper can be seen in Figure 1. In Figure 1, the power
consumption of various commercially available mobile computing systems and alternative
systems under development are plotted against their respective performance in operations
per second. Each approach is represented in Figure 1with a unique symbol. To aid in
the interpretation of Figure 1, guide lines showing computational efficiency are given. It
should be noted that the performance metrics for each architecture depend on the hardware
implementation and the task being performed. In general, it is seen that as the performance
of the system increases so does the power consumption. From Figure 1, it can be seen
that systems using analog computing techniques offer the highest efficiency [
29
] while
IMC offers the most computational performance [
61
]. Meanwhile, adiabatic techniques are
currently most used in extremely low power draw but low-performance systems
[53,62]
,
potentially due to the limitations discussed in Section 3.3. Alternative technologies such as
STT-MRAM and memristors are not currently seen to excel in any particular category, but
still offer a good balance between power consumption and performance while showing
much room for future improvement.
Since the majority of past research and commercial implementations of mobile proces-
sors have focused on conventional, transistor-based designs, alternative technologies such
as those using memristors or STT-MRAM are relatively immature. While there have been
some commercial products utilizing these technologies
[63,64]
, there is still much room
for research to improve the applicability of these technologies for wide-scale deployment.
Research into fabrication techniques could improve aspects such as yields, reliability, or
electrical characteristics. There is also room for further research into the architectural aspect
of these new technologies. Because these alternative technologies are so wildly different
from conventional transistor-based circuits, the design procedures and conventions used in
conventional designs will need to be adapted to the new technologies. In some cases, it is
even possible that the low-level logical operations that processors are built upon can be
modified and improved [56].
Although processor designs using just one of the techniques presented above show
improvements in efficiency over conventional processors, hybrid designs that employ
multiple techniques have shown the highest efficiencies [
35
,
41
,
46
,
53
,
65
]. By using multiple
techniques, researchers are able to take advantage of the strengths of each individual
technique while mitigating their individual weaknesses. These hybrid designs seem to be
the most promising field of future research for further increases in processing efficiency.
There are some combinations of techniques in particular, such as using adiabatic charge
recovery in an analog circuit [
66
], that have seen relatively little research up to this point
but are showing very promising initial results. It is also possible that hybrid designs could
be created using new, innovative techniques or more conventional techniques that were
not discussed in this survey, such as approximate computing [
67
]. Finally, there are several
exotic alternative techniques, such as quantum computing [
68
70
] or adiabatic computing
using superconductive Josephson junctions [
71
73
], that have shown exceptionally high
Energies 2023,16, 6043 10 of 15
computing performance and efficiencies. Unfortunately, these techniques currently require
cryogenic cooling, making them impractical for use in high-performance mobile computing
systems. If these cooling requirements can be removed, these techniques could become
very lucrative for future high-performance mobile computing systems, but there is still
much research needed to reach that goal.
Table 1. Comparison of different optimization techniques.
Technique Conventional Analog IMC Adiabatic Alternative
Technologies
Optimization
Method N/A Computational
modal
Bottleneck
mitigation Energy recovery Hardware
implementation
Operational
Frequency
50 MHz [74]–
1.3 GHz [1]
10 kHz [29]–
690 MHz [75]
10 MHz [40]–
2.2 GHz [76]
20 kHz [53]–
1 GHz [77]
2.5 kHz [25]–
3.3 GHz [78]
Computational
Efficiency
560 MOPS/W [14]–
5.4 TOPS/W [16]
21 GOPS/W [31]–
410 POPS/W [29]
46 GOPS/W [79]–
1.6 POPS/W [40]
150 TOPS/W [62]–
110 POPS/W [53]
25 TOPS/W [80]–
710 TOPS/W [81]
Typical Use
Cases General purpose Matrix operations
and PDEs Matrix operations General purpose General purpose
Primary Limi-
tations
Power
consumption,
limited efficiency
Reduced precision,
manufacturing
inconsistencies
Increased memory
size, requires SIMD
friendly problems
Increased design
complexity,
extraneous signals,
additional design
constraints
Immature
technologies,
reliability
1nW 1µW 1mW 1W 1kW
106
109
1012
1015
1 TOPS/MW
1 TOPS/kW
1 TOPS/W
1 TOPS/mW
1 TOPS/µW
Power consumed
Operations per Second
Conventional
Adiabatic
Alternative
Analog
IMC
Figure 1.
Comparison of system efficiencies (Conventional [
1
,
14
16
,
82
], Adiabatic [
53
,
62
], Alterna-
tive [55,81,83,84], Analog [31,75,8589], IMC [40,61,79,90,91]).
Energies 2023,16, 6043 11 of 15
5. Conclusions
HPC systems have become increasingly powerful in recent years, but the immense
power draw required by conventional HPC hardware makes them impractical for use
in high-performance mobile computing systems. While it is possible to offload the data
from the mobile computing system to a central server for processing, multiple applications
would benefit from having energy efficient, onboard data processing capabilities. For these
applications, new hardware designs are needed. Multiple different hardware designs are
presented in this survey, each of which aims to solve at least one issue with conventional,
transistor-based von Neumann systems. Some of them, such as in-memory or analog
computing, target inefficiencies in the architecture itself to eliminate processing bottlenecks.
Adiabatic computing and the alternative technologies on the other hand seek to decrease
the waste power dissipated by the circuit components themselves, increasing the overall
efficiency of the system. While any of these approaches used individually showed improve-
ments in efficiency, papers that combined multiple techniques in a single system showed
the biggest improvements. These hybrid architectures present a promising field for future
research.
Author Contributions:
Conceptualization, O.O., T.E. and A.A.; methodology, O.O., T.E. and A.A.;
investigation, O.O.; writing—original draft preparation, O.O.; writing—review and editing, O.O., T.E.
and A.A.; project administration, T.E. All authors have read and agreed to the published version of
the manuscript.
Funding: This research received no external funding.
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Karumbunathan, L.S. NVIDIA Jetson AGX Orin Series; Technical Report 10749-001; Nvidia: Santa Clara, CA, USA, 2022.
2.
ORNL’s Frontier First to Break the Exaflop Ceiling. 2021. Available online: https://www.top500.org/news/ornls-frontier-first-
to-break-the-exaflop-ceiling/ (accessed on 25 June 2023).
3. Fortuna, L.; Buscarino, A. Sustainable Energy Systems. Energies 2022,15, 9227. [CrossRef]
4.
Yang, H.; Lee, H.; Zo, H. User acceptance of smart home services: An extension of the theory of planned behavior. Ind. Manag.
Data Syst. 2017,117, 68–89. [CrossRef]
5.
Vilar, B.M.J.C.; Luiz, S.O.D.; Perkusich, A.; Santos, D.R. Dynamic power management for network interfaces. In Proceedings of
the 2015 IEEE 5th International Conference on Consumer Electronics—Berlin (ICCE-Berlin), Berlin, Germany, 6–9 September 2015;
pp. 383–387. [CrossRef]
6. Shi, W.; Dustdar, S. The Promise of Edge Computing. Computer 2016,49, 78–81. [CrossRef]
7.
Qin, M.; Chen, L.; Zhao, N.; Chen, Y.; Yu, F.R.; Wei, G. Power-Constrained Edge Computing with Maximum Processing Capacity
for IoT Networks. IEEE Internet Things J. 2019,6, 4330–4343. [CrossRef]
8.
Hua, H.; Li, Y.; Wang, T.; Dong, N.; Li, W.; Cao, J. Edge Computing with Artificial Intelligence: A Machine Learning Perspective.
ACM Comput. Surv. 2023,55, 1–35. [CrossRef]
9.
Wang, X.; Han, Y.; Leung, V.C.M.; Niyato, D.; Yan, X.; Chen, X. Convergence of Edge Computing and Deep Learning: A
Comprehensive Survey. IEEE Commun. Surv. Tutorials 2020,22, 869–904. [CrossRef]
10.
Campolo, C.; Iera, A.; Molinaro, A. Network for Distributed Intelligence: A Survey and Future Perspectives. IEEE Access
2023
,
11, 52840–52861. [CrossRef]
11. Zhou, F.; Chai, Y. Near-sensor and in-sensor computing. Nat. Electron. 2020,3, 664–671. [CrossRef]
12. ARM. Cortex-M4 Devices Generic User Guide; Technical Report dui0553; ARM: Cambridge, UK, 2011.
13. ARM. Arm Cortex-M7 Devices Generic User Guide; Technical Report dui0646; ARM: Cambridge, UK, 2018.
14.
STMicroelectronics. STM32F427xx STM32F429xx Datasheet—Production Data; Technical Report 024030; STMicroelectronics:
Geneva, Switzerland, 2019.
15.
Karumbunathan, L.S. Solving Entry-Level Edge AI Challenges with NVIDIA Jetson Orin Nano. 2022. Available online:
https://developer.nvidia.com/blog/solving-entry-level-edge-ai-challenges-with-nvidia-jetson-orin- nano/ (accessed on 25
June 2023).
16. Xilinx. ACAP at the Edge with the Versal AI Edge Series; Technical Report WP518; Xilinx: San Jose, CA, USA, 2021.
17.
Fuchs, A.; Wentzlaff, D. The Accelerator Wall: Limits of Chip Specialization. In Proceedings of the 2019 IEEE International
Symposium on High Performance Computer Architecture (HPCA), Washington, DC, USA, 16–20 February 2019; pp. 1–14.
[CrossRef]
Energies 2023,16, 6043 12 of 15
18.
Le, S.T.; Drenski, T.; Hills, A.; King, M.; Kim, K.; Matsui, Y.; Sizer, T. 100 Gbps DMT ASIC for Hybrid LTE-5G Mobile Fronthaul
Networks. J. Light. Technol. 2021,39, 801–812. [CrossRef]
19.
Corrˇea, M.; Neto, L.; Palomino, D.; Corrˇea, G.; Agostini, L. ASIC Solution for the Directional Intra Prediction of the AV1 Encoder
Targeting UHD 4K Videos. In Proceedings of the 2020 IEEE International Symposium on Circuits and Systems (ISCAS), Seville,
Spain, 12–14 October 2020; pp. 1–5. [CrossRef]
20. Landauer, R. Irreversibility and Heat Generation in the Computing Process. IBM J. Res. Dev. 1961,5, 183–191. [CrossRef]
21.
Jeon, W.; Park, J.H.; Kim, Y.; Koo, G.; Ro, W.W. Hi-End: Hierarchical, Endurance-Aware STT-MRAM-Based Register File for
Energy-Efficient GPUs. IEEE Access 2020,8, 127768–127780. [CrossRef]
22.
Huang, Y.; Guo, N.; Seok, M.; Tsividis, Y.; Sethumadhavan, S. Analog Computing in a Modern Context: A Linear Algebra
Accelerator Case Study. IEEE Micro 2017,37, 30–38. [CrossRef]
23. Kogge, P.M. Function-based computing and parallelism: A review. Parallel Comput. 1985,2, 243–253. [CrossRef]
24. Weste, N.; Harris, D. CMOS VLSI Design: A Circuits and Systems Perspective, 4th ed.; Addison Wesley: Boston, MA, USA, 2011.
25.
Perrin, Y.; Galisultanov, A.; Hutin, L.; Basset, P.; Fanet, H.; Pillonnet, G. Contact-Free MEMS Devices for Reliable and Low-Power
Logic Operations. IEEE Trans. Electron Devices 2021,68, 2938–2943. [CrossRef]
26.
Horowitz, M. 1.1 Computing’s energy problem (and what we can do about it). In Proceedings of the 2014 IEEE International
Solid-State Circuits Conference Digest of Technical Papers (ISSCC), San Francisco, CA, USA, 9–13 February 2014; pp. 10–14.
[CrossRef]
27.
Simevski, A.; Schrape, O.; Benito, C. Comparative Analyses of Low-Power IC Design Techniques based on Chip Measurements.
In Proceedings of the 2018 16th Biennial Baltic Electronics Conference (BEC), Tallinn, Estonia, 8–10 October 2018; pp. 1–6.
[CrossRef]
28.
Bartík, M. External Power Gating Technique—An Inappropriate Solution for Low Power Devices. In Proceedings of the 2020
11th IEEE Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), Vancouver, BC,
Canada, 4–7 November 2020; pp. 241–245. [CrossRef]
29.
Seok, M.; Yang, M.; Jiang, Z.; Lazar, A.A.; Seo, J.S. Cases for Analog Mixed Signal Computing Integrated Circuits for Deep Neural
Networks. In Proceedings of the 2019 International Symposium on VLSI Design, Automation and Test (VLSI-DAT), Hsinchu,
Taiwan, 22–25 April 2019; pp. 1–2. [CrossRef]
30.
Rodriguez-Perez, A.; Ruiz-Amaya, J.; Delgado-Restituto, M.; Rodrigeuz-Vazquez, A. A Low-Power Programmable Neural
Spike Detection Channel with Embedded Calibration and Data Compression. IEEE Trans. Biomed. Circuits Syst.
2012
,6, 87–100.
[CrossRef] [PubMed]
31.
Chen, T.; Botimer, J.; Chou, T.; Zhang, Z. A 1.87-mm
2
56.9-GOPS Accelerator for Solving Partial Differential Equations. IEEE J.
Solid-State Circuits 2020,55, 1709–1718. [CrossRef]
32.
Malavipathirana, H.; Hariharan, S.I.; Udayanga, N.; Mandal, S.; Madanayake, A. A Fast and Fully Parallel Analog CMOS Solver
for Nonlinear PDEs. IEEE Trans. Circuits Syst. I Regul. Pap. 2021,68, 3363–3376. [CrossRef]
33. Hasler, J. Large-Scale Field-Programmable Analog Arrays. Proc. IEEE 2020,108, 1283–1302. [CrossRef]
34.
Kim, S.; Shah, S.; Hasler, J. Calibration of Floating-Gate SoC FPAA System. IEEE Trans. Very Large Scale Integr. Syst.
2017
,
25, 2649–2657. [CrossRef]
35.
Schlottmann, C.R.; Shapero, S.; Nease, S.; Hasler, P. A Digitally Enhanced Dynamically Reconfigurable Analog Platform for
Low-Power Signal Processing. IEEE J. Solid-State Circuits 2012,47, 2174–2184. [CrossRef]
36.
Johnson, A.; Davies, R. Speculative Execution Attack Methodologies (SEAM): An overview and component modelling of Spectre,
Meltdown and Foreshadow attack methods. In Proceedings of the 2019 7th International Symposium on Digital Forensics and
Security (ISDFS), Barcelos, Portugal, 10–12 June 2019; pp. 1–6. [CrossRef]
37.
Jeon, D.I.; Park, K.B.; Chung, K.S. HMC-MAC: Processing-in Memory Architecture for Multiply-Accumulate Operations with
Hybrid Memory Cube. IEEE Comput. Archit. Lett. 2018,17, 5–8. [CrossRef]
38.
Garland, J.; Gregg, D. Low Complexity Multiply Accumulate Unit for Weight-Sharing Convolutional Neural Networks. IEEE
Comput. Archit. Lett. 2017,16, 132–135. [CrossRef]
39.
Shanbhag, N.R.; Abdallah, R.A.; Kumar, R.; Jones, D.L. Stochastic computation. In Proceedings of the Design Automation
Conference, Anaheim, CA, USA, 13–18 June 2010; pp. 859–864.
40.
Zhang, X.; Wang, Y.; Zhang, Y.; Song, J.; Zhang, Z.; Cheng, K.; Wang, R.; Huang, R. Memory System Designed for Multiply-
Accumulate (MAC) Engine Based on Stochastic Computing. In Proceedings of the 2019 International Conference on IC Design
and Technology (ICICDT), Suzhou, China, 17–19 June 2019; pp. 1–4. [CrossRef]
41.
Chung, S.; Wang, J. Tightly Coupled Machine Learning Coprocessor Architecture with Analog In-Memory Computing for
Instruction-Level Acceleration. IEEE J. Emerg. Sel. Top. Circuits Syst. 2019,9, 544–561. [CrossRef]
42.
Sarma, T.; Parikh, C.D. Effect of Leakage Currents in Adiabatic Logic Circuits at Lower Technology Nodes. In Proceedings of the
2019 IEEE Conference on Modeling of Systems Circuits and Devices (MOS-AK India), Hyderabad, India, 25–27 February 2019;
pp. 79–81. [CrossRef]
43.
Srilakshmi, K.; Tilak, A.V.N.; Rao, K.S. Performance of FinFET based adiabatic logic circuits. In Proceedings of the 2016 IEEE
Region 10 Conference (TENCON), Singapore, 22–25 November 2016; pp. 2377–2382. [CrossRef]
44.
Samson, M.; Mandavalli, S. Adiabatic 5T SRAM. In Proceedings of the 2011 International Symposium on Electronic System
Design, Kochi, India, 19–21 December 2011; pp. 267–272. [CrossRef]
Energies 2023,16, 6043 13 of 15
45.
Samanta, S. Sequential adiabatic logic for ultra low power applications. In Proceedings of the 2017 Devices for Integrated Circuit
(DevIC), Kalyani, India, 23–24 March 2017; pp. 821–824. [CrossRef]
46.
Filippini, L.; Taskin, B. The Adiabatically Driven StrongARM Comparator. IEEE Trans. Circuits Syst. II Express Briefs
2019
,
66, 1957–1961. [CrossRef]
47.
Filippini, L.; Taskin, B. A 900 MHz Charge Recovery Comparator with 40 fJ per Conversion. In Proceedings of the 2018 IEEE
International Symposium on Circuits and Systems (ISCAS), Florence, Italy, 27–30 May 2018; pp. 1–5. [CrossRef]
48.
Van Elzakker, M.; van Tuijl, E.; Geraedts, P.; Schinkel, D.; Klumperink, E.A.M.; Nauta, B. A 10-bit Charge-Redistribution ADC
Consuming 1.9 µW at 1 MS/s. IEEE J. Solid-State Circuits 2010,45, 1007–1015. [CrossRef]
49.
Moyal, V.; Tripathi, N. Adiabatic Threshold Inverter Quantizer for a 3-bit Flash ADC. In Proceedings of the 2016 International
Conference on Wireless Communications, Signal Processing and Networking (WiSPNET), Chennai, India, 23–25 March 2016;
pp. 1543–1546. [CrossRef]
50.
Vieri, C.J. Pendulum: A Reversible Computer Architecture. Master’s Thesis, Massachusetts Institute of Technology, Cambridge,
MA, USA, 1995.
51.
Degada, A.; Thapliyal, H. Single-Rail Adiabatic Logic for Energy-Efficient and CPA-Resistant Cryptographic Circuit in Low-
Frequency Medical Devices. IEEE Open J. Nanotechnol. 2022,3, 1–14. [CrossRef]
52.
Dhananjay, K.; Salman, E. SEAL-RF: SEcure Adiabatic Logic for Wirelessly-Powered IoT Devices. IEEE Internet Things J.
2022
,10,
1112–1123. [CrossRef]
53.
Karakiewicz, R.; Genov, R.; Cauwenberghs, G. 1.1 TMACS/mW Fine-Grained Stochastic Resonant Charge-Recycling Array
Processor. IEEE Sens. J. 2012,12, 785–792. [CrossRef]
54.
Milo, V.; Glukhov, A.; Pérez, E.; Zambelli, C.; Lepri, N.; Mahadevaiah, M.K.; Quesada, E.P.B.; Olivo, P.; Wenger, C.; Ielmini, D.
Accurate Program/Verify Schemes of Resistive Switching Memory (RRAM) for In-Memory Neural Network Circuits. IEEE Trans.
Electron Devices 2021,68, 3832–3837. [CrossRef]
55.
Bavandpour, M.; Mahmoodi, M.R.; Strukov, D.B. aCortex: An Energy-Efficient Multipurpose Mixed-Signal Inference Accelerator.
IEEE J. Explor. Solid-State Comput. Devices Circuits 2020,6, 98–106. [CrossRef]
56.
Zanotti, T.; Puglisi, F.M.; Pavan, P. Smart Logic-in-Memory Architecture for Low-Power Non-Von Neumann Computing. IEEE J.
Electron Devices Soc. 2020,8, 757–764. [CrossRef]
57. Julliere, M. Tunneling between ferromagnetic films. Phys. Lett. A 1975,54, 225–226. [CrossRef]
58.
Jain, S.; Ranjan, A.; Roy, K.; Raghunathan, A. Computing in Memory with Spin-Transfer Torque Magnetic RAM. IEEE Trans. Very
Large Scale Integr. Syst. 2018,26, 470–483. [CrossRef]
59.
Cai, H.; Guo, Y.; Liu, B.; Zhou, M.; Chen, J.; Liu, X.; Yang, J. Proposal of Analog In-Memory Computing with Magnified Tunnel
Magnetoresistance Ratio and Universal STT-MRAM Cell. IEEE Trans. Circuits Syst. I Regul. Pap.
2022
,69, 1519–1531. [CrossRef]
60.
Fan, D.; Shim, Y.; Raghunathan, A.; Roy, K. STT-SNN: A Spin-Transfer-Torque Based Soft-Limiting Non-Linear Neuron for
Low-Power Artificial Neural Networks. IEEE Trans. Nanotechnol. 2015,14, 1013–1023. [CrossRef]
61.
Snelgrove, M.; Beachler, R. speedAI240: A 2-Petaflop, 30-Teraflops/W At-Memory Inference Acceleration Device with 1456
RISC-V Cores. IEEE Micro 2023,43, 58–63. [CrossRef]
62.
Sanni, K.; Andreou, A. A Mixed-Signal Successive Approximation Architecture for Energy-Efficient Fixed-Point Arithmetic in
16 nm FinFET. In Proceedings of the 2019 IEEE International Symposium on Circuits and Systems (ISCAS), Sapporo, Japan, 26–29
May 2019; pp. 1–5. [CrossRef]
63.
Everspin Technologies, Inc. Everspin Announces Commercial Availability of Its EMxxLX STT-MRAM Devices, 2022. Available
online: https://www.everspin.com/news/everspin-announces-commercial-availability-its-emxxlx-stt-mram-devices (accessed
on 25 June 2023).
64.
Choe, J. Recent Technology Insights on STT-MRAM: Structure, Materials, and Process Integration. In Proceedings of the 2023
IEEE International Memory Workshop (IMW), Monterey, CA, USA, 21–24 May 2023; pp. 1–4. [CrossRef]
65.
Zhang, H.; Liu, J.; Bai, J.; Li, S.; Luo, L.; Wei, S.; Wu, J.; Kang, W. HD-CIM: Hybrid-Device Computing-In-Memory Structure
Based on MRAM and SRAM to Reduce Weight Loading Energy of Neural Networks. IEEE Trans. Circuits Syst. I Regul. Pap.
2022
,
69, 4465–4474. [CrossRef]
66.
Filippini, L.; Khuon, L.; Taskin, B. Charge recovery implementation of an analog comparator: Initial results. In Proceedings of the
2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS), Boston, MA, USA, 6–9 August 2017;
pp. 1505–1508. [CrossRef]
67.
Jiang, H.; Santiago, F.J.H.; Mo, H.; Liu, L.; Han, J. Approximate Arithmetic Circuits: A Survey, Characterization, and Recent
Applications. Proc. IEEE 2020,108, 2108–2135. [CrossRef]
68.
Kuppusamy, P.; Yaswanth Kumar, N.; Dontireddy, J.; Iwendi, C. Quantum Computing and Quantum Machine Learning
Classification—A Survey. In Proceedings of the 2022 IEEE 4th International Conference on Cybernetics, Cognition and Machine
Learning Applications (ICCCMLA), Goa, India, 8–9 October 2022; pp. 200–204. [CrossRef]
69.
Yang, Z.; Zolanvari, M.; Jain, R. A Survey of Important Issues in Quantum Computing and Communications. IEEE Commun.
Surv. Tutorials 2023,25, 1059–1094. [CrossRef]
70.
Upama, P.B.; Faruk, M.J.H.; Nazim, M.; Masum, M.; Shahriar, H.; Uddin, G.; Barzanjeh, S.; Ahamed, S.I.; Rahman, A. Evolution
of Quantum Computing: A Systematic Survey on the Use of Quantum Computing Tools. In Proceedings of the 2022 IEEE
Energies 2023,16, 6043 14 of 15
46th Annual Computers, Software, and Applications Conference (COMPSAC), Los Alamitos, CA, USA, 27 June–1 July 2022;
pp. 520–529. [CrossRef]
71.
Ayala, C.L.; Tanaka, T.; Saito, R.; Nozoe, M.; Takeuchi, N.; Yoshikawa, N. MANA: A Monolithic Adiabatic iNtegration
Architecture Microprocessor Using 1.4-zJ/op Unshunted Superconductor Josephson Junction Devices. IEEE J. Solid-State Circuits
2021,56, 1152–1165. [CrossRef]
72.
Yamauchi, T.; San, H.; Yoshikawa, N.; Chen, O. A Study on the Efficient Design of Adders Using Adiabatic Quantum-Flux-
Parametron Circuits. In Proceedings of the 2022 IEEE 11th Global Conference on Consumer Electronics (GCCE), Osaka, Japan,
18–21 October 2022; pp. 114–116. [CrossRef]
73.
Takahashi, D.; Takeuchi, N.; Yamanashi, Y.; Yoshikawa, N. Design and Demonstration of a Superconducting Field-Programmable
Gate Array Using Adiabatic Quantum-Flux-Parametron Logic and Memory. IEEE Trans. Appl. Supercond.
2022
,32, 1–7. [CrossRef]
74.
Chou, C.C.; Chen, T.Y.; Fang, W.C. FPGA implementation of EEG system-on-chip with automatic artifacts removal based on
BSS-CCA method. In Proceedings of the 2016 IEEE Biomedical Circuits and Systems Conference (BioCAS), Shanghai, China,
17–19 October 2016; pp. 224–227.
75.
Zhang, B.; Saikia, J.; Meng, J.; Wang, D.; Kwon, S.; Myung, S.; Kim, H.; Kim, S.J.; Seo, J.s.; Seok, M. A 177 TOPS/W, Capacitor-
based In-Memory Computing SRAM Macro with Stepwise-Charging/Discharging DACs and Sparsity-Optimized Bitcells for
4-Bit Deep Convolutional Neural Networks. In Proceedings of the 2022 IEEE Custom Integrated Circuits Conference (CICC),
Newport Beach, CA, USA, 24–27 April 2022; pp. 1–2. [CrossRef]
76.
Simon, W.A.; Qureshi, Y.M.; Rios, M.; Levisse, A.; Zapater, M.; Atienza, D. BLADE: An in-Cache Computing Architecture for
Edge Devices. IEEE Trans. Comput. 2020,69, 1349–1363. [CrossRef]
77.
Sathe, V.S.; Chueh, J.Y.; Papaefthymiou, M.C. Energy-Efficient GHz-Class Charge-Recovery Logic. IEEE J. Solid-State Circuits
2007,42, 38–47. [CrossRef]
78.
Wu, B.; Zhu, H.; Reis, D.; Wang, Z.; Wang, Y.; Chen, K.; Liu, W.; Lombardi, F.; Hu, X.S. An Energy-Efficient Computing-in-Memory
(CiM) Scheme Using Field-Free Spin-Orbit Torque (SOT) Magnetic RAMs. IEEE Trans. Emerg. Top. Comput.
2023
,11, 331–342.
[CrossRef]
79.
Villemur, M.; Tognetti, G.; Julian, P. Memory based computation core for nonlinear neural operations. In Proceedings of the 2019
Argentine Conference on Electronics (CAE), Mar del Plata, Argentina, 14–15 March 2019; pp. 98–102. [CrossRef]
80.
Lu, L.; Mani, A.; Do, A.T. A 129.83 TOPS/W Area Efficient Digital SOT/STT MRAM-Based Computing-In-Memory for Advanced
Edge AI Chips. In Proceedings of the 2023 IEEE International Symposium on Circuits and Systems (ISCAS), Monterey, CA, USA,
21–25 May 2023; pp. 1–5. [CrossRef]
81.
Xie, W.; Sang, H.; Kwon, B.; Im, D.; Kim, S.; Kim, S.; Yoo, H.J. A 709.3 TOPS/W Event-Driven Smart Vision SoC with High-
Linearity and Reconfigurable MRAM PIM. In Proceedings of the 2023 IEEE Symposium on VLSI Technology and Circuits (VLSI
Technology and Circuits), Kyoto, Japan, 11–16 June 2023; pp. 1–2. [CrossRef]
82.
Cavalcante, M.; Schuiki, F.; Zaruba, F.; Schaffner, M.; Benini, L. Ara: A 1-GHz+ Scalable and Energy-Efficient RISC-V Vector
Processor with Multiprecision Floating-Point Support in 22-nm FD-SOI. IEEE Trans. Very Large Scale Integr. Syst.
2020
,28, 530–543.
[CrossRef]
83.
Han, L.; Huang, P.; Wang, Y.; Zhou, Z.; Zhang, Y.; Liu, X.; Kang, J. Efficient Discrete Temporal Coding Spike-Driven In-Memory
Computing Macro for Deep Neural Network Based on Nonvolatile Memory. IEEE Trans. Circuits Syst. I Regul. Pap.
2022
,
69, 4487–4498. [CrossRef]
84.
Chiu, Y.C.; Khwa, W.S.; Li, C.Y.; Hsieh, F.L.; Chien, Y.A.; Lin, G.Y.; Chen, P.J.; Pan, T.H.; You, D.Q.; Chen, F.Y.; et al. A 22 nm 8 Mb
STT-MRAM Near-Memory-Computing Macro with 8b-Precision and 46.4-160.1TOPS/W for Edge-AI Devices. In Proceedings of
the 2023 IEEE International Solid- State Circuits Conference (ISSCC), San Francisco, CA, USA, 19–23 February 2023; pp. 496–498.
[CrossRef]
85.
Cowan, G.; Melville, R.; Tsividis, Y. A VLSI analog computer/digital computer accelerator. IEEE J. Solid-State Circuits
2006
,
41, 42–53. [CrossRef]
86.
Dong, Q.; Sinangil, M.E.; Erbagci, B.; Sun, D.; Khwa, W.S.; Liao, H.J.; Wang, Y.; Chang, J. 15.3 a 351 TOPS/W and 372.4 GOPS
compute-in-memory SRAM macro in 7 nm FINFET CMOS for machine-learning applications. In Proceedings of the 2020 IEEE
International Solid- State Circuits Conference—(ISSCC), San Francisco, CA, USA, 16–20 February 2020. [CrossRef]
87.
Kneip, A.; Lefebvre, M.; Verecken, J.; Bol, D. IMPACT: A 1-to-4b 813-TOPS/W 22-nm FD-SOI Compute-in-Memory CNN
Accelerator Featuring a 4.2-POPS/W 146-TOPS/mm
2
CIM-SRAM with Multi-Bit Analog Batch-Normalization. IEEE J. Solid-State
Circuits 2023,58, 1871–1884. [CrossRef]
88.
Kneip, A.; Lefebvre, M.; Verecken, J.; Bol, D. A 1-to-4b 16.8-POPS/W 473-TOPS/mm
2
6T-based In-Memory Computing SRAM in
22 nm FD-SOI with Multi-Bit Analog Batch-Normalization. In Proceedings of the ESSCIRC 2022—IEEE 48th European Solid
State Circuits Conference (ESSCIRC), Milan, Italy, 19–22 September 2022; pp. 157–160. [CrossRef]
89.
Kim, S.; Kim, S.; Um, S.; Kim, S.; Kim, K.; Yoo, H.J. Neuro-CIM: A 310.4 TOPS/W Neuromorphic Computing-in-Memory
Processor with Low WL/BL activity and Digital-Analog Mixed-mode Neuron Firing. In Proceedings of the 2022 IEEE Symposium
on VLSI Technology and Circuits (VLSI Technology and Circuits), Honolulu, HI, USA, 12–17 June 2022; pp. 38–39. [CrossRef]
Energies 2023,16, 6043 15 of 15
90.
Kim, H.; Yoo, T.; Kim, T.T.H.; Kim, B. Colonnade: A Reconfigurable SRAM-Based Digital Bit-Serial Compute-In-Memory Macro
for Processing Neural Networks. IEEE J. Solid-State Circuits 2021,56, 2221–2233. [CrossRef]
91.
Zang, Q.; Goh, W.L.; Lu, L.; Yu, C.; Mu, J.; Kim, T.T.H.; Kim, B.; Lit, D.; Dot, A.T. 282-to-607 TOPS/W, 7T-SRAM Based CiM with
Reconfigurable Column SAR ADC for Neural Network Processing. In Proceedings of the 2023 IEEE International Symposium on
Circuits and Systems (ISCAS), Monterey, CA, USA, 21–25 May 2023; pp. 1–5. [CrossRef]
Disclaimer/Publisher’s Note:
The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.
... Since edge devices have more stringent energy constraints compared to centralized data processing methods, energy-efficient data processing is a major challenge [4,5]. In response, various technological efforts have been made to enhance the energy efficiency of edge computing, one of which is the integration of processing-in-memory (PIM) architecture with edge devices [6,7]. PIM performs data processing inside or near the memory array, reducing latency due to data movement and addressing the key design goal of high energy efficiency in edge devices for memory-centric tasks such as AI applications [8]. ...
Article
Full-text available
The rapid advancement of artificial intelligence (AI) technology, combined with the widespread proliferation of Internet of Things (IoT) devices, has significantly expanded the scope of AI applications, from data centers to edge devices. Running AI applications on edge devices requires a careful balance between data processing performance and energy efficiency. This challenge becomes even more critical when the computational load of applications dynamically changes over time, making it difficult to maintain optimal performance and energy efficiency simultaneously. To address these challenges, we propose a novel processing-in-memory (PIM) technology that dynamically optimizes performance and power consumption in response to real-time workload variations in AI applications. Our proposed solution consists of a new PIM architecture and an operational algorithm designed to maximize its effectiveness. The PIM architecture follows a well-established structure known for effectively handling data-centric tasks in AI applications. However, unlike conventional designs, it features a heterogeneous configuration of high-performance PIM (HP-PIM) modules and low-power PIM (LP-PIM) modules. This enables the system to dynamically adjust data processing based on varying computational load, optimizing energy efficiency according to the application’s workload demands. In addition, we present a data placement optimization algorithm to fully leverage the potential of the heterogeneous PIM architecture. This algorithm predicts changes in application workloads and optimally allocates data to the HP-PIM and LP-PIM modules, improving energy efficiency. To validate and evaluate the proposed technology, we implemented the PIM architecture and developed an embedded processor that integrates this architecture. We performed FPGA prototyping of the processor, and functional verification was successfully completed. Experimental results from running applications with varying workload demands on the prototype PIM processor demonstrate that the proposed technology achieves up to 29.54% energy savings.
Conference Paper
Este artigo explora a evolução da Computação de Alto Desempenho (HPC) e da Computação Móvel de Alto Desempenho (MHPC), analisando seu desenvolvimento e estágio atual. Com base nisso, são discutidos os principais desafios que ainda limitam seu avanço e as dificuldades técnicas enfrentadas. Por fim, o estudo destaca novas oportunidades e possibilidades de inovação, considerando a crescente capacidade da Computação Móvel de Alto Desempenho.
Article
Full-text available
To keep pace with the explosive growth of Artificial Intelligence (AI) and Machine Learning (ML)-dominated applications, distributed intelligence solutions are gaining momentum, which exploit cloud facilities, edge nodes and end-devices to increase the overall computational power, meet application requirements, and optimize performance. Despite the benefits in terms of data privacy and efficient usage of resources, distributing intelligence throughout the cloud-to-things continuum raises unprecedented challenges to the network design. Distributed AI/ML components need high-bandwidth, low-latency connectivity to execute learning and inference tasks, while ensuring high-accuracy and energy-efficiency. This paper aims to explore the new challenging distributed intelligence scenario by extensively and critically scanning the main research achievements in the literature. In addition, starting from them, the main building blocks of a network ecosystem that can enable distributed intelligence are identified and the authors’ views are dissected to provide guidelines for the design of a “future network for distributed Intelligence".
Article
Full-text available
Amid a strife for ever-growing AI processing capabilities at the edge, compute-in-memory (CIM) SRAMs involving current-based dot-product (DP) operators have become excellent candidates to execute low-precision convolutional neural networks (CNNs) with tremendous energy efficiency. Yet, these architectures suffer from noticeable analog non-idealities and a lack of dynamic range adaptivity, leading to significant information loss during ADC quantization that hinders CNN performance with digital batch-normalization (DBN). To overcome these issues, we present IMPACT, a 1-to-4b mixed-signal accelerator in 22-nm FD-SOI intended for low-precision edge CNNs. It includes a novel 72-kB dual-supply CIM-SRAM macro with 6T-based DP operators as well as a multi-bit analog batch-normalization (ABN) unit to bypass the ADC quantization issue. IMPACT embeds the macro within a highly parallel, channel-and precision-adaptive digital datapath that handles memory transfers and provides input-reshaping capability. Finally, a co-designed CIM-aware CNN training framework accounts for the macro’s analog impairments, wherein non-linearity and variability. Measurement results showcase a 4b-normalized computing efficiency of 813 TOPS/W at 64 MHz for the whole accelerator. Taken aside, the CIM-SRAM macro achieves a peak energy efficiency and area efficiency of 4.2 POPS/W and 146 TOPS/mm 2^2 , respectively, surpassing all existing low-precision CIM designs to date.
Article
Full-text available
Driven by the rapid progress in quantum hardware, recent years have witnessed a furious race for quantum technologies in both academia and industry. Universal quantum computers have supported up to hundreds of qubits, while the scale of quantum annealers has reached three orders of magnitude (i.e., thousands of qubits). Quantum computing power keeps climbing. Race has consequently generated an overwhelming number of research papers and documents. This article provides an entry point for interested readers to learn the key aspects of quantum computing and communications from a computer science perspective. It begins with a pedagogical introduction and then reviews the key milestones and recent advances in quantum computing. In this article, the key elements of a quantum Internet are categorized into four important issues, which are investigated in detail: a) quantum computers, b) quantum networks, c) quantum cryptography, and d) quantum machine learning. Finally, the article identifies and discusses the main barriers, the major research directions, and trends.
Article
italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">The speedAI family of devices is Untether AI’s second generation at-memory compute device specifically designed for the acceleration of neural networks. Untether AI achieves state-of-the-art efficiency by minimizing the cost of data movement with its at-memory compute architecture, combined with innovative data types and purpose built on-chip networking. The result is an industry-best 2,000 tera-floating point operations per second (TFLOPs) per chip with an energy efficiency of 30 TFLOPs per watt.</i
Article
The separation of memory and computing units in the von Neumann architecture leads to undesirable energy consumption due to data movement and insufficient memory bandwidth. Energy-efficient in-memory computing platforms have the potential to address such issues. Due to its non-volatility and advantageous features over CMOS (such as low power, near-zero leakage current and high integration density), spin-based devices have been advocated for in-memory computing. This paper proposes a field-free Spin Orbit Torque (FF-SOT) MRAM based computing-in-memory (CiM) scheme that realizes XNOR/XOR logic and a cascading adder. This novel FF-SOT-CiM design does not require expensive peripheral circuits for computation while using the same memory cell design as a SOT-MRAM. Furthermore, FF-SOT-CiM does not require additional write cycles to save the result of its computations in the memory. The design offers higher write speed and; lower operating energy compared to CiM schemes based on other technologies; it also alleviates the source degeneration effect by leveraging an advanced switching mechanism. Extensive simulation results show that the proposed FF-SOT-CiM achieves up to 3.1x (2.6x) latency (energy) reduction compared to SRAM-based CiM, with negligible hardware overhead when performing in-memory XOR. ADD operations; the proposed FF-SOT-CiM can be to 5.0X and 1.5X faster and 3.4X and 1.1X more energy efficient than existing STT-based and FeFET-based schemes, respectively.