PreprintPDF Available

Why neural processing units (NPUs) are the next Big Thing in AI

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

Fun exploration of signals and how they connect reality and thought.
Why neural processing units (NPUs) are
The Next Big Thing in AI
Jean Louis Van Belle, Drs, MAEc, BAEc, BPhil
17 August 2024
Contents
Introduction ............................................................................................................................ 1
Signal processing .................................................................................................................... 1
Digital and/or image signal processing (ISP) ............................................................................. 2
Mono-, duo-, tri- and tetra-chromatic machine vision ............................................................... 4
Logic and logical words ........................................................................................................... 4
What is an NPU, and what are TOPS? ....................................................................................... 6
Post scriptum: articial versus biological brains ....................................................................... 7
Introduction
This paper is a sequel to my paper on the quantum hype in computing, in which I wrote that
“adding a ‘don’t know’ state does not add anything to the logical description of a system: plain
n-state logic is what has driven logic since Aristotle and Plato invented logic about 2375 years
ago.
1
In the abovementioned paper, I argue the von Neumann architecture for a computer
covers all logic (which – let us be clear on this – is n-state logicno fuzziness here!), and that
the quantum hype in computing is what is: a hype.
However, I want to retract those words in the above-mentioned paper somewhat: the concept of
a neural processing unit – and its physical implementation – is currently revolutionizing the
Internet of Things and, in this paper, I want to show why and how.
2
Signal processing
The workings of our eye, or that of a bird (illustrated below
3
), rely on the use of two, three or four
photopigments that convert light into biochemical energy. That biochemical energy is – quite
simply – the motion of an electron or proton charge from one place to another: energy
(expressed in joule) is a force over a distance (1 J = 1 Nm).
1
Plato’s death is dated 348 BC or 406 AUC.
2
I pride myself on being a non-mainstream researcher. At the same time, I am probably more mainstream
than mainstreamers themselves. 
3
We readily acknowledge the source of the graph: Wikipedia.
As you can see, the psychological perception of a yellow photon – or any photon with a
wavelength between 570 and 580 nm – is that it is does not elicit any response from the S-cones
in our eye, but that it does elicit a (non-linear) response from the M (green or G) and L (red or R)
cones, respectively.
The rather nice thing about the graph above is that it shows how a fourth channel can or could
extend the range of what we see. However, there is nothing inherently inferior about two, three
or four channels of vision: to consistently dene the electromagnetic (EM) signature of an
object, we only need to know:
1. What wavelengths or energies it absorbs and reects; and
2. With what intensity it absorbs or reects light.
The latter is related to a property of the system and, therefore, not relevant in this discussion.
4
Indeed, two, three or four detectors can do what logic vision dictates it should do, and that is to
detect which photon energies are being reected and absorbed by a particular object.
The question which arises here is this: would an eye with monochromatic vision (one type of
photon detector only) do the job?
Digital and/or image signal processing (ISP)
Pure monochromacy would imply that the biological or computer eye can only measure the
intensity of the light: how many photons per mm2 and per second? Monochromatic view has no
idea about the individual or average energy of each photon. No biological, physical, or
mathematical system or model would be able to work eectively with an upper and lower limit:
0 for no photons, and 1 for full reection or absorption. This yields a simple inverse Lorentz
factor graph for an ideal photon detector:
All photons at resonant frequency would be detected.
Photons that are not as resonant frequency would be detected and identied as having a
frequency between 1 and 1 1 times the base frequency, as illustrated below.
4
The absorption of thermal energy, or how thermal energy radiates back out of some black body, has to
do with degrees of freedom that come with complicated molecular or crystal structures in real-life
materials. While it is probably the single-most area of interest for engineers, we must leave that topic as a
(self-appointed) researcher in fundamental physics.
What if the threshold of 1 is breached? We could laugh about this question and say that it will
burn down the circuit, and then we cannot meaningfully talk about signal processing anymore.
But that does not reect how Nature works and how we think and operate as a human being: a
sinusoidal or logistic growth (aka sigmoid) curve (show below) makes more sense. Frankly, we
do not know what function captures Nature itself: a sinusoid or Lorentz’s simple inverse of the
equation for a circle (shown above), the logistic or sigmoid function, or… Well… What would you
suggest? 
Let us get back to basics. Regardless of your worries about the appropriate model for modeling
non-linear responses to inputs, you will want to do two things:
1. Compress the signal into something with less noise and, surely, less requirements in
terms of data storage.
5
2. Re-translate the images into meaningful stories about how objects move in these
stories.
That is where neural processing units (NPUs) come into play. I will explain the why and how of
this new kind of embedded logic in a future update of this paper.  I actually notice, right now,
that I have not talked about Fourier transforms yet: it is a wonderful way of compressing data. I
will get to that. We rst need to understand better how RGB – or three-valued logic – works.
5
See Annex I to my previous paper on this topic: video nowadays usually streams in the standard high-
denition (HD) format: 1920 by1080 pixels, refresh rate of 60 Hz (60 frames per second), and 38 = 24 bits
for the color of each pixel. That amounts to a stream of 3 Gbps or about 375 MB per second. Even with a
250 or 1000 compression factor, the question remains: how do you store that? :-/
Mono-, duo-, tri- and tetra-chromatic machine vision
The illustration below – which shows the spectral sensitivity of the human L (red), M (green), and
S (blue) cone cells – shows that that thinking of colors in terms two, three, or four dimensions is
not all that useful. Scientists say our dogs suer from red-green blindness because their retina
is duochrome only, but I am not so sure: what is the logical – and, therefore, psychological or
neural – dierence between a mix of two versus three or four signals?
White is full intensity: it translates to a value of 255 in black-and-white, RGB or four-color
schemes: in other words, white is not a color. Black is not a color either: white and black refer to
the intensity of light. That is all: nothing more, nothing less. So, there is no reason to assume
setting up one, two, three or four detectors would inherently be a better way to distinguish
colors (as shown below) or – what matters when talking ISP (image signal processing) or NPUs
(neural processing units) – the EM signature of a moving object: yellow is light with a frequency
of 510530 Hz, a wavelength of 565–590 nm, and – therefore – a photon energy that is equal to
2.10–2.19 eV (E = hf = hc/).
There is nothing more to it: describing yellow as a hex #FF-FF-00 (100% red, 100% green, and 0%
blue) or a 0% cyan, 0% magenta, 100% yellow and 0% black thing amounts to the same!
Indeed, it is just like the discussion on whether RISC or CISC instruction sets work better, and
why. All programmers know the true answer: it depends on what you want to do with the logic,
doesn’t it? That being said, all of the current logic and investment in sensors and rendering what
they read, is based on RGB logicso that is what the future will revolve about. 
Logic and logical words
The title of this paper says this paper is about neural processing units, and I have said nothing
about them till now. The concept of an NPU is a logical concept: it still runs on a CPU and a
GPU. At the same time, I think the concept is going to revolutionize how mankind will built a
system-on-a-chip (SoC) in the future. Let us give you some numbers illustrating the problem
that NPUs can solve:
.
Video nowadays usually streams in the standard high-denition (HD) format: 1920 by
1080 pixels, refresh rate of 60 Hz (60 frames per second), and 38 = 24 bits for the color
of each pixel.
6
That amounts to a stream of 3 Gbps or about 375 MB per second.
The bandwidth of a regular Ethernet cable (Cat6) is only 1 Gbps. Also, regular video
recorders (VR) would, for example, only have 24 or 36 TB storage capacity
7
: 375 MBps
amounts to 1.35 TB per hour. Hence, if there were no video compression codec
8
, one
camera would exhaust all storage space in one day only.
Codec such as AVC (H.264) or HEVC (H.265) compresses video by 150 or 250 timesor
even more.
9
This is why IP cameras come with powerful processing units: codec software is complicated,
and – see the numbers above – it must run on very capable hardware, just like your smartphone.
Indeed, the system-on-a-chip (SoC) and/or single-board-computer (SBC) inside of your phone, a
compact camera
10
, or an IP surveillance camera, is very similar: it will combine a central
processing unit (CPU) as well as a graphical processing unit (GPU) working together to do what
they are supposed to do. Let us compare the SoC capabilities of my old Samsung J5 with more
recent technology with some notes and numbers.
SoC of Samsung J5 smartphone (2015)
Qualcomm QCS605 SoC (2018)
CPU: Snapdragon 410
Quad (4 cores) ARM Cortex A53 (2012)11
Clock speed up to 1.2GHz12
28 nm technology
On-chip memory: 128 KB
GPU: Adreno 615 (14 nm technology)
Memory bandwidth: 4.2 GB/s
CPU: Hexagon 685
8 Kryo 300 64-bit cores: 22.5GHz +
61.7GHz clock speeds
10 nm technology (Samsung 10LPE)
GPU: Adreno 615 (14 nm technology)
Memory controller: LPDDR4X-3732
6.951 GiB/s bandwidth
6
Those of us who are old enough will remember the old VGA standard (320200 pixels and only one byte
(8 bits per pixel, which makes for 28 = 256 colors), which was introduced by IBM with the PS/2 in 1987 and
used until HD (19201080 pixels = 2MP) and true color (224 = 16777216 color variations) came out.
7
See, for example, Bosch equipment: the DIVAR 5000 recording unit usually comes with 46 = 24 TB
HDDs, while the integrated DIVAR IP all-in-one 4000 comes with 218 = 36 TB.
8
In case you wonder, codec is, apparently, a portmanteau of encoder and decoder.
9
Some sources mention a factor as high as 1000. Even then, you can see that video storage is a
headache: a week is 168 hours only, so even with compression your HDDs ll up quickly. Hence, when
implementing a video management system (VMS), you must dene sensible data storage and retention
policies.
10
This is no surprise, of course: people buy smartphones to communicate but a very decisive factor when
selecting a smartphone model is the video and camera on it, and the quality of the images you can shoot
with it: just remember how small compact cameras gave way to smartphones, and you will appreciate
how embedded imaging technologies have converged.
11
The Technical Reference Manual for this CPU works can be found, surprisingly, on the Arduino website.
It was launched in 2012, and it is still widely sold and used.
12
Most CPUs work with variable frequency: they throttle up and down, so to speak, based upon the load
being placed on them. This is to reduce energy consumption and also to reduce CPU temperature. Clock
speed is an indicator for CPU capacity to quickly execute instructions but MIPS (million instructions per
second) depend on many factors, including the instruction set (32/64 bit RISC) that is used, the number of
cores and how they work together for parallel processing, the organization of cache, branching, etcetera.
To get a good idea of the complexity of calculating MIPS, see this Chips and Cheese assessment.
Memory: 1.5 GB RAM + 8 or 16 GB internal
AI engine: 1 TOPS
At rst, the dierence does not appear to be all that important. In terms of generations, the
relevant dierence in years is that between the introduction of the Cortex 53 CPU (2012) and the
year of introduction of the QCS605 (2018). Conceptually, there is a new performance measure
here: TOPS (trillions of operations per second). Qualcomm’s SoC eectively uses the same
Adreno 615 GPU with a newer 8-core CPU, but all are designed to work together to boost the
performance of what is now referred to as a neural processing unit (NPU). Let us explore this
concept.
What is an NPU, and what are TOPS?
As you will have inferred from the above, processing power and performance depends very
much on the task at hand. A common performance measure was – and still is – FLOPS: oating-
point operations per second. However, oating point operations can be single-precision (32-bit)
or double-precision (64-bit). The latter has become the standard: already in 2008, AMD Radeon
GPUs in the HD 4000 series reached 1 TFLOPS (tera = 1012) performance (single-precision only).
For the specialized tasks of image signal processing, especially when combining it with the
articial intelligence that is needed to encode/decode images, a new concept was needed, and
this new concept is what is now being referred to as hardware acceleration, neural processing,
deep learning and various other termssuch as an AI accelerator, neural processing unit (NPU)
or a deep learning processor (DLP). The important thing is that processing power for these
specialized computing tasks is now being measured in a uniform way, and that is TOPS.
The numbers go up quickly: Qualcomm launched its QCS7230 SoC two years ago (March 2022),
and it is labeled as “a purpose-built chip for enterprise and commercial IoT applications”, with a
performance up to 7 TOPS for AI purposes. Its architecture is referred to as being heterogenous,
but it is still based – basically – on a CPU and a GPU: one Kryo 585 CPU with one 2.8 GHz core,
three 2.42 GHz cores, and two 1.8 GHz cores (hexa-core instead of octa-core), and one Adreno
650 GPU.
What is interesting about it is that it – just like the even newer QCS8250 – which is produced
using 7 nm technology – boasts a dedicated NPU 230 with “always-on Neural Network (NN) use
cases”. This is the new AI which makes or will make IP cameras much smarter in the future. We
refer to the high-level Qualcomm documentation of how all of the logical components of this
SoC work together, and copy the key diagram hereunder. It is now for you to further explore the
wonderful world of NPUs or – more generally – AI use cases and image signal processing (ISP).
Post scriptum: articial versus biological brains
I re-read this paper and hesitated to take it o the web. It talks about how things work
nowadays, but it does not talk a lot about the history of AI and articial neural networks: it is
absolutely fascinating, and informs a lot of what goes on now. A few brief points may be made
here:
1. As usual, the original idea for emulating the logic and workings of the neural networks in
human and animal brains animals came from military research during the Second World War:
while Howard Aitken was building the rst computer (together with von Neumann, who
programmed it), a logician (Walter Pitts) and a neurophysiologist (Warren McCulloch) conceived
the idea for what is now referred to as the McCulloch-Pitts neuron. It was implemented through
emulation (software) in 1957 only as well as, later on (1960s), as newly wired hardware. It
apparently earned Frank Rosenblatt - the inventor of this new 'Perceptron' - the title of “father of
deep learning” in the long history of AI.
By accident, today's Nature brieng (my only paid subscription to news outlets nowadays) talks
about two other – but very much related – things:
2. An experiment by a University of Reading team that shows how hydrogel - yes, just plain dead
hydrogel (we are not talking about jellysh or other living tissues here) - can learn how to play
Pong. It is also “absolutely fascinating. See this 22 August Guardian article with the key
ndings and a wonderful video showing how it was done.
13
3. A talk by Wei-Chung Allen Lee from Harvard Medical School on how real brains work suggests
articial neural networks - although amazing - are still a far cry from the complexity in how
various types of neurons (about 86 billion of them (plus or minus a few, I guess), with 100 trillion
connections - also much more complicated than a typical computer bus, I'd think) actually work
together in a human or animal brain. He calls his research "connectomics" rather than
neuroscience. It made me think that, in light of the complexity of our brain as compared to a
smart device, it is rather amazing that our smart devices are so smart !
We surely live in interesting times ! 
Post scriptum to the post scriptum:
What I write above about the complexity of biological brains does not reduce my admiration
for the technology emulating our brain. Qualcomm's next-gen SoCs (system-on-a-chip) now
come not only with incredibly powerful CPUs and GPUs, but also with a neural processing
unit (NPU), which combines the power of CPUs and GPUs with an AI learning engine and
ready-to-use AI applications, which brings AI from the edge to edge-on-chip. It makes me
think of the 1974 Bachman-Turner Overdrive hit: we ain't seen nothing yet !
The experiment with the "blob of jelly" playing Pong is truly amazing. As Ms. Flora Graham
(Nature Brieng's editor) puts it, what we are seeing here is this: the hydrogel (polymer
chains mixed with water, basically) developed a sort of 'muscle memory’ while it was being
'trained' with simple electric currents in an equally simple feedback loop. That shows that
"even very simple materials can exhibit complex, adaptive behaviors typically associated
with living systems or sophisticated AI,” according to the researchers.
Doesn't that trigger a 'wooow!' reaction in you too? :-) Frankly, I would not have believed this if
someone had told me about this, but this comes from a reliable science journal. The mechanics
and the math one needs to explain this must be mind-boggling. It must be easy to reproduce this
experiment, and I hope to read more about that in the future.
Brussels, 23 August 2023
13
In case you wonder, the link was also in my Nature Brieng feedwhich is currently my only paid
subscription to news outlets.
ResearchGate has not been able to resolve any citations for this publication.
ResearchGate has not been able to resolve any references for this publication.