PreprintPDF Available

Tensor Algebra on an Optoelectronic Microchip

Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Tensor algebra lies at the core of computational science and machine learning. Due to its high usage, entire libraries exist dedicated to improving its performance. Conventional tensor algebra performance boosts focus on algorithmic optimizations, which in turn lead to incremental improvements. In this paper, we describe a method to accelerate tensor algebra a different way: by outsourcing operations to an optical microchip. We outline a numerical programming language developed to perform tensor algebra computations that is designed to leverage our optical hardware's full potential. We introduce the language's current grammar and go over the compiler design. We then show a new way to store sparse rank-n tensors in RAM that outperforms conventional array storage (used by C++, Java, etc.). This method is more memory-efficient than Compressed Sparse Fiber (CSF) format and is specifically tuned for our optical hardware. Finally, we show how the scalar-tensor product, rank-$n$ Kronecker product, tensor dot product, Khatri-Rao product, face-splitting product, and vector cross product can be compiled into operations native to our optical microchip through various tensor decompositions.
Content may be subject to copyright.
arXiv:2208.06749v1 [cs.PL] 13 Aug 2022
Tensor Algebra on an Optoelectronic Microchip
Sathvik Redrouthu
Procyon Photonics
Ashburn, Virginia
Rishi Athavale
Procyon Photonics
Ashburn, Virginia
Abstract—Tensor algebra lies at the core of computational sci-
ence and machine learning. Due to its high usage, entire libraries
exist dedicated to improving its performance. Conventional tensor
algebra performance boosts focus on algorithmic optimizations,
which in turn lead to incremental improvements. In this paper,
we describe a method to accelerate tensor algebra a different
way: by outsourcing operations to an optical microchip. We
outline a numerical programming language developed to perform
tensor algebra computations that is designed to leverage our
optical hardware’s full potential. We introduce the language’s
current grammar and go over the compiler design. We then
show a new way to store sparse rank-ntensors in RAM that
outperforms conventional array storage (used by C++, Java, etc.).
This method is more memory-efficient than Compressed Sparse
Fiber (CSF) format and is specifically tuned for our optical
hardware. Finally, we show how the scalar-tensor product, rank-
nKronecker product, tensor dot product, Khatri-Rao product,
face-splitting product, and vector cross product can be compiled
into operations native to our optical microchip through various
tensor decompositions.
Keywords—Data analytics, machine learning, optical comput-
ing, scientific computing, tensor algebra
A. Tensor Algebra
Tensor algebra has numerous applications in scientific dis-
ciplines. For example, widely used multiphysics simulation
software (e.g. COMSOL Multiphysics, Ansys Lumerical, etc.)
must perform large-scale numerical computations to solve
problems in numerous fields such as fluid dynamics, structural
mechanics, heat transfer and electromagnetics [13]. Many
of these computations are streamlined through chained tensor
algebra expressions [4]. In addition, advances in machine
learning (ML) due to large neural networks (e.g., DALL-
E 2, GPT-3, PaLM, etc.) also make use of massive tensor
algebra computations [5]. Optimizing tensor algebra becomes
exceedingly important when ML models must meet time
constraints (e.g., high-frequency stock trading bots) [6].
Tensors themselves can be thought of as n-dimensional nu-
merical arrays for the purposes of this paper. Each dimension
of a tensor is referred to as a mode. A tensor’s rank is the
number of modes it has and therefore the number indices
needed to access a specific value [7]. Rank-0 tensors, having 0
modes, require no indices to access values and thus represent a
single number, or a scalar. Similarly, rank-1 tensors are simply
vectors and rank-2 tensors are matrices.
Tensors of rank n > 0are very useful in representing
indexed data. For example, a search engine tracking page
3.1 0 0 0.9 0
0 0 2.3 0 1.3
0.1 0 0 0 0
8.0 9.9 4.4 0 0
00000 z
Fig. 1. Example of a 5×3×2tensor. The tensor is referred to as ”sparse”
as most of the entries are 0. Most tensors encountered are sparse.
URLs, keywords, and backlinks can store collected data in
a rank-3 tensor. Typically, however, not every element in this
tensor is useful. It is often not the case that any given website
contains each keyword and backlink ever indexed by the search
engine. In the frequent scenario where a page URL does not
map to a specific keyword-backlink combination, a 0 can sim-
ply be placed at tensor[URL][keyword][backlink].
This results in most of the tensor’s entries becoming 0; such a
tensor is referred to as a sparse tensor [8]. We discuss efficient
storage methods for sparse tensors in Sec. V.
Of course, search giants such as Google collect much more
information than described in the example. Other companies
are in the same boat; in fact, according to [9], a specific rank-
3 Facebook tensor has dimensions 1591 ×63891 ×63890.
Huge computations are performed constantly on tensors like
these; such is the case for most large-scale graph applications
[10]. Even after numerous algorithmic optimizations, however,
such computation is far too slow to keep up with increasing
demands [11]. For example, animation firms like Pixar can
take up to 39 hours of computing time to render a single frame
[12]. It is therefore apparent that some form of optimization
sustainable throughout the future is necessary.
B. The Photonic Advantage
Many highly optimized tensor algebra libraries currently
exist (e.g., Eigen, MATLAB Tensor Toolbox, and SPLATT)
[1315]. However, as Moore’s Law and Dennard Scaling
reach their limits and the demand for tensor algebra increases,
running tensor algebra on classical hardware will no longer be
viable and these libraries must adapt [11].
An alternative to classical hardware involves optical com-
puting (the use of photons to perform computations), which
offers a significant speed increase and surmounts most of the
energy challenges posed in conventional computer engineering
[16]. Moreover, its lack of dependence on the conventional
transistor leads it to be independent from the decline of
Moore’s Law. Recognizing this, some of us at Procyon Photon-
ics have designed an optical microchip able to perform high-
speed matrix-vector multiplication (MVM). The chip (named
Tachyon 1) maintains a compact form and is inherently analog,
indicating its potential in computational fields [17].
Performing tensor algebra on such a microchip would offer
a significant speed increase while simultaneously sidestepping
the decline of Moore’s Law. In this paper, we describe a
method where this is possible.
C. Apollo
To our knowledge, no programming language has been
invented that can leverage an optical microchip’s full potential
and link it to fields that can be influenced by its capabilities.
For these reasons, we introduce Apollo, a computing language
designed specifically for Tachyon 1. Apollo supports important
tensor algebra operations that are mapped onto the correspond-
ing units on the host computer and optical chip. The language
will be extended to support operations and algorithms that
are not related solely to tensor algebra but still important for
computationally expensive tasks, such as deep neural network
(DNN) training/inference.
We begin by going through preliminary notation and defi-
nitions in Sec II. Next, we cover the language’s grammar and
supported operations in Sec. III. In Sec. IV, we go over the
workflow, compiler front-end, and virtual machine (VM). It is
here where we introduce the most important VM instruction
that Sec. VI revolves around.
Next, we illustrate a new method to store large, sparse
tensors in Sec. V, which we found to surpass the conventional
array storage method from a memory viewpoint. In addition,
we show how our method is more efficient than CSF format for
our optical hardware. Finally, since Tachyon 1 is engineered
to perform matrix-vector multiplication in a single instruction,
we focus on decomposing complex tensor algebra expressions
into sequences of matrix-vector products in Sec. VI. Efficient
tensor decompositions would allow entire tensor algebra ex-
pressions to be run at an incredible speed.
A. Notation
Tensors of rank n > 2are denoted in scripted letters (e.g.,
X). Matrices are denoted in uppercase boldface and vectors
are denoted in lowercase boldface (Mand vrespectively).
The identity matrix is denoted as I.
B. Definitions
We use multiple tensor operations in Apollo, some of which
are modifications of existing definitions. In this section, we
define each operation the way it is used within the language.
Definition II.1 (Scalar-tensor product).Given a scalar λand
a tensor X RI1×I2×···×In, the scalar-tensor product λX
RI1×I2×···×Inis given by:
Definition II.2 (Rank-nKronecker product).Given two ten-
sors X RI1×I2×···×Inand Y RJ1×J2×···×Jn, the rank-n
Kronecker product X Y RI1J1×I2J2×...InJnis given by:
(X Y) (xj1j2...jn)Y
Each index i1i2...inis a corresponding index in a block
Definition II.3 (Tensor inner product).Given two tensors
X,Y RI1×I2×···×In, the inner product hX ,Yi Ris
given by:
hX ,Yi =X
Definition II.4 (Tensor dot product).Given two tensors X
RI1×I2···×Imand Y RJ1×J2×···×Jn1×Jn, the tensor dot
product X · Y RI1×I2×···×Im1×J1×J2×···×Jn2×Jnis given
(X · Y)i1i2...imj1j2...jn=X
where Im=Jn1.
Definition II.5 (Khatri-Rao product).Given two matrices A
RI×Kand BRJ×K, the Khatri-Rao product AB
RI·J×Kis given by:
AB=a1b1a2b2··· aKbK
This can be thought of as a column-wise Kronecker product.
Definition II.6 (Face-splitting product).Given two matrices
ARK×Iand BRK×J, the face-splitting product AB
RK×I·Jis given by:
This can be thought of as a row-wise Kronecker product.
Definition II.7 (Vector cross product).Given two vectors u
R3and vR3, the vector cross product u×vR3is given
We discuss how to run each of these operations on our
optical hardware in Sec. VI.1
1We refer to Def. II.3 in the general case to provide a complete definition,
but only discuss implementation in the vector case.
hloweri a|b|c|d|e|f|g|h|i|
hupperi A|B|C|D|E|F|G|H|I|
hdigiti 0|1|2|3|4|5|6|7|8|
hcharacteri hloweri | hupperi
hintegeri [+|-]hdigiti{hdigiti}
hfloating-pointi [hintegeri] .{hintegeri}-
htensori { [htensori]{,htensori} }
hidentifieri hcharacteri {hcharacteri | digit |_}
hprimaryi hintegeri | hfloating-pointi | hidentifieri |
htensori | [-]htermi
hfactori (hexpri)| hprimaryi
htermi hfactori {(‘*|/|@|&|%|#’) hfactori}
hexpri htermi {(‘+|-’) htermi}
htypei int|float|tensor
hstatementi lethtypei hidentifieri=hexpri;
hprogrami {hstatementi}
Fig. 2. Apollo’s grammar shown in EBNF. The base case in the recursive
tensor structure is a list of comma separated integers and/or floating point
A. Grammar
The language ideally would have a grammar that is intuitive,
vast, and requires minimal coding on the user’s side. Since this
is a prototype, however, the grammar is limited and technical.
This minimizes the number of compiler tricks needed, which
we found was a good avenue to take to focus on compiling
tensor algebra expressions. The current grammar is described
in Fig. 2 in Extended Backus-Naur Form (EBNF) notation.
Notable emphasis is placed on expressions, as they are the
focus for our application of optical computing. Currently, is
only possible to declare new statements, as shown Fig. 2.
However, all operations we discuss are able to be performed
with solely this grammar, which we plan to expand in the
B. Supported Operators
The standard PEMDAS order is supported for scalars. For
tensors of rank n1, the order of operations should be
defined with parenthesis. We show the operators supported
in this Apollo prototype in tables I and II.
Name Operator Usage
Addition + s1+s2
Subtraction - s1s2
Multiplication, dot
*s1s2,s1· T1,
T1·s1,T2· T2
Division / x÷y
Kronecker product @ T1 T2
Khatri-Rao product & T1 T2
%T1 T2
Cross product # x×y
Name Operator Usage
Negation - x
A. Workflow
The workflow we decided on is shown in Fig. 3. Note that
the Apollo compiler is 2-stage.
Native code
AVM instructions t1926
Host ASM Tachyon 1
Host CPU
Fig. 3. Native Apollo code gets compiled into Apollo Virtual Machine (AVM)
instructions by the compiler front-end. AVM generates standard assembly
instructions for regular operations and compiles tensor algebra to t1926
instructions. Respective assemblers target the host CPU and Tachyon 1. This
chosen workflow enables tensor algebra to be outsourced to Tachyon 1. Note
that the scope of this paper is limited to AVM instruction generation.
The standard assembler targets the host CPU, whereas the
t1926 assembler targets Tachyon 1. Such a setup is used
because Tachyon 1 is geared towards certain types of com-
putations only.
B. Compiler Front-end
We use a hand coded compiler front-end (lexer, parser, and
code generator). This is because we have found that parser
generators do not cooperate well with tensor algebra and
our storage choice. We use a recursive descent parser, which
works well for performance. The in-compiler tensor storage we
describe in Sec. V-B is more easily implemented with such a
It is noteworthy that there are many instances in the lan-
guage where operators are overloaded. For example, consider
the multiplication operator, *. If A*Bis called, four cases
are possible. 1) Ais a scalar and Bis a tensor of rank n > 0,
2) Ais a tensor of rank n > 0and Bis a scalar, 3) Aand Bare
both scalars, or 4) Aand Bare both tensors of rank n > 0. The
parser considers these cases and generates abstract syntax tree
(AST) nodes of the correct type (e.g., variable nodes, scalar
nodes, tensor nodes, etc.).
The AST is traversed in pre-order by the code genera-
tor, sequentially producing the appropriate VM instructions.
Standard procedures are followed for variable handling. In
the case of more exotic AST nodes (e.g., tensor nodes) the
code generator calls special functions (discussed in Sec. VI)
to generate the correct code. The VM instruction set is outlined
in Sec. IV-C.
C. Virtual Machine
Apollo’s VM is stack-based. It provides 4 memory segments
(namely, the constant,global,pointer, and this
segments), shown in Fig. 42.
Fig. 4. The constant (abstract), global,pointer, and this virtual
segments respectively.
Each one of these segments are anchored to a specific
location in RAM at compile time3. They are fixed in their
locations, except for the this segment, which we use for
tensors. Index 0 in the pointer segment contains the base
address of the this segment, so if the value at index 0
changes, the this segment gets anchored to a different RAM
location, similar to [18]. As the language expands, we may
add additional memory segments that can dynamically change
location during run-time; if we take this route, we will allocate
more RAM and add more values to the pointer segment.
The constant segment is used to push and pop constants
to and from the stack, as in [18]. Note that despite showing
2The RAM referred to throughout this section is a simplified virtual
abstraction. Hence, we freely interact with it using numbers in the decimal
system. The actual RAM is referred to when discussing compilation to target
architectures, which will be done in a future paper.
3Exact RAM indices are not included.
Operation Compiles to Description
neg Host ASM Negates the value at
the top of stack.
add Host ASM Pops stack into b.
Pops stack into a.
Pushes a+bto
sub Host ASM Pops stack into b.
Pops stack into a.
Pushes abto
mult Host ASM Pops stack into b.
Pops stack into a.
Pushes ab to stack.
div Host ASM Pops stack into b.
Pops stack into a.
Pushes a÷bto
mvmul Host ASM Pops stack into b.
Pops stack into A.
Pushes Ab to stack.
Name Args Description
malloc int size Finds an unused
RAM segment of
length size,
pushes pointer
pointing to the first
segment index to
solely integers, the segment supports integer and floating-point
values. The global segment is used in conjunction with the
symbol table to store variable values, which can be accessed
throughout the lifetime of the program. Values in the global
segment can also be references to tensors4. See Sec. V for
more information regarding tensor storage.
The memory access commands are push [segment] i
and pop [segment] i. The push instruction pushes the
value at index iof memory segment [segment] onto the
stack. The pop instruction pops the value on top of the
stack onto index iof memory segment [segment] [18].
The rest of the AVM instruction set (composed of arithmetic
instructions and built-in subroutines) is given in Tables III and
Note that each arithmetic instruction can be done in a single
instruction by the corresponding processor.
Subroutines are handled with the instruction call
[fname] [nArgs]. The first [nArgs] values are treated
as arguments, so the virtual machine would pop the stack
[nArgs] times if the call command is generated.
Since malloc has 1 argument, a possible code fragment
for it looks like:
4Apollo does not yet support user-defined subroutines, so a local segment
is not required.
0 1 2
0 3
0.1 8.0 9.9 4.4 3.1 0.9 1.3
Fig. 5. CSF representation of the tensor in Fig. 1, as proposed by [15].
push constant 3
call malloc 1
This would 1) push 3 onto the stack, 2) pop 3 off the stack
and pass it into malloc, 3) find an unused RAM segment of
size 3, and 4) push a pointer to the first index of that segment
to the RAM. Its behavior mimics Memory.alloc in [18].
A. Current Methods
Tensor components are conventionally represented as nested
arrays in standard programming languages. In C++, the com-
ponents are stored as one contiguous array. To access the
element at index ij, the element at index base+i+j in the
flattened block is indexed (where base is the base address
of the array) [19]. In Java, each array of dimension n+ 1
contains pointers to each sub-array of dimension n. If n= 0,
the (n+1)-dimensional array simply stores scalar values [20].
Since tensors are often sparse, however, these conventional
methods often end up storing excess zeros, making them sub-
optimal. The Facebook tensor discussed in Sec. I-A has only
737,934 nonzero values and is therefore 99.999% made up
of zeroes. It is apparent that tensor storage optimizations
must be considered. Compressed Sparse Fiber (CSF) format
is a better method that stores a tensor in a tree structure,
where the indices and values for only non-zero components
are contained, as shown in Fig. 5. CSF performs significant
better than conventional approaches for applications involved
in highly sparse tensor algebra.
However, CSF requires storing pointers to each child node,
likely integrated to enable fast indexing [15]. Such an opti-
mization would typically be incredibly important; however,
since our optical hardware can do an MVM in a single
instruction, it is not necessary that we are able to access indices
efficiently in intermediate computations. Rather, it is important
that we return an entire row of indices as fast as possible. Sec.
VI provides insight into why this is the case.
B. Binary Sparse Tensor Tree Format
To save memory and return sub-tensors quickly, we store
the tensor in Fig. 1 as shown in Fig. 6.
8.0 1
9.9 2
3.1 3
Fig. 6. Our representation of the tensor in Fig. 1, which we refer to as Binary
Sparse Tensor Tree (BSTT) format. Each non-leaf node contains an index. The
left child is always the root of a sub-tensor belonging to the current tensor.
The right child is always the root of the next sub-tensor belonging to the
parent tensor shared by the current tensor.
0 0 0 0.1 10··· 41.3
Fig. 7. Pre-order traversal of BSTT format results in the following array,
which is then stored on the heap (we plan to make tensors mutable in future
Apollo versions). Values are always assumed to be floating point numbers, a
safe assumption due to the large number of non-integer values encountered
in the targeted fields [7–9, 11, 15]. Indices are always integers. This allows us
to determine the leaf nodes and ”reconstruct” the tree when needed.
We only use this format for intermediate computations. It is
slower to index into a specific value, but this is irrelevant as
such indexing is not necessary for Apollo-supported interme-
diate computations on Tachyon 1. Again, however, we must
be able to access a full row of rank-nindices easily. This is
efficient with our format as we can simply return a pointer to
the first index. Hence, our method is more useful than CSF
format for our purposes.
As stated in earlier sections, the most powerful tensor
algebra operation supported by Tachyon 1 that can be done
in a single instruction is matrix-vector multiplication (MVM).
Therefore, it is the compiler’s job to translate more complex
operations into sequences of MVMs when applicable, thereby
accelerating computation of the whole expression. For clarity,
note that Tachyon 1 multiplies matrices and vectors in the
order Ax =b. Also note that decomposition into sub-tensors
of who’s sizes are supported by Tachyon 1 is not covered in
this paper.
A. Scalar-tensor Product
The scalar-tensor product as defined in Def. II.1 is a
commutative operation that multiplies each element in a tensor
Xby a scalar λ5. The product is very easy to compile;
simply iterate through each vector xi1i2...in1in the tensor and
generate the mvmul instruction to multiply it by the matrix
0λ. Note that the compiler reorients the product
to generate the matrix before the vector if the user calls it in
the opposite order. In other words, it ensures that running the
generated code results in a product in the order λIxi1i2...in1.
B. Rank-nKronecker Product
The Kronecker product is useful in signal and image pro-
cessing [21]. Through the Khatri-Rao product, it is useful
in neural networks (through minimization of convolution and
tensor sketch operations) and natural language processing [22].
Refer to Def. II.2 for the definition of the Kronecker
product. For clarity, each element in the result is simply the
element at xi1i2...inmultiplied by Yfor two tensors Xand
Y. The product can be represented compactly between two
matrices as
a11Ba12 B··· a1nB
a21Ba22 B··· a2nB
am1Bam2B··· amnB
Therefore, the compiler can compute the scalar-tensor product
for each element in the resultant block tensor through the
method outlined in Section VI-A.
C. Tensor Dot Product
Many fields, including machine learning and physics, de-
mand the ability to compute the dot product efficiently
[23, 24]. To allow Tachyon 1 to meet this demand, we must
also provide a way for the Apollo compiler to transform
this operation into a sequence of MVMs. The tensor dot
product is an operation between two rank-ntensors, Aand B.
Some possibly familiar tensor dot products include the rank-
0, rank-1, and rank-2 dot products (scalar product, vector dot
product, and matrix multiplication respectively). The compiler
considers the dot product operation over the component arrays.
A few cases are possible:
1) Aand Bare both scalars
2) Either Aor Bis a scalar, but not both
3) Ais a vector/matrix, whereas Bis a vector
4) Ais a vector/matrix, whereas Bis a rank-ntensor with
n > 2
5) Ais a rank-ntensor where n > 2, whereas Bis a vector
6) Ais a rank-ntensor and Bis a rank-mtensor, where
n > 1,m > 1, and n6=m
7) Aand Bare both rank-ntensors
5Xis assumed to be a tensor of rank n > 0, since the parser would map
the scalar case to scalar multiplication.
Cases 1 and 2 are irrelevant since the parser maps Case 1 to
the scalar product and Case 2 to the scalar-tensor product (Def.
II.1; discussed in Sec. VI-A). In Case 3, Ais always treated as
a matrix and the mvmul command is simply generated (this
accounts for Def. II.3 if Ais a vector).
From this point on, we define a function fnthat refers to
Case n(e.g., f3generates an MVM instruction). Continuing,
in Case 4, Ais also treated as a matrix. Bis decomposed
into a chain of vectors and a series of references to Case 3
(f3(A,bi1i2...in1)) are made. In Case 5, Ais decomposed
into a chain of matrices and a series of references to Case 3
(f3(Ai1i2...in2,B)) are again made.
In cases 6 and 7, we consider Def. II.4. In Case 6, the tensor
of lower rank is first decomposed. Case 4 is then referenced
for each matrix if Awas decomposed (always into a matrix
chain, resulting in calls to f4(Ai1i2...in2,B)) and Case 5 is
referenced if Bwas decomposed (always into a vector chain,
resulting in calls to f5(A,bi1i2...in1)). Finally, in Case 7, A
is decomposed and Case 4 is referenced (f4(Ai1i2...in2,B)).
D. Vector Cross Product
The cross product is an operation that appears frequently in
computational geometry/computer graphics. A common task
is to generate a third vector orthogonal to two other vectors
(or a plane formed by 3 points) [25]. The cross product can
also be used to calculate the distance between two lines and
calculate if they are parallel. It also appears in a multitude of
physics simulations.
For most applications, cross products are in R3and between
two vectors6. We consider the cross product in a positively
oriented orthonormal basis. The cross product of two vectors
in R3as defined in Def. II.7 is also given by the antisymmetric
matrix-vector product
The mvmul command can simply be generated from here.
E. Other Tensor Products
The Khatri-Rao product is useful in variances in statis-
tics, multi-way models, linear matrix equations, and signal
processing [2630]. The face-splitting product is useful in
convolutional layers in neural networks and digital signal
processing in a digital antenna array [31, 32].
The code generation method shown in Sec. VI-B can be
easily extended to support the Khatri-Rao and face-splitting
products given in definitions II.5 and II.6 respectively. An
mvmul command can be generated for each index ion the
operands aiand bi.
6Higher rank cross products can be defined using the Levi-Civita symbol
ǫijk , which we omit due to relatively few applications.
F. Compilation of Expressions
Chaining multiple operations into expressions is supported.
The code generator traverses the AST with the tensor algebra
operator precedence discussed in Sec. III-A, and each code
generation command is called sequentially as outlined in Sec.
VI. However, as a prototype, the Apollo compiler assumes the
arguments are valid and performs no expression-related error
There are still additions that will need to be made to the
Apollo language in order to fully optimize optical compu-
tations. Most importantly, we will need to use our tensor
storage algorithm only for highly sparse tensors involved in
intermediate computations; we currently implement it for all
tensors. We plan to also extend Apollo to generate t1926 and
host instructions, integrate t1926 instructions with Tachyon
1, and develop the methodology by which Tachyon 1 would
interact with the host CPU. Neural network activation func-
tions, such as ReLU, sigmoid, and softmax, are planned to be
hard-wired into the microchip; we will extend the language
to support neural networks when this occurs. We also plan
to add more useful tensor algebra operations based on the
foundation discussed in this paper, such as the Matricized
Tensor Times Khatri-Rao Product (MTTKRP). These and
other extensions would help Apollo become a more robust
and efficient language.
Future research should explore tensor storage methods that
will be able to more efficiently represent sparse tensors while
still making them easy to index into. In order to optimize
for speed, it will also be crucial to investigate how to best
minimize required communication between Tachyon 1 and the
host CPU, as converting between optical and electrical signals
takes a significant amount of time. We plan to conduct this
research ourselves, but at the same time encourage others to
look into it as well.
In the future, we plan on extending the advances made in
developing the Apollo language to build APIs for high-level
languages (e.g., Python, Java, C++, etc.) so that they will be
able to utilize Tachyon 1. This will allow users of conventional
languages to be able to harness the speed of optical computing
for applications such as physics simulations and ML. We
specifically plan on building libraries able to integrate with
the TensorFlow and PyTorch APIs so that users will be able
to run ML models made with these APIs on Tachyon 1.
Our current design framework for Apollo leads the way for
more powerful calculations to be performed faster on a new
generation of hardware. With future advancements and opti-
mizations, Apollo has the potential to impact numerous fields
in engineering, computer science, and the natural sciences by
allowing for significantly faster tensor algebra computations.
In this paper, we show how to perform tensor algebra
computations on an optoelectronic microchip through Apollo,
a domain specific language designed for this purpose. We
then go over the language, compiler, and virtual machine
designs. Next, we show a new way to store tensors that
outperforms both conventional storage and CSF format from
a memory viewpoint while still being compatible with our
optical hardware. Finally, we go over the compilation of tensor
algebra expressions into matrix-vector multiplications, which
are native to our microchip. We illustrate how complex tensor
algebra expressions can be run quickly and efficiently through
our methods. Finally, we discuss the impact of our research,
provide suggestions for future research avenues, and outline
how we plan to extend the Apollo language.
We thank Dhruv Anurag for Apollo-related discussion and
testing. We thank Jagadeepram Maddipatla for creating test
cases. We thank Dr. Jonathan Osborne for mathematical dis-
cussion and advice. We thank Mr. Emil Jurj for supporting
this project. We thank Shihao Cao for support and useful
discussion about the project’s future. Finally, we thank our
families for extended support and patience.
[1] “Comsol multiphysics® software - understand,
predict, and optimize. [Online]. Available:
[2] “Engineering simulation software ansys products.” [Online].
[3] “Multiphysics modeling. [Online]. Available:
[4] D. E. Keyes, L. C. McInnes, C. Woodward, W. Gropp, E. Myra,
M. Pernice, J. Bell, J. Brown, A. Clo, J. Connors, E. Constantinescu,
D. Estep, K. Evans, C. Farhat, A. Hakim, G. Hammond, G. Hansen,
J. Hill, T. Isaac, X. Jiao, K. Jordan, D. Kaushik, E. Kaxiras, A. Koniges,
K. Lee, A. Lott, Q. Lu, J. Magerlein, R. Maxwell, M. McCourt,
M. Mehl, R. Pawlowski, A. P. Randles, D. Reynolds, B. Rivi`ere,
U. ude, T. Scheibe, J. Shadid, B. Sheehan, M. Shephard, A. Siegel,
B. Smith, X. Tang, C. Wilson, and B. Wohlmuth, “Multiphysics
simulations: Challenges and opportunities, The International Journal
of High Performance Computing Applications, vol. 27, no. 1, pp. 4–83,
2013. [Online]. Available:
[5] D. Blalock and J. Guttag, “Multiplying matrices without multiplying,”
in Proceedings of the 38th International Conference on Machine
Learning, ser. Proceedings of Machine Learning Research, M. Meila
and T. Zhang, Eds., vol. 139. PMLR, 18–24 Jul 2021, pp. 992–1004.
[Online]. Available:
[6] A. Briola, J. D. Turiel, R. Marcaccioli, and T. Aste, “Deep reinforcement
learning for active high frequency trading,” CoRR, vol. abs/2101.07107,
2021. [Online]. Available:
[7] P. A. Tew, “An investigation of sparse tensor formats for
tensor libraries, M.Eng. Thesis, Massachusetts Institute of
Technology, Cambridge, MA, Jun 2016. [Online]. Available: sparse.pdf
[8] H. Xu, K. Kostopoulou, A. Dutta, X. Li, A. Ntoulas, and P. Kalnis,
“Deepreduce: A sparse-tensor communication framework for federated
deep learning, in Advances in Neural Information Processing Systems,
M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan,
Eds., vol. 34. Curran Associates, Inc., 2021, pp. 21 150–21 163.
[Online]. Available:
[9] F. Kjolstad, S. Kamil, S. Chou, D. Lugato, and S. Amarasinghe,
“The tensor algebra compiler,” Proc. ACM Program. Lang., vol. 1,
no. OOPSLA, pp. 77:1–77:29, Oct. 2017. [Online]. Available:
[10] D. M. Dunlavy, T. G. Kolda, and W. P. Kegelmeyer, 7. Multilinear Al-
gebra for Analyzing Data with Multiple Linkages, pp. 85–114. [Online].
[11] Srivastava, Nitish Kumar, “Design and generation of efficient hardware
accelerators for sparse and dense tensor computations,” 2020. [Online].
[12] J. Lehrer, “1,084 days: How toy story 3 was made,” Jun 2010. [Online].
Available: 3- was-made
[13] P. Peltzer, J. Lotz, and U. Naumann, “Eigen-ad: Algorithmic
differentiation of the eigen library,” CoRR, vol. abs/1911.12604, 2019.
[Online]. Available:
[14] T. Kola, B. W. Bader, E. N. Acar Ataman, D. Dunlavy, R. Bassett,
C. J. Battaglino, T. Plantenga, E. Chi, S. Hansen, and USDOE,
“Tensor toolbox for matlab v. 3.0, 3 2017. [Online]. Available:
[15] S. Smith, N. Ravindran, N. D. Sidiropoulos, and G. Karypis, “Splatt:
Efficient and parallel sparse tensor-matrix multiplication,” in 2015 IEEE
International Parallel and Distributed Processing Symposium, 2015, pp.
[16] C. Cole, “Optical and electrical programmable comput-
ing energy use comparison, Opt. Express, vol. 29,
no. 9, pp. 13 153–13 170, Apr 2021. [Online]. Available: 9-13153
[17] S. Garg, J. Lou, A. Jain, and M. A. Nahmias, “Dynamic precision
analog computing for neural networks, CoRR, vol. abs/2102.06365,
2021. [Online]. Available:
[18] N. Nisan and S. Schocken, The Elements of Computing Systems:
Building a modern computer from first principles. The MIT Press,
[19] Corob-Msft, “Arrays (c++).” [Online]. Available:
[20] “Arrays.” [Online]. Available:
[21] C. F. Loan, “The ubiquitous kronecker product,” Journal of Computa-
tional and Applied Mathematics, vol. 123, no. 1, pp. 85–100, 2000,
numerical Analysis 2000. Vol. III: Linear Algebra. [Online]. Available:
[22] A. D. Jagtap, Y. Shin, K. Kawaguchi, and G. E. Karniadakis, “Deep
kronecker neural networks: A general framework for neural networks
with adaptive activation functions, CoRR, vol. abs/2105.09513, 2021.
[Online]. Available:
[23] S. Rabanser, O. Shchur, and S. unnemann, “Introduction to tensor
decompositions and their applications in machine learning, 2017.
[Online]. Available:
[24] G. Dahl, J. M. Leinaas, J. Myrheim, and E. Ovrum, A tensor product
matrix approximation problem in quantum physics, Linear Algebra and
its Applications, vol. 420, no. 2, pp. 711–725, 2007. [Online]. Available:
[25] R. Eisele, “3d cross product.” [Online]. Available: cross-product/
[26] C. A. Sims, J. H. Stock, and M. W. Watson, “Inference in linear time
series models with some unit roots,” Econometrica, vol. 58, no. 1, p.
113, Jan. 1990.
[27] R. L. Chambers, A. H. Dorfman, and S. Wang, “Limited information
likelihood analysis of survey data, J. R. Stat. Soc. Series B Stat.
Methodol., vol. 60, no. 2, pp. 397–411, 1998.
[28] R. Bro, Multi-way Analysis in the Food Industry. Models. Algorithms
and Applications.
[29] H. Lev-Ari, Efficient solution of linear matrix equations with applica-
tions to multistatic.
[30] R. S. Budampati and N. D. Sidiropoulos, “Khatri-Rao space-time codes
with maximum diversity gains over frequency-selective channels,” in
Sensor Array and Multichannel Signal Processing Workshop Proceed-
ings, 2002. IEEE, 2003.
[31] V. Slyusar, “New matrix operations for dsp,” 11 1999.
[32] D. Ha, A. M. Dai, and Q. V. Le, “Hypernet-
works,” CoRR, vol. abs/1609.09106, 2016. [Online]. Available:
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
Tensor algebra lives at the heart of big data applications. Where classical machine learning techniques such as embedding generation in recommender systems, dimensionality reduction and latent Dirichlet allocation make use of multi-dimensional tensor factorizations, deep learning techniques such as convolutional neural networks, recurrent neural networks and graph learning use tensor computations primarily in the form of matrix-matrix and matrix-vector multiplications. The tensor computations often used in many of these fields operate on sparse data where most of the elements are zeros. Traditionally, tensor computations have been performed on CPUs and GPUs, both of which have low energy-efficiency as they allocate excessive hardware resources to flexibly support various workloads. However, with the end of Moore’s law and Dennard scaling, one can no longer expect more and faster transistors for the same dollar and power budget. This has led to an ever-growing need for energy-efficient and high-performance hardware that has resulted in a recent surge of interest in application-specific, domain-specific and behavior-specific accelerators, which sacrifice generality for higher performance and energy efficiency. In this dissertation, I explore hardware specialization for tensor computations by building programmable accelerators. A central theme in my dissertation is determining common spatial optimizations, computation and memory access patterns, and building efficient storage formats and hardware for tensor computations. First, I present T2S-Tensor, a language and compilation framework for productively generating high-performance systolic arrays for dense tensor computations. Then I present a versatile accelerator, Tensaurus, that can accelerate both dense and mixed sparse-dense tensor computations. Here, I also introduce a new sparse storage format that allows accessing sparse data in a vectorized and streaming fashion and thus achieves high memory bandwidth utilization for sparse tensor kernels. Finally, I present a novel sparse-sparse matrix multiplication accelerator, MatRaptor, designed using a row-wise product approach. I also show how these different hardware specialization techniques outperform CPUs, GPUs and state-of-the-art accelerators in both energy efficiency and performance.
Full-text available
This lecture presents the basic concepts of a lot of matrix operations and related applications for digital beamforming, which was proposed by author in 1996-1998. This lecture can be used for radar system, smart antennas for wireless communications, and other systems applying digital beamforming. It's intended for individuals new to the field who wish to gain a basic understanding in this area. For additional information, check out the reference material presented at the end of this lecture.
Conference Paper
Full-text available
Multi-dimensional arrays, or tensors, are increas-ingly found in fields such as signal processing and recommender systems. Real-world tensors can be enormous in size and often very sparse. There is a need for efficient, high-performance tools capable of processing the massive sparse tensors of today and the future. This paper introduces SPLATT, a C library with shared-memory parallelism for three-mode tensors. SPLATT contains algorithmic improvements over competing state of the art tools for sparse tensor factorization. SPLATT has a fast, parallel method of multiplying a matricized tensor by a Khatri-Rao product, which is a key kernel in tensor factorization methods. SPLATT uses a novel data structure that exploits the sparsity patterns of tensors. This data structure has a small memory footprint similar to competing methods and allows for the computational improvements featured in our work. We also present a method of finding cache-friendly reorderings and utilizing them with a novel form of cache tiling. To our knowledge, this is the first work to investigate reordering and cache tiling in this context. SPLATT averages almost 30× speedup compared to our baseline when using 16 threads and reaches over 80× speedup on NELL-2.
Full-text available
We consider multiphysics applications from algorithmic and architectural perspectives, where ‘‘algorithmic’’ includes both mathematical analysis and computational complexity, and ‘‘architectural’’ includes both software and hardware environments. Many diverse multiphysics applications can be reduced, en route to their computational simulation, to a common algebraic coupling paradigm. Mathematical analysis of multiphysics coupling in this form is not always practical for realistic applications, but model problems representative of applications discussed herein can provide insight. A variety of software frameworks for multiphysics applications have been constructed and refined within disciplinary communities and executed on leading-edge computer systems. We examine several of these, expose some commonalities among them, and attempt to extrapolate best practices to future systems. From our study, we summarize challenges and forecast opportunities.
We propose a new type of neural networks, Kronecker neural networks (KNNs), that form a general framework for neural networks with adaptive activation functions. KNNs employ the Kronecker product, which provides an efficient way of constructing a very wide network while keeping the number of parameters low. Our theoretical analysis reveals that under suitable conditions, KNNs induce a faster decay of the loss than that by the feed-forward networks. This is also empirically verified through a set of computational examples. Furthermore, under certain technical assumptions, we establish global convergence of gradient descent for KNNs. As a specific case, we propose the Rowdy activation function that is designed to get rid of any saturation region by injecting sinusoidal fluctuations, which include trainable parameters. The proposed Rowdy activation function can be employed in any neural network architecture like feed-forward neural networks, Recurrent neural networks, Convolutional neural networks etc. The effectiveness of KNNs with Rowdy activation is demonstrated through various computational experiments including function approximation using feed-forward neural networks, solution inference of partial differential equations using the physics-informed neural networks, and standard deep learning benchmark problems using convolutional and fully-connected neural networks.
Optical computing has been proposed as a replacement for electrical computing to reduce energy use of math intensive programmable applications like machine learning. Objective energy use comparison requires that data transfer is separated from computing and made constant, with only computing variable. Three operations compared in this manner are multiplication, addition and inner product. For each, it is found that energy use is dominated by data transfer, and that computing energy use is a small fraction of the total. Switching to optical from electrical programmable computing does not reduce energy use.
Tensors provide a generalized structure to store arbitrary indexable data, which is applicable in fields such as chemometrics, physics simulations, signal processing and lies at the heart of machine learning. Many naturally occurring tensors are considered sparse as they contain mostly zero values. As with sparse matrices, various techniques can be employed to more efficiently store and compute on these sparse tensors. This work explores several sparse tensor formats while ultimately evaluating two implementations; one based on explicitly storing coordinates and one that compresses these coordinates. The two formats, Coordinate and CSF2, were evaluated by comparing their execution time of tensor-matrix products and the MTTKRP operation on several datasets. We find that the Coordinate format is superior for uniformly distributed sparse tensors or when used in computation that emits a sparse tensor via a mode dependent operation. In all other considered cases for large sparse tensors, the storage savings of the compressed format provide the best results.
Tensors are multidimensional arrays of numerical values and therefore generalize matrices to multiple dimensions. While tensors first emerged in the psychometrics community in the $20^{\text{th}}$ century, they have since then spread to numerous other disciplines, including machine learning. Tensors and their decompositions are especially beneficial in unsupervised learning settings, but are gaining popularity in other sub-disciplines like temporal and multi-relational data analysis, too. The scope of this paper is to give a broad overview of tensors, their decompositions, and how they are used in machine learning. As part of this, we are going to introduce basic tensor concepts, discuss why tensors can be considered more rigid than matrices with respect to the uniqueness of their decomposition, explain the most important factorization algorithms and their properties, provide concrete examples of tensor decomposition applications in machine learning, conduct a case study on tensor-based estimation of mixture models, talk about the current state of research, and provide references to available software libraries.
Analysts of survey data are often interested in modelling the population process, or superpopulation, that gave rise to a ‘target’ set of survey variables. An important tool for this is maximum likelihood estimation. A survey is said to provide limited information for such inference if data used in the design of the survey are unavailable to the analyst. In this circumstance, sample inclusion probabilities, which are typically available, provide information which needs to be incorporated into the analysis. We consider the case where these inclusion probabilities can be modelled in terms of a linear combination of the design and target variables, and only sample values of these are available. Strict maximum likelihood estimation of the underlying superpopulation means of these variables appears to be analytically impossible in this case, but an analysis based on approximations to the inclusion probabilities leads to a simple estimator which is a close approximation to the maximum likelihood estimator. In a simulation study, this estimator outperformed several other estimators that are based on approaches suggested in the sampling literature.
We consider a matrix approximation problem arising in the study of entanglement in quantum physics. This notion represents a certain type of correlations between subsystems in a composite quantum system. The states of a system are described by a density matrix, which is a positive semidefinite matrix with trace one. The goal is to approximate such a given density matrix by a so-called separable density matrix, and the distance between these matrices gives information about the degree of entanglement in the system. Separability here is expressed in terms of tensor products. We discuss this approximation problem for a composite system with two subsystems and show that it can be written as a convex optimization problem with special structure. We investigate related convex sets, and suggest an algorithm for this approximation problem which exploits the tensor product structure in certain subproblems. Finally some computational results and experiences are presented.