Available via license: CC BY 4.0

Content may be subject to copyright.

arXiv:2208.06749v1 [cs.PL] 13 Aug 2022

Tensor Algebra on an Optoelectronic Microchip

Sathvik Redrouthu

Procyon Photonics

Ashburn, Virginia

Email: 2024sredrout@tjhsst.edu

Rishi Athavale

Procyon Photonics

Ashburn, Virginia

Email: rishi.athavale1@gmail.com

Abstract—Tensor algebra lies at the core of computational sci-

ence and machine learning. Due to its high usage, entire libraries

exist dedicated to improving its performance. Conventional tensor

algebra performance boosts focus on algorithmic optimizations,

which in turn lead to incremental improvements. In this paper,

we describe a method to accelerate tensor algebra a different

way: by outsourcing operations to an optical microchip. We

outline a numerical programming language developed to perform

tensor algebra computations that is designed to leverage our

optical hardware’s full potential. We introduce the language’s

current grammar and go over the compiler design. We then

show a new way to store sparse rank-ntensors in RAM that

outperforms conventional array storage (used by C++, Java, etc.).

This method is more memory-efﬁcient than Compressed Sparse

Fiber (CSF) format and is speciﬁcally tuned for our optical

hardware. Finally, we show how the scalar-tensor product, rank-

nKronecker product, tensor dot product, Khatri-Rao product,

face-splitting product, and vector cross product can be compiled

into operations native to our optical microchip through various

tensor decompositions.

Keywords—Data analytics, machine learning, optical comput-

ing, scientiﬁc computing, tensor algebra

I. INTRO DUC TIO N

A. Tensor Algebra

Tensor algebra has numerous applications in scientiﬁc dis-

ciplines. For example, widely used multiphysics simulation

software (e.g. COMSOL Multiphysics, Ansys Lumerical, etc.)

must perform large-scale numerical computations to solve

problems in numerous ﬁelds such as ﬂuid dynamics, structural

mechanics, heat transfer and electromagnetics [1–3]. Many

of these computations are streamlined through chained tensor

algebra expressions [4]. In addition, advances in machine

learning (ML) due to large neural networks (e.g., DALL-

E 2, GPT-3, PaLM, etc.) also make use of massive tensor

algebra computations [5]. Optimizing tensor algebra becomes

exceedingly important when ML models must meet time

constraints (e.g., high-frequency stock trading bots) [6].

Tensors themselves can be thought of as n-dimensional nu-

merical arrays for the purposes of this paper. Each dimension

of a tensor is referred to as a mode. A tensor’s rank is the

number of modes it has and therefore the number indices

needed to access a speciﬁc value [7]. Rank-0 tensors, having 0

modes, require no indices to access values and thus represent a

single number, or a scalar. Similarly, rank-1 tensors are simply

vectors and rank-2 tensors are matrices.

Tensors of rank n > 0are very useful in representing

indexed data. For example, a search engine tracking page

3.1 0 0 0.9 0

0 0 2.3 0 1.3

00000

0.1 0 0 0 0

8.0 9.9 4.4 0 0

00000 z

x

y

Fig. 1. Example of a 5×3×2tensor. The tensor is referred to as ”sparse”

as most of the entries are 0. Most tensors encountered are sparse.

URLs, keywords, and backlinks can store collected data in

a rank-3 tensor. Typically, however, not every element in this

tensor is useful. It is often not the case that any given website

contains each keyword and backlink ever indexed by the search

engine. In the frequent scenario where a page URL does not

map to a speciﬁc keyword-backlink combination, a 0 can sim-

ply be placed at tensor[URL][keyword][backlink].

This results in most of the tensor’s entries becoming 0; such a

tensor is referred to as a sparse tensor [8]. We discuss efﬁcient

storage methods for sparse tensors in Sec. V.

Of course, search giants such as Google collect much more

information than described in the example. Other companies

are in the same boat; in fact, according to [9], a speciﬁc rank-

3 Facebook tensor has dimensions 1591 ×63891 ×63890.

Huge computations are performed constantly on tensors like

these; such is the case for most large-scale graph applications

[10]. Even after numerous algorithmic optimizations, however,

such computation is far too slow to keep up with increasing

demands [11]. For example, animation ﬁrms like Pixar can

take up to 39 hours of computing time to render a single frame

[12]. It is therefore apparent that some form of optimization

sustainable throughout the future is necessary.

B. The Photonic Advantage

Many highly optimized tensor algebra libraries currently

exist (e.g., Eigen, MATLAB Tensor Toolbox, and SPLATT)

[13–15]. However, as Moore’s Law and Dennard Scaling

reach their limits and the demand for tensor algebra increases,

running tensor algebra on classical hardware will no longer be

viable and these libraries must adapt [11].

An alternative to classical hardware involves optical com-

puting (the use of photons to perform computations), which

offers a signiﬁcant speed increase and surmounts most of the

energy challenges posed in conventional computer engineering

[16]. Moreover, its lack of dependence on the conventional

transistor leads it to be independent from the decline of

Moore’s Law. Recognizing this, some of us at Procyon Photon-

ics have designed an optical microchip able to perform high-

speed matrix-vector multiplication (MVM). The chip (named

Tachyon 1) maintains a compact form and is inherently analog,

indicating its potential in computational ﬁelds [17].

Performing tensor algebra on such a microchip would offer

a signiﬁcant speed increase while simultaneously sidestepping

the decline of Moore’s Law. In this paper, we describe a

method where this is possible.

C. Apollo

To our knowledge, no programming language has been

invented that can leverage an optical microchip’s full potential

and link it to ﬁelds that can be inﬂuenced by its capabilities.

For these reasons, we introduce Apollo, a computing language

designed speciﬁcally for Tachyon 1. Apollo supports important

tensor algebra operations that are mapped onto the correspond-

ing units on the host computer and optical chip. The language

will be extended to support operations and algorithms that

are not related solely to tensor algebra but still important for

computationally expensive tasks, such as deep neural network

(DNN) training/inference.

We begin by going through preliminary notation and deﬁ-

nitions in Sec II. Next, we cover the language’s grammar and

supported operations in Sec. III. In Sec. IV, we go over the

workﬂow, compiler front-end, and virtual machine (VM). It is

here where we introduce the most important VM instruction

that Sec. VI revolves around.

Next, we illustrate a new method to store large, sparse

tensors in Sec. V, which we found to surpass the conventional

array storage method from a memory viewpoint. In addition,

we show how our method is more efﬁcient than CSF format for

our optical hardware. Finally, since Tachyon 1 is engineered

to perform matrix-vector multiplication in a single instruction,

we focus on decomposing complex tensor algebra expressions

into sequences of matrix-vector products in Sec. VI. Efﬁcient

tensor decompositions would allow entire tensor algebra ex-

pressions to be run at an incredible speed.

II. PR E LI M INAR IE S

A. Notation

Tensors of rank n > 2are denoted in scripted letters (e.g.,

X). Matrices are denoted in uppercase boldface and vectors

are denoted in lowercase boldface (Mand vrespectively).

The identity matrix is denoted as I.

B. Deﬁnitions

We use multiple tensor operations in Apollo, some of which

are modiﬁcations of existing deﬁnitions. In this section, we

deﬁne each operation the way it is used within the language.

Deﬁnition II.1 (Scalar-tensor product).Given a scalar λand

a tensor X ∈ RI1×I2×···×In, the scalar-tensor product λX ∈

RI1×I2×···×Inis given by:

(λX)i1i2...in=λ(xi1i2...in)

Deﬁnition II.2 (Rank-nKronecker product).Given two ten-

sors X ∈ RI1×I2×···×Inand Y ∈ RJ1×J2×···×Jn, the rank-n

Kronecker product X ⊗ Y ∈ RI1J1×I2J2×...InJnis given by:

(X ⊗ Y)i1i2...in= (xj1j2...jn)Y

Each index i1i2...inis a corresponding index in a block

tensor.

Deﬁnition II.3 (Tensor inner product).Given two tensors

X,Y ∈ RI1×I2×···×In, the inner product hX ,Yi ∈ Ris

given by:

hX ,Yi =X

i1X

i2

···X

in

xi1i2...inyi1i2...in

Deﬁnition II.4 (Tensor dot product).Given two tensors X ∈

RI1×I2···×Imand Y ∈ RJ1×J2×···×Jn−1×Jn, the tensor dot

product X · Y ∈ RI1×I2×···×Im−1×J1×J2×···×Jn−2×Jnis given

by:

(X · Y)i1i2...imj1j2...jn=X

im,jn−1

xi1i2...imyj1j2...jn−1jn

where Im=Jn−1.

Deﬁnition II.5 (Khatri-Rao product).Given two matrices A∈

RI×Kand B∈RJ×K, the Khatri-Rao product A⊙B∈

RI·J×Kis given by:

A⊙B=a1⊗b1a2⊗b2··· aK⊗bK

This can be thought of as a column-wise Kronecker product.

Deﬁnition II.6 (Face-splitting product).Given two matrices

A∈RK×Iand B∈RK×J, the face-splitting product A•B∈

RK×I·Jis given by:

A•B=

a1⊗b1

a2⊗b2

.

.

.

aK⊗bK

This can be thought of as a row-wise Kronecker product.

Deﬁnition II.7 (Vector cross product).Given two vectors u∈

R3and v∈R3, the vector cross product u×v∈R3is given

by:

u×v=

e1e2e3

a1a2a3

b1b2b3

We discuss how to run each of these operations on our

optical hardware in Sec. VI.1

1We refer to Def. II.3 in the general case to provide a complete deﬁnition,

but only discuss implementation in the vector case.

hloweri ⇒ ‘a’|‘b’|‘c’|‘d’|‘e’|‘f’|‘g’|‘h’|‘i’|

‘j’|‘k’|‘l’|‘m’|‘n’|‘o’|‘p’|‘q’|‘r’|

‘s’|‘t’|‘u’|‘v’|‘w’|‘x’|‘y’|‘z’

hupperi ⇒ ‘A’|‘B’|‘C’|‘D’|‘E’|‘F’|‘G’|‘H’|‘I’|

‘J’|‘K’|‘L’|‘M’|‘N’|‘O’|‘P’|‘Q’|‘R’|

‘S’|‘T’|‘U’|‘V’|‘W’|‘X’|‘Y’|‘Z’

hdigiti ⇒ ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|

‘9’

hcharacteri ⇒ hloweri | hupperi

hintegeri ⇒ [+|-]hdigiti{hdigiti}

hﬂoating-pointi ⇒ [hintegeri] ‘.’{hintegeri}-

htensori ⇒ ‘{’ [htensori]{‘,’htensori} ‘}’

hidentiﬁeri ⇒ hcharacteri {hcharacteri | digit |‘_’}

hprimaryi ⇒ hintegeri | hﬂoating-pointi | hidentiﬁeri |

htensori | [-]htermi

hfactori ⇒ ‘(’hexpri‘)’| hprimaryi

htermi ⇒ hfactori {(‘*’|‘/’|‘@’|‘&’|‘%’|‘#’) hfactori}

hexpri ⇒ htermi {(‘+’|‘-’) htermi}

htypei ⇒ ‘int’|‘float’|‘tensor’

hstatementi ⇒ ‘let’htypei hidentiﬁeri‘=’hexpri‘;’

hprogrami ⇒ {hstatementi}

Fig. 2. Apollo’s grammar shown in EBNF. The base case in the recursive

tensor structure is a list of comma separated integers and/or ﬂoating point

values.

III. LANGUAGE DETAIL S

A. Grammar

The language ideally would have a grammar that is intuitive,

vast, and requires minimal coding on the user’s side. Since this

is a prototype, however, the grammar is limited and technical.

This minimizes the number of compiler tricks needed, which

we found was a good avenue to take to focus on compiling

tensor algebra expressions. The current grammar is described

in Fig. 2 in Extended Backus-Naur Form (EBNF) notation.

Notable emphasis is placed on expressions, as they are the

focus for our application of optical computing. Currently, is

only possible to declare new statements, as shown Fig. 2.

However, all operations we discuss are able to be performed

with solely this grammar, which we plan to expand in the

future.

B. Supported Operators

The standard PEMDAS order is supported for scalars. For

tensors of rank n≥1, the order of operations should be

deﬁned with parenthesis. We show the operators supported

in this Apollo prototype in tables I and II.

TABLE I

BINA RY OP ER ATO RS

Name Operator Usage

Addition + s1+s2

Subtraction - s1−s2

Multiplication, dot

product

*s1s2,s1· T1,

T1·s1,T2· T2

Division / x÷y

Kronecker product @ T1⊗ T2

Khatri-Rao product & T1⊙ T2

Face-splitting

product

%T1• T2

Cross product # x×y

TABLE II

UNARY OP E RATO RS

Name Operator Usage

Negation - −x

IV. COM PIL ER DES IGN

A. Workﬂow

The workﬂow we decided on is shown in Fig. 3. Note that

the Apollo compiler is 2-stage.

Native code

AVM instructions t1926

Host ASM Tachyon 1

Host CPU

Fig. 3. Native Apollo code gets compiled into Apollo Virtual Machine (AVM)

instructions by the compiler front-end. AVM generates standard assembly

instructions for regular operations and compiles tensor algebra to t1926

instructions. Respective assemblers target the host CPU and Tachyon 1. This

chosen workﬂow enables tensor algebra to be outsourced to Tachyon 1. Note

that the scope of this paper is limited to AVM instruction generation.

The standard assembler targets the host CPU, whereas the

t1926 assembler targets Tachyon 1. Such a setup is used

because Tachyon 1 is geared towards certain types of com-

putations only.

B. Compiler Front-end

We use a hand coded compiler front-end (lexer, parser, and

code generator). This is because we have found that parser

generators do not cooperate well with tensor algebra and

our storage choice. We use a recursive descent parser, which

works well for performance. The in-compiler tensor storage we

describe in Sec. V-B is more easily implemented with such a

parser.

It is noteworthy that there are many instances in the lan-

guage where operators are overloaded. For example, consider

the multiplication operator, *. If A*Bis called, four cases

are possible. 1) Ais a scalar and Bis a tensor of rank n > 0,

2) Ais a tensor of rank n > 0and Bis a scalar, 3) Aand Bare

both scalars, or 4) Aand Bare both tensors of rank n > 0. The

parser considers these cases and generates abstract syntax tree

(AST) nodes of the correct type (e.g., variable nodes, scalar

nodes, tensor nodes, etc.).

The AST is traversed in pre-order by the code genera-

tor, sequentially producing the appropriate VM instructions.

Standard procedures are followed for variable handling. In

the case of more exotic AST nodes (e.g., tensor nodes) the

code generator calls special functions (discussed in Sec. VI)

to generate the correct code. The VM instruction set is outlined

in Sec. IV-C.

C. Virtual Machine

Apollo’s VM is stack-based. It provides 4 memory segments

(namely, the constant,global,pointer, and this

segments), shown in Fig. 42.

0

1

2

.

.

.

0

1

2

.

.

.

0

0

1

2

.

.

.

Fig. 4. The constant (abstract), global,pointer, and this virtual

segments respectively.

Each one of these segments are anchored to a speciﬁc

location in RAM at compile time3. They are ﬁxed in their

locations, except for the this segment, which we use for

tensors. Index 0 in the pointer segment contains the base

address of the this segment, so if the value at index 0

changes, the this segment gets anchored to a different RAM

location, similar to [18]. As the language expands, we may

add additional memory segments that can dynamically change

location during run-time; if we take this route, we will allocate

more RAM and add more values to the pointer segment.

The constant segment is used to push and pop constants

to and from the stack, as in [18]. Note that despite showing

2The RAM referred to throughout this section is a simpliﬁed virtual

abstraction. Hence, we freely interact with it using numbers in the decimal

system. The actual RAM is referred to when discussing compilation to target

architectures, which will be done in a future paper.

3Exact RAM indices are not included.

TABLE III

AVM AR IT HM E TI C INS TRUC TI ON SET

Operation Compiles to Description

neg Host ASM Negates the value at

the top of stack.

add Host ASM Pops stack into b.

Pops stack into a.

Pushes a+bto

stack.

sub Host ASM Pops stack into b.

Pops stack into a.

Pushes a−bto

stack.

mult Host ASM Pops stack into b.

Pops stack into a.

Pushes ab to stack.

div Host ASM Pops stack into b.

Pops stack into a.

Pushes a÷bto

stack.

mvmul Host ASM Pops stack into b.

Pops stack into A.

Pushes Ab to stack.

TABLE IV

AVM SU BRO U TI NE IN S TRU CTIO N SET

Name Args Description

malloc int size Finds an unused

RAM segment of

length size,

pushes pointer

pointing to the ﬁrst

segment index to

stack.

solely integers, the segment supports integer and ﬂoating-point

values. The global segment is used in conjunction with the

symbol table to store variable values, which can be accessed

throughout the lifetime of the program. Values in the global

segment can also be references to tensors4. See Sec. V for

more information regarding tensor storage.

The memory access commands are push [segment] i

and pop [segment] i. The push instruction pushes the

value at index iof memory segment [segment] onto the

stack. The pop instruction pops the value on top of the

stack onto index iof memory segment [segment] [18].

The rest of the AVM instruction set (composed of arithmetic

instructions and built-in subroutines) is given in Tables III and

IV.

Note that each arithmetic instruction can be done in a single

instruction by the corresponding processor.

Subroutines are handled with the instruction call

[fname] [nArgs]. The ﬁrst [nArgs] values are treated

as arguments, so the virtual machine would pop the stack

[nArgs] times if the call command is generated.

Since malloc has 1 argument, a possible code fragment

for it looks like:

4Apollo does not yet support user-deﬁned subroutines, so a local segment

is not required.

T

0

0

0

1

0 1 2

1

0

0 3

1

4

0.1 8.0 9.9 4.4 3.1 0.9 1.3

Fig. 5. CSF representation of the tensor in Fig. 1, as proposed by [15].

push constant 3

call malloc 1

This would 1) push 3 onto the stack, 2) pop 3 off the stack

and pass it into malloc, 3) ﬁnd an unused RAM segment of

size 3, and 4) push a pointer to the ﬁrst index of that segment

to the RAM. Its behavior mimics Memory.alloc in [18].

V. SPAR SE TEN SOR ST OR AG E

A. Current Methods

Tensor components are conventionally represented as nested

arrays in standard programming languages. In C++, the com-

ponents are stored as one contiguous array. To access the

element at index ij, the element at index base+i+j in the

ﬂattened block is indexed (where base is the base address

of the array) [19]. In Java, each array of dimension n+ 1

contains pointers to each sub-array of dimension n. If n= 0,

the (n+1)-dimensional array simply stores scalar values [20].

Since tensors are often sparse, however, these conventional

methods often end up storing excess zeros, making them sub-

optimal. The Facebook tensor discussed in Sec. I-A has only

737,934 nonzero values and is therefore 99.999% made up

of zeroes. It is apparent that tensor storage optimizations

must be considered. Compressed Sparse Fiber (CSF) format

is a better method that stores a tensor in a tree structure,

where the indices and values for only non-zero components

are contained, as shown in Fig. 5. CSF performs signiﬁcant

better than conventional approaches for applications involved

in highly sparse tensor algebra.

However, CSF requires storing pointers to each child node,

likely integrated to enable fast indexing [15]. Such an opti-

mization would typically be incredibly important; however,

since our optical hardware can do an MVM in a single

instruction, it is not necessary that we are able to access indices

efﬁciently in intermediate computations. Rather, it is important

that we return an entire row of indices as fast as possible. Sec.

VI provides insight into why this is the case.

B. Binary Sparse Tensor Tree Format

To save memory and return sub-tensors quickly, we store

the tensor in Fig. 1 as shown in Fig. 6.

0

0

0

0.1

1

0

8.0 1

9.9 2

4.4

1

0

0

3.1 3

0.9

1

4

1.3

Fig. 6. Our representation of the tensor in Fig. 1, which we refer to as Binary

Sparse Tensor Tree (BSTT) format. Each non-leaf node contains an index. The

left child is always the root of a sub-tensor belonging to the current tensor.

The right child is always the root of the next sub-tensor belonging to the

parent tensor shared by the current tensor.

0 0 0 0.1 10··· 41.3

size=20

Fig. 7. Pre-order traversal of BSTT format results in the following array,

which is then stored on the heap (we plan to make tensors mutable in future

Apollo versions). Values are always assumed to be ﬂoating point numbers, a

safe assumption due to the large number of non-integer values encountered

in the targeted ﬁelds [7–9, 11, 15]. Indices are always integers. This allows us

to determine the leaf nodes and ”reconstruct” the tree when needed.

We only use this format for intermediate computations. It is

slower to index into a speciﬁc value, but this is irrelevant as

such indexing is not necessary for Apollo-supported interme-

diate computations on Tachyon 1. Again, however, we must

be able to access a full row of rank-nindices easily. This is

efﬁcient with our format as we can simply return a pointer to

the ﬁrst index. Hence, our method is more useful than CSF

format for our purposes.

VI. CO MPI LIN G TENS OR AL G EBR A EXP RES SIO NS

As stated in earlier sections, the most powerful tensor

algebra operation supported by Tachyon 1 that can be done

in a single instruction is matrix-vector multiplication (MVM).

Therefore, it is the compiler’s job to translate more complex

operations into sequences of MVMs when applicable, thereby

accelerating computation of the whole expression. For clarity,

note that Tachyon 1 multiplies matrices and vectors in the

order Ax =b. Also note that decomposition into sub-tensors

of who’s sizes are supported by Tachyon 1 is not covered in

this paper.

A. Scalar-tensor Product

The scalar-tensor product as deﬁned in Def. II.1 is a

commutative operation that multiplies each element in a tensor

Xby a scalar λ5. The product is very easy to compile;

simply iterate through each vector xi1i2...in−1in the tensor and

generate the mvmul instruction to multiply it by the matrix

λI=λ0

0λ. Note that the compiler reorients the product

to generate the matrix before the vector if the user calls it in

the opposite order. In other words, it ensures that running the

generated code results in a product in the order λIxi1i2...in−1.

B. Rank-nKronecker Product

The Kronecker product is useful in signal and image pro-

cessing [21]. Through the Khatri-Rao product, it is useful

in neural networks (through minimization of convolution and

tensor sketch operations) and natural language processing [22].

Refer to Def. II.2 for the deﬁnition of the Kronecker

product. For clarity, each element in the result is simply the

element at xi1i2...inmultiplied by Yfor two tensors Xand

Y. The product can be represented compactly between two

matrices as

A⊗B=

a11Ba12 B··· a1nB

a21Ba22 B··· a2nB

.

.

..

.

.....

.

.

am1Bam2B··· amnB

Therefore, the compiler can compute the scalar-tensor product

for each element in the resultant block tensor through the

method outlined in Section VI-A.

C. Tensor Dot Product

Many ﬁelds, including machine learning and physics, de-

mand the ability to compute the dot product efﬁciently

[23, 24]. To allow Tachyon 1 to meet this demand, we must

also provide a way for the Apollo compiler to transform

this operation into a sequence of MVMs. The tensor dot

product is an operation between two rank-ntensors, Aand B.

Some possibly familiar tensor dot products include the rank-

0, rank-1, and rank-2 dot products (scalar product, vector dot

product, and matrix multiplication respectively). The compiler

considers the dot product operation over the component arrays.

A few cases are possible:

1) Aand Bare both scalars

2) Either Aor Bis a scalar, but not both

3) Ais a vector/matrix, whereas Bis a vector

4) Ais a vector/matrix, whereas Bis a rank-ntensor with

n > 2

5) Ais a rank-ntensor where n > 2, whereas Bis a vector

6) Ais a rank-ntensor and Bis a rank-mtensor, where

n > 1,m > 1, and n6=m

7) Aand Bare both rank-ntensors

5Xis assumed to be a tensor of rank n > 0, since the parser would map

the scalar case to scalar multiplication.

Cases 1 and 2 are irrelevant since the parser maps Case 1 to

the scalar product and Case 2 to the scalar-tensor product (Def.

II.1; discussed in Sec. VI-A). In Case 3, Ais always treated as

a matrix and the mvmul command is simply generated (this

accounts for Def. II.3 if Ais a vector).

From this point on, we deﬁne a function fnthat refers to

Case n(e.g., f3generates an MVM instruction). Continuing,

in Case 4, Ais also treated as a matrix. Bis decomposed

into a chain of vectors and a series of references to Case 3

(f3(A,bi1i2...in−1)) are made. In Case 5, Ais decomposed

into a chain of matrices and a series of references to Case 3

(f3(Ai1i2...in−2,B)) are again made.

In cases 6 and 7, we consider Def. II.4. In Case 6, the tensor

of lower rank is ﬁrst decomposed. Case 4 is then referenced

for each matrix if Awas decomposed (always into a matrix

chain, resulting in calls to f4(Ai1i2...in−2,B)) and Case 5 is

referenced if Bwas decomposed (always into a vector chain,

resulting in calls to f5(A,bi1i2...in−1)). Finally, in Case 7, A

is decomposed and Case 4 is referenced (f4(Ai1i2...in−2,B)).

D. Vector Cross Product

The cross product is an operation that appears frequently in

computational geometry/computer graphics. A common task

is to generate a third vector orthogonal to two other vectors

(or a plane formed by 3 points) [25]. The cross product can

also be used to calculate the distance between two lines and

calculate if they are parallel. It also appears in a multitude of

physics simulations.

For most applications, cross products are in R3and between

two vectors6. We consider the cross product in a positively

oriented orthonormal basis. The cross product of two vectors

in R3as deﬁned in Def. II.7 is also given by the antisymmetric

matrix-vector product

a×b=

0−a3a2

a30−a1

−a2a10

b1

b2

b3

The mvmul command can simply be generated from here.

E. Other Tensor Products

The Khatri-Rao product is useful in variances in statis-

tics, multi-way models, linear matrix equations, and signal

processing [26–30]. The face-splitting product is useful in

convolutional layers in neural networks and digital signal

processing in a digital antenna array [31, 32].

The code generation method shown in Sec. VI-B can be

easily extended to support the Khatri-Rao and face-splitting

products given in deﬁnitions II.5 and II.6 respectively. An

mvmul command can be generated for each index ion the

operands aiand bi.

6Higher rank cross products can be deﬁned using the Levi-Civita symbol

ǫijk , which we omit due to relatively few applications.

F. Compilation of Expressions

Chaining multiple operations into expressions is supported.

The code generator traverses the AST with the tensor algebra

operator precedence discussed in Sec. III-A, and each code

generation command is called sequentially as outlined in Sec.

VI. However, as a prototype, the Apollo compiler assumes the

arguments are valid and performs no expression-related error

handling.

VII. DI SC USS ION

There are still additions that will need to be made to the

Apollo language in order to fully optimize optical compu-

tations. Most importantly, we will need to use our tensor

storage algorithm only for highly sparse tensors involved in

intermediate computations; we currently implement it for all

tensors. We plan to also extend Apollo to generate t1926 and

host instructions, integrate t1926 instructions with Tachyon

1, and develop the methodology by which Tachyon 1 would

interact with the host CPU. Neural network activation func-

tions, such as ReLU, sigmoid, and softmax, are planned to be

hard-wired into the microchip; we will extend the language

to support neural networks when this occurs. We also plan

to add more useful tensor algebra operations based on the

foundation discussed in this paper, such as the Matricized

Tensor Times Khatri-Rao Product (MTTKRP). These and

other extensions would help Apollo become a more robust

and efﬁcient language.

Future research should explore tensor storage methods that

will be able to more efﬁciently represent sparse tensors while

still making them easy to index into. In order to optimize

for speed, it will also be crucial to investigate how to best

minimize required communication between Tachyon 1 and the

host CPU, as converting between optical and electrical signals

takes a signiﬁcant amount of time. We plan to conduct this

research ourselves, but at the same time encourage others to

look into it as well.

In the future, we plan on extending the advances made in

developing the Apollo language to build APIs for high-level

languages (e.g., Python, Java, C++, etc.) so that they will be

able to utilize Tachyon 1. This will allow users of conventional

languages to be able to harness the speed of optical computing

for applications such as physics simulations and ML. We

speciﬁcally plan on building libraries able to integrate with

the TensorFlow and PyTorch APIs so that users will be able

to run ML models made with these APIs on Tachyon 1.

Our current design framework for Apollo leads the way for

more powerful calculations to be performed faster on a new

generation of hardware. With future advancements and opti-

mizations, Apollo has the potential to impact numerous ﬁelds

in engineering, computer science, and the natural sciences by

allowing for signiﬁcantly faster tensor algebra computations.

VIII. CON CLU SIO N

In this paper, we show how to perform tensor algebra

computations on an optoelectronic microchip through Apollo,

a domain speciﬁc language designed for this purpose. We

then go over the language, compiler, and virtual machine

designs. Next, we show a new way to store tensors that

outperforms both conventional storage and CSF format from

a memory viewpoint while still being compatible with our

optical hardware. Finally, we go over the compilation of tensor

algebra expressions into matrix-vector multiplications, which

are native to our microchip. We illustrate how complex tensor

algebra expressions can be run quickly and efﬁciently through

our methods. Finally, we discuss the impact of our research,

provide suggestions for future research avenues, and outline

how we plan to extend the Apollo language.

Acknowledgements

We thank Dhruv Anurag for Apollo-related discussion and

testing. We thank Jagadeepram Maddipatla for creating test

cases. We thank Dr. Jonathan Osborne for mathematical dis-

cussion and advice. We thank Mr. Emil Jurj for supporting

this project. We thank Shihao Cao for support and useful

discussion about the project’s future. Finally, we thank our

families for extended support and patience.

REF ERE NC E S

[1] “Comsol multiphysics® software - understand,

predict, and optimize.” [Online]. Available:

https://www.comsol.com/comsol-multiphysics

[2] “Engineering simulation software — ansys products.” [Online].

Available: https://www.ansys.com/products

[3] “Multiphysics modeling.” [Online]. Available:

https://www.veryst.com/services/simulation-analysis/multiphysics-modeling

[4] D. E. Keyes, L. C. McInnes, C. Woodward, W. Gropp, E. Myra,

M. Pernice, J. Bell, J. Brown, A. Clo, J. Connors, E. Constantinescu,

D. Estep, K. Evans, C. Farhat, A. Hakim, G. Hammond, G. Hansen,

J. Hill, T. Isaac, X. Jiao, K. Jordan, D. Kaushik, E. Kaxiras, A. Koniges,

K. Lee, A. Lott, Q. Lu, J. Magerlein, R. Maxwell, M. McCourt,

M. Mehl, R. Pawlowski, A. P. Randles, D. Reynolds, B. Rivi`ere,

U. R¨ude, T. Scheibe, J. Shadid, B. Sheehan, M. Shephard, A. Siegel,

B. Smith, X. Tang, C. Wilson, and B. Wohlmuth, “Multiphysics

simulations: Challenges and opportunities,” The International Journal

of High Performance Computing Applications, vol. 27, no. 1, pp. 4–83,

2013. [Online]. Available: https://doi.org/10.1177/1094342012468181

[5] D. Blalock and J. Guttag, “Multiplying matrices without multiplying,”

in Proceedings of the 38th International Conference on Machine

Learning, ser. Proceedings of Machine Learning Research, M. Meila

and T. Zhang, Eds., vol. 139. PMLR, 18–24 Jul 2021, pp. 992–1004.

[Online]. Available: https://proceedings.mlr.press/v139/blalock21a.html

[6] A. Briola, J. D. Turiel, R. Marcaccioli, and T. Aste, “Deep reinforcement

learning for active high frequency trading,” CoRR, vol. abs/2101.07107,

2021. [Online]. Available: https://arxiv.org/abs/2101.07107

[7] P. A. Tew, “An investigation of sparse tensor formats for

tensor libraries,” M.Eng. Thesis, Massachusetts Institute of

Technology, Cambridge, MA, Jun 2016. [Online]. Available:

http://tensor-compiler.org/ﬁles/tew-meng-thesis- sparse.pdf

[8] H. Xu, K. Kostopoulou, A. Dutta, X. Li, A. Ntoulas, and P. Kalnis,

“Deepreduce: A sparse-tensor communication framework for federated

deep learning,” in Advances in Neural Information Processing Systems,

M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan,

Eds., vol. 34. Curran Associates, Inc., 2021, pp. 21 150–21 163.

[Online]. Available: https://tinyurl.com/3vrf7crn

[9] F. Kjolstad, S. Kamil, S. Chou, D. Lugato, and S. Amarasinghe,

“The tensor algebra compiler,” Proc. ACM Program. Lang., vol. 1,

no. OOPSLA, pp. 77:1–77:29, Oct. 2017. [Online]. Available:

http://doi.acm.org/10.1145/3133901

[10] D. M. Dunlavy, T. G. Kolda, and W. P. Kegelmeyer, 7. Multilinear Al-

gebra for Analyzing Data with Multiple Linkages, pp. 85–114. [Online].

Available: https://epubs.siam.org/doi/abs/10.1137/1.9780898719918.ch7

[11] Srivastava, Nitish Kumar, “Design and generation of efﬁcient hardware

accelerators for sparse and dense tensor computations,” 2020. [Online].

Available: https://hdl.handle.net/1813/70449

[12] J. Lehrer, “1,084 days: How toy story 3 was made,” Jun 2010. [Online].

Available: https://www.wired.co.uk/article/how-toy-story- 3- was-made

[13] P. Peltzer, J. Lotz, and U. Naumann, “Eigen-ad: Algorithmic

differentiation of the eigen library,” CoRR, vol. abs/1911.12604, 2019.

[Online]. Available: http://arxiv.org/abs/1911.12604

[14] T. Kola, B. W. Bader, E. N. Acar Ataman, D. Dunlavy, R. Bassett,

C. J. Battaglino, T. Plantenga, E. Chi, S. Hansen, and USDOE,

“Tensor toolbox for matlab v. 3.0,” 3 2017. [Online]. Available:

https://www.osti.gov/biblio/1349514

[15] S. Smith, N. Ravindran, N. D. Sidiropoulos, and G. Karypis, “Splatt:

Efﬁcient and parallel sparse tensor-matrix multiplication,” in 2015 IEEE

International Parallel and Distributed Processing Symposium, 2015, pp.

61–70.

[16] C. Cole, “Optical and electrical programmable comput-

ing energy use comparison,” Opt. Express, vol. 29,

no. 9, pp. 13 153–13 170, Apr 2021. [Online]. Available:

http://opg.optica.org/oe/abstract.cfm?URI=oe-29- 9-13153

[17] S. Garg, J. Lou, A. Jain, and M. A. Nahmias, “Dynamic precision

analog computing for neural networks,” CoRR, vol. abs/2102.06365,

2021. [Online]. Available: https://arxiv.org/abs/2102.06365

[18] N. Nisan and S. Schocken, The Elements of Computing Systems:

Building a modern computer from ﬁrst principles. The MIT Press,

2021.

[19] Corob-Msft, “Arrays (c++).” [Online]. Available:

https://docs.microsoft.com/en-us/cpp/cpp/arrays-cpp?view=msvc-170

[20] “Arrays.” [Online]. Available:

https://docs.oracle.com/javase/tutorial/java/nutsandbolts/arrays.html

[21] C. F. Loan, “The ubiquitous kronecker product,” Journal of Computa-

tional and Applied Mathematics, vol. 123, no. 1, pp. 85–100, 2000,

numerical Analysis 2000. Vol. III: Linear Algebra. [Online]. Available:

https://www.sciencedirect.com/science/article/pii/S0377042700003939

[22] A. D. Jagtap, Y. Shin, K. Kawaguchi, and G. E. Karniadakis, “Deep

kronecker neural networks: A general framework for neural networks

with adaptive activation functions,” CoRR, vol. abs/2105.09513, 2021.

[Online]. Available: https://arxiv.org/abs/2105.09513

[23] S. Rabanser, O. Shchur, and S. G¨unnemann, “Introduction to tensor

decompositions and their applications in machine learning,” 2017.

[Online]. Available: https://arxiv.org/abs/1711.10781

[24] G. Dahl, J. M. Leinaas, J. Myrheim, and E. Ovrum, “A tensor product

matrix approximation problem in quantum physics,” Linear Algebra and

its Applications, vol. 420, no. 2, pp. 711–725, 2007. [Online]. Available:

https://www.sciencedirect.com/science/article/pii/S0024379506004149

[25] R. Eisele, “3d cross product.” [Online]. Available:

https://www.xarg.org/book/linear-algebra/3d- cross-product/

[26] C. A. Sims, J. H. Stock, and M. W. Watson, “Inference in linear time

series models with some unit roots,” Econometrica, vol. 58, no. 1, p.

113, Jan. 1990.

[27] R. L. Chambers, A. H. Dorfman, and S. Wang, “Limited information

likelihood analysis of survey data,” J. R. Stat. Soc. Series B Stat.

Methodol., vol. 60, no. 2, pp. 397–411, 1998.

[28] R. Bro, Multi-way Analysis in the Food Industry. Models. Algorithms

and Applications.

[29] H. Lev-Ari, Efﬁcient solution of linear matrix equations with applica-

tions to multistatic.

[30] R. S. Budampati and N. D. Sidiropoulos, “Khatri-Rao space-time codes

with maximum diversity gains over frequency-selective channels,” in

Sensor Array and Multichannel Signal Processing Workshop Proceed-

ings, 2002. IEEE, 2003.

[31] V. Slyusar, “New matrix operations for dsp,” 11 1999.

[32] D. Ha, A. M. Dai, and Q. V. Le, “Hypernet-

works,” CoRR, vol. abs/1609.09106, 2016. [Online]. Available:

http://arxiv.org/abs/1609.09106