Content uploaded by Yuri Alexeev
Author content
All content in this area was uploaded by Yuri Alexeev on Nov 26, 2018
Content may be subject to copyright.
Boosting Curative Surgery Success Rates using
FPGAs
Ahmed Sanaullah Chen Yang *Yuri Alexeev *Kazutomo Yoshii Martin C. Herbordt
Boston University, Boston, MA *Argonne National Lab, Lemont, IL
Abstract—FPGAs deliver high performance, energy efficient,
cost-effective and scalable solutions for complex, and often
critical, modern scientific challenges. In previous work, we have
shown that FPGA based custom designs have outperformed
CPUs and GPUs in applications such as machine learning, drug
design, genomics, molecular dynamics, space exploration etc.
For Machine Learning using Deep Neural Networks (DNNs) in
particular, FPGAs can outperform even ASICs due to higher
potential for application specific optimizations. In this paper, we
present a FPGA based DNN inference processor that increases the
success rate of curative surgery for neural tissues. Differentiation
between tumors and normal cells is done in real-time through
analysis of smoke from ionizing current to ensure tumor edges are
identified with high precision. Our design achieves 820x speedup
over GPUs and 110x speedup over the Google Tensor Processing
Unit (ASIC) for the classification task. These results indicate that
FPGAs can play a significant role in improving the efficiency and
success of computer-assisted procedures.
I. INTRODUCTION
Distinguishing between tumors and healthy tissue during
neurosurgery can be a matter of life and death for patients.
Localization of the tumor using MRI scans lacks the required
precision to perform a reliable distinction during the proce-
dure. The consequences of incomplete removal of cancerous
cells could be regrowth that requires a second surgery, while
cutting healthy tissue is potentially fatal. Biopsies can identify
tumor edges but take significant time-frames to complete
which is not feasible. A more reliable, computer-aided method
for identifying tumor cells during surgery is mass spectrom-
etry. Devices such as iKnife [1] utilize ionizing current to
burn through tissue and analyze the smoke (Rapid Evaporative
Ionization Mass Spectrometry (REIMS) [2]) to determine
structural lipid profiles [3]. Measured chemical composition
consisting of thousands of parameters is then compared to a
reference dataset to perform identification.
This classification stage for mass spectrometry based anal-
ysis is memory bound due to the large number of look-up
operations and thus does not scale. An alternative approach
is to use neural networks to automate detection. Not only
do neural networks have higher computational intensity and
better scaling potential, but the algorithm also provides op-
portunities for reducing memory bounds through data reuse
and application specific optimizations. In our work, we have
employed feed-forward Multi-Layer Perceptrons. Multi-Layer
Perceptron (MLP) is one of the most commonly deployed
Deep Neural Network, representing 61% of the workload in
Google data centers [4]. To the best of our knowledge, the
Google Tensor Processing Unit (TPU) [4], the state-of-the-
art implementation of MLP inference, achieves ≈10% of the
92 TeraOps/s peak. TPU addresses the memory bound by
processing multiple test vectors simultaneously to increase
operations per weight byte loaded from DRAM. However, in
our application, waiting for sufficient input vectors to get good
performance is not feasible since healthy tissues could be cut
in the meantime.
In this work, we have designed a quantized TeraOps/s Re-
configurable Inference Processor for MLPs (TRIP) on FPGAs
that alleviates the memory bound by storing all weights on-
chip and ensured performance is invariant of the input batch
size. Through this approach, we not only guarantee stall-
free data availability to pipelines during steady-state, but also
reduce power consumption significantly by reducing DRAM
access. For large databases that cannot directly fit on chip,
Deep Compression relaxes memory footprint requirements
with no effect on accuracy [5]. TRIP is deployed as a stan-
dalone device directly connected to the mass spectrometer,
as a co-processor where input vectors are supplied through
OpenCL wrappers from the host machine, as well as in a
cluster configuration where on-chip transceivers communicate
between FPGAs [6]. By comparison, TPU can only be used
in a co-processor configuration.
II. SY ST EM DESIGN
MLP based neural networks typically have asymmetric logi-
cal configurations, with average neurons per layer being orders
of magnitude larger than the number of layers. Consequently,
while relatively small delays for inter-layer function evalua-
tions have negligible impact, addressing compute bottlenecks
through high compute capability and sustained throughput for
intra-layer computations is performance critical. This served
as motivation for the TRIP architecture presented in Figure 1.
Fig. 1. TRIP Architecture Overview
1) Processing Core: The Processing Core implements up
to 8192 - 8 bit integer multipliers in a M×N2D array.
We employ both DSP and ALM multipliers to ensure
sufficient resources on medium-high end FPGAs. Each
slice of Mmultipliers has an associated adder tree to
enable scalar product evaluation. TRIP provides the flex-
ibility of choosing array configurations that maximize
processing core utilization. Resulting (signed) 32-bit N
partial sums are accumulated into output buffers which
are preloaded with bias vectors.
2) Activation Pipeline: The Activation pipeline performs
non-linear operations on layer result vectors and re-
quantizes them from 32-bit to a lower bit size for the
next layer. TRIP currently supports ReLU activation
(x=max(0, x)), implemented using an array of N
MUXs.
3) Interface Logic: The Interface Logic structure depends
on the TRIP deployment configuration. It operates con-
currently with the compute hardware to mask the latency
of fetching test vectors from DAQ devices, DRAM or
transceivers.
4) Control Module: The control module coordinates flow
of data based on the MLP model. Parameters for exe-
cution and computation patterns are initialized on-chip
when the device is programmed, removing the need for
streaming instructions from the host machine and enable
stand-alone/cluster operation.
III. TPU COMPARISON
As we did not have access to TPU, we approximately
quantify the inference latency bound that allows TPU to have
sufficiently large input batch sizes required to outperform
our design (Figure 2). The first generation of TPU has 64K
MACs, 30GB/s off-chip bandwidth, and 700MHz operating
frequency. On the other hand, TRIP is deployed with 8192
multipliers for stand-alone/cluster, and 4096 multipliers for
co-processor designs. The latter is due to resource usage by
the OpenCL wrapper. Operating frequency is 200MHz. Input
data fetch latency is assumed to be negligible. From the figure,
we estimate that TRIP outperforms TPU for input batch sizes
of less than 53 test vectors.
Fig. 2. TRIP-TPU Inference Latency Bound Comparison
IV. RES ULT S
A. Benchmark
To the best of our knowledge, there are no neurosurgery
benchmarks available to test our design. We therefore utilize
the Arcene benchmark [7] which is based on similar objec-
tives. The benchmark contains mass spectrometric data of
samples from patients with ovarian and prostrate cancer, as
well as healthy individuals. Classification result will predict
the presence of tumors. Our MLP model has 10,000 input
neurons and 2 output neurons, as well two hidden layers with
32 neurons each. Training is done using Tensorflow.
B. Device Specifications
We implement our design on a Nallatech 385A development
board which hosts an Altera Arria10 10AX115N3F40E2SG
FPGA. The device has 427,200 ALMs, 1,518 DSP blocks (2
18x18 integer multipliers per block) and 54,260Kb on-chip
SRAM. The GPU design is implemented on NVIDIA Tesla
K80m which has 49992 CUDA cores and 480 GB/s global
memory bandwidth. CUDA matrix operations are performed
using the cuBLAS library and compiled with CUDA 8.0.
C. FPGA Parameters
Table 1 lists the resource usage and operating frequency
for our TRIP deployment configurations. Resource usage for
transceivers is estimated based on the Novo-G# [6] router
design.
TABLE I
TRIP DEP LOYM EN T BAS ED FPGA PARAMETERS
Configuration ALM DSP BRAM Freq.(MHz)
Stand-Alone 313K 1,280 0.3MB 207.49
Co-Processor 251K 1,280 0.6MB 201.2
Cluster 315K 1,280 0.3MB 206
D. Performance
Table 2 gives the performance comparison for GPU, TPU
and TRIP for batch-less evaluation. Projections of TRIP on
Stratix 10 are based on 4x more ALM multipliers, 4 extra
DSP scalar product slices and 2x operating frequency. From
the results, we see that TRIP is orders of magnitude faster
than both GPUs and TPU and has better resource utilization.
For GPUs in particular, the inability to use available compute
cores efficiently due to model dimensions has a significant
negative impact on performance .
TABLE II
ARCENE PERFORMANCE COMPARISON FOR SINGLE INPUT VE CTO R
Architecture M,N Useful Op Performance Speedup
NVIDIA K80 - - 0.004 TOps/s 1x
TPU 256,256 47% 0.03 TOps/s 7.4x
TRIP Arria 10 CoProc 256,16 97% 1.59 TOps/s 398x
TRIP Arria 10 Cluster 256,32 96% 3.28 TOps/s 820x
TRIP Stratix 10 CoProc 256,86 71% 12.56 TOps/s 3140x
TRIP Stratix 10 Cluster 256,102 60% 12.58 TOps/s 3145x
E. Power
TRIP consumes 34W on Arria 10, of which 31W is the
static power
V. IM PACT
To the best of our knowledge, TRIP is the only TeraOps/s
MLP inference engine for single inputs. Its deployment versa-
tility and high processing speeds makes it an ideal candidate
for classifying mass spectrometer results to increase the suc-
cess rate of tumor removal surgeries and reduce procedure
timeframes. Use of OpenCL reduces co-processor integration
effort in legacy codes that can pre-process spectrometer data.
Cluster configuration enables larger models to be evaluated by
distributing layers to multiple devices. TRIP’s reconfigurability
allows hardware to adapt to the applications, maximizing
utilization of available compute resources and minimizing high
latency memory accesses. TRIP is not constrained to a certain
number of quantized bits. Based on the application, the size
of weights can be reduced to further increase the size of
Processing Core without significant impact on accuracy. Since
TRIP is implemented using Off-The-Shelf FPGAs, utilizing
newer technologies is simply changing design parameters and
compiling for the new device(as opposed to spinning new
silicon for ASICs).
REFERENCES
[1] E. R. Council. (2013) A smart knife to fight cancer, crime
and contamination. [Online]. Available: https://erc.europa.eu/projects-
figures/stories/’smart’-knife-fight-cancer-crime-and-contamination
[2] J. Balog, L. Sasi-Szab ´
o, J. Kinross, M. R. Lewis, L. J. Muirhead,
K. Veselkov, R. Mirnezami, B. Dezs˝
o, L. Damjanovich, A. Darzi et al.,
“Intraoperative tissue identification using rapid evaporative ionization
mass spectrometry,” Science translational medicine, vol. 5, no. 194, pp.
194ra93–194ra93, 2013.
[3] E. R. St John, J. Balog, J. S. McKenzie, M. Rossi, A. Covington,
L. Muirhead, Z. Bodai, F. Rosini, A. V. Speller, S. Shousha et al., “Rapid
evaporative ionisation mass spectrometry of electrosurgical vapours for
the identification of breast pathology: towards an intelligent knife for
breast cancer surgery,” Breast Cancer Research, vol. 19, no. 1, p. 59,
2017.
[4] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Ba-
jwa, S. Bates, S. Bhatia, N. Boden, A. Borchers et al., “In-datacenter
performance analysis of a tensor processing unit,” arXiv preprint
arXiv:1704.04760, 2017.
[5] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J.
Dally, “Eie: efficient inference engine on compressed deep neural net-
work,” in Proceedings of the 43rd International Symposium on Computer
Architecture. IEEE Press, 2016, pp. 243–254.
[6] A. D. George, M. C. Herbordt, H. Lam, A. G. Lawande, J. Sheng, and
C. Yang, “Novo-g#: Large-scale reconfigurable computing with direct and
programmable interconnects,” in High Performance Extreme Computing
Conference (HPEC), 2016 IEEE. IEEE, 2016, pp. 1–7.
[7] U. M. L. Repository. Arcene data set. [Online]. Available:
https://archive.ics.uci.edu/ml/datasets/Arcene