Conference PaperPDF Available

Boosting Curative Surgery Success Rates using FPGAs

Authors:

Abstract and Figures

FPGAs deliver high performance, energy efficient, cost-effective and scalable solutions for complex, and often critical, modern scientific challenges. In previous work, we have shown that FPGA based custom designs have outperformed CPUs and GPUs in applications such as machine learning, drug design, genomics, molecular dynamics, space exploration etc. For Machine Learning using Deep Neural Networks (DNNs) in particular, FPGAs can outperform even ASICs due to higher potential for application specific optimizations. In this paper, we present a FPGA based DNN inference processor that increases the success rate of curative surgery for neural tissues. Differentiation between tumors and normal cells is done in real-time through analysis of smoke from ionizing current to ensure tumor edges are identified with high precision. Our design achieves 820x speedup over GPUs and 110x speedup over the Google Tensor Processing Unit (ASIC) for the classification task. These results indicate that FPGAs can play a significant role in improving the efficiency and success of computer-assisted procedures.
Content may be subject to copyright.
Boosting Curative Surgery Success Rates using
FPGAs
Ahmed Sanaullah Chen Yang *Yuri Alexeev *Kazutomo Yoshii Martin C. Herbordt
Boston University, Boston, MA *Argonne National Lab, Lemont, IL
Abstract—FPGAs deliver high performance, energy efficient,
cost-effective and scalable solutions for complex, and often
critical, modern scientific challenges. In previous work, we have
shown that FPGA based custom designs have outperformed
CPUs and GPUs in applications such as machine learning, drug
design, genomics, molecular dynamics, space exploration etc.
For Machine Learning using Deep Neural Networks (DNNs) in
particular, FPGAs can outperform even ASICs due to higher
potential for application specific optimizations. In this paper, we
present a FPGA based DNN inference processor that increases the
success rate of curative surgery for neural tissues. Differentiation
between tumors and normal cells is done in real-time through
analysis of smoke from ionizing current to ensure tumor edges are
identified with high precision. Our design achieves 820x speedup
over GPUs and 110x speedup over the Google Tensor Processing
Unit (ASIC) for the classification task. These results indicate that
FPGAs can play a significant role in improving the efficiency and
success of computer-assisted procedures.
I. INTRODUCTION
Distinguishing between tumors and healthy tissue during
neurosurgery can be a matter of life and death for patients.
Localization of the tumor using MRI scans lacks the required
precision to perform a reliable distinction during the proce-
dure. The consequences of incomplete removal of cancerous
cells could be regrowth that requires a second surgery, while
cutting healthy tissue is potentially fatal. Biopsies can identify
tumor edges but take significant time-frames to complete
which is not feasible. A more reliable, computer-aided method
for identifying tumor cells during surgery is mass spectrom-
etry. Devices such as iKnife [1] utilize ionizing current to
burn through tissue and analyze the smoke (Rapid Evaporative
Ionization Mass Spectrometry (REIMS) [2]) to determine
structural lipid profiles [3]. Measured chemical composition
consisting of thousands of parameters is then compared to a
reference dataset to perform identification.
This classification stage for mass spectrometry based anal-
ysis is memory bound due to the large number of look-up
operations and thus does not scale. An alternative approach
is to use neural networks to automate detection. Not only
do neural networks have higher computational intensity and
better scaling potential, but the algorithm also provides op-
portunities for reducing memory bounds through data reuse
and application specific optimizations. In our work, we have
employed feed-forward Multi-Layer Perceptrons. Multi-Layer
Perceptron (MLP) is one of the most commonly deployed
Deep Neural Network, representing 61% of the workload in
Google data centers [4]. To the best of our knowledge, the
Google Tensor Processing Unit (TPU) [4], the state-of-the-
art implementation of MLP inference, achieves 10% of the
92 TeraOps/s peak. TPU addresses the memory bound by
processing multiple test vectors simultaneously to increase
operations per weight byte loaded from DRAM. However, in
our application, waiting for sufficient input vectors to get good
performance is not feasible since healthy tissues could be cut
in the meantime.
In this work, we have designed a quantized TeraOps/s Re-
configurable Inference Processor for MLPs (TRIP) on FPGAs
that alleviates the memory bound by storing all weights on-
chip and ensured performance is invariant of the input batch
size. Through this approach, we not only guarantee stall-
free data availability to pipelines during steady-state, but also
reduce power consumption significantly by reducing DRAM
access. For large databases that cannot directly fit on chip,
Deep Compression relaxes memory footprint requirements
with no effect on accuracy [5]. TRIP is deployed as a stan-
dalone device directly connected to the mass spectrometer,
as a co-processor where input vectors are supplied through
OpenCL wrappers from the host machine, as well as in a
cluster configuration where on-chip transceivers communicate
between FPGAs [6]. By comparison, TPU can only be used
in a co-processor configuration.
II. SY ST EM DESIGN
MLP based neural networks typically have asymmetric logi-
cal configurations, with average neurons per layer being orders
of magnitude larger than the number of layers. Consequently,
while relatively small delays for inter-layer function evalua-
tions have negligible impact, addressing compute bottlenecks
through high compute capability and sustained throughput for
intra-layer computations is performance critical. This served
as motivation for the TRIP architecture presented in Figure 1.
Fig. 1. TRIP Architecture Overview
1) Processing Core: The Processing Core implements up
to 8192 - 8 bit integer multipliers in a M×N2D array.
We employ both DSP and ALM multipliers to ensure
sufficient resources on medium-high end FPGAs. Each
slice of Mmultipliers has an associated adder tree to
enable scalar product evaluation. TRIP provides the flex-
ibility of choosing array configurations that maximize
processing core utilization. Resulting (signed) 32-bit N
partial sums are accumulated into output buffers which
are preloaded with bias vectors.
2) Activation Pipeline: The Activation pipeline performs
non-linear operations on layer result vectors and re-
quantizes them from 32-bit to a lower bit size for the
next layer. TRIP currently supports ReLU activation
(x=max(0, x)), implemented using an array of N
MUXs.
3) Interface Logic: The Interface Logic structure depends
on the TRIP deployment configuration. It operates con-
currently with the compute hardware to mask the latency
of fetching test vectors from DAQ devices, DRAM or
transceivers.
4) Control Module: The control module coordinates flow
of data based on the MLP model. Parameters for exe-
cution and computation patterns are initialized on-chip
when the device is programmed, removing the need for
streaming instructions from the host machine and enable
stand-alone/cluster operation.
III. TPU COMPARISON
As we did not have access to TPU, we approximately
quantify the inference latency bound that allows TPU to have
sufficiently large input batch sizes required to outperform
our design (Figure 2). The first generation of TPU has 64K
MACs, 30GB/s off-chip bandwidth, and 700MHz operating
frequency. On the other hand, TRIP is deployed with 8192
multipliers for stand-alone/cluster, and 4096 multipliers for
co-processor designs. The latter is due to resource usage by
the OpenCL wrapper. Operating frequency is 200MHz. Input
data fetch latency is assumed to be negligible. From the figure,
we estimate that TRIP outperforms TPU for input batch sizes
of less than 53 test vectors.
Fig. 2. TRIP-TPU Inference Latency Bound Comparison
IV. RES ULT S
A. Benchmark
To the best of our knowledge, there are no neurosurgery
benchmarks available to test our design. We therefore utilize
the Arcene benchmark [7] which is based on similar objec-
tives. The benchmark contains mass spectrometric data of
samples from patients with ovarian and prostrate cancer, as
well as healthy individuals. Classification result will predict
the presence of tumors. Our MLP model has 10,000 input
neurons and 2 output neurons, as well two hidden layers with
32 neurons each. Training is done using Tensorflow.
B. Device Specifications
We implement our design on a Nallatech 385A development
board which hosts an Altera Arria10 10AX115N3F40E2SG
FPGA. The device has 427,200 ALMs, 1,518 DSP blocks (2
18x18 integer multipliers per block) and 54,260Kb on-chip
SRAM. The GPU design is implemented on NVIDIA Tesla
K80m which has 49992 CUDA cores and 480 GB/s global
memory bandwidth. CUDA matrix operations are performed
using the cuBLAS library and compiled with CUDA 8.0.
C. FPGA Parameters
Table 1 lists the resource usage and operating frequency
for our TRIP deployment configurations. Resource usage for
transceivers is estimated based on the Novo-G# [6] router
design.
TABLE I
TRIP DEP LOYM EN T BAS ED FPGA PARAMETERS
Configuration ALM DSP BRAM Freq.(MHz)
Stand-Alone 313K 1,280 0.3MB 207.49
Co-Processor 251K 1,280 0.6MB 201.2
Cluster 315K 1,280 0.3MB 206
D. Performance
Table 2 gives the performance comparison for GPU, TPU
and TRIP for batch-less evaluation. Projections of TRIP on
Stratix 10 are based on 4x more ALM multipliers, 4 extra
DSP scalar product slices and 2x operating frequency. From
the results, we see that TRIP is orders of magnitude faster
than both GPUs and TPU and has better resource utilization.
For GPUs in particular, the inability to use available compute
cores efficiently due to model dimensions has a significant
negative impact on performance .
TABLE II
ARCENE PERFORMANCE COMPARISON FOR SINGLE INPUT VE CTO R
Architecture M,N Useful Op Performance Speedup
NVIDIA K80 - - 0.004 TOps/s 1x
TPU 256,256 47% 0.03 TOps/s 7.4x
TRIP Arria 10 CoProc 256,16 97% 1.59 TOps/s 398x
TRIP Arria 10 Cluster 256,32 96% 3.28 TOps/s 820x
TRIP Stratix 10 CoProc 256,86 71% 12.56 TOps/s 3140x
TRIP Stratix 10 Cluster 256,102 60% 12.58 TOps/s 3145x
E. Power
TRIP consumes 34W on Arria 10, of which 31W is the
static power
V. IM PACT
To the best of our knowledge, TRIP is the only TeraOps/s
MLP inference engine for single inputs. Its deployment versa-
tility and high processing speeds makes it an ideal candidate
for classifying mass spectrometer results to increase the suc-
cess rate of tumor removal surgeries and reduce procedure
timeframes. Use of OpenCL reduces co-processor integration
effort in legacy codes that can pre-process spectrometer data.
Cluster configuration enables larger models to be evaluated by
distributing layers to multiple devices. TRIP’s reconfigurability
allows hardware to adapt to the applications, maximizing
utilization of available compute resources and minimizing high
latency memory accesses. TRIP is not constrained to a certain
number of quantized bits. Based on the application, the size
of weights can be reduced to further increase the size of
Processing Core without significant impact on accuracy. Since
TRIP is implemented using Off-The-Shelf FPGAs, utilizing
newer technologies is simply changing design parameters and
compiling for the new device(as opposed to spinning new
silicon for ASICs).
REFERENCES
[1] E. R. Council. (2013) A smart knife to fight cancer, crime
and contamination. [Online]. Available: https://erc.europa.eu/projects-
figures/stories/’smart’-knife-fight-cancer-crime-and-contamination
[2] J. Balog, L. Sasi-Szab ´
o, J. Kinross, M. R. Lewis, L. J. Muirhead,
K. Veselkov, R. Mirnezami, B. Dezs˝
o, L. Damjanovich, A. Darzi et al.,
“Intraoperative tissue identification using rapid evaporative ionization
mass spectrometry,Science translational medicine, vol. 5, no. 194, pp.
194ra93–194ra93, 2013.
[3] E. R. St John, J. Balog, J. S. McKenzie, M. Rossi, A. Covington,
L. Muirhead, Z. Bodai, F. Rosini, A. V. Speller, S. Shousha et al., “Rapid
evaporative ionisation mass spectrometry of electrosurgical vapours for
the identification of breast pathology: towards an intelligent knife for
breast cancer surgery,Breast Cancer Research, vol. 19, no. 1, p. 59,
2017.
[4] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Ba-
jwa, S. Bates, S. Bhatia, N. Boden, A. Borchers et al., “In-datacenter
performance analysis of a tensor processing unit,” arXiv preprint
arXiv:1704.04760, 2017.
[5] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J.
Dally, “Eie: efficient inference engine on compressed deep neural net-
work,” in Proceedings of the 43rd International Symposium on Computer
Architecture. IEEE Press, 2016, pp. 243–254.
[6] A. D. George, M. C. Herbordt, H. Lam, A. G. Lawande, J. Sheng, and
C. Yang, “Novo-g#: Large-scale reconfigurable computing with direct and
programmable interconnects,” in High Performance Extreme Computing
Conference (HPEC), 2016 IEEE. IEEE, 2016, pp. 1–7.
[7] U. M. L. Repository. Arcene data set. [Online]. Available:
https://archive.ics.uci.edu/ml/datasets/Arcene
Conference Paper
Full-text available
Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X -- 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X -- 80X higher. Moreover, using the CPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.
Article
Full-text available
Background Re-operation for positive resection margins following breast-conserving surgery occurs frequently (average = 20–25%), is cost-inefficient, and leads to physical and psychological morbidity. Current margin assessment techniques are slow and labour intensive. Rapid evaporative ionisation mass spectrometry (REIMS) rapidly identifies dissected tissues by determination of tissue structural lipid profiles through on-line chemical analysis of electrosurgical aerosol toward real-time margin assessment. Methods Electrosurgical aerosol produced from ex-vivo and in-vivo breast samples was aspirated into a mass spectrometer (MS) using a monopolar hand-piece. Tissue identification results obtained by multivariate statistical analysis of MS data were validated by histopathology. Ex-vivo classification models were constructed from a mass spectral database of normal and tumour breast samples. Univariate and tandem MS analysis of significant peaks was conducted to identify biochemical differences between normal and cancerous tissues. An ex-vivo classification model was used in combination with bespoke recognition software, as an intelligent knife (iKnife), to predict the diagnosis for an ex-vivo validation set. Intraoperative REIMS data were acquired during breast surgery and time-synchronized to operative videos. Results A classification model using histologically validated spectral data acquired from 932 sampling points in normal tissue and 226 in tumour tissue provided 93.4% sensitivity and 94.9% specificity. Tandem MS identified 63 phospholipids and 6 triglyceride species responsible for 24 spectral differences between tissue types. iKnife recognition accuracy with 260 newly acquired fresh and frozen breast tissue specimens (normal n = 161, tumour n = 99) provided sensitivity of 90.9% and specificity of 98.8%. The ex-vivo and intra-operative method produced visually comparable high intensity spectra. iKnife interpretation of intra-operative electrosurgical vapours, including data acquisition and analysis was possible within a mean of 1.80 seconds (SD ±0.40). Conclusions The REIMS method has been optimised for real-time iKnife analysis of heterogeneous breast tissues based on subtle changes in lipid metabolism, and the results suggest spectral analysis is both accurate and rapid. Proof-of-concept data demonstrate the iKnife method is capable of online intraoperative data collection and analysis. Further validation studies are required to determine the accuracy of intra-operative REIMS for oncological margin assessment. Electronic supplementary material The online version of this article (doi:10.1186/s13058-017-0845-2) contains supplementary material, which is available to authorized users.
Article
Full-text available
Rapid evaporative ionization mass spectrometry (REIMS) is an emerging technique that allows near-real-time characterization of human tissue in vivo by analysis of the aerosol ("smoke") released during electrosurgical dissection. The coupling of REIMS technology with electrosurgery for tissue diagnostics is known as the intelligent knife (iKnife). This study aimed to validate the technique by applying it to the analysis of fresh human tissue samples ex vivo and to demonstrate the translation to real-time use in vivo in a surgical environment. A variety of tissue samples from 302 patients were analyzed in the laboratory, resulting in 1624 cancerous and 1309 noncancerous database entries. The technology was then transferred to the operating theater, where the device was coupled to existing electrosurgical equipment to collect data during a total of 81 resections. Mass spectrometric data were analyzed using multivariate statistical methods, including principal components analysis (PCA) and linear discriminant analysis (LDA), and a spectral identification algorithm using a similar approach was implemented. The REIMS approach differentiated accurately between distinct histological and histopathological tissue types, with malignant tissues yielding chemical characteristics specific to their histopathological subtypes. Tissue identification via intraoperative REIMS matched the postoperative histological diagnosis in 100% (all 81) of the cases studied. The mass spectra reflected lipidomic profiles that varied between distinct histological tumor types and also between primary and metastatic tumors. Thus, in addition to real-time diagnostic information, the spectra provided additional information on divergent tumor biochemistry that may have mechanistic importance in cancer.
Conference Paper
State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power. Previously proposed 'Deep Compression' makes it possible to fit large DNNs (AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same weight. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing. Going from DRAM to SRAM gives EIE 120x energy saving; Exploiting sparsity saves 10x; Weight sharing gives 8x; Skipping zero activations from ReLU saves another 3x. Evaluated on nine DNN benchmarks, EIE is 189x and 13x faster when compared to CPU and GPU implementations of the same DNN without compression. EIE has a processing power of 102GOPS/s working directly on a compressed network, corresponding to 3TOPS/s on an uncompressed network, and processes FC layers of AlexNet at 1.88x10^4 frames/sec with a power dissipation of only 600mW. It is 24,000x and 3,400x more energy efficient than a CPU and GPU respectively. Compared with DaDianNao, EIE has 2.9x, 19x and 3x better throughput, energy efficiency and area efficiency.
A smart knife to fight cancer, crime and contamination
  • E R Council
E. R. Council. (2013) A smart knife to fight cancer, crime and contamination. [Online]. Available: https://erc.europa.eu/projectsfigures/stories/'smart'-knife-fight-cancer-crime-and-contamination
Novo-g#: Large-scale reconfigurable computing with direct and programmable interconnects
  • A D George
  • M C Herbordt
  • H Lam
  • A G Lawande
  • J Sheng
  • C Yang
A. D. George, M. C. Herbordt, H. Lam, A. G. Lawande, J. Sheng, and C. Yang, "Novo-g#: Large-scale reconfigurable computing with direct and programmable interconnects," in High Performance Extreme Computing Conference (HPEC), 2016 IEEE. IEEE, 2016, pp. 1-7.