Technical ReportPDF Available
Big Data, Simulations and HPC Convergence
Geoffrey Fox1, Judy Qiu, Shantenu Jha2, Saliya Ekanayake1, and
Supun Kamburugamuve1
1School of Informatics and Computing, Indiana University,
Bloomington, IN 47408, USA
2RADICAL, Rutgers University, Piscataway, NJ 08854, USA
Abstract. Two major trends in computing systems are the growth
in high performance computing (HPC) with in particular an interna-
tional exascale initiative, and big data with an accompanying cloud in-
frastructure of dramatic and increasing size and sophistication. In this
paper, we study an approach to convergence for software and applica-
tions/algorithms and show what hardware architectures it suggests. We
start by dividing applications into data plus model components and clas-
sifying each component (whether from Big Data or Big Compute) in the
same way. This leads to 64 properties divided into 4 views, which are
Problem Architecture (Macro pattern); Execution Features (Micro pat-
terns); Data Source and Style; and finally the Processing (runtime) View.
We discuss convergence software built around HPC-ABDS (High Perfor-
mance Computing enhanced Apache Big Data Stack) and show how one
can merge Big Data and HPC (Big Simulation) concepts into a single
stack and discuss appropriate hardware.
Keywords: Big Data, HPC, Simulations
1 Introduction
Two major trends in computing systems are the growth in high performance
computing (HPC) with an international exascale initiative, and the big data phe-
nomenon with an accompanying cloud infrastructure of well publicized dramatic
and increasing size and sophistication. There has been substantial discussion
of the convergence of big data analytics, simulations and HPC [1, 11–13, 29, 30]
highlighted by the Presidential National Strategic Computing Initiative [5]. In
studying and linking these trends and their convergence, one needs to consider
multiple aspects: hardware, software, applications/algorithms and even broader
issues like business model and education. Here we focus on software and appli-
cations/algorithms and make comments on the other aspects. We discuss appli-
cations/algorithms in section 2, software in section 3 and link them and other
aspects in section 4.
2 Applications and Algorithms
We extend the analysis given by us [18,21], which used ideas in earlier parallel
computing studies [8,9,31] to build a set of Big Data application characteristics
2 Geoffrey Fox et. al
with 50 features– called facets – divided into 4 views. As it incorporated the ap-
proach of the Berkeley dwarfs [8] and included features from the NRC Massive
Data Analysis Reports Computational Giants [27], we termed these characteris-
tics as Ogres. Here we generalize approach to integrate Big Data and Simulation
applications into a single classification that we call convergence diamonds with
a total of 64 facets split between the same 4 views. The four views are Problem
Architecture (Macro pattern abbreviated PA); Execution Features (Micro pat-
terns abbreviated EF); Data Source and Style (abbreviated DV); and finally the
Processing (runtime abbreviated Pr) View.
The central idea is that any problem – whether Big Data or Simulation,
and whether HPC or cloud-based, can be broken up into Data plus Model.
The DDDAS approach is an example where this idea is explicit [3]. In a Big
Data problem, the Data is large and needs to be collected, stored, managed and
accessed. Then one uses Data Analytics to compare some Model with this data.
The Model could be small such as coordinates of a few clusters or large as in a
deep learning network; almost by definition the Data is large!
On the other hand for simulations the model is nearly always big – as in values
of fields on a large space-time mesh. The Data could be small and is essentially
zero for Quantum Chromodynamics simulations and corresponds to the typically
small boundary conditions for many simulations; however climate and weather
simulations can absorb large amounts of assimilated data. Remember Big Data
has a model, so there are model diamonds for big data they describe analytics.
The diamonds and their facets are given in a table put in the appendix. They
are summarized above in Figure 1.
Comparing Big Data and simulations is not so clear; however comparing the
model in simulations and the model in Big Data is straightforward while the
data in both cases can be treated similarly. This simple idea lies at heart of our
approach to Big Data - Simulation convergence. In the convergence diamonds
given in Table presented in Appendix, one divides the facets into three types
1. Facet n (without D or M) refers to a facet of system including both data
and model – 16 in total.
2. Facet nD is a Data only facet – 16 in Total
3. Facet nM is a Model only facet – 32 in total
The increase in total facets and large number of model facets corresponds
mainly to adding Simulation facets to the Processing View of the Diamonds.
Note we have included characteristics (facets) present in the Berkeley Dwarfs
and NAS Parallel Benchmarks as well the NRC Massive Data Analysis Com-
putational Giants. For some facets there are separate data and model facets. A
good example in Convergence Diamond Micropatterns or Execution Features is
that EF-4D is Data Volume and EF-4M Model size.
The views Problem Architecture; Execution Features; Data Source and Style;
and Processing (runtime) are respectively mainly System Facets, a mix of sys-
tem, model and data facets; mainly data facets with the final view entirely model
facets. The facets tell us how to compare diamonds (instances of big data and
Big Data, Simulations and HPC Convergence 3
Fig. 1. Summary of the 64 facets in the Convergence Diamonds
4 Geoffrey Fox et. al
simulation applications) and see which system architectures are needed to sup-
port each diamond and which architectures across multiple diamonds including
those from both simulation and big data areas.
In several papers [17, 33, 34] we have looked at the model in big data prob-
lems and studied the model performance on both cloud and HPC systems. We
have shown similarities and differences between models in simulation and big
data area. In particular the latter often need HPC hardware and software en-
hancements to get good performances. There are special features of each class;
for example simulations often have local connections between model points cor-
responding either to the discretization of a differential operator or a short range
force. Big data sometimes involve fully connected sets of points and these for-
mulations have similarities to long range force problems in simulation. In both
regimes we often see linear algebra kernels but the sparseness structure is rather
different. Graph data structures are present in both cases but that in simula-
tions tends to have more structure. The linkage between people in Facebook
social network is less structured than the linkage between molecules in a com-
plex biochemistry simulation. However both are graphs with some long range
but many short range interactions. Simulations nearly always involve a mix of
point to point messaging and collective operations like broadcast, gather, scatter
and reduction. Big data problems sometimes are dominated by collectives as op-
posed to point to point messaging and this motivates the map collective problem
architecture facet PA-3 above. In simulations and big data, one sees a similar
BSP (loosely synchronous PA-8), SPMD (PA-7) Iterative (EF-11M) and this
motivates the Spark [32], Flink [7], Twister [15, 16] approach. Note that pleas-
ingly parallel (PA-1) local (Pr-2M) structure is often seen in both simulations
and big data.
In [33, 34] we introduce Harp as a plug-in to Hadoop with scientific data
abstractions, support of iterations and high quality communication primitives.
This runs with good performance on several important data analytics includ-
ing Latent Dirichlet Allocation LDA, clustering and dimension reduction. Note
LDA has a non trivial structure sparse structure coming from an underlying bag
of words model for documents. In [17], we look at performance in great detail
showing excellent data analytics speed up on an Infiniband connected HPC clus-
ter using MPI. Deep Learning [14,24] has clearly shown importance of HPC and
uses many ideas originally developed for simulations.
Above we discuss models in the big data and simulation regimes; what about
the data? Here we see the issue as perhaps less clear but convergence does not
seem difficult technically. Given models can be executed on HPC systems when
needed, it appears reasonable to use a different architecture for the data with the
big data approach of hosting data on clouds quite attractive. HPC has tended not
to use big data management tools but rather to host data on shared file systems
like Lustre. We expect this to change with object stores and HDFS approaches
gaining popularity in the HPC community. It is not clear if HDFS will run
on HPC systems or instead on co-located clouds supporting the rich object,
SQL, NoSQL and NewSQL paradigms. This co-location strategy can also work
Big Data, Simulations and HPC Convergence 5
for streaming data with in the traditional Apache Storm-Kafka map streaming
model (PA-5) buffering data with Kafka on a cloud and feeding that data to
Apache Storm that may need HPC hardware for complex analytics (running
on bolts in Storm). In this regard we have introduced HPC enhancements to
Storm [26].
We believe there is an immediate need to investigate the overlap of appli-
cation characteristics and classification from high-end computing and big data
ends of the spectrum. Here we have shown how initial work [21] to classify big
data applications can be extended to include traditional high-performance ap-
plications. Can traditional classifications for high-performance applications [8]
be extended in the opposite direction to incorporate big data applications? And
if so, is the end result similar, overlapping or very distinct to the preliminary
classification proposed here? Such understanding is critical in order to eventu-
ally have a common set of benchmark applications and suites [10] that will guide
the development of future systems that must have a design point that provides
balanced performance.
Note applications are instances of Convergence Diamonds. Each instance will
exhibit some but not all of the facets of Fig. 1. We can give an example of the
NAS Parallel Benchmark [4] LU (Lower-Upper symmetric Gauss Seidel) using
MPI. This would be a diamond with facets PA-4, 7, 8; Pr-3M,16M with its
size specified in EF-4M. PA-4 would be replaced by PA-2 if one used (unwisely)
MapReduce for this problem. Further if you read initial data from MongoDB, the
data facet DV-1D would be added. Many other examples are given in section 3
of [18]. For example non-vector clustering in Table 1 of this section is a nice data
analytics example. It exhibits Problem Architecture view PA-3, PA-7, and PA-
8; Execution Features EF-9D (Static), EF-10D (Regular), EF-11M (iterative),
EF-12M (bag of items), EF-13D (Non-metric), EF-13M(Non metric), and EF-
14M(O(N2) algorithm); Processing view Pr-3M, Pr-9M (Machine learning and
Expectation maximization), and Pr-12M (Full matrix, Conjugate Gradient).
3 HPC-ABDS Convergence Software
In previous papers [20, 25, 28], we introduced the software stack HPC-ABDS
(High Performance Computing enhanced Apache Big Data Stack) shown on-
line [4] and in Figures 2 and 3. These were combined with the big data application
analysis [6, 19, 21] in terms of Ogres that motivated the extended convergence
diamonds in section 2. We also use Ogres and HPC-ABDS to suggest a system-
atic approach to benchmarking [18, 22]. In [23] we described the software model
of Figure 2 while further details of the stack can be found in an online course [2]
that includes a section with about one slide (and associated lecture video) for
each entry in Figure 2.
Figure 2 collects together much existing relevant systems software coming
from either HPC or commodity sources. The software is broken up into layers so
software systems are grouped by functionality. The layers where there is especial
opportunity to integrate HPC and ABDS are colored green in Figure 2. This is
6 Geoffrey Fox et. al
Kaleidoscope of (Apache) Big Data Stack (ABDS) and HPC Technologies
Cross-Cutting
Functions
17)
Workflow
-
Orchestration:
ODE, ActiveBPEL, Airavata, Pegasus, Kepler, Swift, Taverna, Triana,
Trident, BioKepler, Galaxy, IPython, Dryad, Naiad, Oozie, Tez, Google FlumeJava, Crunch, Cascading,
Scalding, e-Science Central, Azure Data Factory, Google Cloud Dataflow, NiFi (NSA), Jitterbit, Talend,
Pentaho, Apatar, Docker Compose, KeystoneML
1) Message and
Data Protocols:
Avro, Thrift,
Protobuf
16)
Mahout , MLlib , MLbase, DataFu, R, pbdR, Bioconductor, ImageJ,
OpenCV, Scalapack, PetSc, PLASMA MAGMA, Azure Machine Learning, Google Prediction API &
Translation API, mlpy, scikit-learn, PyBrain, CompLearn, DAAL(Intel), Caffe, Torch, Theano, DL4j,
H2O, IBM Watson, Oracle PGX, GraphLab, GraphX, IBM System G, GraphBuilder(Intel), TinkerPop,
Parasol, Dream:Lab, Google Fusion Tables, CINET, NWB, Elasticsearch, Kibana, Logstash, Graylog,
Splunk, Tableau, D3.js, three.js, Potree, DC.js, TensorFlow, CNTK
2) Distributed
Coordination:
Google Chubby,
Zookeeper,
Giraffe, JGroups
15B)
Application Hosting
Frameworks:
Google App Engine, AppScale, Red Hat OpenShift, Heroku,
Aerobatic, AWS Elastic Beanstalk, Azure, Cloud Foundry, Pivotal, IBM BlueMix, Ninefold, Jelastic,
Stackato, appfog, CloudBees, Engine Yard, CloudControl, dotCloud, Dokku, OSGi, HUBzero, OODT,
Agave, Atmosphere
15A) High level Programming: Kite, Hive, HCatalog, Tajo, Shark, Phoenix, Impala, MRQL, SAP
HANA, HadoopDB, PolyBase, Pivotal HD/Hawq, Presto, Google Dremel, Google BigQuery, Amazon
Redshift, Drill, Kyoto Cabinet, Pig, Sawzall, Google Cloud DataFlow, Summingbird
3) Security &
Privacy:
InCommon,
Eduroam,
OpenStack,
Keystone, LDAP,
Sentry, Sqrrl,
OpenID, SAML
OAuth
14B)
Stream
s
:
Storm, S4, Samza, Granules, Neptune, Google MillWheel, Amazon Kinesis, LinkedIn,
Twitter Heron, Databus, Facebook Puma/Ptail/Scribe/ODS, Azure
Stream Analytics, Floe, Spark
Streaming, Flink Streaming, DataTurbine
14A) Basic Programming model and runtime, SPMD, MapReduce: Hadoop, Spark, Twister, MR-MPI,
Stratosphere (Apache Flink), Reef, Disco, Hama, Giraph, Pregel, Pegasus, Ligra, GraphChi, Galois,
Medusa-GPU, MapGraph, Totem
13)
Inter process communication Collectives, point
-
to
-
point, publish
-
subscribe:
MPI, HPX-5, Argo
BEAST HPX-5 BEAST PULSAR, Harp, Netty, ZeroMQ, ActiveMQ, RabbitMQ, NaradaBrokering, QPid,
Kafka, Kestrel, JMS, AMQP, Stomp, MQTT, Marionette Collective, Public Cloud: Amazon SNS,
Lambda, Google Pub Sub, Azure Queues, Event Hubs
4) Monitoring:
Ambari, Ganglia,
Nagios, Inca
12)
In
-
memory databases/caches:
Gora (general object from NoSQL), Memcached, Redis, LMDB (key
value), Hazelcast, Ehcache, Infinispan, VoltDB, H-Store
12)
Object
-
relational mapping:
Hibernate, OpenJPA, EclipseLink, DataNucleus, ODBC/JDBC
12)
Extraction Tools:
UIMA, Tika
21 layers
Over 350
Software
Packages
January 29
2016
11C)
SQL
(NewSQL)
:
Oracle, DB2, SQL Server, SQLite, MySQL, PostgreSQL, CUBRID, Galera
Cluster, SciDB, Rasdaman, Apache Derby, Pivotal Greenplum, Google Cloud SQL, Azure SQL, Amazon
RDS, Google F1, IBM dashDB, N1QL, BlinkDB, Spark SQL
11B)
NoSQL:
Lucene, Solr, Solandra, Voldemort, Riak, ZHT, Berkeley DB, Kyoto/Tokyo Cabinet,
Tycoon, Tyrant, MongoDB, Espresso, CouchDB, Couchbase, IBM Cloudant, Pivotal Gemfire, HBase,
Google Bigtable, LevelDB, Megastore and Spanner, Accumulo, Cassandra, RYA, Sqrrl, Neo4J, graphdb,
Yarcdata, AllegroGraph, Blazegraph, Facebook Tao, Titan:db, Jena, Sesame
Public Cloud: Azure Table, Amazon Dynamo, Google DataStore
11A)
File management:
iRODS, NetCDF, CDF, HDF, OPeNDAP, FITS, RCFile, ORC, Parquet
10)
Data Transport:
BitTorrent, HTTP, FTP, SSH, Globus Online (GridFTP), Flume, Sqoop, Pivotal
GPLOAD/GPFDIST
9)
Cluster Resource Management
: Mesos, Yarn, Helix, Llama, Google Omega, Facebook Corona,
Celery, HTCondor, SGE, OpenPBS, Moab, Slurm, Torque, Globus Tools, Pilot Jobs
8)
File systems:
HDFS, Swift, Haystack, f4, Cinder, Ceph, FUSE, Gluster, Lustre, GPFS, GFFS
Public Cloud: Amazon S3, Azure Blob, Google Cloud Storage
7)
Interoperability:
Libvirt, Libcloud, JClouds, TOSCA, OCCI, CDMI, Whirr, Saga, Genesis
6)
DevOps:
Docker (Machine, Swarm), Puppet, Chef, Ansible, SaltStack, Boto, Cobbler, Xcat, Razor,
CloudMesh, Juju, Foreman, OpenStack Heat, Sahara, Rocks, Cisco Intelligent Automation for Cloud,
Ubuntu MaaS, Facebook Tupperware, AWS OpsWorks, OpenStack Ironic, Google Kubernetes, Buildstep,
Gitreceive, OpenTOSCA, Winery, CloudML, Blueprints, Terraform, DevOpSlang, Any2Api
5)
IaaS Management from HPC to hypervisors:
Xen, KVM, QEMU, Hyper-V, VirtualBox, OpenVZ,
LXC, Linux-Vserver, OpenStack, OpenNebula, Eucalyptus, Nimbus, CloudStack, CoreOS, rkt, VMware
ESXi, vSphere and vCloud, Amazon, Azure, Google and other public Clouds
Networking: Google Cloud DNS, Amazon Route 53
Fig. 2. Big Data and HPC Software subsystems arranged in 21 layers. Green layers
have a significant HPC integration
Big Data, Simulations and HPC Convergence 7
Fig. 3. Comparison of Big Data and HPC Simulation Software Stacks
termed HPC-ABDS (High Performance Computing enhanced Apache Big Data
Stack) as many critical core components of the commodity stack (such as Spark
and Hbase) come from open source projects while HPC is needed to bring per-
formance and other parallel computing capabilities [23]. Note that Apache is the
largest but not only source of open source software; we believe that the Apache
Foundation is a critical leader in the Big Data open source software movement
and use it to designate the full big data software ecosystem. The figure also
includes proprietary systems as they illustrate key capabilities and often moti-
vate open source equivalents. We built this picture for big data problems but it
also applies to big simulation with caveat that we need to add more high level
software at the library level and more high level tools like Global Arrays. This
will become clearer in the next section when we discuss Figure 2 in more detail.
The essential idea of our Big Data HPC convergence for software is to make
use of ABDS software where possible as it offers richness in functionality, a com-
pelling open-source community sustainability model and typically attractive user
interfaces. ABDS has a good reputation for scale but often does not give good
performance. We suggest augmenting ABDS with HPC ideas especially in the
green layers of Figure 2. We have illustrated this with Hadoop [33,34], Storm [26]
and the basic Java environment [17]. We suggest using the resultant HPC-ABDS
for both big data and big simulation applications. In the language of Figure 2, we
use the stack on left enhanced by the high performance ideas and libraries of the
classic HPC stack on the right. As one example we recommend using enhanced
MapReduce (Hadoop, Spark, Flink) for parallel programming for simulations
and big data where its the model (data analytics) that has similar requirements
8 Geoffrey Fox et. al
to simulations. We have shown how to integrate HPC technologies into MapRe-
duce to get performance expected in HPC [34] and that on the other hand if
the user interface is not critical, one can use a simulation technology (MPI) to
drive excellent data analytics performance [17]. A byproduct of these studies is
that classic HPC clusters make excellent data analytics engine. One can use the
convergence diamonds to quantify this result. These define properties of appli-
cations between both data and simulations and allow one to specify hardware
and software requirements uniformly over these two classes of applications.
4 Convergence Systems
Fig. 4. Dual Convergence Architecture
Figure 3 contrasts modern ABDS and HPC stacks illustrating most of the
21 layers and labelling on left with layer number used in Figure 2. The omitted
layers in Figure 2 are Interoperability, DevOps, Monitoring and Security (layers
7, 6, 4, 3) which are all important and clearly applicable to both HPC and
ABDS. We also add in Figure 3, an extra layer corresponding to programming
language, which feature is not discussed in Figure 2. Our suggested approach is
to build around the stacks of Figure 2, taking the best approach at each layer
which may require merging ideas from ABDS and HPC. This converged stack is
still emerging but we have described some features in the previous section. Then
this stack would do both big data and big simulation as well the data aspects
(store, manage, access) of the data in the data plus model framework. Although
the stack architecture is uniform it will have different emphases in hardware
and software that will be optimized using the convergence diamond facets. In
particular the data management will usually have a different optimization from
the model computation.
Thus we propose a canonical dual system architecture sketched in Figure 4
with data management on the left side and model computation on the right. As
Big Data, Simulations and HPC Convergence 9
drawn the systems are the same size but this of course need not be true. Further
we depict data rich nodes on left to support HDFS but that also might not be
correct – maybe both systems are disk rich or maybe we have a classic Lustre
style system on the model side to mimic current HPC practice. Finally the sys-
tems may in fact be coincident with data management and model computation
on the same nodes. The latter is perhaps the canonical big data approach but
we see many big data cases where the model will require hardware optimized
for performance and with for example high speed internal networking or GPU
enhanced nodes. In this case the data may be more effectively handled by a
separate cloud like cluster. This depends on properties recorded in the facets
of the Convergence Diamonds for application suites. These ideas are built on
substantial experimentation but still need significant testing as they have not be
looked at systematically.
We suggested using the same software stack for both systems in the dual
Convergence system. Now that means we pick and chose from HPC-ABDS on
both machines but we neednt make same choice on both systems; obviously the
data management system would stress software in layers 10 and 11 of Figure 2
while the model computation would need libraries (layer 16) and programming
plus communication (layers 13-15).
Acknowledgments. This work was partially supported by NSF CIF21 DIBBS
1443054, NSF OCI 1149432 CAREER. and AFOSR FA9550-13-1-0225 awards.
We thank Dennis Gannon for comments on an early draft.
References
1. Big Data and Extreme-scale Computing (BDEC), http://www.exascale.org/bdec/,
Accessed: 2016 Jan 29
2. Data Science Curriculum: Indiana University Online Class: Big Data Open Source
Software and Projects. 2014, http://bigdataopensourceprojects.soic.indiana.edu/,
accessed Dec 11 2014
3. DDDAS Dynamic Data-Driven Applications System Showcase, http://www.
1dddas.org/, Accessed July 22 2015
4. HPC-ABDS Kaleidoscope of over 350 Apache Big Data Stack and HPC Technolo-
gies, http://hpc-abds.org/kaleidoscope/
5. NSCI: Executive Order – Creating a National Strategic Comput-
ing Initiative, https://www.whitehouse.gov/the-press- office/2015/07/29/
executive-order-creating- national-strategic-computing- initiative, July 29 2015
6. NIST Big Data Use Case & Requirements. V1.0 Final Version 2015 (Jan 2016),
http://bigdatawg.nist.gov/V1 output docs.php
7. Apache Software Foundation: Apache Flink open source platform for distributed
stream and batch data processing, https://flink.apache.org/, Accessed: Jan 16 2016
8. Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K.,
Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., et al.: The landscape of
parallel computing research: A view from berkeley. Tech. rep., UCB/EECS-2006-
183, EECS Department, University of California, Berkeley (2006), Available from
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
10 Geoffrey Fox et. al
9. Bailey, D.H., Barszcz, E., Barton, J.T., Browning, D.S., Carter, R.L., Dagum,
L., Fatoohi, R.A., Frederickson, P.O., Lasinski, T.A., Schreiber, R.S., et al.: The
NAS parallel benchmarks. International Journal of High Performance Computing
Applications 5(3), 63–73 (1991)
10. Baru, C., Rabl, T.: Tutorial 4 ”Big Data Benchmarking” at 2014 IEEE Interna-
tional Conference on Big Data. 2014 [accessed 2015 January 2]
11. Baru, C.: BigData Top100 List, http://www.bigdatatop100.org/, Accessed: 2016
Jan
12. Bryant, R.E.: Data-Intensive Supercomputing: The case for DISC. http://www.cs.
cmu.edu/bryant/pubdir/cmu-cs- 07-128.pdf, CMU-CS-07-128 May 10 2007
13. Bryant, R.E.: Supercomputing & Big Data: A Convergence. https://www.
nitrd.gov/nitrdgroups/images/5/5e/SC15panel RandalBryant.pdf, Supercomput-
ing (SC) 15 Panel- Supercomputing and Big Data: From Collision to Convergence
Nov 18 2015 - Austin, Texas https://www.nitrd.gov/apps/hecportal/index.php?
title=Events#Supercomputing .28SC.29 15 Panel
14. Coates, A., Huval, B., Wang, T., Wu, D., Catanzaro, B., Andrew, N.: Deep learning
with COTS HPC systems. In: Proceedings of the 30th international conference on
machine learning. pp. 1337–1345 (2013)
15. Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.H., Qiu, J., Fox, G.:
Twister: a runtime for iterative mapreduce. In: Proceedings of the 19th ACM
International Symposium on High Performance Distributed Computing. pp. 810–
818. ACM (2010)
16. Ekanayake, J., Pallickara, S., Fox, G.: Mapreduce for data intensive scientific anal-
yses. In: eScience, 2008. eScience’08. IEEE Fourth International Conference on.
pp. 277–284. IEEE (2008)
17. Ekanayake, S., Kamburugamuve, S., Fox, G.: SPIDAL: High Performance Data
Analytics with Java and MPI on Large Multicore HPC Clusters. http://dsc.soic.
indiana.edu/publications/hpc2016-spidal-high-performance-submit-18-public.pdf
(Jan 2016), Technical Report
18. Fox, G., Jha, S., Qiu, J., Ekanazake, S., Luckow, A.: Towards
a Comprehensive Set of Big Data Benchmarks. Big Data and
High Performance Computing 26, 47 (Feb 2015), Available from
http://grids.ucs.indiana.edu/ptliupages/publications/OgreFacetsv9.pdf
19. Fox, G., Chang, W.: Big data use cases and requirements. In: 1st Big Data Inter-
operability Framework Workshop: Building Robust Big Data Ecosystem ISO/IEC
JTC 1 Study Group on Big Data. pp. 18–21 (2014)
20. Fox, G., Qiu, J., Jha, S.: High Performance High Functionality Big Data
Software Stack. Big Data and Extreme-scale Computing (BDEC) (2014),
Available from http://www.exascale.org/bdec/sites/www.exascale.org.bdec/files/
whitepapers/fox.pdf
21. Fox, G.C., Jha, S., Qiu, J., Luckow, A.: Towards an Understanding of Facets
and Exemplars of Big Data Applications, in 20 Years of Beowulf: Work-
shop to Honor Thomas Sterling’s 65th Birthday October 14, 2014. Annapolis
http://dx.doi.org/10.1145/2737909.2737912
22. Fox, G.C., Jha, S., Qiu, J., Luckow, A.: Ogres: A Systematic Approach to Big Data
Benchmarks. Big Data and Extreme-scale Computing (BDEC) pp. 29–30 (2015)
23. Fox, G.C., Qiu, J., Kamburugamuve, S., Jha, S., Luckow, A.: HPC-ABDS High
Performance Computing Enhanced Apache Big Data Stack. In: Cluster, Cloud and
Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on.
pp. 1057–1066. IEEE (2015)
Big Data, Simulations and HPC Convergence 11
24. Iandola, F.N., Ashraf, K., Moskewicz, M.W., Keutzer, K.: FireCaffe: near-linear
acceleration of deep neural network training on compute clusters. arXiv preprint
arXiv:1511.00175 (2015)
25. Jha, S., Qiu, J., Luckow, A., Mantha, P., Fox, G.C.: A tale of two data-intensive
paradigms: Applications, abstractions, and architectures. In: 2014 IEEE Interna-
tional Congress on Big Data (BigData Congress). pp. 645–652. IEEE (2014)
26. Kamburugamuve, S., Ekanayake, S., Pathirage, M., Fox, G.: Towards High Per-
formance Processing of Streaming Data in Large Data Centers. http://dsc.soic.
indiana.edu/publications/high performance processing stream.pdf (2016), Techni-
cal Report
27. National Research Council: Frontiers in Massive Data Analysis. The National
Academies Press, Washington, DC (2013)
28. Qiu, J., Jha, S., Luckow, A., Fox, G.C.: Towards HPC-ABDS: An Initial High-
Performance Big Data Stack. Building Robust Big Data Ecosystem ISO/IEC JTC
1 Study Group on Big Data pp. 18–21 (2014), Available from http://grids.ucs.
indiana.edu/ptliupages/publications/nist-hpc-abds.pdf
29. Reed, D.A., Dongarra, J.: Exascale computing and big data. Communications of
the ACM 58(7), 56–68 (2015)
30. Trader, T.: Toward a Converged Exascale-Big Data Software Stack,
http://www.hpcwire.com/2016/01/28/toward-a-converged-software/
-stack-for- extreme-scale-computing- and-big-data/, January 28 2016
31. Van der Wijngaart, R.F., Sridharan, S., Lee, V.W.: Extending the BT NAS parallel
benchmark to exascale computing. In: Proceedings of the International Conference
on High Performance Computing, Networking, Storage and Analysis. p. 94. IEEE
Computer Society Press (2012)
32. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster
computing with working sets. In: Proceedings of the 2nd USENIX conference on
Hot topics in cloud computing. vol. 10, p. 10 (2010)
33. Zhang, B., Peng, B., Qiu, J.: Parallel LDA Through Synchronized Communica-
tion Optimizations. http://dsc.soic.indiana.edu/publications/LDA optimization
paper.pdf (2015), Technical Report
34. Zhang, B., Ruan, Y., Qiu, J.: Harp: Collective communication on hadoop. In: IEEE
International Conference on Cloud Engineering (IC2E) conference (2014)
Appendix: Convergence Diamonds with 64 Facets
These are discussed in Section 2 and summarized in Figure 1
Table 1: Convergence Diamonds and their Facets.
Facet and View Comments
PA: Problem Architecture View of Diamonds
(Meta or MacroPatterns)
Nearly all are the system of Data and Model
Continued on next page
12 Geoffrey Fox et. al
Facet and View Comments
PA-1 Pleasingly Parallel As in BLAST, Protein docking. Includes Local
Analytics or Machine Learning ML or filtering
pleasingly parallel, as in bio-imagery, radar im-
ages (pleasingly parallel but sophisticated local
analytics)
PA-2 Classic MapReduce Search, Index and Query and Classification al-
gorithms like collaborative filtering.
PA-3 Map-Collective Iterative maps + communication dominated by
collective operations as in reduction, broadcast,
gather, scatter. Common datamining pattern
but also seen in simulations
4 Map Point-to-Point Iterative maps + communication dominated by
many small point to point messages as in graph
algorithms and simulations
PA-5 Map Streaming Describes streaming, steering and assimilation
problems
PA-6
Shared memory
(as opposed to
distributed parallel
algorithm)
Corresponds to problem where shared memory
implementations important. Tend to be
dynamic and asynchronous
PA-7 SPMD Single Program Multiple Data, common paral-
lel programming feature
PA-8
Bulk
Synchronous
Processing (BSP)
Well-defined compute-communication phases
PA-9 Fusion Full applications often involves fusion of multi-
ple methods. Only present for composite Dia-
monds
PA-10 Dataflow Important application features often occurring
in composite Diamonds
PA-11M Agents Modelling technique used in areas like epidemi-
ology (swarm approaches)
PA-12 Orchestration
(workflow) All applications often involve orchestration
(workflow) of multiple components
EF: Diamond Micropatterns or Execution Features
EF-1 Performance
Metrics Result of Benchmark
EF-2
Flops/byte
(Memory or I/O).
Flops/watt (power).
I/O Not needed for pure in memory benchmark.
Continued on next page
Big Data, Simulations and HPC Convergence 13
Facet and View Comments
EF-3 Execution
Environment Core libraries needed: matrix-matrix/vector al-
gebra, conjugate gradient, reduction, broad-
cast; Cloud, HPC, threads, message passing
etc. Could include details of machine used for
benchmarking here
EF-4D Data Volume Property of a Diamond Instance. Benchmark
measure
EF-4M Model Size
EF-5D Data Velocity Associated with streaming facet but value de-
pends on particular problem. Not applicable to
model
EF-6D Data Variety Most useful for composite Diamonds. Applies
separately for model and data
EF-6M Model Variety
EF-7 Veracity Most problems would not discuss but poten-
tially important
EF-8M Communication
Structure Interconnect requirements; Is communication
BSP, Asynchronous, Pub-Sub, Collective, Point
to Point? Distribution and Synch
EF-9D D=Dynamic
or S=Static Data Clear qualitative properties. Importance famil-
iar from parallel computing and important sep-
arately for data and model
EF-9M D=Dynamic
or S=Static Model Clear qualitative properties.
Importance familiar from parallel computing
and important separately for data and modelEF-10D R=Regular
or I=Irregular Data
EF-10M R=Regular
or I=Irregular Model
EF-11M Iterative
or not? Clear qualitative property of Model. High-
lighted by Iterative MapReduce and always
present in classic parallel computing
EF-12D Data
Abstraction e.g. key-value, pixel, graph, vector, bags of
words or items. Clear quantitative property al-
though important data abstractions not agreed
upon. All should be supported by Programming
model and run time
EF-12M Model
Abstraction e.g. mesh points, finite element, Convolutional
Network.
Continued on next page
14 Geoffrey Fox et. al
Facet and View Comments
EF-13D
Data in
Metric Space
or not?
Important property of data.
EF-13M
Model in
Metric Space
or not?
Often driven by data but model and data can
be different here
EF-14M O(N2) or O(N)
Complexity? Property of Model algorithm
DV: Data Source and Style View of Diamonds
(No model involvement except in DV-9)
DV-1D SQL/NoSQL/
NewSQL? Can add NoSQL sub-categories such as key-
value, graph, document, column, triple store
DV-2D Enterprise
data model e.g. warehouses. Property of data model high-
lighted in database community / industry
benchmarks
DV-3D Files or Objects? Clear qualitative property of data model where
files important in Science; objects in industry
DV-4D File or Object System HDFS/Lustre/GPFS. Note HDFS important in
Apache stack but not much used in science
DV-5D
Archived
or Batched
or Streaming
Streaming is incremental update of datasets
with new algorithms to achieve real-time
response; Before data gets to compute system,
there is often an initial data gathering phase
which is characterized by a block size and
timing. Block size varies from month (Remote
Sensing, Seismic) to day (genomic) to seconds
or lower (Real time control, streaming)
Streaming
Category S1) S1) Set of independent events where precise
time sequencing unimportant.
Streaming
Category S2) S2) Time series of connected small events where
time ordering important.
Streaming
Category S3) S3) Set of independent large events where each
event needs parallel processing with time se-
quencing not critical
Streaming
Category S4) S4) Set of connected large events where each
event needs parallel processing with time se-
quencing critical.
Continued on next page
Big Data, Simulations and HPC Convergence 15
Facet and View Comments
Streaming
Category S5) S5) Stream of connected small or large events
to be integrated in a complex way.
DV-6D
Shared and/or
Dedicated and/or
Transient and/or
Permanent
Clear qualitative property of data whose
importance is not well studied. Other
characteristics maybe needed for auxiliary
datasets and these could be interdisciplinary,
implying nontrivial data movement/replication
DV-7D Metadata
and Provenance Clear qualitative property but not for kernels
as important aspect of data collection process
DV-8D Internet of Things Dominant source of commodity data in future.
24 to 50 Billion devices on Internet by 2020
DV-9 HPC Simulations
generate Data Important in science research especially at ex-
ascale
DV-10D
Geographic
Information
Systems
Geographical Information Systems provide at-
tractive access to geospatial data
Pr: Processing (runtime) View of Diamonds
Useful for Big data and Big simulation
Pr-1M Micro-benchmarks Important subset of small kernels
Pr-2M
Local Analytics
or Informatics
or Simulation
Executes on a single core or perhaps node and
overlaps Pleasingly Parallel
Pr-3M
Global Analytics
or Informatics
or simulation
Requiring iterative programming models across
multiple nodes of a parallel system
Pr-12M
Linear
Algebra Kernels Important property of some analytics
Many
important
subclasses
Conjugate Gradient, Krylov, Arnoldi iterative
subspace methods
Full Matrix
Structured and unstructured sparse matrix
methods
Pr-13M Graph Algorithms Clear important class of algorithms often hard
especially in parallel
Pr-14M Visualization Clearly important aspect of analysis in simula-
tions and big data analyses
Pr-15M Core Libraries Functions of general value such as Sorting,
Math functions, Hashing
Big Data Processing Diamonds
Continued on next page
16 Geoffrey Fox et. al
Facet and View Comments
Pr-4M Base Data
Statistics Describes simple statistical averages needing
simple MapReduce in problem architecture
Pr-5M Recommender Engine Clear type of big data machine learning of es-
pecial importance commercially
Pr-6M Data Search/
Query/Index Clear important class of algorithms especially
in commercial applications.
Pr-7M Data Classification Clear important class of big data algorithms
Pr-8M Learning Includes deep learning as category
Pr-9M Optimization
Methodology Includes Machine Learning, Nonlinear
Optimization, Least Squares, expectation
maximization, Dynamic Programming, Lin-
ear/Quadratic Programming, Combinatorial
Optimization
Pr-10M Streaming Data
Algorithms Clear important class of algorithms associated
with Internet of Things. Can be called DDDAS
Dynamic Data-Driven Application Systems
Pr-11M Data Alignment Clear important class of algorithms as in
BLAST to align genomic sequences
Simulation (Exascale) Processing Diamonds
Pr-16M Iterative
PDE Solvers Jacobi, Gauss Seidel etc.
Pr-17M Multiscale Method? Multigrid and other variable resolution ap-
proaches
Pr-18M Spectral Methods Fast Fourier Transform
Pr-19M N-body Methods Fast multipole, Barnes-Hut
Pr-20M Particles and Fields Particle in Cell
Pr-21M Evolution of
Discrete Systems Electrical Grids, Chips, Biological Systems,
Epidemiology. Needs ODE solvers
Pr-22M Nature of
Mesh if used Structured, Unstructured, Adaptive
... o This is a reservoir of software subsystems -nearly all from outside the project and coming from a mix of HPC and Big Data communities o We added a categorization and an HPC enhancement approach o HPC-ABDS combined with the NIST Big Data Application Analysis leads to Big Data -Big Simulation -HPC Convergence [12,13], described in section 2.3 ...
... Many of our SPIDAL algorithms have linear algebra at their core; one nice example is multi-dimensional scaling in Section 4.7 (2) which is based on matrix-matrix multiplication. Convergence Diamonds [12] in 4 views generalizing Ogres. ...
... A key idea introduced in [12,19] was to separate, for any application, the data and model components which were merged together in the original Ogre analysis. In Big Data problems, naturally the data size is large and this normally is the focus of work in that area. ...
Preprint
Full-text available
This is a 21-month progress report on an NSF-funded project NSF14-43054 started October 1, 2014, and involving a collaboration between university teams at Arizona, Emory, Indiana (lead), Kansas, Rutgers, Virginia Tech, and Utah. The project is constructing data building blocks to address major cyberinfrastructure challenges in seven different communities: Biomolecular Simulations, Network and Computational Social Science, Epidemiology, Computer Vision, Spatial Geographical Information Systems, Remote Sensing for Polar Science, and Pathology Informatics. The project has an overall architecture [5] built around the twin concepts of HPC-ABDS (High Performance Computing enhanced Apache Big Data Stack) software and a classification of Big data applications – the Ogres – that defined the key qualities exhibited by applications and required to be supported in software. These underpinning ideas are described in section 2 together with recent extensions including a discussion of Big Data – Big Simulation and HPC convergence. Our architecture for data intensive applications relies on Apache Big Data stack ABDS for the core software building blocks where we add an interface layer MIDAS – the Middleware for Data- Intensive Analytics and Science – described in Section 3, that will enable scalable applications with the performance of HPC (High Performance Computing) and the rich functionality of the commodity ABDS (Apache Big Data Stack). The next set of building blocks described in section 4 are members of a cross-cutting high-performance data-analysis library – SPIDAL (Scalable Parallel Interoperable Data Analytics Library). SPIDAL consists of a set of core routines covering well-established functionality (such as optimization and clustering) together with targeted community-specific capabilities motivated by applications described in Section 5. Section 6 covers community engagement and Section 7 has some important lessons learned as well as existing and future spinoff activities. The project has a webpage, an early Indiana University press release, and the public NSF award announcement.
... We introduced a novel HPC-Cloud convergence framework named Harp-DAAL [28]- [31]. Harp-DAAL shows how simulations and Big Data can use common programming environments [11] with a runtime based on a rich set of collectives and libraries. Harp is a collective communication library that supports all 5 classes of dataintensive computation, from pleasingly parallel to machine learning and simulations. ...
... 11. A Significant under-immunized cluster found using our network scan statistics approach. ...
Chapter
Our project is at the interface of Big Data and HPC – High-Performance Big Data computing and this paper describes a collaboration between 7 collaborating Universities at Arizona State, Indiana (lead), Kansas, Rutgers, Stony Brook, Virginia Tech, and Utah. It addresses the intersection of High-performance and Big Data computing with several different application areas or communities driving the requirements for software systems and algorithms. We describe the base architecture, including the HPC-ABDS, High-Performance Computing enhanced Apache Big Data Stack, and an application use case study identifying key features that determine software and algorithm requirements. We summarize middleware including Harp-DAAL collective communication layer, Twister2 Big Data toolkit, and pilot jobs. Then we present the SPIDAL Scalable Parallel Interoperable Data Analytics Library and our work for it in core machine-learning, image processing and the application communities, Network science, Polar Science, Biomolecular Simulations, Pathology, and Spatial systems. We describe basic algorithms and their integration in end-to-end use cases.
... Larger-scale applications deployed on HPC will produce greater data transferring demands on the scarce I/O resource. Under the convergence trend of big data and HPC [3], certain big data applications have higher data requirements on the parallel file system (PFS). In addition, fault-tolerance technologies, such as checkpointing/restart, which are designed to resist the decreasing Mean Time between Failures (MTBF) also exacerbate I/O contention [4]. ...
Article
Full-text available
With the convergence of big data and HPC (high-performance computing), various machine learning applications and traditional large-scale simulations with a stochastically iterative I/O periodicity are running concurrently on HPC platforms, which poses more challenges on the scarcely shared I/O resources due to the ever-growing data transfer demand. Currently the existing heuristic online and periodic offline I/O scheduling methods for traditional HPC applications with a fixed I/O periodicity are not suitable for the applications with stochastically iterative I/O periodicities, which are required to schedule the concurrent I/Os from different applications under I/O congestion. In this work, we propose an adaptively periodic I/O scheduling (APIO) method that optimizes the system efficiency and application dilation by taking the stochastically iterative I/O periodicity of the applications into account. We first build a periodic offline scheduling method within a specified duration to capture the iterative nature. After that, APIO adjusts the bandwidth allocation to resist stochasticity based on the actual length of the computing phrase. In the case where the specified duration does not satisfy the actual running requirements, the period length will be extended to adapt to the actual duration. Theoretical analysis and extensive simulations demonstrate the efficiency of our proposed I/O scheduling method over the existing online approach.
... The system structure for big data analytics in the cloud environments includes subsystems for decision making, gathering, and analysis, with a large number of servers to manage vast volumes of data flow [16,17]. Therefore, big data processing in the cloud requires efficient, effective, productive, and reliable resources performed with maximum throughput, minimized energy consumption, and run time [18][19][20]. ...
Article
Full-text available
Big data analytics in cloud environments introduces challenges such as real-time load balancing besides security, privacy, and energy efficiency. This paper proposes a novel load balancing algorithm in cloud environments that performs resource allocation and task scheduling efficiently. The proposed load balancer reduces the execution response time in big data applications performed on clouds. Scheduling, in general, is an NP-hard problem. Our proposed algorithm provides solutions to reduce the search area that leads to reduced complexity of the load balancing. We recommend two mathematical optimization models to perform dynamic resource allocation to virtual machines and task scheduling. The provided solution is based on the hill-climbing algorithm to minimize response time. We evaluate the performance of proposed algorithms in terms of response time, turnaround time, throughput metrics, and request distribution with some of the existing algorithms that show significant improvements.
... Recently, there have been a lot of interest in converging high performance computing (HPC) and big data technologies [5,6]. A number of projects have been trying to integrate big data technologies such as HDFS and Spark into HPC environments [7,8]. ...
Article
Full-text available
Twister2 is an open‐source big data hosting environment designed to process both batch and streaming data at scale. Twister2 runs jobs in both high‐performance computing (HPC) and big data clusters. It provides a cross‐platform resource scheduler to run jobs in diverse environments. Twister2 is designed with a layered architecture to support various clusters and big data problems. In this paper, we present the cross‐platform resource scheduler of Twister2. We identify required services and explain implementation details. We present job startup delays for single jobs and multiple concurrent jobs in Kubernetes and OpenMPI clusters. We compare job startup delays for Twister2 and Spark at a Kubernetes cluster. In addition, we compare the performance of terasort algorithm on Kubernetes and bare metal clusters at AWS cloud.
Article
In this paper, we review the background and the state of the art of the Distributed Computing software stack. We aim to provide the readers with a comprehensive overview of this area by supplying a detailed big-picture of the latest technologies. First, we introduce the general background of Distributed Computing and propose a layered top–bottom classification of the latest available software. Next, we focus on each abstraction layer, i.e. Application Development (including Task-based Workflows, Dataflows, and Graph Processing), Platform (including Data Sharing and Resource Management), Communication (including Remote Invocation, Message Passing, and Message Queuing), and Infrastructure (including Batch and Interactive systems). For each layer, we give a general background, discuss its technical challenges, review the latest programming languages, programming models, frameworks, libraries, and tools, and provide a summary table comparing the features of each alternative. Finally, we conclude this survey with a discussion of open problems and future directions.
Article
In recent years, the areas of High-Performance Computing (HPC) and massive data processing (also know as Big Data) have been in a convergence course, since they tend to be deployed on similar hardware. HPC systems have historically performed well in regular, matrix-based computations; on the other hand, Big Data problems have often excelled in fine-grained, data parallel workloads. While HPC programming is mostly task-based, like COMPSs, popular Big Data environments, like Spark, adopt the functional programming paradigm. A careful analysis shows that there are pros and cons to both approaches, and integrating them may yield interesting results. With that reasoning in mind, we have developed DDF, an API and library for COMPSs that allows developers to use Big Data techniques while using that HPC environment. DDF has a functional-based interface, similar to many Data Science tools, that allows us to use dynamic evaluation to adapt the task execution in run time. It brings some of the qualities of Big Data programming, making it easier for application domain experts to write Data Analysis jobs. In this article we discuss the API and evaluate the impact of the techniques used in its implementation that allow a more efficient COMPSs execution. In addition, we present a performance comparison with Spark in several application patterns. The results show that each technique significantly impacts the performance, allowing COMPSs to outperform Spark in many use cases.
Conference Paper
Full-text available
We study many Big Data applications from a variety of research and commercial areas and suggest a set of characteristic features and possible kernel benchmarks that stress those features for data analytics. We draw conclusions for the hardware and software architectures that are suggested by this analysis.
Conference Paper
Full-text available
Smart devices, mobile robots, ubiquitous sensors, and other connected devices in the Internet of Things (IoT) increasingly require real-time computations beyond their hardware limits to process the events they capture. Leveraging cloud infras-tructures for these computational demands is a pattern adopted in the IoT community as one solution, which has led to a class of Dynamic Data Driven Applications (DDDA). These applications offload computations to the cloud through Distributed Stream Processing Frameworks (DSPF) such as Apache Storm. While DSPFs are efficient in computations, current implementations barely meet the strict low latency requirements of large scale DDDAs due to inefficient inter-process communication. This research implements efficient highly scalable communication algorithms and presents a comprehensive study of performance, taking into account the nature of these applications and characteristics of the cloud runtime environments. It further reduces communication costs within a node using an efficient shared memory approach. These algorithms are applicable in general to existing DSPFs and the results show significant improvements in latency over the default implementation in Apache Storm.
Conference Paper
Full-text available
We review the High Performance Computing Enhanced Apache Big Data Stack HPC-ABDS and summarize the capabilities in 21 identified architecture layers. These cover Message and Data Protocols, Distributed Coordination, Security & Privacy, Monitoring, Infrastructure Management, DevOps, Interoperability, File Systems, Cluster & Resource management, Data Transport, File management, NoSQL, SQL (NewSQL), Extraction Tools, Object-relational mapping, In-memory caching and databases, Inter-process Communication, Batch Programming model and Runtime, Stream Processing, High-level Programming, Application Hosting and PaaS, Libraries and Applications, Workflow and Orchestration. We summarize status of these layers focusing on issues of importance for data analytics. We highlight areas where HPC and ABDS have good opportunities for integration.
Article
Full-text available
Long training times for high-accuracy deep neural networks (DNNs) impede research into new DNN architectures and slow the development of high-accuracy DNNs. In this paper we present FireCaffe, which successfully scales deep neural network training across a cluster of GPUs. We also present a number of best practices to aid in comparing advancements in methods for scaling and accelerating the training of deep neural networks. The speed and scalability of distributed algorithms is almost always limited by the overhead of communicating between servers; DNN training is not an exception to this rule. Therefore, the key consideration here is to reduce communication overhead wherever possible, while not degrading the accuracy of the DNN models that we train. Our approach has three key pillars. First, we select network hardware that achieves high bandwidth between GPU servers -- Infiniband or Cray interconnects are ideal for this. Second, we consider a number of communication algorithms, and we find that reduction trees are more efficient and scalable than the traditional parameter server approach. Third, we optionally increase the batch size to reduce the total quantity of communication during DNN training, and we identify hyperparameters that allow us to reproduce the small-batch accuracy while training with large batch sizes. When training GoogLeNet and Network-in-Network on ImageNet, we achieve a 16x and 23x speedup, respectively, when training on a cluster of 32 GPUs.
Article
Full-text available
Daniel A. Reed and Jack Dongarra state that scientific discovery and engineering innovation requires unifying traditionally separated high-performance computing and big data analytics. Big data machine learning and predictive data analytics have been considered as the fourth paradigm of science, allowing researchers to extract insights from both scientific instruments and computational simulations. A rich ecosystem of hardware and software has emerged for big-data analytics similar to high-performance computing.
Conference Paper
Full-text available
The NAS Parallel Benchmarks (NPB) are a well-known suite of benchmarks that proxy scientific computing applications. They specify several problem sizes that represent how such applications may run on different sizes of HPC systems. However, even the largest problem (class F) is still far too small to exercise properly a petascale supercomputer. Our work shows how one may scale the Block Tridiagonal (BT) NPB from today's published size to petascale and exascale computing systems. In this paper we discuss the pros and cons of various ways of scaling. We discuss how scaling BT would impact computation, memory access, and communications, and highlight the expected bottleneck, which turns out to be not memory or communication bandwidth, but network latency. Two complementary ways are presented to overcome latency obstacles. We also describe a practical method to gather approximate performance data for BT at exascale on actual hardware, without requiring an exascale system.
Article
Full-text available
Scientific problems that depend on processing large amounts of data require overcoming challenges in multiple areas: managing large-scale data distribution, co-placement and scheduling of data with compute resources, and storing and transferring large volumes of data. We analyze the ecosystems of the two prominent paradigms for data-intensive applications, hereafter referred to as the high-performance computing and the Apache-Hadoop paradigm. We propose a basis, common terminology and functional factors upon which to analyze the two approaches of both paradigms. We discuss the concept of "Big Data Ogres" and their facets as means of understanding and characterizing the most common application workloads found across the two paradigms. We then discuss the salient features of the two paradigms, and compare and contrast the two approaches. Specifically, we examine common implementation/approaches of these paradigms, shed light upon the reasons for their current "architecture" and discuss some typical workloads that utilize them. In spite of the significant software distinctions, we believe there is architectural similarity. We discuss the potential integration of different implementations, across the different levels and components. Our comparison progresses from a fully qualitative examination of the two paradigms, to a semi-quantitative methodology. We use a simple and broadly used Ogre (K-means clustering), characterize its performance on a range of representative platforms, covering several implementations from both paradigms. Our experiments provide an insight into the relative strengths of the two paradigms. We propose that the set of Ogres will serve as a benchmark to evaluate the two paradigms along different dimensions.
Article
Scaling up deep learning algorithms has been shown to lead to increased performance in benchmark tasks and to enable discovery of complex high-level features. Recent efforts to train extremely large networks (with over 1 billion parameters) have relied on cloud-like computing infrastructure and thousands of CPU cores. In this paper, we present technical details and results from our own system based on Commodity Off-The-Shelf High Performance Computing (COTS HPC) technology: a cluster of GPU servers with Infiniband interconnects and MPI. Our system is able to train 1 billion parameter networks on just 3 machines in a couple of days, and we show that it can scale to networks with over 11 billion parameters using just 16 machines. As this infrastructure is much more easily marshaled by others, the approach enables much wider-spread research with extremely large neural networks.
Conference Paper
We provide a summary of the outcomes from the Workshop on Big Data Benchmarking (WBDB2012) held on May 8-9, 2012 in San Jose, CA. The workshop discussed a number of issues related to big data benchmarking definitions and benchmark processes, and was attended by 60 invitees representing 45 different organizations from industry and academia. Attendees were selected based on their experience and expertise in one or more areas of big data, database systems, performance benchmarking, and big data applications. The participants concluded that there exists both a need and an opportunity for defining benchmarks to capture the end-to-end aspects of big data applications. The metrics for such benchmarks would need to include metrics for performance as well as price/performance, and consider several costs including total system cost, setup cost, and energy costs. The next Workshop on Big Data Benchmarking is scheduled to be held on December 17-18, 2012 in Pune, India.