PosterPDF Available

Digital Science Center III



This covers Streaming workshops held, IoTCloud for cloud control of robots, SPIDAL project, HPC-ABDS, WebPlotviz visualization and Stock Market data, Scientific paper impact analysis for XSEDE
Geoffrey C. Fox, David Crandall, Judy Qiu, Gregor von Laszewski, Fugang Wang, Badi' Abdul-Wahid,
Saliya Ekanayake, Supun Kamburugamuva, Jerome Mitchell, Bingjing Zhang, Pulasthi Wickramasinghe, Hyungro Lee, Andrew Younge
School of Informatics and Computing, Indiana University
Digital Science Center III
Scientific Impact Metrics
We developed asoftwar e framework and process to evaluate scientific
impact for XSEDE. We calculate and track various Scientific Impact
Metrics of XSEDE., BlueWaters, and NCAR.Recently we conducted
an updated peers comparison analysis with newly added and updated
data, which shows consistent results as the previous one.During this
process we retrieved and processed millions of d ata entries from
multiple sources in various format s to obtain the result.
Rank -
Rank -
# Citation
# Citation -
XSEDE 5,081 59 63 28 12
Peers 356k 49 48 15 5
Figures tracking various impact metrics for XSEDE (#pubs; #citations; H-Index; G-Index;
Table comparing XSEDE publication citation metrics with peers
Visualization of Stock market as a
high dimensional time series
In this project we vi sualize stocks by projecting the correlations between
stocks through time in to 3D using MDS. Years of histo rical daily stock
data is segmented using asliding window approach to create a
continuous visuali zations through time.
Trajectories of stocks thr ough time Stock visualization of one time frame
Data:Obtained daily stock values using The Center for Research in
Security Prices (CSRP)1 database through the Wharton Research
Data Services (WRDS) web i nterface
Experiments avai lable:
Data analytics for IoT devices in Cloud
We developed a framework to bring data from I oT devices to acloud
environment for real time d ata analysis.The fram ework consi sts of;
Data collection nodes near the devices, Publish-subscribe brokers to
bring data to cloud and Apache Storm coupled with other batch
processing engines for data processing in cloud.Our data pi pe line is
Robot Gateway Message Brokers Apache Storm.
WebPlotViz – Browser Visualization
of High Dimensional Data
WebPl otViz is a 2D/3D data point browser that can visualize very large
volumes of 2D or 3D data, as points in avirtual space and enable users
to explore the virtual space i nteractively .WebPlotViz also i ncludes
support for Time Seri es Data plots
446K sequences and ~100 clust ers visualized in WebPlotViz
Kaleidoscope of (Apache) Big Data Stack (ABDS) and HPC Technologies
1) Message
and Data
Avro, Thrift,
2) Distributed
: Google
3) Security &
LDAP, Sentry,
Sqrrl, OpenID,
Nagios, Inca
17) Workflow-Orchestration: ODE, ActiveBPEL, Airavata, Pegasus, Kepler, Swift, Taverna, Triana, Trident, BioKepler, Galaxy, IPython, Dryad,
Naiad, Oozie, Tez, Google FlumeJava, Crunch, Cascading, Scalding, e-Science Central, Azure Data Factory, Google Cloud Dataflow, NiFi (NSA),
Jitterbit, Talend, Pentaho, Apatar, Docker Compose, KeystoneML
16) Application and Analytics: Mahout , MLlib , MLbase, DataFu, R, pbdR, Bioconductor, ImageJ, OpenCV, Scalapack, PetSc, PLASMA MAGMA,
Azure Machine Learning, Google Prediction API & Translation API, mlpy, scikit-learn, PyBrain, CompLearn, DAAL(Intel), Caffe, Torch, Theano, DL4j,
H2O, IBM Watson, Oracle PGX, GraphLab, GraphX, IBM System G, GraphBuilder(Intel), TinkerPop, Parasol, Dream:Lab, Google Fusion Tables,
CINET, NWB, Elasticsearch, Kibana, Logstash, Graylog, Splunk, Tableau, D3.js, three.js, Potree, DC.js, TensorFlow, CNTK
15B) Application Hosting Frameworks: Google App Engine, AppScale, Red Hat OpenShift, Heroku, Aerobatic, AWS Elastic Beanstalk, Azure, Cloud
Foundry, Pivotal, IBM BlueMix, Ninefold, Jelastic, Stackato, appfog, CloudBees, Engine Yard, CloudControl, dotCloud, Dokku, OSGi, HUBzero, OODT,
Agave, Atmosphere
15A) High level Programming: Kite, Hive, HCatalog, Tajo, Shark, Phoenix, Impala, MRQL, SAP HANA, HadoopDB, PolyBase, Pivotal HD/Hawq,
Presto, Google Dremel, Google BigQuery, Amazon Redshift, Drill, Kyoto Cabinet, Pig, Sawzall, Google Cloud DataFlow, Summingbird
14B) Streams: Storm, S4, Samza, Granules, Neptune, Google MillWheel, Amazon Kinesis, LinkedIn, Twitter Heron, Databus, Facebook
Puma/Ptail/Scribe/ODS, Azure Stream Analytics, Floe, Spark Streaming, Flink Streaming, DataTurbine
14A) Basic Programming model and runtime, SPMD, MapReduce: Hadoop, Spark, Twister, MR-MPI, Stratosphere (Apache Flink), Reef, Disco,
Hama, Giraph, Pregel, Pegasus, Ligra, GraphChi, Galois, Medusa-GPU, MapGraph, Totem
13) Inter process communication Collectives, point-to-point, publish-subscribe: MPI, HPX-5, Argo BEAST HPX-5 BEAST PULSAR, Harp, Netty,
ZeroMQ, ActiveMQ, RabbitMQ, NaradaBrokering, QPid, Kafka, Kestrel, JMS, AMQP, Stomp, MQTT, Marionette Collective, Public Cloud: Amazon
SNS, Lambda, Google Pub Sub, Azure Queues, Event Hubs
12) In-memory databases/caches: Gora (general object from NoSQL), Memcached, Redis, LMDB (key value), Hazelcast, Ehcache, Infinispan, VoltDB,
12) Object-relational mapping: Hibernate, OpenJPA, EclipseLink, DataNucleus, ODBC/JDBC
12) Extraction Tools: UIMA, Tika
11C) SQL(NewSQL): Oracle, DB2, SQL Server, SQLite, MySQL, PostgreSQL, CUBRID, Galera Cluster, SciDB, Rasdaman, Apache Derby, Pivotal
Greenplum, Google Cloud SQL, Azure SQL, Amazon RDS, Google F1, IBM dashDB, N1QL, BlinkDB, Spark SQL
11B) NoSQL: Lucene, Solr, Solandra, Voldemort, Riak, ZHT, Berkeley DB, Kyoto/Tokyo Cabinet, Tycoon, Tyrant, MongoDB, Espresso, CouchDB,
Couchbase, IBM Cloudant, Pivotal Gemfire, HBase, Google Bigtable, LevelDB, Megastore and Spanner, Accumulo, Cassandra, RYA, Sqrrl, Neo4J,
graphdb, Yarcdata, AllegroGraph, Blazegraph, Facebook Tao, Titan:db, Jena, Sesame
Public Cloud: Azure Table, Amazon Dynamo, Google DataStore
11A) File management: iRODS, NetCDF, CDF, HDF, OPeNDAP, FITS, RCFile, ORC, Parquet
10) Data Transport: BitTorrent, HTTP, FTP, SSH, Globus Online (GridFTP), Flume, Sqoop, Pivotal GPLOAD/GPFDIST
9) Cluster Resource Management: Mesos, Yarn, Helix, Llama, Google Omega, Facebook Corona, Celery, HTCondor, SGE, OpenPBS, Moab, Slurm,
Torque, Globus Tools, Pilot Jobs
8) File systems: HDFS, Swift, Haystack, f4, Cinder, Ceph, FUSE, Gluster, Lustre, GPFS, GFFS
Public Cloud: Amazon S3, Azure Blob, Google Cloud Storage
7) Interoperability: Libvirt, Libcloud, JClouds, TOSCA, OCCI, CDMI, Whirr, Saga, Genesis
6) DevOps: Docker (Machine, Swarm), Puppet, Chef, Ansible, SaltStack, Boto, Cobbler, Xcat, Razor, CloudMesh, Juju, Foreman, OpenStack Heat,
Sahara, Rocks, Cisco Intelligent Automation for Cloud, Ubuntu MaaS, Facebook Tupperware, AWS OpsWorks, OpenStack Ironic, Google Kubernetes,
Buildstep, Gitreceive, OpenTOSCA, Winery, CloudML, Blueprints, Terraform, DevOpSlang, Any2Api
5) IaaS Management from HPC to hypervisors: Xen, KVM, QEMU, Hyper-V, VirtualBox, OpenVZ, LXC, Linux-Vserver, OpenStack, OpenNebula,
Eucalyptus, Nimbus, CloudStack, CoreOS, rkt, VMware ESXi, vSphere and vCloud, Amazon, Azure, Google and other public Clouds
Networking: Google Cloud DNS, Amazon Route 53
HPC-ABDS Apache Big Data Stack
Simultaneous Localization and Mapping(SLAM) is an exam ple
application built on top of our framework, where we expl oit parallel
data processing to speedup the expensive SLAM computation.
Design and Build Scalable High Performance Data Analytics Library
SPIDAL (Scalab le Parallel Interoperabl e Data Analytics Library): Scalable
Analytics for:
Domain spec ific data analy tics libraries mainl y from project.
Add Core Machine learning l ibraries mainl y from community.
Performance of Java and MIDAS Inter-and Intra-node
NIST Big Data Application Analysis features of data inten sive Application s
deriving 50 Og res and 64 Convergence Diamonds. Application Nexus.
HPC-ABDS: Clou d-HPC inter operable software performance of HPC (Hig h
Performance Computin g) and th e rich functionality of the commod ity Apache Big
Data Stack. Software Nexu s
MIDAS: Integ rating Middl eware from project.
Applications: Biomolecul ar Simulation s, Network and Comp utational Soc ial
Science, Epidemiolog y, Computer Vision, Geographical Information Systems,
Remote Sen sing for Polar Sc ience and Path ology Informatics, Streaming for
robotics, streaming stock analytics
Implementations: HPC as well as cloud s (OpenStac k, Docker) Convergenc e with
common DevOps t ool Hardware Nexus
Main Components of SPIDAL Project
Classification of Application
Initial investigation of application characteristics to
define/develop classification
Event size, synchronicity, time & length sc ales..
Need to enhance with industry/research use case
comparison industry many small events; research often
large (as are self driving cars)
Current software solutions
Impressive commercial solutions for commerci al
applications: applicability to science and Government(e.g.
DoE) unclear.
Plethora of “local point” solutions (see report for detailed
listing) but few end-to-end general streaming
infrastructures outside open sourced big data systems
(Apache Spark, Flink, Storm, Samza).
Opens up iss ues in distributed computing, e.g.,
performance, fault-tolerance, dynamic resource
NSF DoE and AFOSR funding
October 27-28 2015 Indianapolis STREAM2015
March 22-23, 2016 Washington DC, STREAM2016
Summary of Streaming Workshops
ResearchGate has not been able to resolve any citations for this publication.
ResearchGate has not been able to resolve any references for this publication.