PosterPDF Available

Digital Science Center III

Authors:

Abstract

This covers Streaming workshops held, IoTCloud for cloud control of robots, SPIDAL project, HPC-ABDS, WebPlotviz visualization and Stock Market data, Scientific paper impact analysis for XSEDE
Geoffrey C. Fox, David Crandall, Judy Qiu, Gregor von Laszewski, Fugang Wang, Badi' Abdul-Wahid,
Saliya Ekanayake, Supun Kamburugamuva, Jerome Mitchell, Bingjing Zhang, Pulasthi Wickramasinghe, Hyungro Lee, Andrew Younge
School of Informatics and Computing, Indiana University
Digital Science Center III
Scientific Impact Metrics
We developed asoftware framework and process to evaluate scientific
impact for XSEDE. We calculate and track various Scientific Impact
Metrics of XSEDE., BlueWaters, and NCAR.Recently we conducted
an updated peers comparison analysis with newly added and updated
data, which shows consistent results as the previous one.During this
process we retrieved and processed millions of data entries from
multiple sources in various formats to obtai n the result.
#
Publications
Rank -
Average
Rank -
Median
# Citation
-Average
# Citation -
Median
XSEDE 5,081 59 63 28 12
Peers 356k 49 48 15 5
Figures tracking various impact metrics for XSEDE (#pubs; #citations; H-Index; G-Index;
etc.)
Table comparing XSEDE publication citation metrics with peers
Visualization of Stock market as a
high dimensional time series
In this project we vi sualize stocks by projecting the correlations between
stocks through time in to 3D using MDS. Years of historical daily stock
data is segmented using asliding window approach to create a
continuous visuali zations through time.
Trajectories of stocks thr ough time Stock visualization of one time frame
Data:Obtained daily stock values using The Center for Research in
Security Prices (CSRP)1 database through the Wharton Research
Data Services (WRDS) web interface
Experiments avai lable:
https://spidal-gw.dsc.soic.indiana.edu/groupdashboard/Stocks
Data analytics for IoT devices in Cloud
We developed a framework to bring data from I oT devices to acloud
environment for real time data analysis.The framework consists of;
Data collection nodes near the devices, Publish-subscribe brokers to
bring data to cloud and Apache Storm coupled with other batch
processing engines for data processing in cloud.Our data pi pe line is
Robot Gateway Message Brokers Apache Storm.
WebPlotViz – Browser Visualization
of High Dimensional Data
WebPlotViz is a 2D/3D data point browser that can visualize very large
volumes of 2D or 3D data, as points in avirtual space and enable users
to explore the virtual space interactively.WebPlotViz also includes
support for Time Seri es Data plots
446K sequences and ~100 clust ers visualized in WebPlotViz
Kaleidoscope of (Apache) Big Data Stack (ABDS) and HPC Technologies
Cross-
Cutting
Functions
1) Message
and Data
Protocols:
Avro, Thrift,
Protobuf
2) Distributed
Coordination
: Google
Chubby,
Zookeeper,
Giraffe,
JGroups
3) Security &
Privacy:
InCommon,
Eduroam
OpenStack
Keystone,
LDAP, Sentry,
Sqrrl, OpenID,
SAML OAuth
4)
Monitoring:
Ambari,
Ganglia,
Nagios, Inca
17) Workflow-Orchestration: ODE, ActiveBPEL, Airavata, Pegasus, Kepler, Swift, Taverna, Triana, Trident, BioKepler, Galaxy, IPython, Dryad,
Naiad, Oozie, Tez, Google FlumeJava, Crunch, Cascading, Scalding, e-Science Central, Azure Data Factory, Google Cloud Dataflow, NiFi (NSA),
Jitterbit, Talend, Pentaho, Apatar, Docker Compose, KeystoneML
16) Application and Analytics: Mahout , MLlib , MLbase, DataFu, R, pbdR, Bioconductor, ImageJ, OpenCV, Scalapack, PetSc, PLASMA MAGMA,
Azure Machine Learning, Google Prediction API & Translation API, mlpy, scikit-learn, PyBrain, CompLearn, DAAL(Intel), Caffe, Torch, Theano, DL4j,
H2O, IBM Watson, Oracle PGX, GraphLab, GraphX, IBM System G, GraphBuilder(Intel), TinkerPop, Parasol, Dream:Lab, Google Fusion Tables,
CINET, NWB, Elasticsearch, Kibana, Logstash, Graylog, Splunk, Tableau, D3.js, three.js, Potree, DC.js, TensorFlow, CNTK
15B) Application Hosting Frameworks: Google App Engine, AppScale, Red Hat OpenShift, Heroku, Aerobatic, AWS Elastic Beanstalk, Azure, Cloud
Foundry, Pivotal, IBM BlueMix, Ninefold, Jelastic, Stackato, appfog, CloudBees, Engine Yard, CloudControl, dotCloud, Dokku, OSGi, HUBzero, OODT,
Agave, Atmosphere
15A) High level Programming: Kite, Hive, HCatalog, Tajo, Shark, Phoenix, Impala, MRQL, SAP HANA, HadoopDB, PolyBase, Pivotal HD/Hawq,
Presto, Google Dremel, Google BigQuery, Amazon Redshift, Drill, Kyoto Cabinet, Pig, Sawzall, Google Cloud DataFlow, Summingbird
14B) Streams: Storm, S4, Samza, Granules, Neptune, Google MillWheel, Amazon Kinesis, LinkedIn, Twitter Heron, Databus, Facebook
Puma/Ptail/Scribe/ODS, Azure Stream Analytics, Floe, Spark Streaming, Flink Streaming, DataTurbine
14A) Basic Programming model and runtime, SPMD, MapReduce: Hadoop, Spark, Twister, MR-MPI, Stratosphere (Apache Flink), Reef, Disco,
Hama, Giraph, Pregel, Pegasus, Ligra, GraphChi, Galois, Medusa-GPU, MapGraph, Totem
13) Inter process communication Collectives, point-to-point, publish-subscribe: MPI, HPX-5, Argo BEAST HPX-5 BEAST PULSAR, Harp, Netty,
ZeroMQ, ActiveMQ, RabbitMQ, NaradaBrokering, QPid, Kafka, Kestrel, JMS, AMQP, Stomp, MQTT, Marionette Collective, Public Cloud: Amazon
SNS, Lambda, Google Pub Sub, Azure Queues, Event Hubs
12) In-memory databases/caches: Gora (general object from NoSQL), Memcached, Redis, LMDB (key value), Hazelcast, Ehcache, Infinispan, VoltDB,
H-Store
12) Object-relational mapping: Hibernate, OpenJPA, EclipseLink, DataNucleus, ODBC/JDBC
12) Extraction Tools: UIMA, Tika
11C) SQL(NewSQL): Oracle, DB2, SQL Server, SQLite, MySQL, PostgreSQL, CUBRID, Galera Cluster, SciDB, Rasdaman, Apache Derby, Pivotal
Greenplum, Google Cloud SQL, Azure SQL, Amazon RDS, Google F1, IBM dashDB, N1QL, BlinkDB, Spark SQL
11B) NoSQL: Lucene, Solr, Solandra, Voldemort, Riak, ZHT, Berkeley DB, Kyoto/Tokyo Cabinet, Tycoon, Tyrant, MongoDB, Espresso, CouchDB,
Couchbase, IBM Cloudant, Pivotal Gemfire, HBase, Google Bigtable, LevelDB, Megastore and Spanner, Accumulo, Cassandra, RYA, Sqrrl, Neo4J,
graphdb, Yarcdata, AllegroGraph, Blazegraph, Facebook Tao, Titan:db, Jena, Sesame
Public Cloud: Azure Table, Amazon Dynamo, Google DataStore
11A) File management: iRODS, NetCDF, CDF, HDF, OPeNDAP, FITS, RCFile, ORC, Parquet
10) Data Transport: BitTorrent, HTTP, FTP, SSH, Globus Online (GridFTP), Flume, Sqoop, Pivotal GPLOAD/GPFDIST
9) Cluster Resource Management: Mesos, Yarn, Helix, Llama, Google Omega, Facebook Corona, Celery, HTCondor, SGE, OpenPBS, Moab, Slurm,
Torque, Globus Tools, Pilot Jobs
8) File systems: HDFS, Swift, Haystack, f4, Cinder, Ceph, FUSE, Gluster, Lustre, GPFS, GFFS
Public Cloud: Amazon S3, Azure Blob, Google Cloud Storage
7) Interoperability: Libvirt, Libcloud, JClouds, TOSCA, OCCI, CDMI, Whirr, Saga, Genesis
6) DevOps: Docker (Machine, Swarm), Puppet, Chef, Ansible, SaltStack, Boto, Cobbler, Xcat, Razor, CloudMesh, Juju, Foreman, OpenStack Heat,
Sahara, Rocks, Cisco Intelligent Automation for Cloud, Ubuntu MaaS, Facebook Tupperware, AWS OpsWorks, OpenStack Ironic, Google Kubernetes,
Buildstep, Gitreceive, OpenTOSCA, Winery, CloudML, Blueprints, Terraform, DevOpSlang, Any2Api
5) IaaS Management from HPC to hypervisors: Xen, KVM, QEMU, Hyper-V, VirtualBox, OpenVZ, LXC, Linux-Vserver, OpenStack, OpenNebula,
Eucalyptus, Nimbus, CloudStack, CoreOS, rkt, VMware ESXi, vSphere and vCloud, Amazon, Azure, Google and other public Clouds
Networking: Google Cloud DNS, Amazon Route 53
21 layers
Over 350
Software
Packages
January
29
2016
HPC-ABDS Apache Big Data Stack
Simultaneous Localization and Mapping(SLAM) is an example
application built on top of our framework, where we exploit parallel
data processing to speedup the expensiv eSLAM computation.
Main Components of SPIDAL Project
Classification of Application
Initial investigation of application characteristics to
define/develop classification
Event size, synchronicity, time & length sc ales..
Need to enhance with industry/research use case
comparison industry many small events; research often
large (as are self driving cars)
Current software solutions
Impressive commercial solutions for commerci al
applications: applicability to science and Government(e.g.
DoE) unclear.
Plethora of “local point” solutions (see report for detailed
listing) but few end-to-end general streaming
infrastructures outside open sourced big data systems
(Apache Spark, Flink, Storm, Samza).
Opens up iss ues in distributed computing, e.g.,
performance, fault-tolerance, dynamic resource
management.
NSF DoE and AFOSR funding
October 27-28 2015 Indianapolis STREAM2015
March 22-23, 2016 Washington DC, STREAM2016
Summary of Streaming Workshops
ResearchGate has not been able to resolve any citations for this publication.
ResearchGate has not been able to resolve any references for this publication.