About
187
Publications
51,802
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
8,795
Citations
Introduction
Skills and Expertise
Current institution
Publications
Publications (187)
Ensuring the success of big graph processing for the next decade and beyond.
Pattern matching is a fundamental tool for answering complex graph queries. Unfortunately, existing solutions have limited capabilities: They do not scale to process large graphs and/or support only a restricted set of search templates or usage scenarios. Moreover, the algorithms at the core of the existing techniques are not suitable for today’s g...
Graphs are by nature unifying abstractions that can leverage interconnectedness to represent, explore, predict, and explain real- and digital-world phenomena. Although real users and consumers of graph instances and graph workloads understand these abstractions, future problems will require new abstractions and systems. What needs to happen in the...
Pattern matching is a fundamental tool for answering complex graph queries. Unfortunately, existing solutions have limited capabilities: they do not scale to process large graphs and/or support only a restricted set of search templates or usage scenarios. We present an algorithmic pipeline that bases pattern matching on constraint checking. The key...
In the face of large-scale automated social engineering attacks to large online services, fast detection and remediation of compromised accounts are crucial to limit the spread of the attack and to mitigate the overall damage to users, companies, and the public at large. We advocate a fully automated approach based on machine learning: we develop a...
Detectable but Uncorrectable Errors (DUEs) in the memory subsystem are becoming increasingly frequent. Today, upon encountering a DUE, applications crash, and the recovery methods used incur significant performance, storage, and energy overheads. To mitigate the impact of these errors, we start from two high-level observations that apply to some cl...
In the face of large-scale automated social engineering attacks to large online services, fast detection and remediation of compromised accounts are crucial to limit the spread of new attacks and to mitigate the overall damage to users, companies, and the public at large. We advocate a fully automated approach based on machine learning: we develop...
In the face of large-scale automated cyber-attacks to large online services, fast detection and remediation of compromised accounts are crucial to limit the spread of new attacks and to mitigate the overall damage to users, companies, and the public at large. We advocate a fully automated approach based on machine learning to enable large-scale onl...
This study characterizes the NVIDIA Jetson TK1 and TX1 Platforms, both built on a NVIDIA Tegra System on Chip and combining a quad-core ARM CPU and an NVIDIA GPU. Their heterogeneous nature, as well as their wide operating frequency range, make it hard for application developers to reason about performance and determine which optimizations are wort...
Requirements for reliability, low power consumption, and performance place complex and conflicting demands on the design of high-performance computing (HPC) systems. Fault-tolerance techniques such as checkpoint/restart (C/R) protect HPC applications against hardware faults. These techniques, however, have non negligible overheads particularly when...
Interferometric Synthetic Aperture Radar (InSAR) is a remote sensing technology used for estimating the
displacement of an object on the ground or the earth's surface itself. Persistent Scatterer-InSAR (PS-InSAR) is a category of time series algorithms enabling high resolution monitoring. PS-InSAR relies on successful selection of points that appea...
This paper proposes using file system custom metadata as a bidirectional communication channel between applications and the storage middleware. This channel can be used to pass hints that enable cross-layer optimizations, an option hindered today by the ossified file-system interface. We study this approach in the context of storage system support...
The wide adoption of graphics processing units (GPUs) as accelerators for general-purpose applications makes the end-to-end reliability implications of their use increasingly significant. Fault injection is a widely adopted method to evaluate the resilience of applications. However, building a fault injector for general-purpose GPU applications is...
The orthodox paradigm to defend against automated social-engineering attacks in large-scale socio-technical systems is reactive and victim-agnostic. Defenses generally focus on identifying the attacks/attackers (e.g., phishing emails, social-bot infiltrations, malware offered for download). To change the status quo, we propose to identify, even if...
Detecting fake accounts in online social networks (OSNs) protects both OSN operators and their users from various malicious activities. Most detection mechanisms attempt to classify user accounts as real (i.e., benign, honest) or fake (i.e., malicious, Sybil) by analyzing either user-level activities or graph-level structures. These mechanisms, how...
Traditional defense mechanisms for fighting against automated fake accounts in online social networks are victim-agnostic. Even though victims of fake accounts play an important role in the viability of subsequent attacks, there is no work on utilizing this insight to improve the status quo. In this position paper, we take the first step and propos...
Tagging is a popular feature that supports several collaborative tasks,
including search, as tags produced by one user can help others finding relevant
content. However, task performance depends on the existence of 'good' tags. A
first step towards creating incentives for users to produce 'good' tags is the
quantification of their value in the firs...
The Big Data challenge consists in managing, storing, analyzing and visualizing these huge and ever growing data sets to extract sense and knowledge. As the volume of data grows exponentially, the management of these data becomes more complex in proportion. A key point is to handle the complexity of the data life cycle, i.e. the various operations...
Large scale-free graphs are famously difficult to process efficiently: the
highly skewed vertex degree distribution makes it difficult to obtain balanced
workload partitions for parallel processing. Our research instead aims to take
advantage of vertex degree heterogeneity by partitioning the workload to match
the strength of the individual computi...
Multimedia content is central to our experience on the Web. Specifically, users frequently search and watch videos online. The textual features that accompany such content (e.g., title, description, and tags) can generally be optimized to attract more search traffic and ultimately to increase the advertisement-generated revenue. This study investig...
As workflow-based data-intensive applications have become in-creasingly popular, the lack of support tools to aid resource provisioning decisions, to estimate the energy cost of running such applications, or simply to support configuration choices has become increasingly evident. Our goal is to design techniques to predict the energy consumption of...
Developing a distributed system is a complex and error-prone task. Properly handling the interaction of a potentially large number of distributed components while keeping resource usage low and performance high is challenging. The state-of-the-practice on performance evaluation focuses on employing profilers to detect and fix potential performance...
Graphs are widespread data structures used to model a wide variety of
problems. The sheer amount of data to be processed has prompted the creation of
a myriad of systems that help us cope with massive scale graphs. The pressure
to deliver fast responses to queries on the graph is higher than ever before,
as it is demanded by many applications (e.g....
A large portion of the audience of video content items on the web currently comes from keyword-based search and/or tag-based navigation. Thus, the textual features of this content (e.g., the title, description, and tags) can directly impact the view count of a particular content item, and ultimately the advertisement generated revenue. More importa...
Infrastructure-as-a-Service (IaaS) clouds are an appealing resource for
scientific computing. However, the bare-bones presentation of raw Linux virtual machines leaves much to the application developer. For many cloud applications, effective data handling is critical to efficient application execution. This paper investigates the capabilities of a...
System provisioning, resource allocation, and system configuration decisions for I/O-intensive workflow applications are complex even for expert users. Users face choices at multiple levels: allocating resources to individual sub-systems (e.g., the application layer, the storage layer) and configuring each of these optimally (e.g., replication leve...
Deduplication is a commonly-used technique on
disk-based storage pools. However, deduplication has not been
used for tape-based pools: tape characteristics, such as high
mount and seek times combined with data fragmentation re-
sulting from deduplication create a toxic combination that leads
to unacceptably high retrieval times.
This work proposes...
As a consequence of increasing hardware fault rates, HPC systems face significant challenges in terms of reliability. Evaluating the error resilience of HPC applications is an essential step for building efficient fault-tolerant mechanisms for these applications. In this paper, we propose a methodology to characterize the resilience of OpenMP progr...
Online gaming is a multi-billion dollar industry that entertains a large, global population. One unfortunate phenomenon, however, poisons the competition and spoils the fun: cheating. The costs of cheating span from industry-supported expenditures to detect and limit it, to victims’ monetary losses due to cyber crime.
This article studies cheaters...
The Silences of the Archives, the Reknown of the Story.
The Martin Guerre affair has been told many times since Jean de Coras and Guillaume Lesueur published their stories in 1561. It is in many ways a perfect intrigue with uncanny resemblance, persuasive deception and a surprizing end when the two Martin stood face to face, memory to memory, befor...
While graphics processing units (GPUs) have gained wide adoption as accelerators for general-purpose applications (GPGPU), the end-to-end reliability implications of their use have not been quantified. Fault injection is a widely used method for evaluating the reliability of applications. However, building a fault injector for GPGPU applications is...
Video content abounds on the Web. Although viewers may reach items via referrals, a large portion of the audience comes from keywordbased search. Consequently, the textual features of multimedia content (e.g., title, description, tags) will directly impact the view count of a particular item, and ultimately the advertisement-generated revenue. This...
The increasing scale and wealth of inter-connected data, such as those
accrued by social network applications, demand the design of new techniques and
platforms to efficiently derive actionable knowledge from large-scale graphs.
However, real-world graphs are famously difficult to process efficiently. Not
only they have a large memory footprint, bu...
Data-intensive science offers new opportunities for innovation and discoveries, provided that large datasets can be handled efficiently. Data management for data-intensive science applications is challenging; requiring support for complex data life cycles, coordination across multiple sites, fault tolerance, and scalability to support tens of sites...
This paper investigates the power, energy, and performance characteristics of large-scale graph processing on hybrid (i.e., CPU and GPU) single-node systems. Graph processing can be accelerated on hybrid systems by properly mapping the graph-layout to processing units, such that the algorithmic tasks exercise each of the units where they perform be...
This paper proposes COntribution-based Incentive Design (COIN) as a general guideline for developing applications in the context of mobile crowd-sensing. As a case study, we apply the key ideas of crowd-sensing to design a smart parking system. The system encourages contribution by differentiating service and assigning tasks proactively to maintain...
Sybil attacks in social and information systems have serious security implications. Out of many defence schemes, Graph-based Sybil Detection (GSD) had the greatest attention by both academia and industry. Even though many GSD algorithms exist, there is no analytical framework to reason about their design, especially as they make different assumptio...
Graph processing has gained renewed attention. The increasing large scale and wealth of connected data, such as those accrued by social network applications, demand the design of new techniques and platforms to efficiently derive actionable information from large scale graphs. Hybrid systems that host processing units optimized for both fast sequen...
Configuring a storage system to better serve an application is a challenging
task complicated by a multidimensional, discrete configuration space and the
high cost of space exploration (e.g., by running the application with different
storage configurations). To enable selecting the best configuration in a
reasonable time, we design an end-to-end pe...
Online Social Networks (OSNs) have attracted millions of active users and have become an integral part of today’s web ecosystem. Unfortunately, in the wrong hands, OSNs can be used to harvest private user data, distribute malware, control botnets, perform surveillance, spread misinformation, and even influence algorithmic trading. Usually, an adver...
User-generated content is shaping the dynamics of the World Wide Web. Indeed,
an increasingly large number of systems provide mechanisms to support the
growing demand for content creation, sharing, and management. Tagging systems
are a particular class of these systems where users share and collaboratively
annotate content such as photos and URLs....
This paper proposes using file system custom metadata as a bidirectional
communication channel between applications and the storage system. This channel
can be used to pass hints that enable cross-layer optimizations, an option
hindered today by the ossified file-system interface. We study this approach in
context of storage system support for larg...
An increasing number of mobile applications aim to enable "smart cities" by harnessing contributions from citizens armed with mobile devices that have sensing ability. However, there are few generally recognized guidelines for developing and deploying crowdsourcing-based solutions in mobile environments. This paper considers the design of a crowdso...
We present a preliminary evaluation of error-resilience of GPGPU applications. We find that, compared to CPUs, these platforms lead to a higher rate of silent data corruption a major concern since these errors are not flagged at runtime and often remain latent. We also find that out-of-bound memory accesses are the most critical reason of crashes....
GPUs have been originally designed for error-resilient workload. Today, GPUs are used in error-sensitive applications, e.g. General Purpose GPU (GPGPU) applications. The goal of this project is to investigate the error resilience of GPGPU applications and understand their reliability characteristics. To this end, we employ fault injection on real G...
Crowdsourcing has inspired a variety of novel mobile applications. However, identifying common practices across different applications is still challenging. In this paper, we use smart parking as a case study to investigate features of crowdsourcing that may apply to other mobile applications. Based on this we derive principles for efficiently harn...
This paper presents an optimization mechanism to increase the
performance of cloud services that transfer groups of deduplicated
virtual machine (VM) images. This is necessary as the naive data
transfer approach for groups of deduplicated VM images is extremely
inefficient as it generates highly random disk access pattern. The
optimization mechanis...
Large, real-world graphs are famously difficult to process efficiently. Not only they have a large memory footprint but most graph processing algorithms entail memory access patterns with poor locality, data-dependent parallelism, and a low compute-to- memory access ratio. Additionally, most real-world graphs have a low diameter and a highly hetero...
This paper evaluates the potential gains a workflow-aware storage system can bring. Two observations make us believe such storage system is crucial to efficiently support workflow-based applications: First, workflows generate irregular and application-dependent data access patterns. These patterns render existing storage systems unable to harness a...
The ease with which we adopt online personas and relationships has created a soft spot that cyber criminals are willing to exploit. Advances in artificial intelligence make it feasible to design bots that sense, think and act cooperatively in social settings just like human beings. In the wrong hands, these bots can be used to infiltrate online com...
Online gaming is a multi-billion dollar industry that en-tertains a large, global population. One unfortunate phenomenon, however, poisons the competition and the fun: cheating. The costs of cheating span from industry-supported expenditures to detect and limit cheating, to victims' monetary losses due to cyber crime. This paper studies cheaters in...
Massively multicore processors, such as Graphics Processing Units
(GPUs), provide, at a comparable price, a one order of magnitude higher
peak performance than traditional CPUs. This drop in the cost of
computation, as any order-of-magnitude drop in the cost per unit of
performance for a class of system components, triggers the opportunity
to redes...
Massively multicore processors, such as Graphics Processing Units (GPUs), provide, at a comparable price, a one order of magnitude higher peak performance than traditional CPUs. This drop in the cost of computation, as any order-of-magnitude drop in the cost per unit of performance for a class of system components, triggers the opportunity to redes...
Cloud-based backup and archival services use large tape libraries as a cost-effective cold tier in their online storage hierarchy today. These services leverage deduplication to reduce the disk storage capacity required by their customer data sets, but they usually re-duplicate the data when moving it from disk to tape.
Online gaming is a multi-billion dollar industry that entertains a large,
global population. One unfortunate phenomenon, however, poisons the competition
and the fun: cheating. The costs of cheating span from industry-supported
expenditures to detect and limit cheating, to victims' monetary losses due to
cyber crime.
This paper studies cheaters in...
Online Social Networks (OSNs) have become an integral part of today's Web. Politicians, celebrities, revolutionists, and others use OSNs as a podium to deliver their message to millions of active web users. Unfortunately, in the wrong hands, OSNs can be used to run astroturf campaigns to spread misinformation and propaganda. Such campaigns usually...
Many-Task Computing (MTC) is a new application category that encompasses increasingly popular applications in biology, economics, and statistics. The high inter-task parallelism and data-intensive processing capabilities of these applications pose new challenges to existing supercomputer hardware-software stacks. These challenges include resource p...
This paper discusses the use of many-task computing tools for multiscale
modeling. It defines multiscale modeling and places different examples of it on
a coupling spectrum, discusses the Swift parallel scripting language, describes
three multiscale modeling applications that could use Swift, and then talks
about how the Swift model is being extend...
The energy costs of running computer systems are a growing concern: for large data centers, recent estimates put these costs higher than the cost of hardware itself. As a consequence, energy efficiency has become a pervasive theme for designing, deploying, and operating computer systems. This paper evaluates the energy trade-offs brought by data de...
Web caches, content distribution networks, peer-to-peer file-sharing networks, distributed file systems, and data grids all have in common that they involve a community of users who use shared data. In each case, overall system performance can be improved significantly by first identifying and then exploiting the structure of community's data acces...
This paper explores the feasibility of a storage architecture that offers the reliability and access performance characteristics of a high-end system, yet is cost-efficient. We propose ThriftStore, a storage architecture that integrates two types of components: volatile, aggregated storage and dedicated, yet low-bandwidth durable storage. On the on...
System logs are an important tool in studying the conditions (e.g., environment misconfigurations, resource status, erroneous user input) that cause failures. However, production system logs are complex, verbose, and lack structural stability over time. These traits make them hard to use, and make solutions that rely on them susceptible to high mai...
This paper presents VMFlockMS, a migration service optimized for cross-datacenter transfer and instantiation of groups of virtual machine (VM) images that comprise an application-level solution (e.g., a three-tier web application). We dub these groups of related VM images VMFlocks. VMFlockMS employs two main techniques: first, data deduplication wi...
The retrieval and analysis of malicious content is an essential task for security researchers. At the same time, the distrib-utors of malicious files deploy countermeasures to evade the scrutiny of security researchers. This paper investigates two techniques used by malware download centers: frequently updating the malicious payload, and blacklisti...
As distributed applications increase in size and complexity, traditional authorization architectures based on a dedicated authorization server become increasingly fragile because this decision point represents a single point of failure and a performance bottleneck. Authorization caching, which enables the reuse of previous authorization decisions,...
Data-structures that map well are required when porting applications to hybrid architectures such as graphics processing units (GPU) based platforms. Hybrid platforms that use GPU have the ability to deliver higher peak computational rate and memory bandwidth. A GPU is used for the sequence alignment problems that aims to find all occurrences of ea...
Versatile storage systems aim to maximize storage resource utilization by supporting the ability to `morph' the storage system to best match the application's demands. To this end, versatile storage systems significantly extend the deployment- or run-time configurability of the storage system. This flexibility, however, introduces a new problem: a...
GPUs offer drastically different performance characteristics compared to traditional multicore architectures. To explore the tradeoffs exposed by this difference, we refactor MUMmer, a widely-used, highly-engineered bioinformatics application which has both CPU- and GPU-based implementations. We synthesize our experience as three high-level guideli...
Assessing the value of individual users' contributions in peer-production systems is paramount to the design of mechanisms that support collaboration and improve users' experience. For instance, to incentivize contributions, file-sharing systems based on the BitTorrent protocol equate value with volume of contributed content and use a prioritizatio...