Anwitaman Datta's research while affiliated with Nanyang Technological University and other places

Publications (220)

Article
COVID-19, which was first detected in late 2019 in Wuhan, China, has spread to the rest of the world and is currently deemed a global pandemic. A flux of events triggered by a wide ranging set of factors such as virus mutations and waves of infections, imperfect medical and policy interventions, and vested interest driven political posturing all ha...
Conference Paper
Full-text available
COVID-19, which was first detected in late 2019 in Wuhan, China, has spread to the rest of the world and is currently deemed a global pandemic. A flux of events triggered by a wide ranging set of factors such as virus mutations and waves of infections, imperfect medical and policy interventions, and vested interest driven political posturing all ha...
Article
Full-text available
The structure of many complex networks includes edge directionality and weights on top of their topology. Network analysis that can seamlessly consider combination of these properties are desirable. In this paper, we study two important such network analysis techniques, namely, centrality and clustering. An information-flow based model is adopted f...
Article
Full-text available
For over a decade, erasure codes have become an integral part of large-scale data storage solutions and data-centers. However, in commercial systems, they are, so far, used predominantly for static data. In the meanwhile, there has also been almost a decade and a half of research on mutable erasure coded data, looking at various associated issues,...
Article
Full-text available
We propose a two-step methodology for exploring the temporal characteristics of a network. First, we construct a graph time series, where each snapshot is the result of a temporal whole-graph embedding. The embedding is carried out using the degree, Katz and betweenness centralities to characterize first and higher order proximities among vertices....
Preprint
The structure of many complex networks includes edge directionality and weights on top of their topology. Network analysis that can seamlessly consider combination of these properties are desirable. In this paper, we study two important such network analysis techniques, namely, centrality and clustering. An information-flow based model is adopted f...
Article
We consider the design and analysis of quorum systems over erasure coded warm data (with low frequency of writes and accesses in general) to guarantee sequential consistency under a fail-stop model while supporting atomic read-modify-write operations by multiple clients. We propose a definition of asymmetric quorum systems that suit the framework o...
Preprint
Full-text available
In this work, we collect a moderate-sized representative corpus of tweets (200,000 approx.) pertaining Covid-19 vaccination spanning over a period of seven months (September 2020 - March 2021). Following a Transfer Learning approach, we utilize the pre-trained Transformer-based XLNet model to classify tweets as Misleading or Non-Misleading and vali...
Article
Full-text available
In this paper we study the problem of consistency in distributed storage systems relying on erasure coding for storage efficient fault-tolerance.We propose QLOC -a flexible framework for supporting the storage of warm data, i.e., data which, while not being very frequently in use, nevertheless continues to be accessed for reads or writes regularly....
Article
Full-text available
This article explores a graph clustering method that is derived from an information theoretic method that clusters points in R n relying on Renyi entropy, which involves computing the usual Euclidean distance between these points. Two view points are adopted: (1) the graph to be clustered is first embedded into R d for some dimension d so a...
Article
Full-text available
We consider the problem of designing grid quorum systems for maximum distance separable (MDS) erasure code based distributed storage systems. Quorums are used as a mechanism to maintain consistency in replication based storage systems, for which grid quorums have been shown to produce optimal load characteristics. This motivates the study of grid q...
Article
This work explores how to enhance pseudonymous whistleblower submission systems, specifically by supporting protocol level unlinkability, while also making the system resilient against (distributed) denial of service attacks. To that end, we propose a blind signature based protocol which facilitates assignment of trust to anonymous posters in a man...
Chapter
In this paper we carry out a survey of vision & white papers and reports from industry as well as private industry actors, along with academic literature, to understand how blockchain is being used for digital government and public services. The purpose of this survey is to explore which fundamental properties of blockchain technology are being har...
Article
Full-text available
We consider a particular instance of user interactions in the Bitcoin network, that of interactions among wallet addresses belonging to scammers. Aggregation of multiple inputs and change addresses are common heuristics used to establish relationships among addresses and analyze transaction amounts in the Bitcoin network. We propose a flow centric...
Article
Distinct transactions among different and unrelated users are combined together to create a single Bitcoin transaction (mixing transaction) to obfuscate the relationships among the actual participants (more specifically, the wallet addresses used for the transactions). We consider multi‐input multi‐output transactions with at least two inputs and t...
Preprint
Research in blockchain systems has mainly focused on improving security and bridging the performance gaps between blockchains and databases. Despite many promising results, we observe a worrying trend that the blockchain landscape is fragmented in which many systems exist in silos. Apart from a handful of general-purpose blockchains, such as Ethere...
Article
Full-text available
The notion of entropic centrality measures how central a node is in terms of how uncertain the destination of a flow starting at this node is: the more uncertain the destination, the more well connected and thus central the node is deemed. This implicitly assumes that the flow is indivisible, and at every node, the flow is transferred from one edge...
Preprint
Fuelled by the success (and hype) around cryptocurrencies, distributed ledger technologies (DLT), particularly blockchains, have gained a lot of attention from a wide spectrum of audience who perceive blockchains as a key to carry out business processes that have hitherto been cumbersome in a cost and time effective manner. Governments across the g...
Article
A large volume of data is generated by traffic surveillance devices such as cameras and sensors integrated into an intelligent transportation system (ITS), a subfield of the Internet of Things (IoT). We argue that network coding can be applied to leverage on an emerging fog architecture that relies on edge resources, to achieve higher throughput, s...
Preprint
Full-text available
The old mantra of decentralizing the Internet is coming again with fanfare, this time around the blockchain technology hype. We have already seen a technology supposed to change the nature of the Internet: peer-to-peer. The reality is that peer-to-peer naming systems failed, peer-to-peer social networks failed, and yes, peer-to-peer storage failed...
Article
We propose a trusted third party free protocol for secure (in terms of content access, manipulation, and confidentiality) data storage and multi-user collaboration over an infrastructure of untrusted storage servers. It is achieved by the application of data dispersal, encryption as well as two-factor (knowledge and possession) based authentication...
Conference Paper
Full-text available
Instagram is a significant platform for users to share media; reflecting their interests. It is used by marketers and brands to reach their potential audience for advertisement. The number of likes on posts serves as a proxy for social reputation of the users, and in some cases, social media influencers with an extensive reach are compensated by ma...
Article
Full-text available
Even as data and analytics-driven applications are becoming increasingly popular, retrieving data from shared databases poses a threat to the privacy of their users. For example, investors/patients retrieve records about stocks/diseases they are interested in from a stock/medical database. Knowledge of such interest is sensitive information that th...
Conference Paper
P2P-based social networking services are severely challenged by churn and the lack of reliable service providers, especially considering the high frequency of posts and profile updates of their users. Improved consistency and data availability shall facilitate better acceptance, which in turn will enhance privacy, an inherent benefit of this class...
Article
Large volume of data is generated by traffic surveillance devices such as cameras and sensors integrated in an intelligent transportation system (ITS). To deal with the extreme volume and the massively geographically distributed sources of data, we advocate a tiered storage and processing architecture, using edge nodes to augment a centralized back...
Article
Full-text available
We leverage on authenticated data structures to guarantee correctness and completeness of query results over encrypted data. Our contribution is in bridging two independent lines of work (searchable encryption, and provable data possession) resulting in a general purpose technique, which does so without increasing the client storage overhead, while...
Article
Full-text available
Data outsourcing is plagued with several security and privacy concerns. Oblivious RAM (ORAM) can be used to address one of the many concerns, specifically to protect the privacy of data access pattern from outsourced cloud storage. This is achieved by simulating each original read or write operation with some read and write operations on both real...
Article
To successfully complete a complex project, agents (companies or individuals) must form a team with the required competencies and resources. A team can be formed either by the project issuer based on individual agents' offers (centralized formation) or by the agents themselves (decentralized formation) bidding for a project as a consortium. The aut...
Article
Full-text available
In this paper we study the problem of storing reliably an archive of versioned data. Specifically, we focus on systems where the differences (deltas) between subsequent versions rather than the whole objects are stored—a typical model for storing versioned data. For reliability, we propose erasure encoding techniques that exploit the sparsity of in...
Chapter
In this paper, we study the problem of storing an archive of versioned data in a reliable and efficient manner. The proposed technique is relevant in cloud settings, where, because of the huge volume of data to be stored, distributed (scale-out) storage systems deploying erasure codes for fault tolerance is typical. However existing erasure coding...
Article
Association rule mining and frequent itemset mining are two popular and widely studied data analysis techniques for a range of applications. In this paper, we focus on privacy-preserving mining on vertically partitioned databases. In such a scenario, data owners wish to learn the association rules or frequent itemsets from a collective data set and...
Article
We propose a differential versioning based data storage (DiVers) architecture for distributed storage systems, which relies on a novel erasure coding technique that exploits sparsity across versions. The emphasis of this work is to demonstrate how sparsity exploiting codes (SEC), originally designed for I/O optimization, can be extended to signific...
Conference Paper
In this paper we propose a protocol that allows end-users in a decentralized setup (without requiring any trusted third party) to protect data shipped to remote servers using two factors-knowledge (passwords) and possession (a time based one time password generation for authentication) that is portable. The protocol also supports revocation and rec...
Chapter
An abundance of data generated from a multitude of sources, and intelligence derived by analyzing the same, has become an important asset across many walks of life. Simultaneously, it raises serious concerns about privacy. Differential privacy has become a popular way to reason about the amount of information about individual entries of a dataset t...
Article
Full-text available
In many aspects of human activity, there has been a continuous struggle between the forces of centralization and decentralization. Computing exhibits the same phenomenon; we have gone from mainframes to PCs and local networks in the past, and over the last decade we have seen a centralization and consolidation of services and applications in data c...
Article
In a decentralized storage system, agents replicate each other’s data to increase availability. Compared to organizationally centralized solutions, such as cloud storage, a decentralized storage system requires less trust in the provider and may result in smaller monetary costs. Our system is based on reciprocal storage contracts that allow the age...
Article
Auditability is crucial for data outsourcing, facilitating accountability and identifying data loss or corruption incidents in a timely manner, reducing in turn the risks from such losses. In recent years, in synch with the growing trend of outsourcing, a lot of progress has been made in designing probabilistic (for efficiency) provable data posses...
Article
Erasure coding has become an integral part of the storage infrastructure in data-centers and cloud backends—since it provides significantly higher fault tolerance for substantially lower storage overhead compared to a naive approach like n-way replication. Fault tolerance refers to the ability to achieve very high availability despite (temporary) f...
Article
Full-text available
In this paper, we study the problem of storing an archive of versioned data in a reliable and efficient manner in distributed storage systems. We propose a new storage technique called differential erasure coding (DEC) where the differences (deltas) between subsequent versions are stored rather than the whole objects, akin to a typical delta encodi...
Patent
Full-text available
In an embodiment, a data encoding method may be provided. The data encoding method may include: inputting data to be encoded; determining a polynomial so that an evaluation of the polynomial at a sum of a first supporting point of the polynomial and a second supporting point of the polynomial corresponds to the sum of an evaluation of the polynomia...
Article
Networked distributed storage systems NDSS use erasure codes in lieu of replication for realising data redundancy. An interesting research challenge is to derive the largest advantage from the trade-offs between storage overhead and reliability that erasure codes provide, while optimising them to satisfy specific storage needs like repairability an...
Article
In order to accomplish complex tasks, it is often necessary to compose a team consisting of experts with diverse competencies. However, for proper functioning, it is also preferable that a team be socially cohesive. A team recommendation system, which facilitates the search for potential team members can be of great help both for (i) individuals wh...
Article
In this paper we study the problem of storing reliably an archive of versioned data. Specifically, we focus on systems where the differences (deltas) between subsequent versions rather than the whole objects are stored - a typical model for storing versioned data. For reliability, we propose erasure encoding techniques that exploit the sparsity of...
Article
Full-text available
The Kurdish language is an Indo-European language spoken in Kurdistan, a large geographical region in the Middle East. Despite having a large number of speakers, Kurdish is among the less-resourced languages and has not seen much attention from the IR and NLP research communities. This article reports on the outcomes of a project aimed at providing...
Article
Full-text available
To successfully complete a complex project, be it a construction of an airport or of a backbone IT system or crowd-sourced projects, agents (companies or individuals) must form a team (a coalition) having required competences and resources. A team can be formed either by the project issuer based on individual agents' offers (centralized formation);...
Article
Networked distributed data storage systems are essential to deal with the needs of storing massive volumes of data. Dependability of such a system relies on its fault tolerance (data should be available in case of node failures) as well as its maintainability (its ability to repair lost data to ensure redundancy replenishment over time). Erasure co...
Conference Paper
In this paper, we introduce InterCloud RAIDer, which realizes a multi-cloud private data backup system by composing (i) a data deduplication technique to reduce the overall storage overhead, (ii) erasure coding to achieve redundancy at low overhead, which is dispersed across multiple cloud services to realize fault-tolerance against individual serv...
Article
Full-text available
There are different ways to realize Reed Solomon (RS) codes. While in the storage community, using the generator matrices to implement RS codes is more popular, in the coding theory community the generator polynomials are typically used to realize RS codes. Prominent exceptions include HDFS-RAID, which uses generator polynomial based erasure codes,...
Conference Paper
The advent of cloud computing is driving a paradigm shift in the computing landscape. An increasing number of businesses and individuals are moving their data and computation to the cloud. While the benefits of cloud computing are numerous, security remains one of the biggest concerns as data and computation are outsourced to untrusted third partie...
Conference Paper
Full-text available
Erasure codes are an integral part of many distributed storage systems aimed at Big Data, since they provide high fault-tolerance for low overheads. However, traditional erasure codes are inefficient on replenishing lost data (vital for long term resilience) and on reading stored data in degraded environments (when nodes might be unavailable). Cons...
Conference Paper
Full-text available
In the past few years erasure codes have been increasingly embraced by distributed storage systems as an alternative for replication, since they provide high fault-tolerance for low overheads. Erasure codes, however, have few shortcomings that need to be addressed to make them a complete solution for networked storage systems. Lack of support for e...
Conference Paper
Often one needs to form teams in order to perform a complex collaborative task. Therefore, it is interesting and useful to assess how well constituents of a team have performed, and leverage this knowledge to guide future team formation. In this work we propose a model for assessing the reputation of participants in collaborative teams. The model t...
Conference Paper
Erasure coding provides a mechanism to store data redundantly for fault-tolerance in a cost-effective manner. Recently, there has been a renewed interest in designing new erasure coding techniques with different desirable properties, including good repairability and degraded read performance, or efficient redundancy generation processes. Very often...
Article
The semantic knowledge of Wikipedia has proved to be useful for many tasks, for example, named entity disambiguation. Among these applications, the task of identifying the word sense based on Wikipedia is a crucial component because the output of this component is often used in subsequent tasks. In this article, we present a two-stage framework (ca...
Article
As tremendous amount of data being generated everyday from human activity and from devices equipped with sensing capabilities, cloud computing emerges as a scalable and cost-effective platform to store and manage the data. While benefits of cloud computing are numerous, security concerns arising when data and computation are outsourced to a third p...
Article
In this paper we propose GoDisco++, a gossip based approach for information dissemination in online social community networks. GoDiscoo++ uses local information available to nodes—that is information associated with a node and its neighbors. The algorithm exploits multiple relations which may exist between nodes, and applies social principles and b...
Article
Full-text available
Erasure codes are an integral part of many distributed storage systems aimed at Big Data, since they provide high fault-tolerance for low overheads. However, traditional erasure codes are inefficient on reading stored data in degraded environments (when nodes might be unavailable), and on replenishing lost data (vital for long term resilience). Con...
Conference Paper
Given the vast volume of data that needs to be stored reliably, many data-centers and large-scale file systems have started using erasure codes to achieve reliable storage while keeping the storage overhead low. This has invigorated the research on erasure codes tailor made to achieve different desirable storage system properties such as efficient...
Article
Computational trust representations are used by Trust Management (TM) systems to elicit information from users about the behavior of others. In most practically used TM systems, simple computational trust representations dominate, such as the three-valued discrete scale of “negative”, “neutral” and “positive” used in reputation systems of Internet...
Conference Paper
Distributed storage systems usually achieve fault tolerance by replicating data across different nodes. However, redundancy schemes based on erasure codes can provide a storage-efficient alternative to repli-cation. This is particularly suited for data archival since archived data is rarely accessed. Typically, the migration to erasure-encoded stor...
Article
The proliferation of mobile devices coupled with Internet access is generating a tremendous amount of highly personal and sensitive data. Applications such as location-based services and quantified self harness such data to bring meaningful context to users’ behavior. As social applications are becoming prevalent, there is a trend for users to shar...
Conference Paper
Team recommendation aids decision support, by not only identifying individuals who are experts for various aspects of a complex task, but also determining various properties of the team as a group. Several aspects such as cohesion and repetition of teams have been identified as important indicators, besides individuals' expertise, in determining ho...
Conference Paper
Full-text available
Event detection from tweets is an important task to understand the current events/topics attracting a large number of common users. However, the unique characteristics of tweets (e.g. short and noisy content, diverse and fast changing topics, and large data volume) make event detection a challenging task. Most existing techniques proposed for well...
Article
There is an increasing trend for businesses to migrate their systems towards the cloud. Security concerns that arise when outsourcing data and computation to the cloud include data confidentiality and privacy. Given that a tremendous amount of data is being generated everyday from plethora of devices equipped with sensing capabilities, we focus on...
Article
Peer-to-peer index structures distributed and managed over the planet, commonly known as structured overlays (e.g., distributed hash tables) have been touted to play the role of a fundamental building block for internet-scale distributed systems. Traditional designs consider incremental or possibly even parallelized construction of a single overlay...
Article
Full-text available
Many private and/or public organizations have been reported to create and monitor targeted Twitter streams to collect and understand users' opinions about the organizations. Targeted Twitter stream is usually constructed by filtering tweets with user-defined selection criteria e.g. tweets published by users from a selected region, or tweets that ma...
Conference Paper
Modeling and understanding social network structure has interested researchers from many backgrounds including social science, computer science, theoretical physics and graph theory. Notable models include [1] and [2] achieving graphs with power-law degree distribution using preferential attachment and small-world characteristics using randomized r...
Article
To achieve reliability in distributed storage systems, data has usually been replicated across different nodes. However the increasing volume of data to be stored has motivated the introduction of erasure codes, a storage efficient alternative to replication, particularly suited for archival in data centers, where old datasets (rarely accessed) can...
Article
The problem of replenishing redundancy in erasure code based fault-tolerant storage has received a great deal of attention recently, leading to the design of several new coding techniques [3], aiming at a better repairability. In this paper, we adopt a different point of view, by proposing to code across different already encoded objects to allevia...
Article
Erasure coding techniques are getting integrated in networked distributed storage systems as a way to provide fault-tolerance at the cost of less storage overhead than traditional replication. Redundancy is maintained over time through repair mechanisms, which may entail large network resource overheads. In recent years, several novel codes tailor-...
Article
An increasing number of businesses are replacing their data storage and computation infrastructure with cloud services. Likewise, there is an increased emphasis on performing analytics based on multiple datasets obtained from different data sources. While ensuring security of data and computation outsourced to a third party cloud is in itself chall...
Article
Full-text available
In this work we describe the PriSM framework for decentralized deployment of a federation of autonomous social networks (ASN). The individual ASNs are centrally managed by organizations according to their institutional needs, while cross-ASN interactions are facilitated subject to security and confidentiality requirements specified by administrator...