Aniket Chakrabarti's research while affiliated with Microsoft and other places

Publications (16)

Chapter
Efficiently finding small samples with high diversity from large graphs has many practical applications such as community detection and online survey. This paper proposes a novel scalable node sampling algorithm for large graphs that can achieve better spread or diversity across communities intrinsic to the graph without requiring any costly pre-pr...
Article
Full-text available
Routing questions in Community Question Answer services (CQAs) such as Stack Exchange sites is a well-studied problem. Yet, cold-start -- a phenomena observed when a new question is posted is not well addressed by existing approaches. Additionally, cold questions posted by new askers present significant challenges to state-of-the-art approaches. We...
Conference Paper
A number of real world problems in many domains (e.g. sociology, biology, political science and communication networks) can be modeled as dynamic networks with nodes representing entities of interest and edges representing interactions among the entities at different points in time. A common representation for such models is the snapshot model - wh...
Article
This paper studies change point detection on networks with community structures. It proposes a framework that can detect both local and global changes in networks efficiently. Importantly, it can clearly distinguish the two types of changes. The framework design is generic and as such several state-of-the-art change point detection algorithms can f...
Article
Full-text available
A number of real world problems in many domains (e.g. sociology, biology, political science and communication networks) can be modeled as dynamic networks with nodes representing entities of interest and edges representing interactions among the entities at different points in time. A common representation for such models is the snapshot model - wh...
Article
The emergence of applications that demand to handle efficiently growing amounts of data has stimulated the development of new computing architectures with several Processing Units (PUs), such as CPUs core, graphics processing units (GPUs) and Intel Xeon Phi (MIC). Aiming to better exploit these architectures, recent works focus on proposing novel r...
Conference Paper
Large scale sensor networks are ubiquitous nowadays. An important objective of deploying sensors is to detect anomalies in the monitored system or infrastructure, which allows remedial measures to be taken to prevent failures, inefficiencies, and security breaches. Most existing sensor anomaly detection methods are local, i.e., they do not capture...
Conference Paper
We propose a novel, scalable, and principled graph sketching technique based on minwise hashing of local neighborhood. For an n-node graph with e-edges (e >> n), we incrementally maintain in real-time a minwise neighbor sampled subgraph using k hash functions in O(n x k) memory, limit being user-configurable by the parameter k. Symmetrization and s...
Conference Paper
We present a novel data embedding that significantly reduces the estimation error of locality sensitive hashing (LSH) technique when used in reproducing kernel Hilbert space (RKHS). Efficient and accurate kernel approximation techniques either involve the kernel principal component analysis (KPCA) approach or the Nyström approximation method. In th...
Article
Given a collection of objects and an associated similarity measure, the all-pairs similarity search problem asks us to find all pairs of objects with similarity greater than a certain user-specified threshold. In order to reduce the number of candidates to search, locality-sensitive hashing (LSH) based indexing methods are very effective. However,...
Conference Paper
All pairs similarity search is a problem where a set of data objects is given and the task is to find all pairs of objects that have similarity above a certain threshold for a given similarity measure-of-interest. When the number of points or dimensionality is high, standard solutions fail to scale gracefully. Approximate solutions such as Locality...
Article
All pairs similarity search is a problem where a set of data objects is given and the task is to find all pairs of objects that have similarity above a certain threshold for a given similarity measure-of-interest. When the number of points or dimensionality is high, standard solutions fail to scale gracefully. Approximate solutions such as Locality...
Article
Full-text available
Internet services access networked storage many times while processing a request. Just a few slow storage ac- cesses per request can raise response times a lot, making the whole service less usable and hurting profits. This paper presents Zoolander, a key value store that meets strict, low latency service level objectives (SLOs). Zo- olander scales...
Conference Paper
Full-text available
NoSQL stores expose narrow APIs for data access, e.g., get(key) or put(key, val). While these APIs often give up strong consistency and transactions, they can scale throughput under intense workloads. Widely used stores, e.g., Apache Zookeeper, Cassandra, and Memcached, have been shown to achieve 10¹⁰ accesses per day in the face of workload shifts...

Citations

... As extension of current study, we plan to apply our model to other applications such as community detection in dynamic networks ( Wang et al. 2018) and exception-tolerant abduction (Zhang, Mathew, and Juba 2017) in attributed networks ( Liang et al. 2018). We also would like to address the problem of routing newly posted questions (item coldstart) to newly registered users (user cold-start) in CQAs, with hoping to increase the expertise of the entire community. ...
... Expert recommendation based on link analysis [8][9][10][11][12][13][14] constructs a question-answer relationship directed graph based on the historical interaction behavior between users in the community and then performs link analysis on the directed graph and calculates the authority of each user. Neural-network-based expert recommendation [15][16][17][18][19][20] encodes higher-level questions and feature representations of expert texts with the help of word2vec and graphs and then extracts features by convolutional neural networks and recurrent neural networks. ...
... Chakraborti et al. 9 consider the effect of heterogeneous workload distribution on bi-objective optimization of data analytics applications by simulating heterogeneity on homogeneous clusters. The performance is represented by a linear function of problem size and the total energy is predicted using historical data tables. ...
... often forming a large portion of the network. Methods such as [6,24] track a fixed set of nodes sampled from the initial time step and are thus limited to detecting changes happening within the initial set of nodes. Other approaches, such as [12,13], summarize each snapshot with a vector dependent on the size of the snapshot. ...
... The results illustrate that when experiencing clustering events, there is a transition in the time scale (from slow to fast) and direction (from hierarchical to distributed) of information transfer in the network. Wang et al. [48] expressed the evolution of the temporal network as a Markov network and detected change points through estimating and comparing the joint edge (dyad) distribution. Experiments on the Senate cosponsorship network show that the method is more efficient than the other approaches in the same period while ensuring a good detection effect. ...
... In literature [11], a compressed binary tree corresponds to the streaming graph data for lossy summarization. In literature [12], hash functions maintain a minimum neighborhood sample subgraph in real-time. GSS [13] first generates a sketch of the streaming graph using hash functions, then uses a novel data structure to store it, achieving lossy summarization supporting various queries. ...
... We note that hypothesis testing has also been used for deciding a certain number of hashes for LSH in the context of similarity search [10,52]. The differences between our technique and [10,52] include: (1) ours is based on a random process of sampling dimensions of a transformed vector while [10,52] are on one of sampling hash functions, which entail significantly different hypothesis testings and (2) ours targets the Euclidean distance function while [10,52] target similarity functions such as Jaccard and Cosine similarity measures (it remains non-trivial to adapt the latter to the Euclidean space), and (3) ours guarantees to be no worse than the method of evaluating exact distances (in our case, i.e., FDScanning) because it obtains exact distances when it has sampled all the dimensions while [10,52] have no such guarantee (when they have sampled all the hash functions and still cannot produce a firmed result, they would have to re-evaluate exact similarities from scratch). ...
... Les capteurs sont souvent utilisés pour suivre divers paramètres d'environnement et de localisation dans de nombreuses applications du monde réel. Les anomalies dans les données de capteurs font référence à des défauts de capteurs ou des événements (tels que des intrusions) imprévus (Rajasegarar et al., 2008;Hayes et Capretz, 2014;Rabatel et al., 2011;Chakrabarti et al., 2016). Les données de capteurs peuvent être binaires, discrètes, continues, audio, vidéo, etc. ...
... However, it is a computationally expensive operation when working with big data [60] . In fact, the efficiency and effectiveness of big data queries and analysis algorithms are greatly affected by the data partitioning scheme [60][61][62][63][64][65][66] . On Hadoop clusters, data partitioning is basically the responsibility of HDFS [10] . ...
... At the network level, prior work has shown the benefits of duplicating flows (or specific packets of a flow) [41-43, 45, 49, 70, 74]. Similarly, other systems have shown the efficacy of duplication for storage (e.g., [40,64,68]) and distributed job execution frameworks [17-19, 65, 76]. ...