Vivek Sharma

Vivek Sharma
Verified
Vivek verified their affiliation via an institutional email.
Verified
Vivek verified their affiliation via an institutional email.
Massachusetts Institute of Technology | MIT · MIT Media Laboratory

PhD

About

73
Publications
28,642
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,046
Citations
Introduction
My research interests cover computer vision, applied machine learning for computer vision, and my favourite- multi/hyper-spectral imaging.
Additional affiliations
October 2019 - present
Harvard Medical School
Position
  • PhD Student
February 2019 - present
Massachusetts Institute of Technology
Position
  • Researcher
February 2019 - present
Harvard University
Position
  • Researcher
Education
September 2012 - June 2014
Karlsruhe Institute of Technology; University of Jean Monnnet; University of Granada; Norwegian University of Science and Technology, Campus Gjøvik University College
Field of study
  • Computer Science
August 2007 - June 2011
BK Birla Institute of Engineering and Technology, Pilani
Field of study
  • Computer Science and Engineering

Publications

Publications (73)
Conference Paper
Full-text available
Convolutional neural networks rely on image texture and structure to serve as discriminative features to classify the image content. Image enhancement techniques can be used as preprocessing steps to help improve the overall image quality and in turn improve the overall effectiveness of a CNN. Existing image enhancement methods, however, are design...
Conference Paper
Full-text available
The CNN-encoding of features from entire videos for the representation of human actions has rarely been addressed. Instead, CNN work has focused on approaches to fuse spatial and temporal networks, but these were typically limited to processing shorter sequences. We present a new video representation, called temporal linear encoding (TLE) and embed...
Article
Full-text available
Image enhancement using visible (RGB) and near-infrared (NIR) image data has been shown to enhance useful details of the image. While the enhanced images are commonly evaluated by observers' perception, in the present work, we rather evaluate it by quantitative feature evaluation. The proposed algorithm presents a new method to enhance the visible...
Article
Full-text available
The work in this paper is driven by the question how to exploit the temporal cues available in videos for their accurate classification, and for human action recognition in particular? Thus far, the vision community has focused on spatio-temporal approaches with fixed temporal convolution kernel depths. We introduce a new temporal layer that models...
Article
Full-text available
The deep generative adversarial networks (GAN) recently have been shown to be promising for different computer vision applications, like image editing, synthesizing high resolution images, generating videos, etc. These networks and the corresponding learning scheme can handle various visual space mappings. We approach GANs with a novel training met...
Preprint
Full-text available
Progress in AI is driven largely by the scale and quality of training data. Despite this, there is a deficit of empirical analysis examining the attributes of well-established datasets beyond text. In this work we conduct the largest and first-of-its-kind longitudinal audit across modalities--popular text, speech, and video datasets--from their det...
Preprint
Full-text available
Growing privacy concerns and regulations like GDPR and CCPA necessitate pseudonymization techniques that protect identity in image datasets. However, retaining utility is also essential. Traditional methods like masking and blurring degrade quality and obscure critical context, especially in human-centric images. We introduce Rendering-Refined Stab...
Preprint
Full-text available
Privacy-preserving computer vision is an important emerging problem in machine learning and artificial intelligence. The prevalent methods tackling this problem use differential privacy or anonymization and obfuscation techniques to protect the privacy of individuals. In both cases, the utility of the trained model is sacrificed heavily in this pro...
Preprint
Full-text available
General-purpose artificial intelligence (AI) systems are built on massive swathes of public web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our knowledge, we conduct the first, large-scale, longitudinal audit of the consent protocols for the web domains underlying AI training corpora. Our audit of 14,000 web domains provides...
Conference Paper
Cloud-based machine learning inference is an emerging paradigm where users 1 query by sending their data through a service provider who runs an ML model 2 on that data and returns back the answer. Due to increased concerns over data 3 privacy, recent works have proposed Collaborative Inference (CI) to learn a privacy-4 preserving encoding of sensit...
Chapter
We propose sanitizer, a framework for secure and task-agnostic data release. While releasing datasets continues to make a big impact in various applications of computer vision, its impact is mostly realized when data sharing is not inhibited by privacy concerns. We alleviate these concerns by sanitizing datasets in a two-stage process. First, we in...
Chapter
Point clouds are an increasingly ubiquitous input modality and the raw signal can be efficiently processed with recent progress in deep learning. This signal may, often inadvertently, capture sensitive information that can leak semantic and geometric properties of the scene which the data owner does not want to share. The goal of this work is to pr...
Preprint
Full-text available
Government agencies collect and manage a wide range of ever-growing datasets. While such data has the potential to support research and evidence-based policy making, there are concerns that the dissemination of such data could infringe upon the privacy of the individuals (or organizations) from whom such data was collected. To appraise the current...
Preprint
Full-text available
State-of-the-art methods in generative representation learning yield semantic disentanglement, but typically do not consider physical scene parameters, such as geometry, albedo, lighting, or camera. We posit that inverse rendering, a way to reverse the rendering process to recover scene parameters from an image, can also be used to learn physically...
Preprint
Full-text available
Point clouds are an increasingly ubiquitous input modality and the raw signal can be efficiently processed with recent progress in deep learning. This signal may, often inadvertently, capture sensitive information that can leak semantic and geometric properties of the scene which the data owner does not want to share. The goal of this work is to pr...
Preprint
Full-text available
We propose sanitizer, a framework for secure and task-agnostic data release. While releasing datasets continues to make a big impact in various applications of computer vision, its impact is mostly realized when data sharing is not inhibited by privacy concerns. We alleviate these concerns by sanitizing datasets in a two-stage process. First, we in...
Preprint
Full-text available
Distributed deep learning frameworks like federated learning (FL) and its variants are enabling personalized experiences across a wide range of web clients and mobile/IoT devices. However, FL-based frameworks are constrained by computational resources at clients due to the exploding growth of model parameters (eg. billion parameter model). Split le...
Preprint
Full-text available
The Coronavirus 2019 (Covid-19) pandemic caused by the SARS-CoV-2 virus represents an unprecedented crisis for our planet. It is a bane of the uber connected world that we live in that this virus has affected almost all countries and caused mortality and economic upheaval at a scale whose effects are going to be felt for generations to come. While...
Conference Paper
Full-text available
Action segmentation refers to inferring boundaries of semantically consistent visual concepts in videos and is an important requirement for many video understanding tasks. For this and other video understanding tasks, supervised approaches have achieved encouraging performance but require a high volume of detailed frame-level annotations. We presen...
Preprint
Full-text available
COVID-19 testing, the cornerstone for effective screening and identification of COVID-19 cases, remains paramount as an intervention tool to curb the spread of COVID-19 both at local and national levels. However, the speed at which the pandemic struck and the response was rolled out, the widespread impact on healthcare infrastructure, the lack of s...
Preprint
Full-text available
Recent deep learning models have shown remarkable performance in image classification. While these deep learning systems are getting closer to practical deployment, the common assumption made about data is that it does not carry any sensitive information. This assumption may not hold for many practical cases, especially in the domain where an indiv...
Conference Paper
Full-text available
Increased deployment of deep learning services on remote cloud instances requires users to share their sensitive information with untrusted parties. Correspondingly, we focus on private inference in these distributed learning setups via sharing of intermediate activations instead of raw inputs. Specifically, we design a dynamic pruning strategy to...
Preprint
Full-text available
The COVID-19 Pandemic has left a devastating trail all over the world, in terms of loss of lives, economic decline, travel restrictions, trade deficit, and collapsing economy including real-estate, job loss, loss of health benefits, the decline in quality of access to care and services and overall quality of life. Immunization from the anticipated...
Preprint
Full-text available
As several COVID-19 vaccine candidates approach approval for human use, governments around the world are preparing comprehensive standards for vaccine distribution and monitoring to avoid long-term consequences that may result from rush-to-market. In this early draft article, we identify challenges for vaccine distribution in four core areas - logi...
Research Proposal
Full-text available
Covid, Risk stratification, Contact Tracing, Risk, Testing
Preprint
UNSTRUCTURED Manual contact tracing is a top-down solution that starts with contact tracers at the public health level, who identify the contacts of infected individuals, interview them to get additional context about the exposure, and also monitor their symptoms and support them until the incubation period is past. On the other hand, digital conta...
Conference Paper
Full-text available
Cross-domain fashion item retrieval naturally arises when unconstrained consumer images are used to query for fashion items in a collection of high-quality photographs provided by retailers. To perform this task, approaches typically leverage both consumer and shop domains from a given dataset to learn a domain invariant representation , allowing t...
Chapter
Video recognition has been advanced in recent years by benchmarks with rich annotations. However, research is still mainly limited to human action or sports recognition - focusing on a highly specific video understanding task and thus leaving a significant gap towards describing the overall content of a video. We fill this gap by presenting a large...
Preprint
Full-text available
Manual contact tracing is a top-down solution that starts with contact tracers at the public health level, who identify the contacts of infected individuals, interview them to get additional context about the exposure, and also monitor their symptoms and support them until the incubation period is passed. On the other hand, digital contact tracing...
Preprint
Full-text available
In this work, we introduce SplitNN-driven Vertical Partitioning, a configuration of a distributed deep learning method called SplitNN to facilitate learning from vertically distributed features. SplitNN does not share raw data or model details with collaborating institutions. The proposed configuration allows training among institutions holding div...
Preprint
Full-text available
We propose a low-cost and an effective way for combining a free simulation software and free CAD models for modeling human-object interaction in order to improve human & object segmentation performance. It is intended for research scenarios related to safe human-robot collaboration (SHRC) and interaction (SHRI) in the industrial domain. The task of...
Thesis
Full-text available
In this thesis, we have focused on self-supervised face representation learning, wherein we proposed methods to automatically generate pseudo-labels for training a neural network. Specifically, we show that with our proposed new techniques to generate weak-labels based on sorting distances (i.e. ranking), clustering algorithm and video constraints,...
Preprint
Full-text available
True understanding of videos comes from a joint analysis of all its modalities: the video frames, the audio track, and any accompanying text such as closed captions. We present a way to learn a compact multimodal feature representation that encodes all these modalities. Our model parameters are learned through a proxy task of inferring the temporal...
Preprint
Full-text available
A good clustering algorithm can discover natural groupings in data. These groupings, if used wisely, provide a form of weak supervision for learning representations. In this work, we present Clustering-based Contrastive Learning (CCL), a new clustering-based representation learning approach that uses labels obtained from clustering along with video...
Article
Characters are a key component of understanding the story conveyed in TV series and movies. With the rise of advanced deep face models, identifying face images may seem like a solved problem. However, as face detectors get better, clustering and identification need to be revisited to address increasing diversity in facial appearance. In this paper,...
Conference Paper
Recently, there has been the development of Split Learning, a framework for distributed computation where model components are split between the client and server (Vepakomma et al. , 2018b). As Split Learning scales to include many different model components, there needs to be a method of matching client-side model components with the best server-s...
Preprint
Full-text available
Recently, there has been the development of Split Learning, a framework for distributed computation where model components are split between the client and server (Vepakomma et al., 2018b). As Split Learning scales to include many different model components, there needs to be a method of matching client-side model components with the best server-si...
Preprint
Full-text available
In this work we introduce ExpertMatcher, a method for automating deep learning model selection using autoencoders. Specifically, we are interested in performing inference on data sources that are distributed across many clients using pretrained expert ML networks on a centralized server. The ExpertMatcher assigns the most relevant model(s) in the c...
Conference Paper
Full-text available
In this work we introduce ExpertMatcher, a method for automating deep learning model selection using autoencoders. Specifically, we are interested in performing inference on data sources that are distributed across many clients using pretrained expert ML networks on a centralized server. The ExpertMatcher assigns the most relevant model(s) in the c...
Conference Paper
Full-text available
We propose a novel self-supervised method for fine-tuning deep face representations called Face-Grouping on Graphs. We apply our method to automatic face grouping, where characters are to be separated based on their identity. To solve this problem, a graph structure with positive and negative edges over a set of face-tracks based on their temporal...
Preprint
Full-text available
In this paper, we are interested in self-supervised learning the motion cues in videos using dynamic motion filters for a better motion representation to finally boost human action recognition in particular. Thus far, the vision community has focused on spatio-temporal approaches using standard filters, rather we here propose dynamic filters that a...
Preprint
Full-text available
Action recognition has been advanced in recent years by benchmarks with rich annotations. However, research is still mainly limited to human action or sports recognition - focusing on a highly specific video understanding task and thus leaving a significant gap towards describing the overall content of a video. We fill in this gap by presenting a l...
Conference Paper
Full-text available
We present a new clustering method in the form of a single clustering equation that is able to directly discover groupings in the data. The main proposition is that the first neighbor of each sample is all one needs to discover large chains and finding the groups in the data. In contrast to most existing clustering algorithms our method does not re...
Preprint
Full-text available
Analyzing the story behind TV series and movies often requires understanding who the characters are and what they are doing. With improving deep face models, this may seem like a solved problem. However, as face detectors get better, clustering/identification needs to be revisited to address increasing diversity in facial appearance. In this paper,...
Preprint
Full-text available
We present a new clustering method in the form of a single clustering equation that is able to directly discover groupings in the data. The main proposition is that the first neighbor of each sample is all one needs to discover large chains and finding the groups in the data. In contrast to most existing clustering algorithms our method does not re...
Preprint
Full-text available
Laparoscopic surgery has a limited field of view. Laser ablation in a laproscopic surgery causes smoke, which inevitably influences the surgeon's visibility. Therefore, it is of vital importance to remove the smoke, such that a clear visualization is possible. In order to employ a desmoking technique, one needs to know beforehand if the image conta...
Conference Paper
Full-text available
Laparoscopic surgery has a limited field of view. Laser ablation in a laproscopic surgery causes smoke, which inevitably influences the surgeon's visibility. Therefore, it is of vital importance to remove the smoke, such that a clear visualization is possible. In order to employ a desmoking technique, one needs to know beforehand if the image conta...
Chapter
Full-text available
The work in this paper is driven by the question if spatio-temporal correlations are enough for 3D convolutional neural networks (CNN)? Most of the traditional 3D networks use local spatio-temporal features. We introduce a new block that models correlations between channels of a 3D CNN with respect to temporal and spatial features. This new block c...
Preprint
Full-text available
The work in this paper is driven by the question if spatio-temporal correlations are enough for 3D convolutional neural networks (CNN)? Most of the traditional 3D networks use local spatio-temporal features. We introduce a new block that models correlations between channels of a 3D CNN with respect to temporal and spatial features. This new block c...
Preprint
Full-text available
Convolutional neural networks rely on image texture and structure to serve as discriminative features to classify the image content. Image enhancement techniques can be used as preprocessing steps to help improve the overall image quality and in turn improve the overall effectiveness of a CNN. Existing image enhancement methods, however, are design...
Conference Paper
Full-text available
Object detection is a challenging task in visual understanding domain, and even more so if the supervision is to be weak. Recently, few efforts to handle the task without expensive human annotations is established by promising deep neural network. A new architecture of cascaded networks is proposed to learn a convolutional neural network (CNN) unde...
Conference Paper
Full-text available
In this paper, we present a simple aggregation of frame-level CNN features in a face track to produce a track-level feature representation for face clustering in movies or videos. The approach is invariant of the image sequence and the number of frames the track has. We demonstrate the effectiveness of this strategy on three challenging benchmark v...
Preprint
Full-text available
Object detection is a challenging task in visual understanding domain, and even more so if the supervision is to be weak. Recently, few efforts to handle the task without expensive human annotations is established by promising deep neural network. A new architecture of cascaded networks is proposed to learn a convolutional neural network (CNN) unde...
Preprint
Full-text available
The CNN-encoding of features from entire videos for the representation of human actions has rarely been addressed. Instead, CNN work has focused on approaches to fuse spatial and temporal networks, but these were typically limited to processing shorter sequences. We present a new video representation, called temporal linear encoding (TLE) and embed...
Article
Full-text available
Image enhancement using the visible (V) and near-infrared (NIR) usually enhances useful image details. The enhanced images are evaluated by observers perception, instead of quantitative feature evaluation. Thus, can we say that these enhanced images using NIR information has better features in comparison to the computed features in the Red, Green,...
Conference Paper
Full-text available
We propose a low cost and effective way to combine a free simulation software and free CAD models for modeling human-object interaction in order to improve human & object segmentation. It is intended for research scenarios related to safe human-robot collaboration (SHRC) and interaction (SHRI) in the industrial domain. The task of human and object...
Article
Full-text available
In this paper, we proposed a novel pipeline for image-level classification in the hyperspectral images. By doing this, we show that the discriminative spectral information at image-level features lead to significantly improved performance in a face recognition task. We also explored the potential of traditional feature descriptors in the hyperspect...
Conference Paper
Full-text available
This paper is an extension of our work related to a generic classification approach for low-level human body-parts segmentation in RGB-D data. In this paper, we discuss the impact of decision tree para- meters, number of training frames and pixel count per object-class during a random forests classifier training. From the evaluation, we observed th...
Conference Paper
Full-text available
Problem Statement In the industrial scenario humans and robots often share the same workspace posing a lot of threats to human safety issues. We focus on the: -Intuitive and natural human-robot interaction. -Safety considerations and measures in a shared work environment. -The realization of cooperative process. -The workflow optimization. We use a...