Zhuowen Tu's research while affiliated with University of California, San Diego and other places

Publications (15)

Preprint
We study semi-supervised learning (SSL) for vision transformers (ViT), an under-explored topic despite the wide adoption of the ViT architectures to different tasks. To tackle this problem, we propose a new SSL pipeline, consisting of first un/self-supervised pre-training, followed by supervised fine-tuning, and finally semi-supervised fine-tuning....
Preprint
Full-text available
Negative flips are errors introduced in a classification system when a legacy model is replaced with a new one. Existing methods to reduce the negative flip rate (NFR) either do so at the expense of overall accuracy using model distillation, or use ensembles, which multiply inference cost prohibitively. We present a method to train a classification...
Preprint
In this paper, we study the challenging instance-wise vision-language tasks, where the free-form language is required to align with the objects instead of the whole image. To address these tasks, we propose X-DETR, whose architecture has three major components: an object detector, a language encoder, and vision-language alignment. The vision and la...
Preprint
Full-text available
We propose an online tracking algorithm that performs the object detection and data association under a common framework, capable of linking objects after a long time span. This is realized by preserving a large spatio-temporal memory to store the identity embeddings of the tracked objects, and by adaptively referencing and aggregating useful infor...
Preprint
Full-text available
We present Contrastive Neighborhood Alignment (CNA), a manifold learning approach to maintain the topology of learned features whereby data points that are mapped to nearby representations by the source (teacher) model are also mapped to neighbors by the target (student) model. The target model aims to mimic the local structure of the source repres...
Preprint
Full-text available
In this paper, we present Long Short-term TRansformer (LSTR), a new temporal modeling algorithm for online action detection, by employing a long- and short-term memories mechanism that is able to model prolonged sequence data. It consists of an LSTR encoder that is capable of dynamically exploiting coarse-scale historical information from an extens...
Preprint
We tackle the problem of visual search under resource constraints. Existing systems use the same embedding model to compute representations (embeddings) for the query and gallery images. Such systems inherently face a hard accuracy-efficiency trade-off: the embedding model needs to be large enough to ensure high accuracy, yet small enough to enable...
Preprint
Full-text available
Computer vision applications such as visual relationship detection and human-object interaction can be formulated as a composite (structured) set detection problem in which both the parts (subject, object, and predicate) and the sum (triplet as a whole) are to be detected in a hierarchical fashion. In this paper, we present a new approach, denoted...
Preprint
We present a plug-in replacement for batch normalization (BN) called exponential moving average normalization (EMAN), which improves the performance of existing student-teacher based self- and semi-supervised learning techniques. Unlike the standard BN, where the statistics are computed within each batch, EMAN, used in the teacher, updates its stat...
Article
We describe a system to automatically detect clinically significant findings from computerized tomography (CT) head scans, operating at performance levels exceeding that of practicing radiologists. Our system, named DeepRadiologyNet, builds on top of deep convolutional neural networks (CNNs) trained using approximately 3.5 million CT head images ga...
Conference Paper
In this work, we present a novel 3D-Convolutional Neural Network (CNN) architecture called I2I-3D that predicts boundary location in volumetric data. Our fine-to-fine, deeply supervised framework addresses three critical issues to 3D boundary detection: (1) efficient, holistic, end-to-end volumetric label training and prediction (2) precise voxel-l...
Conference Paper
Computational simulations provide detailed hemodynamics and physiological data that can assist in clinical decision-making. However, accurate cardiovascular simulations require complete 3D models constructed from image data. Though edge localization is a key aspect in pinpointing vessel walls in many segmentation tools, the edge detection algorithm...

Citations

... In order to exploit the rich global context, some works regard each image as a whole and adopt the fully connected graph [38], [39], the chained graph [37], and the tree-structured graph [40] to model the contexts among objects. 2) One-stage SGG: They use the fully convolutional network or Transformer to detect the objects and relations from image features directly [42], [43]. In this paper, we propose a model-agnostic debiasing method that can be used in any SGG model. ...
... EMA/Momentum has been studied deeply for smoothing the original sequence signal [33,34,35]. It becomes the widely used technique in practices for most of fields ranging from optimization [36,37,38], reinforcement learning [39,40,41], knowledge distillation [30,42], recent semi-supervised learning frameworks [22,43,44,45], and self-supervised learning methods [19,10,20,46]. Momentum is interpreted as an average of consecutive q-functions in reinforcement learning [39], or is used in SSL frameworks preventing model collapse [20,19,47]. ...
... proposed a framework, namely "Feature Lenses", to encourage image representations transformation-invariant. To balance the trade-off between performance and efficiency, Duggal et al. (2021) designed a compatibility-aware neural architecture search scheme to improve the compatibility of models with different sizes. However, since existing compatible algorithms for image retrieval have not investigated the application of hot-refresh model upgrades, the problem of model regression has been overlooked. ...
... A convolutional neural network application demonstrated performance on par with all tested board-certified experts in classifying skin cancer. 43 In radiology, it is expensive and time consuming to train radiologists, so radiographic image recognition is held to be one of the areas of AI 'witness to the greatest gains'. In one study at Cornell University, Ithaca, New York, USA, a deep convolutional neural network was capable of automatically filtering CT head images and reporting with an error rate well below that for board-certified radiologists. ...
... SimVascular is an actively maintained open source project, with additional enhancements and new features in preparation. For image segmentation, deep learning-based 2D segmentation has been explored with the goal to speed up the image-based anatomic modeling process and make it possible for large scale application of cardiovascular simulation on clinical studies [78,79]. These tools will be added to the GUI in future releases. ...
... A highly imbalanced data poses great difficulty in training DL model and makes model accuracy misleading, for example, in a patient data, where the disease is relatively rare and occurs only in 10% of patients screened. e overall designed model accuracy would be high as most of the patients do not have the disease and will reach local minima [88,89]. e problem of class imbalance can be solved by (a) oversampling the data; the amount of oversampling depends on the extent of imbalance in the dataset. ...