Joseph JaJa’s research while affiliated with University of Maryland, College Park and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (132)


Fig. A1: Sample mammograms for different BIRADS scores
Distribution of BIRADS scores and attributes in the filtered EMBED and the RSNA Datasets.
Detecting and Monitoring Bias for Subgroups in Breast Cancer Detection AI
  • Preprint
  • File available

February 2025

·

16 Reads

Amit Kumar Kundu

·

·

·

[...]

·

Joseph Jaja

Automated mammography screening plays an important role in early breast cancer detection. However, current machine learning models, developed on some training datasets, may exhibit performance degradation and bias when deployed in real-world settings. In this paper, we analyze the performance of high-performing AI models on two mammography datasets-the Emory Breast Imaging Dataset (EMBED) and the RSNA 2022 challenge dataset. Specifically, we evaluate how these models perform across different subgroups, defined by six attributes, to detect potential biases using a range of classification metrics. Our analysis identifies certain subgroups that demonstrate notable underperformance, highlighting the need for ongoing monitoring of these subgroups' performance. To address this, we adopt a monitoring method designed to detect performance drifts over time. Upon identifying a drift, this method issues an alert, which can enable timely interventions. This approach not only provides a tool for tracking the performance but also helps ensure that AI models continue to perform effectively across diverse populations.

Download

HoloCamera: Advanced Volumetric Capture for Cinematic-Quality VR Applications

April 2024

·

19 Reads

·

5 Citations

IEEE Transactions on Visualization and Computer Graphics

High-precision virtual environments are increasingly important for various education, simulation, training, performance, and entertainment applications. We present HoloCamera, an innovative volumetric capture instrument to rapidly acquire, process, and create cinematic-quality virtual avatars and scenarios. The HoloCamera consists of a custom-designed free-standing structure with 300 high-resolution RGB cameras mounted with uniform spacing spanning the four sides and the ceiling of a room-sized studio. The light field acquired from these cameras is streamed through a distributed array of GPUs that interleave the processing and transmission of 4K resolution images. The distributed compute infrastructure that powers these RGB cameras consists of 50 Jetson AGX Xavier boards, with each processing unit dedicated to driving and processing imagery from six cameras. A high-speed Gigabit Ethernet network fabric seamlessly interconnects all computing boards. In this systems paper, we provide an in-depth description of the steps involved and lessons learned in constructing such a cutting-edge volumetric capture facility that can be generalized to other such facilities. We delve into the techniques employed to achieve precise frame synchronization and spatial calibration of cameras, careful determination of angled camera mounts, image processing from the camera sensors, and the need for a resilient and robust network infrastructure. To advance the field of volumetric capture, we are releasing a high-fidelity static light-field dataset, which will serve as a benchmark for further research and applications of cinematic-quality volumetric light fields.


Figure 2. Output of the prototypical network with embedding dim m = 2 when the input is real pairs of data from the 3DShapes dataset which differ in a single factor of variation. Each color corresponds to a unique factor which differs in value amongst the pair. The network clusters the changes correctly on the pairs from the original dataset. This suggests that the prototypical network is clustering pairs of images based on the changed factor of variation. Left: λ = 10. Right: λ = 5.
Figure 3. A comparison of latent traversals in latent space for the 3DShapes and Dsprites dataset. Left: 3DShapes, Right: Dsprites. ProtoVAE produces smooth, disentangled latent representations. Row 1 and 2 are some sample original images, and their reconstructions generated by our model, respectively. Rows 3 downward are the traversals for each latent element, as detailed below. For 3DShapes, we actually see a near-perfect traversal across all of the known factors of variation.
Figure 4. Latent traversals on the MPI3D real world disentanglement dataset. The data is collected via a camera that observes a jointed arm with known changed ground truth factors of variation. From top to bottom: original data, reconstruction, arm angle left/right, arm angle top/bottom, background height, arm end color, size. The KL values represent the amount of information encoded by that dimension of the representation.
Figure 5. Latent traversals on the CelebA dataset ProtoVAE successfully captures ground-truth factors of variation on real-world data. From top to bottom: background color, hairstyle, head angle, age, hairstyle, hair color, skin color, face profile.
ProtoVAE: Prototypical Networks for Unsupervised Disentanglement

May 2023

·

184 Reads

Generative modeling and self-supervised learning have in recent years made great strides towards learning from data in a completely unsupervised way. There is still however an open area of investigation into guiding a neural network to encode the data into representations that are interpretable or explainable. The problem of unsupervised disentanglement is of particular importance as it proposes to discover the different latent factors of variation or semantic concepts from the data alone, without labeled examples, and encode them into structurally disjoint latent representations. Without additional constraints or inductive biases placed in the network, a generative model may learn the data distribution and encode the factors, but not necessarily in a disentangled way. Here, we introduce a novel deep generative VAE-based model, ProtoVAE, that leverages a deep metric learning Prototypical network trained using self-supervision to impose these constraints. The prototypical network constrains the mapping of the representation space to data space to ensure that controlled changes in the representation space are mapped to changes in the factors of variations in the data space. Our model is completely unsupervised and requires no a priori knowledge of the dataset, including the number of factors. We evaluate our proposed model on the benchmark dSprites, 3DShapes, and MPI3D disentanglement datasets, showing state of the art results against previous methods via qualitative traversals in the latent space, as well as quantitative disentanglement metrics. We further qualitatively demonstrate the effectiveness of our model on the real-world CelebA dataset.



Fig. 1: The full architecture of Dot-VAE. Input data is fed into an encoder, which projects the data into two disjoint latent spaces, a disentangled c and entangled z. Interventions change the value t of a single disentangled latent at index k to a new value t while keeping the other latents unchanged to get the intervened representation (c k , z). The adversarial network ensures that the distribution of the real representations c and the intervened representations c k is close to each other. The decoder maps the intervened latents to generated datâ x k . This new generated data is passed back through the encoder to train the generator to make distinct and noticeable changes.
Fig. 2: Latent traversals for the disentangled code c for 3Dshapes (Left) and Dsprites dataset (Right). Our method disentangles the informative factors and encodes them in the different dimensions of c. The top two rows are the original and reconstructed images, respectively. Each row below the second corresponds to a traversal. 3DShapes (Left): Each row corresponds to a distinct independent factor. Rows three through ten correspond to orientation, wall hue, size, scale, object hue (two rows), floor color (two rows) respectively. This covers all of the exact factors of variation. In the last row, we can see the model discerned it had discovered all the factors of variation, and need not encode anything. dSprites: (Right) Similarly, in dSprites we can also see that the model discerns the factors in rows 3 through 7: x, y coordinate, size, shape, and orientation.
Fig. 3: Traversals over latent space c for the CelebA real-world celebrity dataset. The dataset consists of cropped headshots of various celebrities. Given a seed image, we perform a traversal across various latents. Our model successfully produces reconstructions that correspond to identifiable factors. Like for synthetic data, some of the latents remain unused; shown are exemplar latents that encoded factors of variation. The model learns many more factors as compared to previous methods such as hair color, skin tone, lighting, etc.
DOT-VAE: Disentangling One Factor at a Time

October 2022

·

82 Reads

As we enter the era of machine learning characterized by an overabundance of data, discovery, organization, and interpretation of the data in an \textit{unsupervised} manner becomes a critical need. One promising approach to this endeavour is the problem of \textit{Disentanglement}, which aims at learning the underlying generative latent factors, called the factors of variation, of the data and encoding them in disjoint latent representations. Recent advances have made efforts to solve this problem for synthetic datasets generated by a fixed set of independent factors of variation. Here, we propose to extend this to real-world datasets with a countable number of factors of variations. We propose a novel framework which augments the latent space of a Variational Autoencoders with a disentangled space and is trained using a Wake-Sleep-inspired two-step algorithm for unsupervised disentanglement. Our network learns to disentangle interpretable, independent factors from the data ``one at a time", and encode it in different dimensions of the disentangled latent space, while making no prior assumptions about the number of factors or their joint distribution. We demonstrate its quantitative and qualitative effectiveness by evaluating the latent representations learned on two synthetic benchmark datasets; DSprites and 3DShapes and on a real datasets CelebA.


FedNet2Net: Saving Communication and Computations in Federated Learning with Model Growing

September 2022

·

4 Reads

·

1 Citation

Lecture Notes in Computer Science

Federated learning (FL) is a recently developed area of machine learning, in which the private data of a large number of distributed clients is used to develop a global model under the coordination of a central server without explicitly exposing the data. The standard FL strategy has a number of significant bottlenecks including large communication requirements and high impact on the clients’ resources. Several strategies have been described in the literature trying to address these issues. In this paper, a novel scheme based on the notion of “model growing" is proposed. Initially, the server deploys a small model of low complexity, which is trained to capture the data complexity during the initial set of rounds. When the performance of such a model saturates, the server switches to a larger model with the help of function-preserving transformations. The model complexity increases as more data is processed by the clients, and the overall process continues until the desired performance is achieved. Therefore, the most complex model is broadcast only at the final stage in our approach resulting in substantial reduction in communication cost and client computational requirements. The proposed approach is tested extensively on three standard benchmarks and is shown to achieve substantial reduction in communication and client computation while achieving comparable accuracy when compared to the current most effective strategies.KeywordsCommunication efficiencyFederated learningFunction preserving transformation


Fig. 1: The full architecture of Dot-VAE. Input data is fed into an encoder, which projects the data into two disjoint latent spaces, a disentangled c and entangled z. Interventions change the value t of a single disentangled latent at index k to a new value t ′ while keeping the other latents unchanged to get the intervened representation (c ′ k , z). The adversarial network ensures that the distribution of the real representations c and the intervened representations c ′ k is close to each other. The decoder maps the intervened latents to generated datâ x k . This new generated data is passed back through the encoder to train the generator to make distinct and noticeable changes.
Fig. 2: Latent traversals for the disentangled code c for 3Dshapes (Left) and Dsprites dataset (Right). Our method disentangles the informative factors and encodes them in the different dimensions of c. The top two rows are the original and reconstructed images, respectively. Each row below the second corresponds to a traversal. 3DShapes (Left): Each row corresponds to a distinct independent factor. Rows three through ten correspond to orientation, wall hue, size, scale, object hue (two rows), floor color (two rows) respectively. This covers all of the exact factors of variation. In the last row, we can see the model discerned it had discovered all the factors of variation, and need not encode anything. dSprites: (Right) Similarly, in dSprites we can also see that the model discerns the factors in rows 3 through 7: x, y coordinate, size, shape, and orientation.
Fig. 3: Traversals over latent space c for the CelebA real-world celebrity dataset. The dataset consists of cropped headshots of various celebrities. Given a seed image, we perform a traversal across various latents. Our model successfully produces reconstructions that correspond to identifiable factors. Like for synthetic data, some of the latents remain unused; shown are exemplar latents that encoded factors of variation. The model learns many more factors as compared to previous methods such as hair color, skin tone, lighting, etc.
DOT-VAE: Disentangling One Factor at a Time

September 2022

·

62 Reads

Lecture Notes in Computer Science

As we enter the era of machine learning characterized by an overabundance of data, discovery, organization, and interpretation of the data in an unsupervised manner becomes a critical need. One promising approach to this endeavour is the problem of Disentanglement, which aims at learning the underlying generative latent factors, called the factors of variation, of the data and encoding them in disjoint latent representations. Recent advances have made efforts to solve this problem for synthetic datasets generated by a fixed set of independent factors of variation. Here, we propose to extend this to real-world datasets with a countable number of factors of variations. We propose a novel framework which augments the latent space of a Variational Autoencoders with a disentangled space and is trained using a Wake-Sleep-inspired two-step algorithm for unsupervised disentanglement. Our network learns to disentangle interpretable, independent factors from the data “one at a time”, and encode it in different dimensions of the disentangled latent space, while making no prior assumptions about the number of factors or their joint distribution. We demonstrate its quantitative and qualitative effectiveness by evaluating the latent representations learned on two synthetic benchmark datasets; DSprites and 3DShapes and on a real datasets CelebA.KeywordsDeep learningRepresentation learningUnsupervised disentanglement


Fig. 1: Overview of the FedNet2Net training.
Fig. 2: Model switching in FedNet2Net training using the EMNIST dataset
FedNet2Net: Saving Communication and Computations in Federated Learning with Model Growing

July 2022

·

50 Reads

·

1 Citation

Federated learning (FL) is a recently developed area of machine learning, in which the private data of a large number of distributed clients is used to develop a global model under the coordination of a central server without explicitly exposing the data. The standard FL strategy has a number of significant bottlenecks including large communication requirements and high impact on the clients' resources. Several strategies have been described in the literature trying to address these issues. In this paper, a novel scheme based on the notion of "model growing" is proposed. Initially, the server deploys a small model of low complexity, which is trained to capture the data complexity during the initial set of rounds. When the performance of such a model saturates, the server switches to a larger model with the help of function-preserving transformations. The model complexity increases as more data is processed by the clients, and the overall process continues until the desired performance is achieved. Therefore, the most complex model is broadcast only at the final stage in our approach resulting in substantial reduction in communication cost and client computational requirements. The proposed approach is tested extensively on three standard benchmarks and is shown to achieve substantial reduction in communication and client computation while achieving comparable accuracy when compared to the current most effective strategies.


Class-Similarity Based Label Smoothing for Confidence Calibration

September 2021

·

31 Reads

·

5 Citations

Lecture Notes in Computer Science

Generating confidence calibrated outputs is of utmost importance for the applications of deep neural networks in safety-critical decision-making systems. The output of a neural network is a probability distribution where the scores are estimated confidences of the input belonging to the corresponding classes, and hence they represent a complete estimate of the output likelihood relative to all classes. In this paper, we propose a novel form of label smoothing to improve confidence calibration. Since different classes are of different intrinsic similarities, more similar classes should result in closer probability values in the final output. This motivates the development of a new smooth label where the label values are based on similarities with the reference class. We adopt different similarity measurements, including those that capture feature-based similarities or semantic similarity. We demonstrate through extensive experiments, on various datasets and network architectures, that our approach consistently outperforms state-of-the-art calibration techniques including uniform label smoothing.


Learning brain dynamics for decoding and predicting individual differences

September 2021

·

157 Reads

·

11 Citations

Insights from functional Magnetic Resonance Imaging (fMRI), as well as recordings of large numbers of neurons, reveal that many cognitive, emotional, and motor functions depend on the multivariate interactions of brain signals. To decode brain dynamics, we propose an architecture based on recurrent neural networks to uncover distributed spatiotemporal signatures. We demonstrate the potential of the approach using human fMRI data during movie-watching data and a continuous experimental paradigm. The model was able to learn spatiotemporal patterns that supported 15-way movie-clip classification (∼90%) at the level of brain regions, and binary classification of experimental conditions (∼60%) at the level of voxels. The model was also able to learn individual differences in measures of fluid intelligence and verbal IQ at levels comparable to that of existing techniques. We propose a dimensionality reduction approach that uncovers low-dimensional trajectories and captures essential informational (i.e., classification related) properties of brain dynamics. Finally, saliency maps and lesion analysis were employed to characterize brain-region/voxel importance, and uncovered how dynamic but consistent changes in fMRI activation influenced decoding performance. When applied at the level of voxels, our framework implements a dynamic version of multivariate pattern analysis. Our approach provides a framework for visualizing, analyzing, and discovering dynamic spatially distributed brain representations during naturalistic conditions.


Citations (69)


... Previously also, scholars explored the narrative framework of screenwriting for cine-VR (Alves et al., 2023), the idea of the spatialised screenplay (Ross & Munt, 2018), a narrative structure for open-world cine-VR (Mazarei, 2023) and writing for space instead of screen (Reyes, 2022). Likewise, the other filmmaking aspects are also explored, such as the mode of production (Chan, 2023;Zhang & Weber, 2023), the use of the camera (Heagerty et al., 2024), editing and transitions (Marañes et al., 2023;Medlar et al., 2024;Zhang et al., 2024). ...

Reference:

Spatial Sound Design for Cinematic Virtual Reality—A Bibliometric Analysis
HoloCamera: Advanced Volumetric Capture for Cinematic-Quality VR Applications
  • Citing Article
  • April 2024

IEEE Transactions on Visualization and Computer Graphics

... Therefore, web archives should be organized to support search and information exploration within a temporal context. In this ongoing work [17], we consider the mechanisms needed to enable information exploration, search, and access of archived web contents within a temporal context while effectively handling the highly unstructured, complex contents and their linking structures, all typically at very large scale. ...

Search and Access Strategies for Web Archives
  • Citing Article
  • January 2009

Archiving Conference

... This is possible with the message passing that exchange information between neighboring nodes Wang et al. [2022]. The operator commonly include convolution, pooling, attention, heterogeneous and point cloud functions Jin and JaJa [2022], Lai et al. [2020]. The GNN ability to capture both local and global structural information that help to understand basic building components and overview level working simultaneously Papp and Wattenhofer [2022]. ...

Improving Graph Neural Network with Learnable Permutation Pooling
  • Citing Conference Paper
  • November 2022

... The result shows that the modified SET Algorithm succeeded in maximizing the learning performance and minimizing the communication cost by encoding only two hyperparameters. Zhu and Jin [54] proposed FedNet2Net, a novel scheme based on a model growing with a modified training scheme. This approach used two transformations, which are called Net2Widernet and Net2DeeperNet. ...

FedNet2Net: Saving Communication and Computations in Federated Learning with Model Growing

... Since the seminal work by Fujishima from 1999 [146], most chord recognition systems applied a knowledge-driven approach [267], involving the extraction of acoustic features, such as chroma [259] or Tonnentz [202], followed by classification or template matching techniques, such as HMMs [21], Dynamic Bayesian Networks (DBNs) [259], or Conditional Random Fields (CRFs) [228]. [250], authors propose an impactful method for generating more reliable soft labels that explicitly consider the relationships among various categories. Similarly, in [245], a novel approach known as label relaxation is introduced, which involves replacing a degenerate probability distribution associated with an observed class label, not by a single smoothed distribution but rather by a larger set of candidate distributions. ...

Class-Similarity Based Label Smoothing for Confidence Calibration
  • Citing Chapter
  • September 2021

Lecture Notes in Computer Science

... If neural representations are organized across multiple scales of variance, then individual differences in these representations might similarly span many latent dimensions. Alternatively, although stimulus-related variance is embedded in high-dimensional subspaces, it is possible that only a low-dimensional subspace might be relevant for individual differences in visual experience (13)(14)(15). ...

Learning brain dynamics for decoding and predicting individual differences

... For example, ABM can simulate human interactions in social networks with the ability to control network sizes, type, and frequency of interaction, and observe individual and group outcomes from weeks to years. At the time of writing, current vocal health-related ABMs have focused on studying cellular and molecular behavior in laryngeal systems [68][69][70][71][72][73][74][75][76] . ABM or other computer simulation methods (e.g., system dynamics) related to social stigmatization or de-stigmatization are barely reported 77,78 . ...

High-Performance Host-Device Scheduling and Data-Transfer Minimization Techniques for Visualization of 3D Agent-Based Wound Healing Applications

... Foveated light-field optics have been proposed [37] and these can be integrated with algorithms that foveate which portions of the scene to render at high resolution to reduce rendering resource consumption. Algorithms include perceptually guided foveation [38], [39] and hardwareoptimized rendering [40]. Unlike our depth sensor, these use passive displays and cameras to optimize bandwidth, storage, and compute. ...

3D-Kernel Foveated Rendering for Light Fields
  • Citing Article
  • February 2020

IEEE Transactions on Visualization and Computer Graphics