Michael Blumenstein’s research while affiliated with University of Technology Sydney and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (393)


Fig. 1: Overview of the proposed method. Left: Predicted locations for a new person in the scene. Middle: Estimated scale at each predicted location. Right: Final human pose estimated after scaling and deformation at each predicted location.
Fig. 3: An illustration of the proposed architecture. The workflow is divided into four subnetworks to estimate the probable location o * , pose template class y * , scaling parameters s * , and linear deformations d * of a potential target pose. Every subnetwork exclusively uses the proposed Mutual Cross-Modal Attention (MCMA) block to encode global and local scene contexts as shown in Fig. 2.
Exploring Mutual Cross-Modal Attention for Context-Aware Human Affordance Generation
  • Preprint
  • File available

February 2025

·

6 Reads

·

·

·

[...]

·

Michael Blumenstein

Human affordance learning investigates contextually relevant novel pose prediction such that the estimated pose represents a valid human action within the scene. While the task is fundamental to machine perception and automated interactive navigation agents, the exponentially large number of probable pose and action variations make the problem challenging and non-trivial. However, the existing datasets and methods for human affordance prediction in 2D scenes are significantly limited in the literature. In this paper, we propose a novel cross-attention mechanism to encode the scene context for affordance prediction by mutually attending spatial feature maps from two different modalities. The proposed method is disentangled among individual subtasks to efficiently reduce the problem complexity. First, we sample a probable location for a person within the scene using a variational autoencoder (VAE) conditioned on the global scene context encoding. Next, we predict a potential pose template from a set of existing human pose candidates using a classifier on the local context encoding around the predicted location. In the subsequent steps, we use two VAEs to sample the scale and deformation parameters for the predicted pose template by conditioning on the local context and template class. Our experiments show significant improvements over the previous baseline of human affordance injection into complex 2D scenes.

Download

d-Sketch: Improving Visual Fidelity of Sketch-to-Image Translation with Pretrained Latent Diffusion Models without Retraining

February 2025

·

6 Reads

Structural guidance in an image-to-image translation allows intricate control over the shapes of synthesized images. Generating high-quality realistic images from user-specified rough hand-drawn sketches is one such task that aims to impose a structural constraint on the conditional generation process. While the premise is intriguing for numerous use cases of content creation and academic research, the problem becomes fundamentally challenging due to substantial ambiguities in freehand sketches. Furthermore, balancing the trade-off between shape consistency and realistic generation contributes to additional complexity in the process. Existing approaches based on Generative Adversarial Networks (GANs) generally utilize conditional GANs or GAN inversions, often requiring application-specific data and optimization objectives. The recent introduction of Denoising Diffusion Probabilistic Models (DDPMs) achieves a generational leap for low-level visual attributes in general image synthesis. However, directly retraining a large-scale diffusion model on a domain-specific subtask is often extremely difficult due to demanding computation costs and insufficient data. In this paper, we introduce a technique for sketch-to-image translation by exploiting the feature generalization capabilities of a large-scale diffusion model without retraining. In particular, we use a learnable lightweight mapping network to achieve latent feature translation from source to target domain. Experimental results demonstrate that the proposed method outperforms the existing techniques in qualitative and quantitative benchmarks, allowing high-resolution realistic image synthesis from rough hand-drawn sketches.


SwinInsSeg: An Improved SOLOv2 Model Using the Swin Transformer and a Multi-Kernel Attention Module for Ship Instance Segmentation

January 2025

·

4 Reads

Maritime surveillance is essential for ensuring security in the complex marine environment. The study presents SwinInsSeg, an instance segmentation model that combines the Swin transformer and a lightweight MKA module to segment ships accurately and efficiently in maritime surveillance. Current models have limitations in segmenting multiscale ships and achieving accurate segmentation boundaries. SwinInsSeg addresses these limitations by identifying ships of various sizes and capturing finer details, including both small and large ships, through the MKA module, which emphasizes important information at different processing stages. Performance evaluations on the MariBoats and ShipInsSeg datasets show that SwinInsSeg outperforms YOLACT, SOLO, and SOLOv2, achieving mask average precision scores of 50.6% and 52.0%, respectively. These results demonstrate SwinInsSeg’s superior capability in segmenting ship instances with improved accuracy.


MMC: Multi-modal colorization of images using textual description

December 2024

·

11 Reads

Signal Image and Video Processing

Handling various objects with different colours is a significant challenge for image colourisation techniques. Thus, for complex real-world scenes, the existing image colourisation algorithms often fail to maintain colour consistency. In this work, we attempt to integrate textual descriptions as an auxiliary condition, along with the greyscale image that is to be colourised, to improve the fidelity of the colourisation process. To do so, we have proposed a deep network that takes two inputs (greyscale image and the respective encoded text description) and tries to predict the relevant colour components. Also, we have predicted each object in the image and have colourised them with their individual description to incorporate their specific attributes in the colourisation process. After that, a fusion model fuses all the image objects (segments) to generate the final colourised image. As the respective textual descriptions contain colour information of the objects in the image, text encoding helps improve the overall quality of predicted colours. In terms of performance, the proposed method outperforms existing colourisation techniques in terms of LPIPS, PSNR and SSIM metrics.








Citations (55)


... The latter category for ITS -non-recurrent architectures paired with autoregressive decoding -has recently gained traction due to the unprecedented "zero-shot" ability of autoregressive LLMs (Mirchandani et al., 2023;Huang et al., 2022;Qi et al., 2023;Kojima et al., 2022;Shen et al., 2024;Nate Gruver & Wilson, 2023;Kwon et al., 2023;Anonymous, 2024). An emerging line of research attempts to formalize "autoregressive learnability", i.e., AR-learnability (Malach, 2023;Xiao & Liu, 2024). ...

Reference:

Model Successor Functions
Retrieval-Augmented Retrieval: Large Language Models are Strong Zero-Shot Retriever
  • Citing Conference Paper
  • January 2024

... This framework demonstrates strong robustness in handling variations in ship scales and changes in visual shapes, making it highly adaptable to diverse scenarios. Sharma et al. [19] utilized an attention mechanism to strengthen the acquisition of multiscale features across multiple dimensions. By focusing on critical regions within the feature space, this method improves the model's ability to capture intricate details and contextual relationships. ...

MASSNet: Multiscale Attention for Single-Stage Ship Instance Segmentation
  • Citing Article
  • May 2024

Neurocomputing

... In recent years, many scholars have explored the application of fire detection to UAVs or edge computing devices [31,32]. UAV-mounted cameras have also been used for target detection of forest fire images or forest fire videos, but there are still problems with video processing and feature extraction [33,34]. YOLO series algorithms show better results in target detection tasks, especially the proposal of YOLOv5, which brings a qualitative leap to YOLO series. ...

A Locally Weighted Linear Regression Based Approach for Arbitrary Moving Shaky and Non-Shaky Video Classification
  • Citing Article
  • December 2023

International Journal of Pattern Recognition and Artificial Intelligence

... Yang et al. pioneered multi-dimensional scene text identification in 2018. A novel technique for text line representation is proposed by TextSnake and Long et al. [40]. The TR breaks down into many disks that overlap in an organized manner. ...

A New Transformer-Based Approach for Text Detection in Shaky and Non-shaky Day-Night Video
  • Citing Chapter
  • November 2023

Lecture Notes in Computer Science

... However, it is unknown to what degree generative AIs are trained for the optimization of statistical color parameters, especially as the image training database and specifics of the AI algorithm may not be disclosed, or certain aspects of images may be more heavily weighted, such as style [10]. Empirical study of AI's performance has revealed deviations from human color processing: color distortions were observed when performing AI colorization tasks on grayscale images [11,12], varying the color distributions within training data reduced object identification [13], and changing pixel colors by the smallest magnitude (1/256 for an 8-bit image) resulted in misclassification of animals within images (e.g., misclassifying a panda as a gibbon) [14]. On the other hand, deep neural networks (DNNs) have been found to exhibit emergent humanlike color properties, such as color constancy and categorical perception [15], even though neural networks typically learn and generate images in RGB rather than color spaces tuned for human perception [16]. ...

TIC: text-guided image colorization using conditional generative model

... Liu et al. [11] propose a feature extraction method for contraband detection, which uses the Canny operator edge extractor to emphasize the edge information of occluded objects. Liu et al. [12] design a feature aggregation method for contraband detection tasks, which uses an edge detection module to aggregate multi-level features to generate robust edge features. Xiang et al. [13] propose a material-aware coordinate attention module, which encodes features along the horizontal and vertical coordinate directions, focusing on material features to alleviate the impact of overlapping occlusion. ...

A New Few-Shot Learning-Based Model for Prohibited Objects Detection in Cluttered Baggage X-Ray Images Through Edge Detection and Reverse Validation
  • Citing Article
  • January 2023

Signal Processing Letters, IEEE

... However, the development of marine surveillance systems using computer vision has largely replaced manual monitoring and is now widely used for managing maritime traffic and conducting surveillance [2][3][4][5]. Ship detection algorithms, as explored by Nalamati et al. (2020Nalamati et al. ( , 2022, Park et al. (2022), and Xing et al. (2023) [6][7][8][9], utilize bounding boxes to identify various features. Still, they often struggle to provide accurate details of ship edges due to the inclusion of unnecessary background pixels. ...

Exploring Transformers for Intruder Detection in Complex Maritime Environment
  • Citing Chapter
  • March 2022

Lecture Notes in Computer Science

... Unlike the conventional human pose estimation techniques [27], [28], [29], [30], directly inferring the valid pose of a non-existent person is difficult [31] due to the unavailability of an actual human body for supervision. Thus, after sampling a probable location o * (x, y) within the scene where a person can be centered, we select a potential candidate from an existing set of m valid human poses as the initial guess (template) at that position. ...

Scene Aware Person Image Generation through Global Contextual Conditioning
  • Citing Conference Paper
  • August 2022

... He et al. [12], introduced a unique method titled "Image Copy-Move Forgery Detection via Deep Cross-Scale Patch-Match", which addresses the practical limitations of existing deep learning algorithms. By combining traditional and deep learning techniques [20,21], end-to-end deep crossscale patch-match is introduced and designed specifically for copy-move forgery detection (CMFD). The method stresses explicit point-to-point matching using features from high-resolution scales, boosting generalizability to different copy-move contents. ...

A new robust approach for altered handwritten text detection