Andrew Zisserman’s research while affiliated with Google Inc. and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (948)


Fig. 3. Model Learning. The regression model is learnt from pairs of aligned DXA and MRI scans. The regression targets are the DXA curve, and the sagittal curve (projected from the 3D MRI spine). The alignment for the sagittal curve to DXA is obtained from the alignment of the coronal projection of the 3D MRI to the DXA. Six curves are regressed: the centerline of the spine as well as the left and right boundaries of the segmentation, for both the coronal and sagittal views.
Fig. 4. Image-Based Regression of Coronal and Sagittal Spine Curves. We use a ResNet50, pre-trained on ImageNet-21k, with a transformer layer to regress the spine curves (x (1,2,3) (z), y (1,2,3) (z)), z ∈ [1, 209] for left, center and right curves. The feature map extracted from ResNet50 are of resolution 7 x 7 x 2048, each vector feature from ResNet50 (49 x 2048) is used as input into a transformer layer. The model regresses the 6 curves (209 x 6) where we have 6 vectors for the 6 output spine curves, of dimension 209. Detailed Architecture in Appendix A Figure 7.
Fig. 6. Mean Average Precision of Predicted Spine Masks for Different IoU Thresholds over the Test samples. We measure the performance of the 3D segmentation using 3D IoU. IoU thresholds ranges from 0.1 to 0.9, with steps of 0.1.
Fig. 7. Full Architecture of our ResNet50 with Transformer Layer.
3D Spine Shape Estimation from Single 2D DXA
  • Preprint
  • File available

December 2024

·

2 Reads

Emmanuelle Bourigault

·

Amir Jamaludin

·

Andrew Zisserman

Scoliosis is traditionally assessed based solely on 2D lateral deviations, but recent studies have also revealed the importance of other imaging planes in understanding the deformation of the spine. Consequently, extracting the spinal geometry in 3D would help quantify these spinal deformations and aid diagnosis. In this study, we propose an automated general framework to estimate the 3D spine shape from 2D DXA scans. We achieve this by explicitly predicting the sagittal view of the spine from the DXA scan. Using these two orthogonal projections of the spine (coronal in DXA, and sagittal from the prediction), we are able to describe the 3D shape of the spine. The prediction is learnt from over 30k paired images of DXA and MRI scans. We assess the performance of the method on a held out test set, and achieve high accuracy.

Download

Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA Benchmark

November 2024

Joseph Heyward

·

·

·

[...]

·

Following the successful 2023 edition, we organised the Second Perception Test challenge as a half-day workshop alongside the IEEE/CVF European Conference on Computer Vision (ECCV) 2024, with the goal of benchmarking state-of-the-art video models and measuring the progress since last year using the Perception Test benchmark. This year, the challenge had seven tracks (up from six last year) and covered low-level and high-level tasks, with language and non-language interfaces, across video, audio, and text modalities; the additional track covered hour-long video understanding and introduced a novel video QA benchmark 1h-walk VQA. Overall, the tasks in the different tracks were: object tracking, point tracking, temporal action localisation, temporal sound localisation, multiple-choice video question-answering, grounded video question-answering, and hour-long video question-answering. We summarise in this report the challenge tasks and results, and introduce in detail the novel hour-long video QA benchmark 1h-walk VQA.




A Short Note on Evaluating RepNet for Temporal Repetition Counting in Videos

November 2024

We discuss some consistent issues on how RepNet has been evaluated in various papers. As a way to mitigate these issues, we report RepNet performance results on different datasets, and release evaluation code and the RepNet checkpoint to obtain these results. Code URL: https://github.com/google-research/google-research/blob/master/repnet/



Automated Spinal MRI Labelling from Reports Using a Large Language Model

October 2024

·

2 Reads

We propose a general pipeline to automate the extraction of labels from radiology reports using large language models, which we validate on spinal MRI reports. The efficacy of our labelling method is measured on five distinct conditions: spinal cancer, stenosis, spondylolisthesis, cauda equina compression and herniation. Using open-source models, our method equals or surpasses GPT-4 on a held-out set of reports. Furthermore, we show that the extracted labels can be used to train imaging models to classify the identified conditions in the accompanying MR scans. All classifiers trained using automated labels achieve comparable performance to models trained using scans manually annotated by clinicians. Code can be found at https://github.com/robinyjpark/AutoLabelClassifier.


It's Just Another Day: Unique Video Captioning by Discriminative Prompting

October 2024

·

2 Reads

Long videos contain many repeating actions, events and shots. These repetitions are frequently given identical captions, which makes it difficult to retrieve the exact desired clip using a text search. In this paper, we formulate the problem of unique captioning: Given multiple clips with the same caption, we generate a new caption for each clip that uniquely identifies it. We propose Captioning by Discriminative Prompting (CDP), which predicts a property that can separate identically captioned clips, and use it to generate unique captions. We introduce two benchmarks for unique captioning, based on egocentric footage and timeloop movies - where repeating actions are common. We demonstrate that captions generated by CDP improve text-to-video R@1 by 15% for egocentric videos and 10% in timeloop movies.


Character-aware audio-visual subtitling in context

October 2024

·

11 Reads

This paper presents an improved framework for character-aware audio-visual subtitling in TV shows. Our approach integrates speech recognition, speaker diarisation, and character recognition, utilising both audio and visual cues. This holistic solution addresses what is said, when it's said, and who is speaking, providing a more comprehensive and accurate character-aware subtitling for TV shows. Our approach brings improvements on two fronts: first, we show that audio-visual synchronisation can be used to pick out the talking face amongst others present in a video clip, and assign an identity to the corresponding speech segment. This audio-visual approach improves recognition accuracy and yield over current methods. Second, we show that the speaker of short segments can be determined by using the temporal context of the dialogue within a scene. We propose an approach using local voice embeddings of the audio, and large language model reasoning on the text transcription. This overcomes a limitation of existing methods that they are unable to accurately assign speakers to short temporal segments. We validate the method on a dataset with 12 TV shows, demonstrating superior performance in speaker diarisation and character recognition accuracy compared to existing approaches. Project page : https://www.robots.ox.ac.uk/~vgg/research/llr-context/



Citations (41)


... Recent work has focused on improving video LLMs for long video understanding. Given the challenge posed by the large number of tokens in long videos, recent approaches [10,29,49,68,70,73,88] adopt Q-former [32] or state space models [18] to aggregate the information before feeding visual tokens into the LLMs. Others extend the context window of LLMs [21,71,88] or design sequence parallelism from a systems perspective [79] to process the long video sequence. ...

Reference:

AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning
Text-Conditioned Resampler For Long Form Video Understanding
  • Citing Chapter
  • October 2024

... Comics represent a highly complex medium for computational analysis, yet they are easily understood by humans-except for individuals within the Blind or Low Vision community. Recent studies [15,19,23] have addressed this gap by developing dialog generation tasks to assist People with Visual Impairments (PVI). These tasks aim to transcribe all spoken text, sorted by appearance, and associate it with the corresponding character's name. ...

The Manga Whisperer: Automatically Generating Transcriptions for Comics
  • Citing Conference Paper
  • June 2024

... While Zhang et al. (2024b) introduce a three-step method consisting of connecting, collapsing, corrupting, to bridge the modality gap, Zhang et al. (2024a) learn to generalize to unseen modality interaction by modality-agnostic feature fusion and pseudo supervision. Huang et al. (2023) learn audio-video representation by selfsupervision and Chalk et al. (2024) explicitly model the temporal extents of audio and visual events for more effective fusion. Akbari et al. (2023) integrate multimodal inputs by alternating gradient descent and mixture-of-experts. ...

TIM: A Time Interval Machine for Audio-Visual Action Recognition
  • Citing Conference Paper
  • June 2024

... • We introduce DistinctAD, which incorporates a Contextual EMA module and a distinctive word prediction loss, significantly enhancing the generation of distinctive ADs from consecutive visual clips with similar contexts. • Comprehensive evaluations on MAD-Eval [20], CMD-AD [22], and TV-AD [81] highlight DistinctAD's superiority. Our outstanding performance in Recall@k/N demonstrates its effectiveness in generating high-quality ADs with both distinctiveness and technical excellence. ...

AutoAD III: The Prequel - Back to the Pixels
  • Citing Conference Paper
  • June 2024

... This is combined with a grounding mechanism, enabling audio-driven segmentation of the image. Similarly, Hamilton et al.'s DenseAV [17] identifies image regions linked to sounds without supervision. Using a DINO image backbone [1] and the HU-BERT audio transformer [19], it employs multi-head feature aggregation for contrastive learning on video-audio pairs. ...

Separating the “Chirp” from the “Chat”: Self-supervised Visual Grounding of Sound and Language
  • Citing Conference Paper
  • June 2024

... This capability, known as amodal completion [11], allows systems to infer and generate the occluded parts of objects, providing users with full representations of partially visible items. Amodal appearance completion is critical in applications such as AR [7,19,25], 3D reconstruction [17,37], and content creation [2,31,39], where intuitive, language-based interaction allows users to specify objects directly [3,8,22]. Traditional amodal appearance completion methods, however, are typically constrained by fixed sets of object categories [35], or necessitating extensive training and limiting their applicability in diverse and changing environments [23]. ...

Amodal Ground Truth and Completion in the Wild
  • Citing Conference Paper
  • June 2024

... Then, a neural network-based module predicts all speakers' corresponding voice activities. This two-stage framework demonstrates highly accurate performance in popular benchmarks such as DIHARD-III [14] and VoxSRC21-23 [15]- [18]. However, the diarization systems mentioned above are natively designed to process pre-recorded audio offline, which means they cannot satisfy scenarios with low latency demand (e.g., real-time meeting transcription) [1]. ...

The VoxCeleb Speaker Recognition Challenge: A Retrospective
  • Citing Article
  • January 2024

IEEE/ACM Transactions on Audio Speech and Language Processing

... SpineNetV2 [21] was used to automatically detect the vertebrae and extract the intervertebral discs (IVDs) in the T2 sagittal scans. Each IVD is of dimension slice x height x width (9 × 112 × 224). ...

Automated detection, labelling and radiological grading of clinical spinal MRIs

... LLMs, trained on vast text data, have a deep understanding of language, which we exploit for their knowledge of sound semantics. Our method, adapted from [19], involves three steps. First, we provide a general description of the task. ...

A Sound Approach: Using Large Language Models to Generate Audio Descriptions for Egocentric Text-Audio Retrieval
  • Citing Conference Paper
  • April 2024