
Andrew SeniorBurton Hospitals NHS Foundation Trust · ENT
Andrew Senior
About
150
Publications
127,914
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
69,907
Citations
Introduction
Skills and Expertise
Publications
Publications (150)
Building a large-scale quantum computer requires effective strategies to correct errors that inevitably arise in physical quantum systems¹. Quantum error-correction codes² present a way to reach this goal by encoding logical information redundantly into many physical qubits. A key challenge in implementing such codes is accurately decoding noisy sy...
Computational design of protein-binding proteins is a fundamental capability with broad utility in biomedical research and biotechnology. Recent methods have made strides against some target proteins, but on-demand creation of high-affinity binders without multiple rounds of experimental testing remains an unsolved challenge. This technical report...
The vast majority of missense variants observed in the human genome are of unknown clinical significance. We present AlphaMissense, an adaptation of AlphaFold fine-tuned on human and primate variant population frequency databases to predict missense variant pathogenicity. By combining structural context and evolutionary conservation, our model achi...
Ancient languages preserve the cultures and histories of the past. However, their study is fraught with difficulties, and experts must tackle a range of challenging text-based tasks, from deciphering lost languages to restoring damaged inscriptions, to determining the authorship of works of literature. Technological aids have long supported the stu...
While the vast majority of well-structured single protein chains can now be predicted to high accuracy due to the recent AlphaFold [1] model, the prediction of multi-chain protein complexes remains a challenge in many cases. In this work, we demonstrate that an AlphaFold model trained specifically for multimeric inputs of known stoichiometry, which...
We describe the operation and improvement of AlphaFold*, the system that was entered by the team AlphaFold2 to the “human” category in the 14th Critical Assessment of Protein Structure Prediction (CASP14). The AlphaFold system entered in CASP14 is entirely different to the one entered in CASP13. It used a novel end-to-end deep neural network traine...
Protein structures can provide invaluable information, both for reasoning about biological processes and for enabling interventions such as structure-based drug development or targeted mutagenesis. After decades of effort, 17% of the total residues in human protein sequences are covered by an experimentally-determined structure1. Here we dramatical...
Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort1, 2, 3–4, the structures of around 100,000 unique proteins have been determined⁵, but this represents a small fraction of the billions of known protein sequences6,7. Structural cover...
The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. Unlike previous works that have focussed on recognising a limited number of words or phrases, we tackle lip reading as an open-world problem -- unconstrained natural language sentences, and in the wild videos. Our key contributions...
CASP13 extended abstract describing DeepMind AlphaFold protein structure prediction system.
The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. Unlike previous works that have focussed on recognising a limited number of words or phrases, we tackle lip reading as an open-world problem - unconstrained natural language sentences, and in the wild videos. Our key contributions...
This work presents a scalable solution to open-vocabulary visual speech recognition. To achieve this, we constructed the largest existing visual speech recognition dataset, consisting of pairs of text and video clips of faces speaking (3,886 hours of video). In tandem, we designed and trained an integrated lipreading system, consisting of a video p...
Multichannel ASR systems commonly separate speech enhancement, including localization, beamforming and postfiltering, from acoustic modeling. In this chapter, we perform multichannel enhancement jointly with acoustic modeling in a deep neural network framework. Inspired by beamforming, which leverages differences in the fine time structure of the s...
The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. Unlike previous works that have focussed on recognising a limited number of words or phrases, we tackle lip reading as an open-world problem - unconstrained natural language sentences, and in the wild videos. Our key contributions...
Neural networks augmented with external memory have the ability to learn algorithmic solutions to complex tasks. These models appear promising for applications such as language modeling and machine translation. However, they scale poorly in both space and time as the amount of memory grows --- limiting their applicability to real-world domains. Her...
This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. W...
Many language generation tasks require the production of text conditioned on both structured and unstructured inputs. We present a novel neural network architecture which generates an output sequence conditioned on an arbitrary number of input functions. Crucially, our approach allows both the choice of conditioning context and the granularity of g...
Many language generation tasks require the production of text conditioned on both structured and unstructured inputs. We present a novel neural network architecture which generates an output sequence conditioned on an arbitrary number of input functions. Crucially, our approach allows both the choice of conditioning context and the granularity of g...
Surgical training is constantly evolving and junior trainees now perform fewer operations than ever. To maintain standards, greater emphasis is placed on simulation, often requiring attendance to expensive courses. While virtual reality systems have been developed recently, cadaveric dissection remains the cornerstone of training. To allow effectiv...
Cerebellopontine angle (CPA) tumours are the most common neoplasms in the posterior fossa, accounting for 5-10% of intracranial tumours. Most CPA tumours are benign, with most being vestibular schwannomas. Meningiomas arising from the jugular foramen are among the rarest of all with very few being described in the literature. Treatment options vary...
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent
neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as
acoustic models for speech recognition. More recently, we have shown that the
performance of sequence trained context dependent (CD) hidden Markov model
(HMM) acoustic models using such LSTM RNNs can...
Automatic speech recognition (ASR) systems are used daily by millions of people worldwide to dictate messages, control devices, initiate searches or to facilitate data input in small devices. The user experience in these scenarios depends on the quality of the speech transcriptions and on the responsiveness of the system. For multilingual users, a...
Hidden Markov models (HMMs) and Gaussian mixture models (GMMs) are the two most common types of acoustic models used in statistical parametric approaches for generating low-level speech waveforms from high-level symbolic inputs via intermediate acoustic feature sequences. However, these models have their limitations in representing complex, nonline...
The present invention relates to the measurement of human activities through video, particularly in retail environments. A method for measuring retail display effectiveness in accordance with an embodiment of the present invention includes: detecting a moving object in a field of view of an imaging device, the imaging device obtaining image data of...
Gastric volvulus is a rare condition with two forms of presentation, either acute or chronic. Since its discovery, there have been no cases of acute on chronic volvulus discussed in the literature. Its vague presentation makes diagnosis and subsequent management difficult. The diagnosis of acute gastric volvulus is made on clinical grounds via Borc...
Previous work presented a proof of concept for sequence training of deep neural networks (DNNs) using asynchronous stochastic optimization, mainly focusing on a small-scale task. The approach offers the potential to leverage both the effi-ciency of stochastic gradient descent and the scalability of par-allel computation. This study presents results...
We propose an algorithm that allows online training of a con-text dependent DNN model. It designs a state inventory based on DNN features and jointly optimizes the DNN parameters and alignment of the training data. The process allows flat starting a model from scratch and avoids any dependency on a GMM/HMM model to bootstrap the training process. A...
The system and method obscures descriptive image information about one or more images. The system comprises a selector for selecting the descriptive image information from one or more of the images, a transformer that transforms the descriptive information into a transformed state, and an authorizer that provides authorization criteria with the ima...
This paper explores asynchronous stochastic optimization for se-quence training of deep neural networks. Sequence training requires more computation than frame-level training using pre-computed frame data. This leads to several complications for stochastic op-timization, arising from significant asynchrony in model updates under massive paralleliza...
We propose providing additional utterance-level features as inputs to a deep neural network (DNN) to facilitate speaker, channel and background normalization. Modifications of the basic algorithm are developed which result in significant reductions in word error rates (WERs). The algorithms are shown to combine well with speaker adaptation by backp...
While deep neural networks (DNNs) have become the dominant acoustic model (AM) for speech recognition systems, they are still dependent on Gaussian mixture models (GMMs) for alignments both for supervised training and for context dependent (CD) tree building. Here we explore bootstrapping DNN AM training without GMM AMs and show that CD trees can b...
We investigate the use of large state inventories and the softplus nonlinearity for on-device neural network based mobile speech recognition. Large state inventories are achieved by less aggressive context-dependent state tying, and made possible by using a bottleneck layer to contain the number of parameters. We investigate alternative approaches...
Statistical parametric speech synthesis (SPSS) using deep neural networks (DNNs) has shown its potential to produce naturally-sounding synthesized speech. However, there are limitations in the current implementation of DNN-based acoustic modeling for speech synthesis, such as the unimodal nature of its objective function and its lack of ability to...
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN)
architecture that has been designed to address the vanishing and exploding
gradient problems of conventional RNNs. Unlike feedforward neural networks,
RNNs have cyclic connections making them powerful for modeling sequences. They
have been successfully used for sequence labeling and...
While deep neural networks (DNNs) have become the dom-inant acoustic model (AM) for speech recognition systems, they are still dependent on Gaussian mixture models (GMMs) for alignments both for supervised training and for context dependent (CD) tree building. Here we explore bootstrap-ping DNN AM training without GMM AMs and show that CD trees can...
Long Short-Term Memory (LSTM) is a specific recurrent neural network (RNN) architecture that was designed to model temporal sequences and their long-range dependencies more accurately than conventional RNNs. In this paper, we explore LSTM RNN architectures for large scale acoustic modeling in speech recognition. We recently showed that LSTM RNNs ar...
We recently showed that Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform state-of-the-art deep neural networks (DNNs) for large scale acoustic modeling where the models were trained with the cross-entropy (CE) criterion. It has also been shown that sequence discriminative training of DNNs initially trained with the CE crite...
YouTube is a highly visited video sharing website where over one billion people watch six billion hours of video every month. Im-proving accessibility to these videos for the hearing impaired and for search and indexing purposes is an excellent application of automatic speech recognition. However, YouTube videos are extremely chal-lenging for autom...
Multiple event types are monitored for events, and surveillance data is stored for each event. Surveillance data for a primary event of one event type can be presented to a user, and surveillance data for a set of related events corresponding to another event type can be presented based on a set of relatedness criteria and the surveillance data for...
Today's speech recognition technology is mature enough to be useful for many practical applications. In this context, it is of paramount importance to train accurate acoustic models for many languages within given resource constraints such as data, processing power, and time. Multilingual training has the potential to solve the data issue and close...
Conventional approaches to statistical parametric speech synthesis typically use decision tree-clustered context-dependent hidden Markov models (HMMs) to represent probability densities of speech parameters given texts. Speech parameters are generated from the probability densities to maximize their output probabilities, then a speech waveform is r...
There is provided an apparatus for the certification of privacy compliance. The apparatus includes a registry of at least one of enrolled video surveillance operators, approved surveillance hardware devices, approved surveillance software programs, approved surveillance system installers, and approved entities that manage surveillance systems. The...
Deep neural networks have recently become the gold standard for acoustic modeling in speech recognition systems. The key computational unit of a deep network is a linear projection followed by a point-wise non-linearity, which is typically a logistic function. In this work, we show that we can improve generalization and make training of deep networ...
Recent deep neural network systems for large vocabulary speech recognition are trained with minibatch stochastic gradient descent but use a variety of learning rate scheduling schemes. We investigate several of these schemes, particularly AdaGrad. Based on our analysis of its limitations, we propose a new variant 'AdaDec' that decouples long-term l...
This paper explores a novel large margin approach to learning a linear transform for dimensionality reduction in speech recognition. The method assumes a trained Gaussian mixture model for each class to be discriminated and trains a dimensionality-reducing linear transform with respect to the fixed model, optimizing a hinge loss on the difference b...
Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feed-forward...
Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feed-f...
Recent work in unsupervised feature learning and deep learning has shown that be-ing able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have developed a software framework called DistBelief that can ut...
Mixture of Gaussians-based background subtraction (BGS) has been widely used for detecting moving objects in surveillance videos. It is very efficient and can update the background model with slow lighting changes, however, it suffers from a number of limitations in complex surveillance conditions such as quick lighting variations, heavy occlusion,...
Optical character recognition is carried out using techniques borrowed from statistical machine translation. In particular, the use of multiple simple feature functions in linear combination, along with minimum-error-rate training, integrated decoding, and N-gram language modeling is found to be remarkably effective, across several scripts and lang...
Myocardial infarction (MI) is a leading cause of death in the UK. A good clinical outcome depends on rapid treatment following the onset of symptoms. A person's knowledge of typical symptoms determines how quickly they present to the medical services.
To investigate knowledge of MI symptoms among the general population and the relationship between...
In video surveillance and long term scene monitoring applications, it is a challenging problem to handle slow-moving or stopped
objects for motion analysis and tracking. We present a new framework by using two feedback mechanisms which allow interactions
between tracking and background subtraction (BGS) to improve tracking accuracy, particularly in...
Face recognition has long been a goal of computer vision, but only in recent years reliable automated face recognition has
become a realistic target of biometrics research. New algorithms, and developments spurred by falling costs of cameras and
by the increasing availability processing power have led to practical face recognition systems. These sy...
In this chapter, we describe the privacy issues surrounding the proliferation of digital imagery, particularly of faces, in
surveillance video, online photo-sharing, medical records and online navigable street imagery. We highlight the growing capacity
for computer systems to process, recognize, and index face images and outline some of the techniq...
Recent advances in deep learning have made the use of large, deep neural networks with tens of millions of parameters suitable for a number of applications that require real-time processing. The sheer size of these networks can represent a challenging computational burden, even for modern CPUs. For this reason, GPUs are routinely used instead to tr...
This chapter presents mechanisms for privacy protection in a distributed, multicamera surveillance system. The design choices
and alternatives for providing privacy protection while delivering meaningful surveillance data for security and retail environments
are described, followed by performance metrics to evaluate theeffectiveness of privacy prot...
We present a brief summary of the elements in an automatic video surveillance system, from imaging system to metadata. Surveillance
system architectures are described, followed by the steps in video analysis, from preprocessing to object detection, tracking,
classification and behaviour analysis.
Video surveillance automation is used in two key modes: watching for known threats in real-time and searching for events of interest after the fact. Typically, real-time alerting is a localized function, for example, an airport security center receives and reacts to a "perimeter breach alert," while investigations often tend to encompass a large nu...
Video surveillance automation is used in two key modes: watching for known threats in real-time and searching for events of interest after the fact. Typically, real-time alerting is a localized function, for example, an airport security center receives and reacts to a “perimeter breach alert,” while investigations often tend to encompass a large nu...
This paper presents mechanisms for privacy protection in a distributed, multicamera surveillance system. The design choices and alternatives for providing privacy protection while delivering meaningful surveillance data for security and retail environments are described, followed by performance metrics to evaluate the effectiveness of privacy prote...
The increasing need for sophisticated surveillance systems and the move to a digital infrastructure has transformed surveillance
into a large scale data analysis and management challenge. Smart surveillance systems use automatic image understanding techniques
to extract information from the surveillance data. While the majority of the research and...
Surveillance video is used in two key modes, watching for known threats in real-time and searching for events of interest after the fact. Typically, real-time alerting is a localized function, e.g. airport security center receives and reacts to a "perimeter breach alert", while investigations often tend to encompass a large number of geographically...
We describe a set of tools for retail analytics based on a combination of video understanding and transaction-log. Tools are provided for loss prevention (returns fraud and cashier fraud), store operations (customer counting) and merchandising (display effectiveness). Results are presented on returns fraud and customer counting.
Fingerprint classification is an important indexing method for any large-scale fingerprint recognition system or database,
as a method for reducing the number of fingerprints that need to be searched when looking for a matching print. Fingerprints
are generally classified into broad categories based on global characteristics. This paper describes n...
The paper introduces a novel detection and tracking system that pro- vides both frame-view and world-coordinate human location infor- mation, based on video from multiple synchronized and calibrated cameras with overlapping fields of view. The system is developed and evaluated for the specific scenario of a seminar lecturer present- ing in front of...
Pervasive sensor based systems are transforming Information Technology systems from being transactional in nature to being observational in nature. Observational systems are inherently distributed and capture information at a much finer grain of space and time. Enabling and building such systems also poses many technology challenges, extracting inf...
Objects in the world exhibit complex interactions. When captured in a video sequence, some interactions manifest themselves as occlusions. A visual tracking system must be able to track objects, which are partially or even fully occluded. In this paper we present a method of tracking objects through occlusions using appearance models. These models...
We present the IBM Smart Surveillance System that uses a distributed architecture to manage a heterogeneous network of active cameras. This system consists of a distributed network of cameras, each with local processing that interprets video to detect and track moving objects. The system performs multi-camera tracking as objects pass through the el...
The cumulative match curve (CMC) is used as a measure of 1: m identification system performance. It judges the ranking capabilities of an identification system. The receiver operating characteristic curve (ROC curve) of a verification system, on the other hand, expresses the quality of a 1:1 matcher. The ROC plots the false accept rate (FAR) of a 1...
Visual detection and tracking of humans in complex scenes is a chal- lenging problem with a wide range of applications, for example surveillance and human-computer interaction. In many such applications, time-synchronous views from multiple calibrated cameras are available, and both frame-view and space- level human location information is desired....
As smart surveillance technology becomes a critical component in security infrastructures, the system architecture assumes a critical importance. This paper considers the example of smart surveillance in an airport environment. We start with a threat model for airports and use this to derive the security requirements. These requirements are used to...
In recent years, closed-circuit television (CCTV) cameras have gained widespread use worldwide. Human operators monitor CCTV systems, unobtrusive or deliberately hidden cameras allow spying and voyeurism, and video surveillance, which make CCTVs a tool for state control and oppression. The use of surveillance is spreading as the hardware becomes mo...