Conference Paper

Embedded Convolutional Face Finder.

France Télécom, Lutetia Parisorum, Île-de-France, France
DOI: 10.1109/ICME.2006.262454 Conference: Proceedings of the 2006 IEEE International Conference on Multimedia and Expo, ICME 2006, July 9-12 2006, Toronto, Ontario, Canada
Source: DBLP

ABSTRACT

In this paper, a high-level optimization methodology is applied for the implementation of the well-known Convolutional Face Finder (CFF) algorithm for real-time applications on cellular phone, such as teleconferencing, advanced user interfaces, pictures indexing and security access control. This face detector is based on a feature extraction and classification technique which consists in a pipeline of convolutions and subsampling operations. Design of embedded systems must find a good trade off between performance and code size due to the limited amount of resource available. We propose a methodology to cope with the main drawbacks of the CFF original implementation like floating- point computation and memory allocation, to allow parallelism exploitation and perform algorithm optimizations. Results show that our embedded face detection system can accurately locate faces with less computational load and memory cost. It runs on a 275MHz Starcore DSP at 9 QCIF images/s with state-of-the-art detection rates and very low false alarm rates.

Download full-text

Full-text

Available from: Roux Sébastien, Sep 05, 2014
  • Source
    • "In this paper, we will only consider the core of the face localization process as depicted in Figure 1. The convolutional neural network used to implement the face detector has been previously optimised in [3], and consists of a set of two different kinds of layers. (i) CSi layers are called convolutional layers and contain a certain number of planes. "
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe a High-Level Synthesis implementation of a parallel architecture for face detection. The chosen face detection method is the well-known Convolutional Face Finder (CFF) algorithm, which consists of a pipeline of convolution operations. We rely on dataflow modelling of the algorithm and we use a high-level synthesis tool in order to specify the local dataflows of our Processing Element (PE), by describing in C language inter-PE communication, fine scheduling of the successive convolutions, and memory distribution and bandwidth. Using this approach, we explore several implementation alternatives in order to find a compromise between processing speed and area of the PE. We then build a parallel architecture composed of a PE ring and a FIFO memory, which constitutes a generic architecture capable of processing images of different sizes. A ring of 25 PEs running at 80 MHz is able to process 127 QVGA images per second or 35 VGA images per second.
    Full-text · Article · Jan 2009 · EURASIP Journal on Embedded Systems
  • Source
    • "In this paper, we will only consider the core of the face localization process as depicted in Figure 1. The convolutional neural network used to implement the face detector has been previously optimised in [3], and consists of a set of two different kinds of layers. (i) CSi layers are called convolutional layers and contain a certain number of planes. "

    Full-text · Article · Jan 2008
  • Source
    • "Les travaux antérieurs [3] ont présenté une analyse des dépendances de données de l'algorihtme pour optimiser l'utilisation mémoire sur les processeurs embarqués, condui- santà un traitement ligne par ligne. Nous généralisons cette approche en divisant l'image d'entrée en blocs de largeur K et présentons dans la figure 3 le modèle correspondant . "
    [Show abstract] [Hide abstract]
    ABSTRACT: Nous présentons dans ce papier une méthodologie d'exploration du parallélisme d'un algorithme de detection de visages et son implantation sur FPGA . L'algorithme choisi est le Convolutional Face Finder (CFF), qui consiste en une cascade de convolutions 2D et de sous-échantillonages. Notre but est de définir une architecture parallèle implantant cet algorithme de manière efficace. Nous présentons une méthodologie d'adéquation algorithme architecture (AAA) utilisant l'outil SynDEx, afin de trouver un bon compromis entre la puissance de calcul, la fonctionalité de chaque Processeur Elémentaire (PE) et l'efficacité de parallelisation. Nous décrivons ensuite une première implantation d'un PE sur un FPGA Virtex 4, en utilisant les blocs de traitement de signal dédiés DSP48. Ce PE fonctionne à une fréquence maximale de 350 MHz et n'occupe que 2% d'un FPGA Virtex 4 SX 35.
    Full-text · Article · Jan 2007
Show more