Article

Dlib-ml: A Machine Learning Toolkit

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

There are many excellent toolkits which provide support for developing machine learning soft- ware in Python, R, Matlab, and similar environments. Dlib-ml is an open source library, targeted at both engineers and research scientists, which aims to provide a similarly rich environment for developing machine learning software in the C++ language. Towards this end, dlib-ml contains an extensible linear algebra toolkit with built in BLAS support. It also houses implementations of algorithms for performing inference in Bayesian networks and kernel-based methods for classifi- cation, regression, clustering, anomaly detection, and fe ature ranking. To enable easy use of these tools, the entire library has been developed with contract p rogramming, which provides complete and precise documentation as well as powerful debugging tools.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... The first stage of our method focuses on isolating relevant frames from the video sequence where potential inconsistencies in mouth movement are more likely to occur. As shown in Fig. 3, given an input video, we first apply a face detector, such as Dlib [42], to crop and align the face in each frame. This ensures that the facial features, particularly the mouth, are centered in each frame, reducing the influence of irrelevant facial movements or camera angles. ...
... In the proposed architecture, we use the Dlib toolkit [42] to extract 64 × 144 mouth regions from the input video. We set the local frame number L as 5 and the global frame number G as 3 such that the model can detect the required number of global frames even if the video is only a few seconds long. ...
... Mouth region landmarks detected by Dlib[42]. Orange colors denote the landmarks for mouth openness measurement and matching. ...
Preprint
Deepfakes are AI-generated media in which the original content is digitally altered to create convincing but manipulated images, videos, or audio. Among the various types of deepfakes, lip-syncing deepfakes are one of the most challenging deepfakes to detect. In these videos, a person's lip movements are synthesized to match altered or entirely new audio using AI models. Therefore, unlike other types of deepfakes, the artifacts in lip-syncing deepfakes are confined to the mouth region, making them more subtle and, thus harder to discern. In this paper, we propose LIPINC-V2, a novel detection framework that leverages a combination of vision temporal transformer with multihead cross-attention to detect lip-syncing deepfakes by identifying spatiotemporal inconsistencies in the mouth region. These inconsistencies appear across adjacent frames and persist throughout the video. Our model can successfully capture both short-term and long-term variations in mouth movement, enhancing its ability to detect these inconsistencies. Additionally, we created a new lip-syncing deepfake dataset, LipSyncTIMIT, which was generated using five state-of-the-art lip-syncing models to simulate real-world scenarios. Extensive experiments on our proposed LipSyncTIMIT dataset and two other benchmark deepfake datasets demonstrate that our model achieves state-of-the-art performance. The code and the dataset are available at https://github.com/skrantidatta/LIPINC-V2 .
... Specifically, we first obtain head pose information (pitch, roll, and yaw) 2 and use them to get frontal view face images; We extract 16,336 images out of 70,000 images, which are used as our fine-tuning data. Next, we use 68 landmarks per image from dlib [19] to obtain the information about facial composition. For example, 'the gap between eyes' and 'width of mouth'. ...
... In this setting, attribute values come from the pretrained GANs' latent space rather than the actual facial attributes, making direct comparisons between input c and predictedĉ (from the generated image) infeasible. Instead, we generate 200 paired images per attribute (e.g., closed vs. open eyes) and compare estimated attributesĉ neg andĉ pos using dlib [19]. A higher score indicates a broader control range. ...
... Baselines (Absolute Control). Baselines in this setting take a real unpaired condition c, allowing direct performance evaluation by comparing input c with estimated c from the generated image using dlib [19]. The lower error indicates better alignment between the generated images and the control attribute values. ...
Preprint
Full-text available
Text-to-Image (T2I) Diffusion Models have achieved remarkable performance in generating high quality images. However, enabling precise control of continuous attributes, especially multiple attributes simultaneously, in a new domain (e.g., numeric values like eye openness or car width) with text-only guidance remains a significant challenge. To address this, we introduce the Attribute (Att) Adapter, a novel plug-and-play module designed to enable fine-grained, multi-attributes control in pretrained diffusion models. Our approach learns a single control adapter from a set of sample images that can be unpaired and contain multiple visual attributes. The Att-Adapter leverages the decoupled cross attention module to naturally harmonize the multiple domain attributes with text conditioning. We further introduce Conditional Variational Autoencoder (CVAE) to the Att-Adapter to mitigate overfitting, matching the diverse nature of the visual world. Evaluations on two public datasets show that Att-Adapter outperforms all LoRA-based baselines in controlling continuous attributes. Additionally, our method enables a broader control range and also improves disentanglement across multiple attributes, surpassing StyleGAN-based techniques. Notably, Att-Adapter is flexible, requiring no paired synthetic data for training, and is easily scalable to multiple attributes within a single model.
... It integrates the DeepFace Library [66], a Python library that includes state-of-the-art face detection and recognition models. In our experiments, we evaluated five detection methods: MTCNN [67], Dlib [68], OpenCV [69], SSD [70], and RetinaFace [71]. Additionally, FairDeFace provides scripts for connecting to the Face++ API [72], a third-party system offering services like face detection, comparison, search, and attribute analysis. ...
... In FairDeFace, verification attacks are implemented using both DeepFace and Face++. DeepFace includes several well-known face recognition methods, such as OpenFace [73], DeepFace [74], Dlib [68], ArcFace [75], VGG-Face, FaceNet [76], and DeepID [77]. These systems produce distance scores as output, which are used to compute metrics like Obfuscation Success Rates (OSRs) for obfuscation or True Positive Rates (TPRs) for face recognition. ...
... Table 4 presents the results, with ArcFace achieving the highest scores across all three metrics. However, the overall findings were unexpected, as they did not align with the performance reported in previous studies [68], [73], [74], [75]. To investigate the potential impact of testing datasets on the results, we repeated the experiments on the LFW dataset, which is the most commonly used dataset for evaluating these methods. ...
Preprint
Full-text available
The lack of a common platform and benchmark datasets for evaluating face obfuscation methods has been a challenge, with every method being tested using arbitrary experiments, datasets, and metrics. While prior work has demonstrated that face recognition systems exhibit bias against some demographic groups, there exists a substantial gap in our understanding regarding the fairness of face obfuscation methods. Providing fair face obfuscation methods can ensure equitable protection across diverse demographic groups, especially since they can be used to preserve the privacy of vulnerable populations. To address these gaps, this paper introduces a comprehensive framework, named FairDeFace, designed to assess the adversarial robustness and fairness of face obfuscation methods. The framework introduces a set of modules encompassing data benchmarks, face detection and recognition algorithms, adversarial models, utility detection models, and fairness metrics. FairDeFace serves as a versatile platform where any face obfuscation method can be integrated, allowing for rigorous testing and comparison with other state-of-the-art methods. In its current implementation, FairDeFace incorporates 6 attacks, and several privacy, utility and fairness metrics. Using FairDeFace, and by conducting more than 500 experiments, we evaluated and compared the adversarial robustness of seven face obfuscation methods. This extensive analysis led to many interesting findings both in terms of the degree of robustness of existing methods and their biases against some gender or racial groups. FairDeFace also uses visualization of focused areas for both obfuscation and verification attacks to show not only which areas are mostly changed in the obfuscation process for some demographics, but also why they failed through focus area comparison of obfuscation and verification.
... For training SelfMAD, we use only the training subset, which contains 25, 000 bona fide images. Raw images are preprocessed by detecting faces in each image using Dlib [31]. The detected square face regions are enlarged by a randomly selected margin between 4% and 20%, then cropped and resized to 384×384 pixels. ...
... While Greedy-DiM is derived from FRLL, MorCode, and Morph-PIPE morphs both originate from FRGC, and fulfill the quality constraints laid down by the International Civil Aviation Organization (ICAO). Selected samples representing bona fide images and different morphing attacks are presented in Fig. 3. Faces in all testing images were detected by Dlib [31] and cropped out with a fixed margin of 12.5%. ...
Preprint
Full-text available
With the continuous advancement of generative models, face morphing attacks have become a significant challenge for existing face verification systems due to their potential use in identity fraud and other malicious activities. Contemporary Morphing Attack Detection (MAD) approaches frequently rely on supervised, discriminative models trained on examples of bona fide and morphed images. These models typically perform well with morphs generated with techniques seen during training, but often lead to sub-optimal performance when subjected to novel unseen morphing techniques. While unsupervised models have been shown to perform better in terms of generalizability, they typically result in higher error rates, as they struggle to effectively capture features of subtle artifacts. To address these shortcomings, we present SelfMAD, a novel self-supervised approach that simulates general morphing attack artifacts, allowing classifiers to learn generic and robust decision boundaries without overfitting to the specific artifacts induced by particular face morphing methods. Through extensive experiments on widely used datasets, we demonstrate that SelfMAD significantly outperforms current state-of-the-art MADs, reducing the detection error by more than 64% in terms of EER when compared to the strongest unsupervised competitor, and by more than 66%, when compared to the best performing discriminative MAD model, tested in cross-morph settings. The source code for SelfMAD is available at https://github.com/LeonTodorov/SelfMAD.
... This project employs the DLIB facial landmark detection model [41] for identifying and tracking human head landmarks, consisting of two main components: face detection and face alignment. The face detection algorithm leverages Histogram of Oriented Gradients (HOG) for feature extraction combined with a Support Vector Machine (SVM) for classification. ...
... • Use the HOG-based cascaded classifier to extract all feature vectors, including HOG features, from patient images; • Input the extracted feature vectors into the SVM model inherited from the CPP-DLIB library [41] to classify and extract features around the facial region, thereby identifying and annotating the location of the face in the image; • Pass the annotated facial region as input to the 68-point alignment model to achieve real-time detection of 68 facial landmarks. The alignment standard, shown in Figure 3, includes key facial regions such as the contours, eyes, eyebrows, nasal triangle, and mouth. ...
Article
Full-text available
Positron emission tomography (PET) is one of the most advanced imaging diagnostic devices in the medical field, playing a crucial role in tumor diagnosis and treatment. However, patient motion during scanning can lead to motion artifacts, which affect diagnostic accuracy. This study aims to develop a head motion monitoring system to identify and select images with excessive motion and corresponding periods. The system, based on an RGB-D structured-light camera, implements facial feature point detection, 3D information acquisition, and head motion monitoring, along with a user interaction software. Through phantom experiments and volunteer experiments, the system’s performance was tested under various conditions, including stillness, pitch movement, yaw movement, and comprehensive movement. Experimental results show that the system’s translational error is less than 2.5 mm, rotational error is less than 2.0°, and it can output motion monitoring results within 10 s after the PET scanning, meeting clinical accuracy requirements and showing significant potential for clinical application.
... Afterwards, the videos were read for cropping the speakers using the OpenCV programming library [32] and were cropped to the faces through the Python module Face Recognition [33]. The extraction of the lip image sections themselves was implemented using the toolkits Dlib [34] and Imutils [35]. The division of the dataset into training, validation or test sets as well as the conversion of the videos into color spaces was performed using the Scikit-learn programming library [36]. ...
Preprint
Full-text available
When reading lips, many people benefit from additional visual information from the lip movements of the speaker, which is, however, very error prone. Algorithms for lip reading with artificial intelligence based on artificial neural networks significantly improve word recognition but are not available for the German language. A total of 1806 video clips with only one German-speaking person each were selected, split into word segments, and assigned to word classes using speech-recognition software. In 38,391 video segments with 32 speakers, 18 polysyllabic, visually distinguishable words were used to train and validate a neural network. The 3D Convolutional Neural Network and Gated Recurrent Units models and a combination of both models (GRUConv) were compared, as were different image sections and color spaces of the videos. The accuracy was determined in 5000 training epochs. Comparison of the color spaces did not reveal any relevant different correct classification rates in the range from 69% to 72%. With a cut to the lips, a significantly higher accuracy of 70% was achieved than when cut to the entire speaker's face (34%). With the GRUConv model, the maximum accuracies were 87% with known speakers and 63% in the validation with unknown speakers. The neural network for lip reading, which was first developed for the German language, shows a very high level of accuracy, comparable to English-language algorithms. It works with unknown speakers as well and can be generalized with more word classes.
... FaceCraft4D: Animated 3D Facial Avatar Generation from a Single ImageFollowing[1], we crop head regions for GAN inversion. Specifically, we use dlib[18] to detect 68 facial keypoints. The keypoints are then aligned to ensure the face is centered in the image. ...
Preprint
We present a novel framework for generating high-quality, animatable 4D avatar from a single image. While recent advances have shown promising results in 4D avatar creation, existing methods either require extensive multiview data or struggle with shape accuracy and identity consistency. To address these limitations, we propose a comprehensive system that leverages shape, image, and video priors to create full-view, animatable avatars. Our approach first obtains initial coarse shape through 3D-GAN inversion. Then, it enhances multiview textures using depth-guided warping signals for cross-view consistency with the help of the image diffusion model. To handle expression animation, we incorporate a video prior with synchronized driving signals across viewpoints. We further introduce a Consistent-Inconsistent training to effectively handle data inconsistencies during 4D reconstruction. Experimental results demonstrate that our method achieves superior quality compared to the prior art, while maintaining consistency across different viewpoints and expressions.
... Therefore, we opted to utilize Facial Landmarks (FL). To detect Facial Landmarks we utilize the Dlib library [15] which predicts 68 landmarks on a human face. We first calculate the aspect ratios of certain facial features following [24]. ...
Preprint
Full-text available
We propose a method to transfer pose and expression between face images. Given a source and target face portrait, the model produces an output image in which the pose and expression of the source face image are transferred onto the target identity. The architecture consists of two encoders and a mapping network that projects the two inputs into the latent space of StyleGAN2, which finally generates the output. The training is self-supervised from video sequences of many individuals. Manual labeling is not required. Our model enables the synthesis of random identities with controllable pose and expression. Close-to-real-time performance is achieved.
... The powerful capabilities of 2D FR systems, if misused for tracking and surveillance, can lead to severe privacy concerns. To demonstrate the effectiveness of our privacy-preserving method, we evaluated its resistance against seven state-of-the-art (SOTA) 2D face recognition systems (Parkhi, Vedaldi, and Zisserman 2015;Schroff, Kalenichenko, and Philbin 2015;Deng et al. 2019;King 2009;Zhong et al. 2021;Alansari et al. 2023), which typically represent faces as vectors, using metrics like cosine similarity (Figure 7) to measure the similarity between images of the same person. The tests results in Table 2 included original sensitive images (Original), pro- Figure 8: We reconstruct geometric details without rendering realistic sensitive images (row 1). ...
Article
While 3D head reconstruction is widely used for modeling, existing neural reconstruction approaches rely on high-resolution multi-view images, posing notable privacy issues. Individuals are particularly sensitive to facial features, and facial image leakage can lead to many malicious activities, such as unauthorized tracking and deepfake. In contrast, geometric data is less susceptible to misuse due to its complex processing requirements, and absence of facial texture features. In this paper, we propose a novel two-stage 3D facial reconstruction method aimed at avoiding exposure to sensitive facial information while preserving detailed geometric accuracy. Our approach first uses non-sensitive rear-head images for initial geometry and then refines this geometry using processed privacy-removed gradient images. Extensive experiments show that the resulting geometry is comparable to methods using full images, while the process is resistant to DeepFake applications and facial recognition (FR) systems, thereby proving its effectiveness in privacy protection.
... Identity features are extracted by FaceNet (Schroff, Kalenichenko, and Philbin 2015) trained on CASIA (Yi et al. 2014), FaceNet trained on VGGFace2 (Cao et al. 2018), and SphereFace (Liu et al. 2017), which are not used in the training procedure. Other utilities: following previous methods Cai et al. 2024), we adopt the Dlib (King 2009) and L2CS-Net (Abdelrahman et al. 2023) to evaluate the landmark detection and gaze estimation performances. Reversibility: we compare our method against the previous reversible methods, in terms of ID similarity, medical results, and visual quality of the reconstructed original image. ...
Article
Face de-identification (DeID) has been widely studied for common scenes, but remains under-researched for medical scenes, mostly due to the lack of large-scale patient face datasets. In this paper, we release MeMa, consisting of over 40,000 photo-realistic patient faces. MeMa is re-generated from massive real patient photos. By carefully modulating the generation and data-filtering procedures, MeMa avoids breaching real patient privacy, while ensuring rich and plausible medical manifestations. We recruit expert clinicians to annotate MeMa with both coarse- and fine-grained labels, building the first medical-scene DeID benchmark. Additionally, we propose a baseline approach for this new medical-aware DeID task, by integrating data-driven medical semantic priors into the DeID procedure. Despite its conciseness and simplicity, our approach substantially outperforms previous ones.
... However, the effectiveness of the proposed method needs to be tested on combined datasets. The 68-point visage model from the dlib library was released in 2009 [7] and quickly gained popularity as a standard in the industry. This model provided a reliable tool for the FER task due to its accuracy in critical areas around the mouth, nose, eyes, and chin. ...
Article
Full-text available
Autonomous face emotion recognition (FER) with landmarks has become an important field of research for human-computer interaction. A significant achievement has been achieved through deep learning algorithms in recent times. Recognizing faces can be done using an end-to-end approach with deep learning techniques, which learns a mapping from raw pixels to the target label. In the field of emotional classification, the research community has extensively utilized 98 and 68 facial landmarks. In particular, pre-trained convolutional neural networks such as the residual network 50-layer network with the random sampler, Visual Geometry Group 16-layer network, and MobileNet including their ensemble versions of deep learning models are popular among researchers due to their ability to handle complex data. Researchers have mostly evaluated the model on a single dataset. A single dataset poses a challenge in developing a generalized model capable of capturing the full versatility of emotions. The key challenge in the dataset is that a single emotion is represented in multiple facial expressions with low-resolution images. This research study uses a combined dataset (CK+, KDEF, and FER-2013), which is more challenging than a single dataset. This research study offers a comprehensive analysis involving 68 and 98 landmarks with different FER deep models, examining how landmarking and different network architectures contribute to emotion recognition accuracy. This research study also considers the overfitting and class imbalance of the proposed ensemble model, which improves its performance by batch-wise feature extraction. Results show 78% accuracy with 98 landmarks and 75% with 68 landmarks. Overall, the model significantly reduces the gap between training and testing accuracy for both single and combined datasets.
... In this module, face detection and lip extraction are the very first tasks. Before the popularity of deep learning-based landmark prediction models, traditional approaches often used color information or structural information for lip detection [7], but pre-trained deep models, such as Dlib [62] and RetinaFace [63], have made this process faster, more accurate, and easier to integrate in any VSR pipeline. The lip region is generally selected as the input to the VSR system [64], however, several studies have demonstrated that their changes are not the only visual signal helping to decode speech [3]. ...
Article
Full-text available
Automatic lip reading has experienced significant advancements driven by deep learning techniques and the availability of large-scale datasets. Traditionally focused on enhancing Audio Speech Recognition (ASR) systems, Visual Speech Recognition (VSR) now demonstrates promising applications in biometric identification, silent speech interfaces, multimodal verification systems, and forensic video analysis. This paper presents a comprehensive survey of the current state-of-the-art deep learning-based VSR research, highlighting key data challenges, task-specific complications, and their solutions. We thoroughly review the essential components of VSR pipelines, including feature extraction, model architectures, and evaluation metrics. Additionally, we examine the most influential datasets and the obstacles they present. Our survey also delves into promising future research directions, such as developing lightweight and fast VSR models, leveraging weakly supervised and few-shot learning techniques, and integrating pre-training and fine-tuning strategies to improve model performance. By addressing the performance gap between lip reading and other computer vision applications, this paper aims to facilitate the practical deployment of VSR systems in real-world scenarios, ultimately bridging the gap between theory and practice.
... Foram selecionados para avaliação métodos com diferentes abordagens e complexidades, prontamente disponíveis para uso em bibliotecas de software (em particular, na linguagem Python, utilizada no desenvolvimento deste trabalho). Em específico, foram comparados o algoritmo Haar-Cascaded [Viola e Jones 2001], utilizando diferentes versões, por meio da biblioteca OpenCV [Bradski 2000]; Histogram of Oriented Gradients (HOG) [Dalal e Triggs 2005], por meio da biblioteca Dlib [King 2009]; YOLO v8 (You Only Look Once) [Redmon et al. 2016]; Multi-task Cascaded Convolutional Networks [Zhang et al. 2016] e; o modelo de rede neural profunda disponível no OpenCV para detecção de faces (módulo dnn -deep neural networks), pré-treinada com a base de dados Caffe [Jia et al. 2014]. ...
Conference Paper
Este artigo compara métodos de detecção facial sob variações de iluminação, destacando o impacto da iluminação na precisão dos algoritmos de reconhecimento facial. Foram analisados diferentes algoritmos, incluindo métodos baseados em Haar-Cascaded, baseados em Redes Neurais Artificiais e também Histograma de Gradientes Orientados. Os resultados indicam que, embora alguns métodos apresentem bom desempenho em condições de iluminação variadas, outros mostram quedas significativas de precisão em ambientes com pouca luz. A pesquisa contribui para o entendimento das limitações e capacidades dos métodos de detecção facial em diferentes condições de iluminação, sendo relevante para o desenvolvimento de sistemas mais robustos.
... These features are often represented as reference points, commonly known as landmarks, which serve as a framework for analyzing facial expressions. Among the most commonly used standards in this field is the 68-point landmark model (King, 2009), which provides a comprehensive mapping of the face by marking critical areas such as the eyes, eyebrows, nose, and mouth. Figure 3 shows the indexes of the 68 landmark coordinates visualized on the image. ...
Article
Full-text available
Facial expressions play a crucial role in human emotion recognition and social interaction. Prior research has highlighted the significance of the eyes and mouth in identifying emotions; however, limited studies have validated these claims using robust biometric evidence. This study investigates the prioritization of facial features during emotion recognition and introduces an optimized approach to landmark-based analysis, enhancing efficiency without compromising accuracy. A total of 30 participants were recruited to evaluate images depicting six emotions: anger, disgust, fear, neutrality, sadness, and happiness. Eye-tracking technology was utilized to record gaze patterns, identifying the specific facial regions participants focused on during emotion recognition. The collected data informed the development of a streamlined facial landmark model, reducing the complexity of traditional approaches while preserving essential information. The findings confirmed a consistent prioritization of the eyes and mouth, with minimal attention allocated to other facial areas. Leveraging these insights, we designed a reduced landmark model that minimizes the conventional 68-point structure to just 24 critical points, maintaining recognition accuracy while significantly improving processing speed.
... The first step in our pre-processing pipeline is dedicated to face detection. For this purpose, we used the facial landmark detection method proposed by Kazemi and Sullivan [18], which is known for its reliability and is implemented in Dlib [19]. This method is based on a combination of machine learning techniques, including both traditional shape predictors and deep learning models, designed to accurately identify 68 facial landmark points, encompassing facial features such as the eyes, nose, mouth, and various facial contours. ...
Article
Drowsiness significantly impairs human concentration and reflexes, leading to a heightened risk of accidents. Despite this, many drivers fail to recognize their drowsiness in time, often with serious consequences. Traditional detection systems based on vehicle movement and steering angles are inadequate in preventing such incidents. Existing vision-based systems, while promising, are typically limited to eye movement analysis, require extensive parameter tuning, and often struggle under varying conditions. To address these challenges, we propose a novel approach for Driver Drowsiness Detection that leverages facial features. Our method utilizes Local Binary Patterns on Three Orthogonal Planes for feature extraction and employs Support Vector Machines for classification. Experiments conducted on two benchmark drowsiness datasets, UTA-RLDD and DROZY, demonstrate our system’s efficacy, achieving accuracy rates of 82%82\% and 90%90\%, respectively. These results indicate the potential for a more reliable and non-invasive drowsiness detection system.
... where the superscript c indicates computed quantities to differentiate them from the measurements. 8. Compute the valuesĀz(t 1 ),Ēl(t 1 ),Āz(t 3 ),Ēl(t 3 ) such that ∆ = 0 or to minimize J = ∆ T ∆ The solution to the set of nonlinear equations or the minimization problem is obtained with the Broyden-Fletcher-Goldfarb-Shanno algorithm implemented in dlib library King [2009]. However, while this algorithm can provide an IOD solution, in practice it is not useful as it fails to provide the entire set of orbits compatible with the measurements and their uncertainty, i.e. the orbit set. ...
Preprint
Full-text available
With debris larger than 1 cm in size estimated to be over one million, precise cataloging efforts are essential to ensure space operations' safety. Compounding this challenge is the oversubscribed problem, where the sheer volume of space objects surpasses ground-based observatories' observational capacity. This results in sparse, brief observations and extended intervals before image acquisition. LeoLabs' network of phased-array radars addresses this need by reliably tracking 10 cm objects and larger in low Earth orbit with 10 independent radars across six sites. While LeoLabs tracklets are extremely short, they hold much more information than typical radar observations. Furthermore, two tracklets are generally available, separated by a couple of minutes. Thus, this paper develops a tailored approach to initialize state and uncertainty from a single or pair of tracklets. Through differential algebra, the initial orbit determination provides the state space compatible with the available measurements, namely an orbit set. This practice, widely used in previous research, allows for efficient data association of different tracklets, thus enabling the addition of accurate tracks to the catalog following their independent initialization. The algorithm's efficacy is tested using real measurements, evaluating the IOD solution's accuracy and ability to predict the next passage from a single or a pair of tracklets.
... Implementation Details For each input image, Dlib (King 2009) is used for aligning facial images. Each image is cropped and resized to 256 × 256. ...
Preprint
With the increasing need for facial behavior analysis, semi-supervised AU intensity estimation using only keyframe annotations has emerged as a practical and effective solution to relieve the burden of annotation. However, the lack of annotations makes the spurious correlation problem caused by AU co-occurrences and subject variation much more prominent, leading to non-robust intensity estimation that is entangled among AUs and biased among subjects. We observe that trend information inherent in keyframe annotations could act as extra supervision and raising the awareness of AU-specific facial appearance changing trends during training is the key to learning invariant AU-specific features. To this end, we propose \textbf{T}rend-\textbf{A}ware \textbf{S}upervision (TAS), which pursues three kinds of trend awareness, including intra-trend ranking awareness, intra-trend speed awareness, and inter-trend subject awareness. TAS alleviates the spurious correlation problem by raising trend awareness during training to learn AU-specific features that represent the corresponding facial appearance changes, to achieve intensity estimation invariance. Experiments conducted on two commonly used AU benchmark datasets, BP4D and DISFA, show the effectiveness of each kind of awareness. And under trend-aware supervision, the performance can be improved without extra computational or storage costs during inference.
... We applied the dlib tool(King, 2009) to each video clip in order to detect the 68 facial keypoints. Subsequently, we employ an affine transformation to align each video frame with a standard reference face frame. ...
Preprint
Full-text available
Video-to-speech (V2S) synthesis, the task of generating speech directly from silent video input, is inherently more challenging than other speech synthesis tasks due to the need to accurately reconstruct both speech content and speaker characteristics from visual cues alone. Recently, audio-visual pre-training has eliminated the need for additional acoustic hints in V2S, which previous methods often relied on to ensure training convergence. However, even with pre-training, existing methods continue to face challenges in achieving a balance between acoustic intelligibility and the preservation of speaker-specific characteristics. We analyzed this limitation and were motivated to introduce DiVISe (Direct Visual-Input Speech Synthesis), an end-to-end V2S model that predicts Mel-spectrograms directly from video frames alone. Despite not taking any acoustic hints, DiVISe effectively preserves speaker characteristics in the generated audio, and achieves superior performance on both objective and subjective metrics across the LRS2 and LRS3 datasets. Our results demonstrate that DiVISe not only outperforms existing V2S models in acoustic intelligibility but also scales more effectively with increased data and model parameters. Code and weights can be found at https://github.com/PussyCat0700/DiVISe.
... This study utilizes Dlib [36] to obtain the coordinates of the eyes on the face, extracting the coordinates of both eyes. By computing the coordinates of both eyes, a similarity transformation is carried out to align the two eyes horizontally at a fixed distance by rotating and scaling the image. ...
Article
Full-text available
This work explores whether facial sketches can be used to predict personality traits, marking, to our knowledge, a first attempt in the literature. Unlike traditional RGB facial images, which capture detailed features, sketch-based images emphasize the structure and movement of facial expression muscles, offering a novel approach to personality prediction. The key contributions are threefold: (1) Expression muscles are proposed to weight the extracted features, thereby improving prediction accuracy; (2) Intermediate sketches generated through the 25-Step Sketching Approach are used for data augmentation to address the issue of data scarcity; (3) Experimental results on a dataset of 12,320 individuals, along with ablation studies, demonstrate that, compared to image-based models, our sketch-based model can predict certain personality traits with similarly high accuracy. Moreover, both the expression muscle weighting and data augmentation strategies positively contribute to performance. Our findings, along with the constructed sketch datasets, provide valuable insights for researchers and practitioners in the field.
Article
Full-text available
In institutions such as universities, corporate offices, and restricted-access areas, enforcing ID card compliance is critical for ensuring security, tracking attendance, and maintaining discipline. Manual enforcement is often inefficient and prone to oversight. To address this, we propose an automated ID Card Detection and Penalty Mechanism that leverages deep learning models for object detection and facial recognition. The system utilizes YOLOv5 for real-time identification of ID cards worn by individuals in front of a camera. If the system fails to detect an ID card, it automatically initiates a secondary process that uses facial recognition to identify the person, predicts their roll number, and triggers an alert mechanism. This includes sending an automated email notification to a predefined recipient, reporting the incident along with the identified individual's details.The system is trained specifically on a dataset comprising known faces and ID card positions to ensure high accuracy in controlled environments. It includes a user-friendly interface where users can start the camera, initiate detection, and send email notifications directly through the GUI. The model is effective in both detecting the presence of ID cards and in handling non-compliance scenarios by linking the individual's identity to the infraction. Experimental evaluations show that the system performs reliably across Automated ID Card Detection and Penalty System Using YOLOv5 and Face Recognition 64 various lighting conditions and backgrounds, with minimal false detections. The proposed solution offers a scalable and efficient method to automate ID enforcement, enhance security monitoring, and reduce dependency on manual supervision. INTRODUCTION In today's technologically advanced world, automated surveillance and identity verification systems have become increasingly important across various sectors, including educational institutions, corporate offices, research labs, and secure government facilities. One fundamental component of such security frameworks is the enforcement of visible identification cards (ID cards) worn by employees, students, or visitors. ID cards serve not only as authentication tools but also as key enablers for access control, attendance monitoring, and accountability. However, ensuring consistent compliance with ID-wearing policies remains a challenge when done manually. Relying on security personnel or administrative staff to monitor ID card usage is time-consuming, resource-intensive, and susceptible to human error.To address this issue, there is a growing need for automated systems that can detect whether individuals are wearing their ID cards and take corrective actions if non-compliance is observed. In this context, computer vision and deep learning techniques offer powerful tools for real-time monitoring and decision-making. Object detection models such as YOLO (You Only Look Once), combined with face recognition and identity prediction algorithms, enable systems to detect ID cards, recognize faces, and link individuals to a known database. These technologies allow institutions to build intelligent surveillance systems that can proactively enforce policies without requiring continuous human intervention. This paper presents an integrated ID Card Detection and Penalty Mechanism system that automates the process of identifying individuals who are not wearing their ID cards and subsequently triggering a disciplinary or notification process. The system uses the YOLOv5 object detection model to identify the presence or absence of an ID card in live camera feeds. If no card is detected, the system uses facial recognition to predict the identity or roll number of the person. Once the individual is identified, the system allows an administrator or supervisor to send a warning message to a designated email address directly from the application interface.The proposed system is particularly useful in educational campuses where students are required to wear ID cards as part of institutional discipline. In such environments, the model can be trained on a dataset containing students' facial images and sample ID card images. The system interface includes real-time camera access, detection initiation, identity display, and email alert generation, making it a complete solution for daily compliance enforcement.Additionally, the model is designed to be lightweight, fast, and easy to deploy on any machine with a webcam. It achieves high detection accuracy under various lighting conditions and works effectively in real-time, thus meeting the practical requirements of a surveillance-grade system. In summary, this research contributes an end-to-end automated framework that enforces ID-wearing compliance using deep learning. By eliminating manual checking and incorporating intelligent alert mechanisms, the system significantly enhances institutional security, operational efficiency, and rule enforcement.
Article
This paper addresses a critical security challenge in the field of automated face recognition, i.e., morphing attack. The paper introduces a novel differential morphing attack detection (D-MAD) system called ACIdA, which is specifically designed to overcome the limitations of existing D-MAD approaches. Traditional methods are effective when the morphed image and live capture are distinct, but they falter when the morphed image closely resembles the accomplice. This is a critical gap because detecting accomplice involvement in addition to the criminal one is essential for robust security. ACIdA’s impact is underscored by its innovative approach, which consists of three modules: One for classifying the type of attempt (bona fide, criminal, or accomplice verification attempt), and two others dedicated to analyzing identity and artifacts. This multi-faceted approach enables ACIdA to excel in scenarios where the morphed image does not equally represent both contributing subjects–a common and challenging situation in real-world applications. The paper’s extensive cross-dataset experimental evaluation demonstrates that ACIdA achieves state-of-the-art results in detecting accomplices, a crucial advancement for enhancing the security of face recognition systems. Furthermore, it maintains strong performance in identifying criminals, thereby addressing a significant vulnerability in current D-MAD methods and marking a substantial contribution to the field of facial recognition security.
Article
Full-text available
Introduction Rapid advancements in artificial intelligence and generative artificial intelligence have enabled the creation of fake images and videos that appear highly realistic. According to a report published in 2022, approximately 71% of people rely on fake videos and become victims of blackmail. Moreover, these fake videos and images are used to tarnish the reputation of popular public figures. This has increased the demand for deepfake detection techniques. The accuracy of the techniques proposed in the literature so far varies with changes in fake content generation techniques. Additionally, these techniques are computationally intensive. The techniques discussed in the literature are based on convolutional neural networks, Linformer models, or transformer models for deepfake detection, each with its advantages and disadvantages. Methods In this manuscript, a hybrid architecture combining transformer and Linformer models is proposed for deepfake detection. This architecture converts an image into patches and performs position encoding to retain spatial relationships between patches. Its encoder captures the contextual information from the input patches, and Gaussian Error Linear Unit resolves the vanishing gradient problem. Results The Linformer component reduces the size of the attention matrix. Thus, it reduces the execution time to half without compromising accuracy. Moreover, it utilizes the unique features of transformer and Linformer models to enhance the robustness and generalization of deepfake detection techniques. The low computational requirement and high accuracy of 98.9% increase the real-time applicability of the model, preventing blackmail and other losses to the public. Discussion The proposed hybrid model utilizes the strength of the transformer model in capturing complex patterns in data. It uses the self-attention potential of the Linformer model and reduces the computation time without compromising the accuracy. Moreover, the models were implemented on patch sizes of 6 and 11. It is evident from the obtained results that increasing the patch size improves the performance of the model. This allows the model to capture fine-grained features and learn more effectively from the same set of videos. The larger patch size also enables the model to better preserve spatial details, which contributes to improved feature extraction.
Preprint
Full-text available
The growing threat posed by deepfake videos, capable of manipulating realities and disseminating misinformation, drives the urgent need for effective detection methods. This work investigates and compares different approaches for identifying deepfakes, focusing on the GenConViT model and its performance relative to other architectures present in the DeepfakeBenchmark. To contextualize the research, the social and legal impacts of deepfakes are addressed, as well as the technical fundamentals of their creation and detection, including digital image processing, machine learning, and artificial neural networks, with emphasis on Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs), and Transformers. The performance evaluation of the models was conducted using relevant metrics and new datasets established in the literature, such as WildDeep-fake and DeepSpeak, aiming to identify the most effective tools in the battle against misinformation and media manipulation. The obtained results indicated that GenConViT, after fine-tuning, exhibited superior performance in terms of accuracy (93.82%) and generalization capacity, surpassing other architectures in the DeepfakeBenchmark on the DeepSpeak dataset. This study contributes to the advancement of deepfake detection techniques, offering contributions to the development of more robust and effective solutions against the dissemination of false information.
Article
There is a large number of cases of road accidents across the world out of which a significant number of accidents are caused due to driver fatigue and drowsy behavior .A system which is reliable in detection of drowsy behavior and signs of fatigue can help in preventing such road accidents by monitoring driver behavior and generating alerts to them when signs of fatigue detected. This paper presents a machine learning approach that uses real-time image and processing it by taking insights from its processing like closing eyes ,facial expressions, calculating the closing and open eye-aspect ratios using different machine learning algorithms and generating alerts if it detects driver drowsy behavior . Experimental research shows that our system is effective and reliable in detection of drowsy behavior of drivers and it generates timely alerts to drivers if it is found drowsy and makes the driver safe.
Article
Identity verification is essential in both an individual’s personal and professional life. It confirms a person’s identity for various services and establishes their legitimacy as an employee within an organization. As cybercrime evolves and becomes more sophisticated, ensuring robust, and secure personal authentication methods has become a critical challenge. Existing face-based authentication systems typically employ deep learning models for user verification. However, these systems are susceptible to various attacks, such as presentation attacks, 3D mask attacks, and adversarial attacks that exploit and deceive the models by manipulating digital representations of human faces. Although various liveness detection techniques have been proposed to combat face spoofing in face-based authentication systems. However, these systems remain vulnerable and can be exploited by sophisticated techniques. To counteract face spoofing in a face-based authentication system, we have proposed an advanced liveness detection technique using Visual Speech Recognition (VSR). The proposed VSR model is designed to integrate seamlessly with face-based authentication systems, forming a dual authentication framework for enhanced liveness detection. The VSR model decodes silently pronounced speech from video by analyzing unique, unforgeable lip motion patterns into textual representation. Although, various liveness detection techniques have been proposed to combat face spoofing in face-based authentication systems. However, these systems remain vulnerable and can be exploited by sophisticated techniques. To counteract face spoofing in a face-based authentication system, we have proposed an advanced liveness detection technique using VSR. The proposed VSR model is designed to integrate seamlessly with face-based authentication systems, forming a dual authentication framework for enhanced liveness detection. The VSR model decodes silently pronounced speech from video by analyzing unique, unforgeable lip motion patterns into textual representation. To achieve effective liveness detection using VSR, we need to enhance the accuracy of the VSR system. The proposed work employs an encoder-decoder technique to extract more robust features from lip motion. The encoder employs a three-dimensional convolution neural network (3D-CNN) combined with a fusion of bi-directional gated recurrent units and long short-term memory (BiGRU-BiLSTM) to effectively capture spatial-temporal patterns from lip movement. The decoder integrates Multi-Head Attention (MHA) with BiGRU-BiLSTM to effectively focus on relevant features and enhance contextual understanding for more accurate text prediction. The proposed VSR system achieved a word error rate (WER) of 0.79%, demonstrating a significant reduction in error rate and outperforming compared to the existing VSR models.
Preprint
Full-text available
The relationship between muscle activity and resulting facial expressions is crucial for various fields, including psychology, medicine, and entertainment. The synchronous recording of facial mimicry and muscular activity via surface electromyography (sEMG) provides a unique window into these complex dynamics. Unfortunately, existing methods for facial analysis cannot handle electrode occlusion, rendering them ineffective. Even with occlusion-free reference images of the same person, variations in expression intensity and execution are unmatchable. Our electromyography-informed facial expression reconstruction (EIFER) approach is a novel method to restore faces under sEMG occlusion faithfully in an adversarial manner. We decouple facial geometry and visual appearance (e.g., skin texture, lighting, electrodes) by combining a 3D Morphable Model (3DMM) with neural unpaired image-to-image translation via reference recordings. Then, EIFER learns a bidirectional mapping between 3DMM expression parameters and muscle activity, establishing correspondence between the two domains. We validate the effectiveness of our approach through experiments on a dataset of synchronized sEMG recordings and facial mimicry, demonstrating faithful geometry and appearance reconstruction. Further, we synthesize expressions based on muscle activity and how observed expressions can predict dynamic muscle activity. Consequently, EIFER introduces a new paradigm for facial electromyography, which could be extended to other forms of multi-modal face recordings.
Article
Full-text available
While classical kernel-based learning algorithms are based on a single kernel, in practice it is often desirable to use multiple kernels. Lanckriet et al. (2004) considered conic combinations of kernel matrices for classification, leading to a convex quadratically constrained quadratic program. We show that it can be rewritten as a semi-infinite linear program that can be efficiently solved by recycling the standard SVM implementations. Moreover, we generalize the formulation and our method to a larger class of problems, including regression and one-class classification. Experimental results show that the proposed algorithm works for hundred thousands of examples or hundreds of kernels to be combined, and helps for automatic model selection, improving the interpretability of the learning result. In a second part we discuss general speed up mechanism for SVMs, especially when used with sparse feature maps as appear for string kernels, allowing us to train a string kernel SVM on a 10 million real-world splice data set from computational biology. We integrated multiple kernel learning in our machine learning toolbox SHOGUN for which the source code is publicly available at http://www.fml.tuebingen.mpg.de/raetsch/projects/shogun.
Article
Full-text available
We present a nonlinear version of the recursive least squares (RLS) algorithm. Our algorithm performs linear regression in a high-dimensional feature space induced by a Mercer kernel and can therefore be used to recursively construct minimum mean-squared-error solutions to nonlinear least-squares problems that are frequently encountered in signal processing applications. In order to regularize solutions and keep the complexity of the algorithm bounded, we use a sequential sparsification process that admits into the kernel representation a new input sample only if its feature space image cannot be sufficiently well approximated by combining the images of previously admitted samples. This sparsification procedure allows the algorithm to operate online, often in real time. We analyze the behavior of the algorithm, compare its scaling properties to those of support vector machines, and demonstrate its utility in solving two signal processing problems-time-series prediction and channel equalization.
Article
Full-text available
The output of a classifier should be a calibrated posterior probability to enable post-processing. Standard SVMs do not provide such probabilities. One method to create probabilities is to directly train a kernel classifier with a logit link function and a regularized maximum likelihood score. However, training with a maximum likelihood score will produce non-sparse kernel machines. Instead, we train an SVM, then train the parameters of an additional sigmoid function to map the SVM outputs into probabilities. This chapter compares classification error rate and likelihood scores for an SVM plus sigmoid versus a kernel method trained with a regularized likelihood error function. These methods are tested on three data-mining-style data sets. The SVM+sigmoid yields probabilities of comparable quality to the regularized maximum likelihood kernel method, while still retaining the sparseness of the SVM.
Conference Paper
The ‘sparse Bayesian ’ modelling approach, as exemplified by the ‘relevance vector machine’, enables sparse classification and regression functions to be obtained by linearly-weighting a small number of fixed basis functions from a large dictionary of potential candidates. Such a model conveys a number of advantages over the related and very popular ‘support vector machine’, but the necessary ‘training ’ procedure — optimisation of the marginal likelihood function — is typically much slower. We describe a new and highly accelerated algorithm which exploits recently-elucidated properties of the marginal likelihood function to enable maximisation via a principled and efficient sequential addition and deletion of candidate basis functions. 1
Article
We present a non-linear kernel-based version of the Recursive Least Squares (RLS) algorithm. Our Kernel-RLS algorithm performs linear regression in the feature space induced by a Mercer kernel, and can therefore be used to recursively construct the minimum meansquared -error regressor. Sparsity (and therefore regularization) of the solution is achieved by an explicit greedy sparsification process that admits into the kernel representation a new input sample only if its feature space image is linearly independent of the images of previously admitted samples. Most importantly, this sparsification procedure allows the algorithm to operate online. We demonstrate the performance and scaling properties of the Kernel-RLS algorithm as compared to a state-of-the-art Support Vector Regression algorithm, on both synthetic and real data.
Article
LIBSVM is a library for support vector machines (SVM). Its goal is to help users to easily use SVM as a tool. In this document, we present all its imple-mentation details. For the use of LIBSVM, the README file included in the package and the LIBSVM FAQ provide the information.
Conference Paper
We describe and analyze a simple and effective iterative algorithm for solving the optimization problem cast by Support Vector Machines (SVM). Our method alternates between stochastic gradient descent steps and projection steps. We prove that the number of iterations required to obtain a solution of accuracy is (1/). In contrast, previous analyses of stochastic gradient descent methods require (1/2) iterations. As in previously devised SVM solvers, the number of iterations also scales linearly with 1/, where is the regularization parameter of SVM. For a linear kernel, the total run-time of our method is (d/()), where d is a bound on the number of non-zero features in each example. Since the run-time does not depend directly on the size of the training set, the resulting algorithm is especially suited for learning from large datasets. Our approach can seamlessly be adapted to employ non-linear kernels while working solely on the primal objective function. We demonstrate the efficiency and applicability of our approach by conducting experiments on large text classification problems, comparing our solver to existing state-of-the-art SVM solvers. For example, it takes less than 5 seconds for our solver to converge when solving a text classification problem from Reuters Corpus Volume 1 (RCV1) with 800,000 training examples.
Conference Paper
Trained support vector machines (SVMs) have a slow run-time classification speed if the classification problem is noisy and the sample data set is large. Approximating the SVM by a more sparse function has been proposed to solve to this problem. In this study, different variants of approximation algorithms are empirically compared. It is shown that gradient descent using the improved Rprop algorithm increases the robustness of the method compared to fixed-point iteration. Three different heuristics for selecting the support vectors to be used in the construction of the sparse approximation are proposed. It turns out that none is superior to random selection. The effect of a finishing gradient descent on all parameters of the sparse approximation is studied.
Article
Support Vector Machines (SVMs) for regression problems are trained by solving a quadratic optimization problem which needs on the order of l 2 memory and time resources to solve, where l is the number of training examples. In this paper, we propose a decomposition algorithm, SVMTorch 1 , which is similar to SVM-Light proposed by Joachims (1999) for classification problems, but adapted to regression problems. With this algorithm, one can now efficiently solve large-scale regression problems (more than 20000 examples). Comparisons with Nodelib, another publicly available SVM algorithm for large-scale regression problems from Flake and Lawrence (2000) yielded significant time improvements. Finally, based on a recent paper from Lin (2000), we show that a convergence proof exists for our algorithm. 1. Introduction Vapnik (1995) has proposed a method to solve regression problems using support vector machines. It has yielded excellent performance on many regression and time ser...
  • Christian Igel
  • Tobias Glasmachers
  • Verena Heidrich-Meisner
  • Shark
Christian Igel, Tobias Glasmachers, and Verena Heidrich-Meisner. Shark. Journal of Machine Learning Research, 9:993–996, 2008.
Linear algebra with C++ template metaprograms. Dr. Dobb's Journal of Software Tools
  • Todd Veldhuizen
  • Kumaraswamy Ponnambalam