PresentationPDF Available

Machine learning for computer vision: a case study in man machine interaction

Authors:

Abstract

This communication will address these questions in the practical case of man machine interaction. A consumer-grade camera is used to perform eye and face tracking with the intent of using this information to drive the computer’s mouse in the most intuitive way possible. In this scenario, we will address the main issues of building AI-based vision systems: the choice of network topology, acquisition of the learning dataset, pre-processing and labelling of the data, learning and evaluation of the model. While doing this we will comment on the traps that paved our way and the strategies we used to solve them. Finally, we compare this approach to the more traditional one with the goal of providing insights on the pros and cons of the ubiquitous usage of machine learning when building a computer vision system.
Organized by Penta projects:
2020005 Mantis Vision
2021004 Imagination
Machine Learning for Computer Vision, a
Case Study in Man Machine Interaction
Workshop ePicture this
TU Delft,
Dr. Ing. Boris Lenseigne
blenseigne@lrtechnologies.fr
Dr. Ing. Julia Cohen
jcohen@lrtechnologies.fr
Dr. Ing. Adrien Dorise
adorise@lrtechnologies.fr
Dr. Ing. Edouard Villain
evillain@lrtechnologies.fr
June 21st 2023
Organized by Penta projects:
2020005 Mantis Vision
2021004 Imagination
Summary
Introduction
Collecting data
Experiments
Results
Challenges
June 21st 2023
Organized by Penta projects:
2020005 Mantis Vision
2021004 Imagination
Introduction
Human Machine Interaction and assistive technologies
Enable Paraplegic users to use a computer
Consumer-grade hardware
Focus on vision based interaction Face and gaze tracking
Control the mouse with face and eyes movements
Challenge: can we predict where the user is looking on the screen ?
Using the video stream from a webcam only (no light, no IR).
A difficult problem with “traditional” image processing
Can an AI/Machine learning approach solve the problem ?
How should we do it ?
Nicole M. Bakker, Boris Lenseigne, Sander Schutte, Elsbeth B.
M. Geukers, Pieter P. Jonker, Frans C. T. van der Helm, Huib J.
Simonsz: Accurate Gaze Direction Measurements With Free
Head Movement for Strabismus Angle Estimation. IEEE Trans.
Biomed. Eng.60(11): 3028-3035 (2013)
The computer must learn how it is used vs the user learns how to use the computer
June 21st 2023
Organized by Penta projects:
2020005 Mantis Vision
2021004 Imagination
Introduction
Traditional approach to gaze tracking
Camera calibration
Eyes and face detection
3D pose estimation
Eyes optical axis estimation
...
AI approach to gaze tracking
Gather data
Choose algorithm
Choose meta-parameters
Perform learning
Let the magic happen
Expected benefits of AI
Task easily solved by humans
Unknown visual cues, difficult to model
AI find regularities in the learning data
June 21st 2023
Organized by Penta projects:
2020005 Mantis Vision
2021004 Imagination
What is Artificial Intelligence ?
Artificial Intelligence (AI)
Techniques that enable a machine to reproduce traits of
human intelligence.
Machine Learning (ML)
The set of techniques from AI that enable a machine to
produce a result without explicit programming.
Deep Learning (DL)
The set of techniques from ML in which the model is an artificial
neural network.
Data for Machine learning
1) Learning data set
2) Validation data set (during learning)
3) Testing data set
AI
ML
DL
June 21st 2023
Organized by Penta projects:
2020005 Mantis Vision
2021004 Imagination
State of the art
Work on AI and face
Face tracking
A. Rabhi, A. Sadiq and A. Mouloudi, "Face tracking: State of the art," 2015
Facial expression
Song Zhenjie, "Facial Expression Emotion Recognition Model Integrating Philosophy and Machine Learning Theory", 2021
Smyle mouse
Commercial mouse control via head & gesture software, monthly subscription
Windows OS-only
US patents
Why develop a new project on AI and face ?
Open-source software and consumer-grade hardware
All HMI methods in a unique application
Precise/micro-movements: not solved
June 21st 2023
Organized by Penta projects:
2020005 Mantis Vision
2021004 Imagination
Collecting data
Inputs
Video stream from the webcam
Facial key features extracted by a face
mesh detector
Model
Targets
Cursor position on the screen
How to split between
train/validation/test
How to synchronize
inputs and outputs?
How to ensure clean
data points?
What type of data
can we
acquire/produce
What scenario to follow?
How many samples
are required?
June 21st 2023
Organized by Penta projects:
2020005 Mantis Vision
2021004 Imagination
Facetracker
Use of an existing ML-basedface tracker (Mediapipe FaceMesh detector)
Possibility to select the number of (x,y,z) points to use
Already a diffucult problem with traditionnal IP
478 points 13 points
69 points
Kartynnik, Y., Ablavatski,
A., Grishchenko, I., &
Grundmann, M. Real-time
facial surface geometry
from monocular video on
mobile GPUs. 2019
June 21st 2023
Organized by Penta projects:
2020005 Mantis Vision
2021004 Imagination
Experiments
Input = facial points
Machine Learning
Tweedie, Bayesian-
Regressor, SGD, SVM, K-NN,
Decision-Tree, Random-
Forest, AdaBoost, GBoost
Fully Connected Neural
Networks
Input = image
Convolutional neural networks
Input = image sequence
LSTMs neural networks
Conv LSTMs neural networks
How to select a model,
optimization method and
hyperparameters?
How to define the
sequence's length?
June 21st 2023
Organized by Penta projects:
2020005 Mantis Vision
2021004 Imagination
Results: baseline
Direct mapping between the face and mouse
No artificial intelligence
Model: face tracker using the point between the eyes as mouse controller
No calibration
Direct mapping between the face movements and the mouse movements on the screen
Mean Absolute Error on test set: 19.5% (of the size of the screen)
with large variations between the videos
with large error at the beginning of every new video, reducing with time (as the system auto-
calibrates)
June 21st 2023
Organized by Penta projects:
2020005 Mantis Vision
2021004 Imagination
Results: Machine Learning
Grid Search strategy performed on various methods for parameter research
Best performance was given by Random Forest and AdaBoost
Prediction close to ground truth with one video
Face mesh features are relevant to predict cursor position
No noticeable difference when using a different number of features
Precision drops when trying to generalise on multiple videos
Breiman, L. Random Forests. Machine Learning 45, 532 (2001)
Target Prediction
Training and testing on same video: Error 2% Overfit: the information is there ! Training on 18 videos: Error 18%
How to get the
most important
features?
Multiple output
not common in
ML
June 21st 2023
Organized by Penta projects:
2020005 Mantis Vision
2021004 Imagination
Fully connected model and training from scratch with minimum amount of data
For one user, 3'30'' of data is enough to overfit FCN model but not enough to predict new data
Results: Fully Connected Network (FCN)
Testing set : same user (2'30‘’) -> Error 13.8%
(binomial distribution)
Training set : one video (3'30‘’) -> Error 4%
Input Layer: 1x207
Hidden layers: 1x512 + 1x128
Output Layer: 1x2
Bigger network
does not mean
better results
Beware of the
constant prediction
local minimum
June 21st 2023
Organized by Penta projects:
2020005 Mantis Vision
2021004 Imagination
Results: Deep Learning CNN (ResNet-like)
Convolutional architecture using the first layers of ResNet18
Weights initialization with pretrained ResNet18 on ImageNet
Training set: 19 videos of 3 different users (30428 frames)
Testing set: 12 videos of the same 3 users (32423 frames)
Mean Absolute Error on test set:
Input is the full image (224x224px): 13.4%
Input is the cropped face (224x224px): 11.8%
Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun, "Deep Residual Learning for Image Recognition", 2015
Pre-training, fine-
tuning or training
from scratch?
June 21st 2023
Organized by Penta projects:
2020005 Mantis Vision
2021004 Imagination
Results: Deep Learning CNN (VGG-like)
For one user, 3'30'' of data is enough to fit CNN model but not enough to predict for
new data
Training set : one video (3'30'')
Error 0.8% (Mean Absolute Error)
Testing set : same user (2'30'')
Error 11.5% (Mean Absolute Error)
Creating VGG-like model and training from scratch with minimum amount of data
Sequential model designed with 4 blocks of 2 convolutions and max pooling layers followed by 4 linear layers
Input : 224x224 - 3 channels images Position (x, y) in the screen
Karen Simonyan and Andrew Zisserman, "Very Deep Convolutional
Networks for Large-Scale Image Recognition", 2015
Define minimum
amount of data
Multiuser
scenario
June 21st 2023
Organized by Penta projects:
2020005 Mantis Vision
2021004 Imagination
Results: Temporal models
Long Short-Term Memory (LSTM) networks are used ad temporal
models
The input is now a time sequence
Face mesh detector used as features
Combination of CNN and LSTM for image features is a work in progress
First results do not show a sign of improvement compared to previous
models
Training error = 5.2%
Test error = 18.4%
Data collected were not designed for temporal models!
Work in progress
Hochreiter Sepp and Schmidhuber Jürgen, "Long Short-Term Memory", 1997
Training a temporal
model is complex,
and requires more
data
Thousands of epoch
needed
June 21st 2023
Organized by Penta projects:
2020005 Mantis Vision
2021004 Imagination
Results: Comparison
All models are able to tracker the
cursor on the training set
All models have difficulties to
generalise
CNN outperforms other models
Models based on the face mesh
detector can find relevant features
for cursor prediction
We believe that this method can help
generalise across users
June 21st 2023
Organized by Penta projects:
2020005 Mantis Vision
2021004 Imagination
Challenges
Dataset-related
How do we select the framerate, image resolution?
How do we normalize the acquired data?
Do we consider single samples or sequences?
Is our dataset balanced enough? Diverse?
Do we have enough data?
Can we model the relationship between images and mouse?
Does it depend on the configuration (relative position user vs. screen)
Model related
How to select a model/architecture?
How to generalise across setup/user?
What learning rate to adopt to avoid local minimum and constant output?
June 21st 2023
Organized by Penta projects:
2020005 Mantis Vision
2021004 Imagination
Current work
InceptionNet models
Non-sequential Convolutional Neural Network
Better generalization than VGG-like models
Same benchmark :
One user, 3'30'' for training and 2'30'' for testing
Combine 2 models
One CNN predict (x, y) per frame then one FCN smooth the last 5 predictions
Model
Train loss
Test loss
VGG ~1% ~11%
InceptionNet ~2% ~8%
Model
Train loss
Test loss
CNN only ~2% ~11%
CNN + FCN ~0.5% ~1.5%
June 21st 2023
Organized by Penta projects:
2020005 Mantis Vision
2021004 Imagination
Current work
Image Super Resolution
Increase resolution of input data
Map with clickable region of interest
Correct outputs with this map
Multimodal models
Multiuser scenarios
Transfer learning
Calibration
Distraction detection
Exclude frame when user is not looking at the screen during training
Input data more accurate
Resources management
Disable mouse position prediction when user is not looking at the screen
June 21st 2023
Organized by Penta projects:
2020005 Mantis Vision
2021004 Imagination
Discussion
Data type
Points extracted with face tracker do not seem enough to predict where the user is looking
Difficulty to fit model even on train set
Using images provides our best results
Whole images consume more memory
Necessity to use more complexes methods (deep learning approach)
Does cropping around user's eyes reduce memory consumption without performance decreasing?
Data acquisition
Acquire more data : record colleagues during their work time WIP
Determine when the user is looking at the mouse
User can move the mouse without looking
User is sometimes looking after the mouse
User is sometimes looking before the mouse
Use cases
In laboratory tests cases Can we move the mouse without using it?
Not there yet !
Real cases Can a paraplegic patient move the mouse?
ResearchGate has not been able to resolve any citations for this publication.
ResearchGate has not been able to resolve any references for this publication.