Conference PaperPDF Available

GIUPlayer: A Gaze Immersive YouTube Player Enabling Eye Control and Attention Analysis



Content may be subject to copyright.
GIUPlayer: A Gaze Immersive YouTube Player Enabling
Eye Control and Aention Analysis
Ramin Hedeshy
University of Koblenz, Germany
Chandan Kumar
University of Koblenz, Germany
Raphael Menges
University of Koblenz, Germany
Steen Staab
University of Stuttgart, Germany
We developed a gaze immersive YouTube player, called GIUPlayer,
with two objectives: First to enable eye-controlled interaction with
video content, to support people with motor disabilities. Second
to enable the prospect of quantifying attention when users view
video content, which can be used to estimate natural viewing be-
haviour. In this paper, we illustrate the functionality and design
of GIUPlayer, and the visualization of video viewing pattern. The
long-term perspective of this work could lead to the realization of
eye control and attention based recommendations in online video
platforms and smart TV applications that record eye tracking data.
Human-centered computing Interactive systems and tools
Video player, eye tracking, Web accessibility
ACM Reference Format:
Ramin Hedeshy, Chandan Kumar, Raphael Menges, and Steen Staab. 2020.
GIUPlayer: A Gaze Immersive YouTube Player Enabling Eye Control and
Attention Analysis. In Symposium on Eye Tracking Research and Applications
(ETRA ’20 Adjunct), June 2–5, 2020, Stuttgart, Germany. ACM, New York,
NY, USA, 3 pages.
Watching entertainment programs and movies are one of the most
common and time-consuming activities of our everyday life. In re-
cent years, YouTube and other online services have gained immense
popularity and preference over conventional TV broadcasts. As per
2019 statistics, over 500 hours of video are uploaded to YouTube ev-
ery minute, and more than 5 billion videos are watched on YouTube
every single day [Statista, Inc. 2020]. Given the signicance of these
online platforms in our modern digital world, it is imperative that
these platforms are accessible to all people. However, several in-
dividuals with motor-impairment, who lost the ability to operate
Also with University of Southampton, UK
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
ETRA ’20 Adjunct, June 2–5, 2020, Stuttgart, Germany
©2020 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-7135-3/20/06.
mouse and keyboard, are not able to interact with computing de-
vices that provide digital content as required by standard interfaces.
More specically, this excludes those individuals from entertain-
ment and educational content on online video platforms that could
enrich their social environment and general quality of life.
In this regard, eye tracking as a novel interaction techniques
enables hands-free interaction using eye gaze as input control to
support people with severe motor disabilities [Jacob and Stellmach
2016; Kumar et al
2016]. To access the online video platforms,
the generic method for eye gaze control at operating system level
(emulation approaches [Lankford 2000; Sweetland 2016]) could
be employed. However, emulation approaches do not have access
on the underlying data structures and interaction possibilities for
most applications or websites [Menges et al
2019]. Thus, a more
appropriate solution is to design special graphical user interface
adapted for eye gaze input. There have been several approaches
to adapt the native application for eye gaze, e. g., eye-controlled
gaming [Isokoski et al
2009; Menges et al
2017b], drawing [van der
Kamp and Sundstedt 2011], browsing [Menges et al
2017a]. More
specically, in the context of video interaction, the control over
the watched content, including pausing, resuming, forwarding, and
control over the volume is crucial for the comfort and pleasure of
the user. In this work, we demonstrate GIUPlayer, which retrieves
YouTube data and adapts interface elements for eye gaze control to
provide intuitive video interaction functionality.
Besides providing input control, another major application of eye
tracking has been in the eld of attention analysis, i.e., to understand
user behavior. In the context of video interaction, eye gaze attention
can provide implicit understanding of user interest while users are
watching the videos. The implicit feedback has been investigated in
dierent analytical scenarios such as to judge video quality [Sawa-
hata et al
2008], to summarize and compress videos [Gitman et al
2014; Katti et al
2011], characterize users [Tanisaro et al
2015], or
to assess relevance of advertisement banners [Tangmanee 2016].
Therefore to support such analysis, in GIUPlayer we record and
visualize the gaze data while video is being played.
GIUPlayer content layout is inspired by the YouTube website since
its functionality is well known, thus intuitive for end users. How-
ever, several interface adaptations are done following user centered
design principles for eyes interaction [Kumar et al
2016; Menges
et al. 2019]. Interface components such as size, shape, appearance,
feedback etc., are customized since it is crucial in enhancing eye
ETRA ’20 Adjunct, June 2–5, 2020, Stugart, Germany Ramin Hedeshy, Chandan Kumar, Raphael Menges, and Steen Staab
Figure 1: Areas in GIUPlayer watch page interface
tracking accuracy for input control. For example, sizes of buttons
are enlarged, their placement is based upon element usage fre-
quency and application control, so natural viewing behavior does
not interfere. Several alternatives for eye gaze input like blink-
ing, gesture, pupil dilation [Jacob and Stellmach 2016] had been
explored, however dwell time based selection was found most suit-
able, as it was the most natural and widely accepted mode of input.
Visual feedback is embedded to guide the selection process.
GIUPlayer consists of three content related pages including the
watch page, a home page, and a search result page.
Figure 1 shows the watch page of GIUPlayer. Every time when a
user selects a video this page appears and the video can be played.
The place for displaying the video takes the most considerable
part of the page on the right side, with controls below (marked as
footer in Figure 1) and suggested videos on the left side (marked as
sidebar). The suggestions are obtained by using the YouTube API.
Dwelling at one of these suggestions (on the image thumbnail part)
causes the played video to change. The text description part is not
clickable so the user can comfortably read without unintentional
selections. Looking at the buttons with arrows above and below
suggestions causes changing of the set of proposed videos. The
footer area serves the control function of the video. If user activates
the fullscreen mode the video size will maximize and the footer is
by default visible to provide the control over the video. However in
fullscreen mode, a small gray rectangle with an arrow inside appears
on the right side over the footer. By looking at this rectangle, a user
can hide the bottom navigation bar to have a complete fullscreen
view without further distractions by the control buttons.
The home page includes the tiled list of videos with each row
representing dierent categories such as top rated, last watched,
and recommended videos. Similarly to the suggestions on the watch
page, the tiled lists can be scrolled with corresponding buttons. The
search result page contains the list of videos with respect to search
query. Additionally, the user settings and prole information can
be accessed from the button on the top right of the page.
The entire interface was developed as a Web application using
JavaScript, CSS, Ajax, and Bootstrap. The Youtube API [Google
LLC 2020] is used to access video information and other meta-
data. MySQL database is used to store user credentials and gaze
data stream of xations on the videos. As we implemented GIU-
Player as Web application, an eye-controlled Web browser like
GazeTheWeb [Menges et al
2017a] can be used to display it. Then,
the integrated dwell-time-based keyboard of GazeTheWeb is used
Figure 2: Displaying xations of ve users on a video
to enter search queries. Moreover, the system works with a variety
of eye tracking peripherals from SMI, Tobii, and Visual Interaction.
Whenever a user plays any video, the recording of the gaze data
starts automatically. The gaze data is processed on-the-y to gener-
ate a list of xation-points. After each video, these xation-points
are stored in a database in relation to the user id and the video id.
The analysis mode allows the visual comparison of the xation
points. A user can activate the analysis mode, in order to visualize
the individual xation points. To analyze and compare the viewing
behaviour, user can choose to display the data of the last ve persons
with their dierent xation points for each video. A unique color
represents the xation per user. See Figure 2 for a screenshot from
the analysis mode. The visualization is realized by multiple canvases
on top of the video. For each user, a new layer with a canvas is
added. The xations are visualized as points around the central
point of the xation with a 50 px radius. Fixations were dened
with minimum time duration of 200 ms [Blascheck et al
Nyström and Holmqvist 2010]. A xation has a spatial position
(x- and y-coordinate) and a start- and end-time in milliseconds.
With these timestamps, the duration of the xation is calculated
and displayed accordingly. We have also implemented the function
to calculate the similarity between gaze data stream of dierent
users on a particular video, i. e., to showcase most similar users
based on viewing pattern. The similarity function incorporates
common xations (temporal and spatial overlap); common amount
of xations and common duration of xations.
The proposed GIUPlayer provides intuitive hands-free interaction
with YouTube platform. We have performed in-lab usability testing
of the interface, and plan to conduct an evaluation with end-users.
For attention analysis, currently we only provide xation visu-
alization and user similarity function, however, the system oers
numerous possibility to exploit attention data for interaction opti-
mization and personalized user experience. For example, the user
recommendations and ltering on current online platforms are
merely based on the explicit feedback from user, e.g., what users
have searched and clicked. A platform with eye tracking can build a
network of users providing eye gaze data. The implicit feedback can
be used to characterize user interests for personalized ranking, and
collaborative ltering based on the similar gaze-pattern of users.
GIUPlayer: A Gaze Immersive YouTube Player ETRA ’20 Adjunct, June 2–5, 2020, Stugart, Germany
We acknowledge the work by students from University of Koblenz:
Denise Dünnebier, Mariya Chkalova, Min Ke, Yessika Legat, Ar-
senii Smyrnov, Benjamin Becker, Daniyal Akbari, Matthias Greber,
Steven Schürstedt, and Jannis Eisenmenger, who were involved in
a project1that contributes towards results reported in this paper.
Tanja Blascheck, Kuno Kurzhals, Michael Raschke, Michael Burch, Daniel Weiskopf,
and Thomas Ertl. 2014. State-of-the-Art of Visualization for Eye Tracking Data.. In
EuroVis (STARs).
Yury Gitman, Mikhail Erofeev,Dmitriy Vatolin, Bolshakov Andrey, and Fedorov Alexey.
2014. Semiautomatic visual-attention modeling and its application to video com-
pression. In 2014 IEEE International Conference on Image Processing (ICIP). IEEE,
Google LLC. 2020 (accessed March 5, 2020). YouTube API.
Poika Isokoski, Markus Joos, Oleg Spakov, and Benoît Martin. 2009. Gaze Controlled
Games. Univers. Access Inf. Soc. 8, 4 (Oct. 2009), 323–337.
s10209-009- 0146-3
Rob Jacob and Sophie Stellmach. 2016. What You Look at is What You Get: Gaze-based
User Interfaces. interactions 23, 5 (Aug. 2016), 62–65.
Harish Katti, Karthik Yadati, Mohan Kankanhalli, and Chua Tat-Seng. 2011. Aective
video summarization and story board generation using pupillary dilation and eye
gaze. In 2011 IEEE International Symposium on Multimedia. IEEE, 319–326.
Chandan Kumar, Raphael Menges, and Steen Staab. 2016. Eye-Controlled Interfaces
for Multimedia Interaction. IEEE MultiMedia 23, 4 (Oct 2016), 6–13.
Chris Lankford. 2000. Eective Eye-gaze Input into Windows. In Proceedings of the
2000 Symposium on Eye Tracking Research & Applications (ETRA ’00). ACM, New
York, NY, USA, 23–27.
Raphael Menges, Chandan Kumar, Daniel Müller, and Korok Sengupta. 2017a.
GazeTheWeb: A Gaze-Controlled Web Browser. In Proceedings of the 14th Web
for All Conference (W4A ’17). ACM.
Raphael Menges, Chandan Kumar, and Steen Staab. 2019. Improving User Experience
of Eye Tracking-based Interaction: Introspecting and Adapting Interfaces. ACM
Trans. Comput.-Hum. Interact. (2019). Accepted May 2019.
Raphael Menges, Chandan Kumar, Ulrich Wechselberger, Christoph Schaefer, Tina
Walber, and Steen Staab. 2017b. Schau genau! A Gaze-Controlled 3D Game for
Entertainment and Education. In Journal of Eye Movement Research, Vol. 10. 220.
Marcus Nyström and Kenneth Holmqvist. 2010. An adaptive algorithm for xation,
saccade, and glissade detection in eyetracking data. Behavior research methods 42,
1 (2010), 188–204.
Yasuhito Sawahata, Rajiv Khosla, Kazuteru Komine, Nobuyuki Hiruma, Takayuki Itou,
Seiji Watanabe, Yuji Suzuki, Yumiko Hara, and Nobuo Issiki. 2008. Determining
comprehension and quality of TV programs using eye-gaze tracking. Pattern
Recognition 41, 5 (2008), 1610–1626.
Statista, Inc. 2020 (accessed March 5, 2020). STATISTA.
Julius Sweetland. 2016. Optikey: Type, Click, Speak.
Chatpong Tangmanee. 2016. Fixation and recall of YouTube ad banners: An eye-
tracking study. " International Journal of Electronic Commerce Studies" 7, 1 (2016),
Pattreeya Tanisaro, Julius Schöning, Kuno Kurzhals, Gunther Heidemann, and Daniel
Weiskopf. 2015. Visual analytics for video applications. it-Information Technology
57, 1 (2015), 30–36.
Jan van der Kamp and Veronica Sundstedt. 2011. Gaze and Voice Controlled Drawing.
In Proceedings of the 1st Conference on Novel Gaze-Controlled Applications (NGCA
’11). ACM, New York, NY, USA, Article 9, 8 pages.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Web is essential for most people, and its accessibility should not be limited to conventional input sources like mouse and keyboard. In recent years, eye tracking systems have greatly improved, beginning to play an important role as input medium. In this work, we present GazeTheWeb, a Web browser accessible solely by eye gaze input. It effectively supports all browsing operations like search, navigation and bookmarks. GazeTheWeb is based on a Chromium powered framework, comprising Web extraction to classify interactive elements, and application of gaze interaction paradigms to represent these elements.
Full-text available
Attaching an ad banner on a clip in a video-sharing website such as YouTube has become common although eye-tracking studies have concluded that this fails to secure visitors' attention. To date, there have been no studies verifying whether ad banners on a video clip can ensure eye fixation from viewers. Through eye-tracking, this study investigates whether YouTube visitors fixate on ad banners, what the correlations between fixation duration on banners and overall fixation counts are, and the extent to which site visitors are able to recall details of ad banners and of the clip viewed. Using a Miramatrix eye-tracker to record YouTube viewers' eye movements, this study showed that nearly all fixated at least once on an ad banner in a clip. However, less than 10% were able to correctly recall the ad content viewed. Nevertheless, about half of viewers were able to correctly recall clip details. Fixation duration on the banner and fixation counts on the clip are negatively correlated, but the relationship between fixation duration and counts on the banner was insignificant. This study sheds new light on YouTube advertising through the use of eye-tracking and advises advertisers to be attentive in selecting clips on which ad banners will appear.
Full-text available
In this article, we describe the concept of video visual analytics with a special focus on the reasoning process in the sensemaking loop. To illustrate this concept with real application scenarios, two visual analytics (VA) tools are discussed in detail that cover the sensemaking process: (i) for video surveillance, and (ii) for eye-tracking data analysis. Surveillance data (i) allow discussion of key VA topics such as browsing and playback, situational awareness, and the deduction of reasoning. Using example (ii) – eye tracking data from persons watching video – we review application features such as the spatio-temporal visualization along with clustering, and identification of attentional synchrony between participants. We examine how these features can support the VA process. Based on this, open challenges in video VA will be discussed.
Conference Paper
Full-text available
This research aims to sufficiently increase the quality of visual-attention modeling to enable practical applications. We found that automatic models are significantly worse at predicting attention than even single-observer eye tracking. We propose a semiautomatic approach that requires eye tracking of only one observer and is based on time consistency of the observer's attention. Our comparisons showed the high objective quality of our proposed approach relative to automatic methods and to the results of single-observer eye tracking with no postprocessing. We demonstrated the practical applicability of our proposed concept to the task of saliency-based video compression.
Conference Paper
Full-text available
We propose a semi-automated, eye-gaze based method for affective analysis of videos. Pupillary Dilation (PD) is introduced as a valuable behavioural signal for assessment of subject arousal and engagement. We use PD information for computationally inexpensive, arousal based composition of video summaries and descriptive story-boards. Video summarization and story-board generation is done offline, subsequent to a subject viewing the video. The method also includes novel eye-gaze analysis and fusion with content based features to discover affective segments of videos and Regions of interest (ROIs) contained therein. Effectiveness of the framework is evaluated using experiments over a diverse set of clips, significant pool of subjects and comparison with a fully automated state-of-art affective video summarization algorithm. Acquisition and analysis of PD information is demonstrated and used as a proxy for human visual attention and arousal based video summarization and story-board generation. An important contribution is to demonstrate usefulness of PD information in identifying affective video segments with abstract semantics or affective elements of discourse and story-telling, that are likely to be missed by automated methods. Another contribution is the use of eye-fixations in the close temporal proximity of PD based events for key frame extraction and subsequent story board generation. We also show how PD based video summarization can to generate either a personalized video summary or to represent a consensus over affective preferences of a larger group or community.
Eye tracking systems have greatly improved in recent years, being a viable and affordable option as digital communication channel, especially for people lacking fine motor skills. Using eye tracking as an input method is challenging due to accuracy and ambiguity issues, and therefore research in eye gaze interaction is mainly focused on better pointing and typing methods. However, these methods eventually need to be assimilated to enable users to control application interfaces. A common approach to employ eye tracking for controlling application interfaces is to emulate mouse and keyboard functionality. We argue that the emulation approach incurs unnecessary interaction and visual overhead for users, aggravating the entire experience of gaze-based computer access. We discuss how the knowledge about the interface semantics can help reducing the interaction and visual overhead to improve the user experience. Thus, we propose the efficient introspection of interfaces to retrieve the interface semantics and adapt the interaction with eye gaze. We have developed a Web browser, GazeTheWeb, that introspects Web page interfaces and adapts both the browser interface and the interaction elements on Web pages for gaze input. In a summative lab study with 20 participants, GazeTheWeb allowed the participants to accomplish information search and browsing tasks significantly faster than an emulation approach. Additional feasibility tests of GazeTheWeb in lab and home environment showcase its effectiveness in accomplishing daily Web browsing activities and adapting large variety of modern Web pages to suffice the interaction for people with motor impairment.
The EU-funded MAMEM project (Multimedia Authoring and Management using your Eyes and Mind) aims to propose a framework for natural interaction with multimedia information for users who lack fine motor skills. As part of this project, the authors have developed a gaze-based control paradigm. Here, they outline the challenges of eye-controlled interaction with multimedia information and present initial project results. Their objective is to investigate how eye-based interaction techniques can be made precise and fast enough to let disabled people easily interact with multimedia information.
Imagine a world in which you can seamlessly engage with a multifaceted interactive environment that includes both real-world appliances and virtual components. With the development of increasingly powerful computing machinery in various form factors, the diversity of interactive systems is tremendous, ranging from body-worn personal devices, networked ubiquitous appliances in smart homes, and public wall-filling displays to personal virtual-and augmented-reality head-up systems. Given the speed of technological advancement, the critical bottleneck is not so much in providing more powerful machinery but rather in creating appealing and intuitive ways for users to manage and interact with a vast amount of information. At the same time, an important goal of humancomputer interaction research is to enable a higher communication bandwidth between the user and the machine. It is therefore critical to find suitable and applicable ways for orchestrating diverse input channels by carefully leveraging their unique characteristics. Eye movements and, more specifically, information about what a person is looking at, provide great opportunities for more engaging, seemingly magical user experiences. However, they also entail several design challenges, which if not considered carefully quickly result in overwhelming, aggravating experiences. In this article, we share some of our experiences and visions about using gaze as an input method through which users and computers can communicate information. A user interface based on eye movements provides several potential benefits. Two of the most commonly named ones are pointing-based interactions that are faster and more effortless than other interfaces, because we can move our eyes extremely fast and with little conscious effort. A simple thought experiment suggests the speed advantage: Before you operate any mechanical pointing device, you usually look at the destination to which you wish to move. Thus, your gaze implicitly indicates your intention before you're able to actuate an input device. In addition, since you naturally look at content that interests you, gaze input provides an implicit contextual cue about your current visual attention. Eye movements not only provide an interesting complementary input, but also allow for fluent interaction across diverse user contexts. For example, you could select a target simply by looking at it and confirming the selection via a speech command or manual input when stepping in front of a large information display, lying on your couch and interacting with your TV, or engaging with content on your head-up display. Gaze can therefore be used as a universal pointing input. Finally, eye movements also enable more attentive systems tailored to the user's current focus and activity (see, for example, [1]).
Currently, TV programs are evaluated by using questionnaires given after previews or by using TV ratings. There are few objective criteria useful for describing technical know-how about program production. One of the TV program producers’ concerns is how to choose expression methods that convey their ideas to viewers correctly and efficiently. Research has shown that eye-gaze direction is related to the human focus and attention. Gaze-based evaluations have been proposed for image-quality evaluations and certain usability tests. Such approaches are mainly based on how often a specific region attracted the subjects’ gaze or how long their gaze was fixed on it.
Conference Paper
Eye tracking is a process that allows an observers gaze to be determined in real time by measuring their eye movements. Recent work has examined the possibility of using gaze control as an alternative input modality in interactive applications. Alternative means of interaction are especially important for disabled users for whom traditional techniques, such as mouse and keyboard, may not be feasible. This paper proposes a novel combination of gaze and voice commands as a means of hands free interaction in a paint style program. A drawing application is implemented which is controllable by input from gaze and voice. Voice commands are used to activate drawing which allow gaze to be used only for positioning the cursor. In previous work gaze has also been used to activate drawing using dwell time. The drawing application is evaluated using subjective responses from participant user trials. The main result indicates that although gaze and voice offered less control that traditional input devices, the participants reported that it was more enjoyable.