Content uploaded by Martin Guggisberg
Author content
All content in this area was uploaded by Martin Guggisberg
Content may be subject to copyright.
Content uploaded by Martin Guggisberg
Author content
All content in this area was uploaded by Martin Guggisberg
Content may be subject to copyright.
Multimedia Information and Mobile-Learning
Duc Phuong Nguyen, Martin Guggisberg, Helmar Burkhart
University of Basel, Switzerland
Phuong.Nguyen@unibas.ch
Abstract
The number of applications that use mobile devices
in learning activities is increasing. The limitations of
mobile devices such as small screen and inconvenient
input facilities can be reduced by using multimedia
contents. For example, students can submit questions
by taking pictures of their notes; field trip experiences
can be shared by video taken from mobile phones, and
students can access long text information via voice
instead of using SMS that allows only 160 characters
per message. However, the use of multimedia
information poses a new challenge for system
developers: how can the students search on multimedia
information? In this paper, we introduce a mobile-
learning system that addresses this issue by applying
several techniques such as Automatic Speech
Recognition (ASR), Optical Character Recognition
(OCR), and Text to Speech (TTS). The system is based
on the CoMobile framework, our open source software
framework for collaborative work using mobile
devices.
1. Introduction
Mobile-learning refers to the use of mobile devices
such as mobile phones, Personal Digital Assistant to
support learning activities [1-3]. A large number of
mobile-learning systems use Short Message Service
(SMS) as the communication channel. Examples are
Concept maps [2], Announcement [3], Reminders [4],
and Language learning application [5]. The information
exchanged is often short text because an SMS can
contain only 160 characters. This limitation implies
that SMS can not be used to deliver large amount of
information without splitting the text into several SMS
messages. The recent generations of mobile phones are
often equipped with digital cameras. Therefore, rich
multimedia contents such as pictures and video clips
can be submitted and accessed by mobile phones.
In this paper, we introduce our mobile-learning
application which addresses the issues of sharing,
accessing, and searching multimedia contents with
mobile phones. The system allows the students to share
and access learning resources. The term resource here
refers to learning materials in electronic form such as
ebook, lecture notes, recorded lectures, and discussions
on learning topics. The resources are in different
formats, for example text, images, audio data, and
video files. The users interact with the system either
with web browsers or mobile phones (often when they
are on the move or have short time).
The support of various contents (text, images, audio,
and video) offers several benefits for mobile users:
- When students want to ask a short question or answer
in brief, they can simply compose an SMS and send it
to the system.
- When the information is too long to send via SMS,
the users can take pictures of their notes and send them
to the system.
- Interesting experiences such as live demonstrations,
excursions can be recorded with mobile phone’s
camera as video clips and shared with other students.
- If the users consider typing SMS to be too slow, they
can call a special telephone number and record their
voice messages (e.g.: questions, answers, discussions).
The recorded audio data can later be accessed from
mobile users via telephone calls and from web users by
clicking on a link to an audio file.
While submitting information as images or audio
data solve the limitation of input facilities of mobile
phones, a new challenge is posed: how can the students
find the information they need? The system can do full
text search on textual information, but for other data
format such as images, audio, or video, some meta-data
is required. Our system addresses this issue by using
several techniques to extract text from audio, images as
well as video clips. The system leverages the use of
open source software projects and is based on
CoMobile - our software framework for collaborative
work with mobile phones.
Proceedings of the Eighth IEEE International Symposium on Multimedia (ISM'06)
0-7695-2746-9/06 $20.00 © 2006
Figure 1 Data flow of the CoMobile system
2. Mobile-Learning with CoMobile
Our mobile-learning application allows students to
share, search and access information by using either
web browsers or mobile phones.
Figure 1 shows the data flow of the CoMobile
system. An example of work flow in our mobile-
learning system is as follows:
Submitting questions via mobile phones:
- Step 1: Student Alice calls the system by dialing a
short number 5678, and then leaves a voice message by
pressing key 1: “I am doing a software project on web
technologies using java servlet, mysql database. If you
are also interested in the topic, please contact me by
email: alice@uni.org”
- Step 2: The PBX module saves the speech as a wav
file and informs the CoMobile core module that a new
voice message is stored.
- Step 3: The CoMobile core module saves this
information into a database so that web clients can
access the wav files. It also calls the ASR module to
extract text from the audio and save the summarized
text (e.g.: “web technologies”, “java”, “project”). The
summarized text will later be used for searching on
audio contents. Similar steps are done for images and
video files.
The web clients use HTML forms to submit text and
they can send images, audio, or video files as
attachments.
Searching on multimedia contents:
- Step 1: Student Bob composes an SMS “search java,
project” and sends it to a special number.
- Step 2: The system processes the incoming SMS and
sends back an SMS “There are 2 matched entries,
please call 567 and press 3 to browse the messages”.
- Step 3: Bob calls the system and listens to the voice
messages. The first message is Alice’s question of the
previous example. The second message is another
discussion which is synthesized speech from a long text
that could not be sent to Bob as an SMS message.
For users who use web browsers, the searching
process is quite simple: the user types “java, project”
into the search textbox and clicks the submit button.
The response is a web page with two results. The first
result is a link to the wav file (Alice’s voice recorded
question) with a list of summarized text that extracted
by the ASR module. The second result is the long text
of another discussion (which is posted to the system
using web browser). The only difference here is that
the long text entry doesn’t need to be converted into
speech because the web browser client can read the text
conveniently on the computer screen.
In addition, the current version of our system also
provides the following features: Audio information
system - students call the system and listen to internet
radios, recorded lectures. Sharing multimedia contents-
students can send images, audio, video clips to the
system and these contents can be accessed by web
clients and mobile clients.
CoMobile
system
ASR
module
(Sphinx)
TTS
module
(Festival)
OCR
module
(GOCR)
SMS
gateway
(Kannel)
PBX
(Asterisk)
Text
Voice
MMS,
Internet
Text, Voice,
Images, Video
Mobile clients
Web clients
Text, Voice,
Images, Video
Multimedia contents (text, pictures, voice, and video) are sent from mobile clients and web clients to the CoMobile system through
various communication channels (SMS, Telephone line, MMS, Internet). The CoMobile system forwards the multimedia contents to
other modules (ASR, OCR) to extract text and then save the text in the database. Long texts are synthesized to speeches by the TTS
module and then the mobile users can access the audio data by dialling a special telephone number.
Proceedings of the Eighth IEEE International Symposium on Multimedia (ISM'06)
0-7695-2746-9/06 $20.00 © 2006
3. The CoMobile framework
CoMobile is a framework that supports
collaborative work with mobile devices. The purpose
of the framework is to leverage the use of mobile
phones in different areas such as Mobile-Learning,
Tourist Information Systems, and Virtual Communities.
In the context of E-Learning, software developers
often have to consider several issues when they want to
use mobile phones for learning activities:
- Is there any possibility to integrate mobile phones
with their existing E-Learning systems without many
changes and reimplementation?
- For common web applications such as Learning
Management Systems, which open source project
should be used/modified so that mobile clients are also
supported? How to keep the system compatible with
future updates of the open source project after
integrating mobile clients into the system?
While designing the CoMobile framework, we made
the following decisions:
- From the web application point of view, mobile
clients are treated equally as web clients. Anything that
specific to mobile phones will be managed by the
CoMobile system. In this way, the integration with
other systems is easy because these systems don’t have
to change their core implementation. Furthermore,
compatibility with future releases of such systems is
assured.
- Communication between internal modules as well
as external systems will be done via HTTP, so that
cross languages and cross platforms are supported.
The CoMobile system follows the Model-View-
Controller (MVC) design pattern [6]. The Controller
component based on Java technologies; other
components of the system (ASR, PBX, OCR, TTS, and
SMS gateway) are modules which do separate tasks.
3.1. Core component
The CoMobile’s core component is composed of
several Java servlets, Java classes, and Perl scripts. The
core component will coordinate the activities of other
modules. For example, the SMS gateway receives SMS
messages from mobile clients and informs the core
component. The core component will inform the web
application and forward the data to the web application
(SMS text, mobile phone number, status, ID). At a later
time when some replies for that post are submitted, the
web application will call the CoMobile’s core
component to send notifications to mobile clients.
The CoMobile framework defines a set of APIs that
other systems can call in order to support mobile
features (e.g.: receive video posts from mobile phones).
The APIs also contains several call-back services that
other systems have to implement when they want to
integrate with CoMobile. These call-back services are
simple and many of them are already available such as
routines to insert posts into the database, get back the
ID of the post and related replies. Therefore, the main
task for system integrator is to make these routines
accessible via HTTP and conform to CoMobile APIs.
The next section discusses about several plug-in
modules that deal with multimedia contents.
Figure 2 demonstrates the flexibility of CoMobile framework in integration with other systems. Two Content Management Platforms:
Drupal written in PHP (left-side picture) and mvnForum written in Java (right-side picture) support mobile clients (e.g.: allowing
send/receive posts to/from mobile phones) by implementing small plug-in modules that communicate with CoMobile system.
Figure 2:
Existing
systems can be easily integrated with
the
CoMobil
e
framework
Proceedings of the Eighth IEEE International Symposium on Multimedia (ISM'06)
0-7695-2746-9/06 $20.00 © 2006
3.2. Plug-in modules
These modules are based on open source projects;
they are controlled by several Java classes. These
classes will communicate with the CoMobile’s core
component. Therefore plug-in modules can be easily
upgraded or replaced by other better components (with
respect to performance) with little modification in the
Java classes.
3.2.1. Automatic Speech Recognition (ASR)
The CoMobile system allows students to post
questions, and answers in audio form. To search on
these audio data, speech recognition technique is used.
In this section, we give a brief overview of ASR
systems and our system’s specific module that used to
extract texts from audio data as well as from video
clips.
In general, the goal of an ASR system is to analyze
the speech signal and determine the words uttered by
the speaker. Comprehensive literatures on ASR can be
found in [7]. In sum, ASR systems can be classified in
several ways. They can be divided into single speaker,
and speaker independent. In the former case, the
system can only recognize the speech of the same
speaker who trained it. In the later case, the system can
recognize the word uttered by any speaker. However,
these systems obtain lower accuracy rates than the
single speaker systems.
In terms of speaking style, ASR systems can be
classified into isolated words, connected words, and
continuous speech systems. The first type requires the
speaker to make clearly pauses between words (e.g.: a
few seconds pauses) in order to detect the start and the
end of words. Connected words systems require the
speakers to make short pauses between the words. In
continuous speech recognition systems, the speakers
can speak normally as if they are talking to other users.
Another kind of ASR systems is called keyword
spotting. These systems only determine a specific set of
words (e.g.: commands keywords like “open”, “close”),
words that are not in the set are considered irrelevant.
In our mobile-learning system, we are interested in
speaker independent, continuous speech ASR systems
because of flexible reason.
In ASR systems, each basic unit of sound (often
word or phonemes) will be characterized by some
acoustic parameters. In an ideal case, if each unit could
be mapped to an unique set of parameters then speech
recognition would be a trivial task. However, this
doesn’t hold true because of many reasons:
- People do not produce the same sound for the same
word (dialect; male/female). A person might speak a
word differently every time (length, energy).
- The pronunciation of a phoneme might depend on
surrounding phonemes.
- Some words sound similar (e.g.: “read” and “red”).
- Background noise changes the characteristics of the
speech signal.
Therefore, ASR systems use statistical models to
deal with uncertainty factors. The recognition process
is to find the best possible sequence of words (the
highest probability) that will fit the input speech signal.
Language model (grammar and dictionary) is used to
select common words and eliminate word sequences
that are not valid. As a result, the number of hypotheses
is reduced, so the processing time is shortened.
Current advanced ASR systems are based on Hidden
Markov Model (HMM), which is a type of statistical
model [8].
Commonly used open source ASR tools are HTK
[9] and Sphinx [10]. HTK is a toolkit for building
Hidden Markov Models. It consists of a set of libraries
and tools available in C sources. Sphinx (version 4) is a
framework for speech recognition written in Java
(previous versions of Sphinx are written in C++).
Several free acoustic models and language models are
distributed with Sphinx such as AN4 (30 words), RM1
(1000 words), Hub4 (64000 words). Therefore, we
decided to use Sphinx in our prototype because the
freely available acoustic models and language models
reduces the development time significantly. In addition,
Sphinx-4 is written in Java, our preferred programming
language.
Extracting text from speech in video: In our system,
there are video clips taken from different contexts such
as during lecture, from a field trip, or presentation, etc.
There might be already some meta-data associated with
these video clips (e.g.: subject of the email that the
video clip is attached, etc.). To provide more
information that facilitates the search, it would be
useful to get the speech from the video clips (if there
are any) and using the ASR module to extract text from
the audio files.
Extracting the audio data from the video data is
done by using the open source software MPlayer [11].
The tool can take several video formats and extract the
audio data from the video. After that, the audio files are
sent to the Sphinx module which will do the speech
recognition function. Finally, the recognized texts are
processed by the CoMobile system. The information
(the relation between the video and the text) is stored in
the system’s database.
Proceedings of the Eighth IEEE International Symposium on Multimedia (ISM'06)
0-7695-2746-9/06 $20.00 © 2006
3.2.2. Optical Character Recognition (OCR)
Optical character recognition software is used to
translate images of text (either printed or handwritten
text) into computer editable text. Comprehensive
reviews in this field can be found in [12-13]. Current
OCR systems are often based on pattern recognition
technique. Language model can be used in post
processing steps to improve the recognition rates. For
clearly segmented printed text, the word recognition
rates are quite high (>90%). However, user
independent handwritten text is still a challenge to be
solved.
So far open source OCR programs are not
competitive with commercial software in term of
performance. However, we decided to use open source
OCR package as it is freely available and its
performance would improve over time. During the
implementation of our system, we have evaluated
several common open source OCR systems (OCRAD
[14], GOCR [15], and ISRI OCR [16]). The
recognition rates for printed text in large font size
(>14pt) are quite good in all software (80% - 100%).
However, when tested with handwritten text,
recognition rates are much lower (from 10% to 50 %).
After the evaluation, GOCR is used because it has
the highest recognition rate in average. The OCR
module can be invoked via HTTP requests sent to the
CoMobile Controller component. In this way, the
module is seen as a service and programs written in
many programming languages can use this service
easily.
Figure 2 Sample handwritten text input for OCR
3.2.3. Text to Speech (TTS)
TTS synthesizer software is used to transform text
input (sentences in text format) into speech. The
linguistic and the digital signal processing module are
the two mains modules in a TTS system. The first
module analyses the input text and transforms it into a
sequence of basic units together with prosody marks
that instruct how to pronounce them in terms of
volume, intonation, etc. The second module transforms
the output of the first module into the synthesized
voice. The linguistic module often uses diphone or
triphone as basic unit (other possibilities can be words,
syllables). Diphone is a small segment of speech that
contains the stationary part of a phoneme, the transition
to the following phoneme, and the stationary part of
this phoneme. Triphone considers the surroundings
sounds, i.e. the immediate left and right phonemes.
One of the widely used, open source speech
synthesis systems is the Festival [17] speech synthesis
system. Festival is a diphone-based TTS synthesizer
whose diphone database can be replace by other
database such as MBROLA [18] to support languages
other than English. Festival can be used from a shell
command, or from programming languages.
There are some alternatives to Festival such as Flite
[19], and FreeTTS [20]. Flite is a light weight and fast
runtime speech synthesis engine (written in C). It was
designed for embedded system like PDA. FreeTTS is
the Java version of Flite. We decided to use Festival
because of its high quality synthesized voices and its
support for other languages such as German, and
Spanish.
4. Summary
In general, images taken by mobile phones’ camera
often have lower quality than scanned images
(background noise, lower resolution).
In our current prototype, the accuracy rates of the
ASR module are 10% to 20% lower than the result of
the Sphinx system reported in [21]. There might be
several reasons:
- The voice can only be recorded in 8 KHz over
telephone line.
- There are many options to trade off between accuracy
rate and speed in Sphinx. We need to experiment more
intensive on such options.
Speech recognition is a computational intensive
process. It is quite common that it takes 5 RT to 10 RT
to process a speech. It means that for an audio of 1
minute length, the recognition process would take 5-10
minutes. It would be useful to distribute the task to
- GOCR recognized the input image as: „ Thìs
ìs a _andwr _tten eyample Wrìte as good as y_u
can. “.
- Ocrad recognized the input image as: „ Thìs is
a haNdLJ_,'_+e_ eyowp(e rfrì_e _s good _5 _orf
CQN. “
Proceedings of the Eighth IEEE International Symposium on Multimedia (ISM'06)
0-7695-2746-9/06 $20.00 © 2006
many high performance servers by splitting the audio
into smaller chunks.
We are planning to deploy the system in the coming
semester and gather users’ feedback from students.
5. References
[1] Roschelle, J. (2003),"Unlocking the learning value of
wireless mobile devices", Journal of Computer Assisted
Learning, 19(3), 260-272.
[2] Christian Wattinger et al.,"Problem-Based Learning
Using Mobile Devices", In the 6th IEEE International
Conference on Advanced Learning Technologies, 2006.
[3] J. Roschelle, R.Rosas, M.Nussbaum, "Towards a design
framework for mobile computer-supported collaborative
learning", Proceedings of the 2005 conference on Computer
support for collaborative learning: learning 2005: the next 10
years, 2005.
[2] Pasi Silander, Erikki Sutinen, Jorma Tarhio, “Mobile
Collaborative Concept Mapping-Combining Classroom
Activity with Simultaneous Field Exploration”, Proc. of the
2nd IEEE International workshop on Wireless and Mobile
Technologies in Education, 2004.
[3] Andy Stone, Jonathan Briggs, Craig Smith, “SMS and
Interactivity- Some Results from the Field, and its
Implications on Effective Uses of Mobile Technologies in
Education”, Proc. of IEEE International workshop on
Wireless and Mobile Technologies in Education, 2002.
[4] Andy Stone, “Mobile Scaffolding: An Experiments in
Using SMS Text Messaging to Support First Year University
Students”, Proc. of IEEE International Conf. on Advance
Learning Technologies, 2004.
[5] P. Thornton, C. Houser, “Learning on the Move: Foreign
language vocabulary via SMS”, Proc. of ED-Media, 2001.
[6] Erich Gamma et al, "Design Patterns: Elements of
Reusable Object-Oriented Software", Addison-Wesley
Professional, ISBN: 0201633612, 1995.
[7] Alex Waibel and Kai-Fu Lee, "Readings in speech
recognition", Morgan Kaufmann Publishers, ISBN: 1-55860-
124-4
[8] Lawrence R. Rabiner, "A Tutorial on Hidden Markov
Models and Selected Applications in Speech Recognition".
Proceedings of the IEEE, 77 (2), p. 257–286, February 1989.
[9] T. Hain, P.C. Woodland, G. Evermann & D. Povey,
"THE CU-HTK MARCH 2000 HUB5E TRANSCRIPTION
SYSTEM", Proc. Speech Transcription Workshop, 2000.
[10] W. Walker et al., "Sphinx-4: A Flexible Open Source
Framework for Speech Recognition ", Sphinx Whitepaper,
Sun Microsystems INC, 2004.
[11] Kyle Rankin, "Linux Multimedia Hacks", O'Reilly,
ISBN: 0-596-10076-0, 2005.
[12] H. Bunke, “Recognition of cursive Roman handwriting -
past, present and future”, In Proc. 7th Int. Conference on
Document Analysis and Recognition, p448 – 459, 2003.
[13] Rejean Plamondon, Sargur N. Srihari, "On-Line and
Off-Line Handwriting Recognition: A Comprehensive
Survey", IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 22, no. 1, pp. 63-84, Jan., 2000.
[14] GNU Ocrad, open source Optical Character Recognition
software. Available online at:
http://www.gnu.org/software/ocrad/
[15] Daniel Lopresti, "Document engineering (DE):
Performance evaluation for text processing of noisy inputs",
Proceedings of the 2005 ACM symposium on Applied
computing SAC, 2005.
[16] ISRI OCR: http://www.isri.unlv.edu/ISRI/Software
[17] Paul A Taylor, Alan Black, and Richard Caley. The
architecture of the festival speech synthesis system. In The
Third ESCA Workshop in Speech Synthesis, pages 147-151,
Jenolan Caves, Australia, 1998.
[18] T. Dutoit, V. Pagel, N. Pierret, F. Bataille, O. Van der
Vrecken , "The MBROLA project: Towards a Set of High
Quality Speech Synthesizers", In Proc. of the Fourth
International Conference on Spoken Language Processing,
1996.
[19] Black, A.W. and K.A. Lenzo, "Flite: a small fast run-
time synthesis engine", In The 4th ISCA Workshop on
Speech Synthesis. 2001. Perthshire, Scotland.
[20] Willie Walker, Paul Lamere and Philip Kwok, "FreeTTS
- A Performance Case Study", Technical report, Sun
Microsystems, 2002.
[21] Tony Ayres, Brian Nolan, "JSAPI speech recognition
with Sphinx4 and SAPI5", In Proceedings of the 4th
international symposium on Information and communication
technologies WISICT, 2005.
Proceedings of the Eighth IEEE International Symposium on Multimedia (ISM'06)
0-7695-2746-9/06 $20.00 © 2006