Conference Paper

Towards language preservation: Design and collection of graphemically balanced and parallel speech corpora of Indonesian ethnic languages

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Various intangible cultural expressions in Indonesia such as oral traditions and literature are fragile and easily lost. Currently among 726 languages, 146 are endangered. Although several projects have been initiated for cultural preservation, the available technology that could support communication within indigenous communities, as well as with people outside the community, is still very rare in Indonesia. Speech-to-speech translation is a technology that enables communication among people speaking in different languages, and therefore it is significant for indigenous communities to preserve their cultural language and overcome language barriers. This paper presents the earlier step of long-term development of speech-to-speech translation system from Indonesian ethnic languages to other languages (i.e., English/Indonesian), which is a design and collection of graphemically balanced and parallel speech corpora of four Indonesian major ethnic languages: Javanese, Sundanese, Balinese and Bataks.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... To minimize the impact of the downstream model on the overall measured performance, a simple two-layer Transformer-based decoder is used. ML-SUPERB was presented as a challenge at ASRU 2023, attracting 12 model submissions and 8 new language submissions [15][16][17][18][19][20][21][22][23][24][25]. ...
... For the CTC framework, ML-SUPERB 2.0 investigates the performance of both the frozen pre-trained encoder using a Transformer-based downstream model and partial fine-tuning of the pre-trained encoder. The experimental setup is similar to that for the CTC framework described in Sections 3.2 and 3.3, with the exception of fine-tuning only the top layers of the encoder (i.e., layers [19][20][21][22][23][24] to limit the number of updated parameters to 100 million. In the CTC-ATT framework, we do not add additional downstream models. ...
... To minimize the impact of the downstream model on the overall measured performance, a simple two-layer Transformer-based decoder is used. ML-SUPERB was presented as a challenge at ASRU 2023, attracting 12 model submissions and 8 new language submissions [15][16][17][18][19][20][21][22][23][24][25]. ...
... For the CTC framework, ML-SUPERB 2.0 investigates the performance of both the frozen pre-trained encoder using a Transformer-based downstream model and partial fine-tuning of the pre-trained encoder. The experimental setup is similar to that for the CTC framework described in Sections 3.2 and 3.3, with the exception of fine-tuning only the top layers of the encoder (i.e., layers [19][20][21][22][23][24] to limit the number of updated parameters to 100 million. In the CTC-ATT framework, we do not add additional downstream models. ...
Preprint
Full-text available
ML-SUPERB evaluates self-supervised learning (SSL) models on the tasks of language identification and automatic speech recognition (ASR). This benchmark treats the models as feature extractors and uses a single shallow downstream model, which can be fine-tuned for a downstream task. However, real-world use cases may require different configurations. This paper presents ML-SUPERB~2.0, which is a new benchmark for evaluating pre-trained SSL and supervised speech models across downstream models, fine-tuning setups, and efficient model adaptation approaches. We find performance improvements over the setup of ML-SUPERB. However, performance depends on the downstream model design. Also, we find large performance differences between languages and datasets, suggesting the need for more targeted approaches to improve multilingual ASR performance.
... Parts of our dataset were taken from existing Indonesian speech datasets [9,11] obtained in previous research projects [12,13,14]. These include read speech recorded from news script reading and phone dialogue, referred to as LVCSR News and LVCSR Teldialog, respectively. ...
Preprint
Full-text available
An ideal speech recognition model has the capability to transcribe speech accurately under various characteristics of speech signals, such as speaking style (read and spontaneous), speech context (formal and informal), and background noise conditions (clean and moderate). Building such a model requires a significant amount of training data with diverse speech characteristics. Currently, Indonesian data is dominated by read, formal, and clean speech, leading to a scarcity of Indonesian data with other speech variabilities. To develop Indonesian automatic speech recognition (ASR), we present our research on state-of-the-art speech recognition models, namely Massively Multilingual Speech (MMS) and Whisper, as well as compiling a dataset comprising Indonesian speech with variabilities to facilitate our study. We further investigate the models' predictive ability to transcribe Indonesian speech data across different variability groups. The best results were achieved by the Whisper fine-tuned model across datasets with various characteristics, as indicated by the decrease in word error rate (WER) and character error rate (CER). Moreover, we found that speaking style variability affected model performance the most.
... The combination of teaching and language corpora is increasingly becoming a convergence proving to be valuable in the field of language education (McEnery & Xiao, 2011). Additionally, it serves as a cultural preservation tool, capturing the oral traditions and linguistic practices of the Minangkabau people for future generations (Handoko et al., 2024;Nelisa et al., 2021;Sakti & Nakamura, 2013). ...
Article
Full-text available
This paper discusses the design for developing the Minangkabau language corpus, especially regarding the opportunities and challenges. The corpus development of Minangkabau is a crucial project to document, preserve, and revive the treasure trove of culture within the language. The availability of a Minangkabau language corpus can open opportunities for more intensive research on the Minangkabau language with a more modern and data-based approach. It can also encourage the development of Minangkabau corpus-based teaching materials. The corpus is manually assembled using various sources’ comprehensive data collection, annotation, and curation pipelines. These may be manuscripts, books, newspapers, or other written texts and spontaneous conversations, such as interviews or public speeches. Multimedia resources, such as television and radio broadcasts, audio-video recordings, and social media content, also add to the diversity of data gathered. The availability of accessible digital sources, such as online videos, online radio programs, and ebooks, can make data collection easier. However, several challenges may appear in developing the Minangkabau language corpus, such as limited technology accessibility, dialect variations, and the involvement of highly skilled human resources. This paper explains some opportunities for developing the Minangkabau language corpus and increasing the role of the corpus in revitalizing and documenting the Minangkabau language. Furthermore, the availability of the Minangkabau language corpus can also be a starting point for developing linguistic technology, such as voice recognition, text-to-speech, and natural language processing.
Article
Full-text available
A number of studies on code-mixing have been extensively researched in educational communication; but few studies have explored the practice of code-switching in preaching, which is the domain of religious communication. The study aims to describe the practice of code-mixing in delivering sermons by pastors who use traditional language (Simalungun) as a medium of delivery. The study used a descriptive qualitative research approach with a survey research design using the google form platform as a research instrument, involving sixty priests as the study participants, using the Simalungun language as a medium for delivering sermons. The results of this study reveal that the motivation to use code-mixing is very important to clarify and facilitate the delivery and understanding of the content of the sermon. The results of this study also reveal the reasons, the dominant preaching genre using code-mixing, the language used in code-mixing, the problems faced in practicing code-mixing, as well as the congregation's perception of the practice of code-mixing carried out by pastors. The results of this study provide a complete description of the practice and perception of code-mixing in sermons using the Simalungun language. Further researches on the content of preaching in Sunday sermons using different methods of analysis are suggested by the study.
Chapter
Namibia is a multilingual nation and, like other African countries, comprises numerous indigenous languages besides its official language, which is English. The national language policy has always recognized the importance of local languages. Despite the fact that all these languages are being used in daily lives, only a handful of them have produced written copies such as language dictionaries and publications for people from indigenous communities. The few resources that are available were created before the digital era, and therefore are not easily accessible nor up to date. Realizing the importance of indigenous languages in Namibia, we intend to develop an online language platform as a basis for language development and preservation. Our proposed solution, a collaborative online open-content dictionary for Namibian indigenous languages which will be collectively maintained by the people of Namibia. Different design techniques and similar existing systems were evaluated together with a group of participants.
Article
Full-text available
In this paper, we report a survey of lan- guage resources in Indonesia, primarily of indigenous languages. We look at the offi- cial Indonesian language (Bahasa Indone- sia) and 726 regional languages of Indone- sia (Bahasa Nusantara) and list all the available lexical resources (LRs) that we can gathered. This paper suggests that the smaller regional languages may remain relatively unstudied, and unknown, but they are still worthy of our attention. Vari- ous LRs of these endangered languages are being built and collected by regional lan- guage centers for study and its preserva- tion. We will also briefly report its pres- ence on the Internet.
Article
Full-text available
The paper gives an overview and evaluation of language resources of Asian languages, in particular of Indonesian official and local languages that are currently used on the Internet. We have collected over 100 million of Asian web pages downloaded from 43 Asian country domains, and analyzed language properties of them. The presence of a language is measured primarily by number of pages written in each language. Through the survey, it is revealed that the digital language divide does exist at serious level in the region, and the state of multilingualism and the dominating presence of cross-border languages, English in particular, are analyzed. From this survey as well, the diversity of Indonesian official and local languages on the Internet is observed.
Article
Full-text available
The dominant discourse in accommodating the ethnic Chinese in Indonesia during Suharto's regime was one of assimilation, which forcefully aimed to absorb this minority into the national body. However, continuous official discrimination towards the Chinese placed them in a paradoxical position that made them an easy target of racial and class hostility. The May 1998 anti-Chinese riots proved the failure of the assmilationist policy. The process of democratization has given rise to a proliferation of identity politics in post-Suharto Indonesia. The policy of multiculturalism has been endorsed by Indonesia's current power holders as a preferred approach to rebuilding the nation, consistent with the national motto: 'Unity in Diversity'. This paper critically considers the politics of multiculturalism and its efficacy in managing cultural diversity and differences. It deploys the concept of hybridity to describe as well as analyze the complex identity politics of the ethnic Chinese in contemporary Indonesia.
Book
The rapid endangerment and death of many minority languages across the world is a matter of widespread concern, not only among linguists and anthropologists but among all concerned with issues of cultural identity in an increasingly globalized culture. By some counts, only 600 of the 6000 or so languages in the world are 'safe' from the threat of extinction. A leading commentator and popular writer on language issues, David Crystal asks the fundamental question, 'Why is language death so important?', reviews the reasons for the current crisis, and investigates what is being done to reduce its impact. The book contains not only intelligent argument, but moving descriptions of the decline and demise of particular languages, and practical advice for anyone interested in pursuing the subject further.
Article
Thesis (Ph. D.)--Ateneo De Manila University-Philippine Normal College Consortium, 1978. Vita. Includes bibliographical references (leaves 464-486). Photocopy.
Article
Japanese women have been found to have higher pitches than Dutch women. This finding has been explained in the past by assuming that Japanese women raise their pitch in order to project a vocal image associated with feminine attributes of powerlessness. In the present study three hypotheses underlying such an assumption were tested experimentally: (1) the association of high pitch with attributes of physical and psychological powerlessness (short, weak, dependent, modest) in the Dutch and Japanese cultures, (2) a stronger differentiation between the ideal woman and man, in terms of powerlessness/power, in Japan than in the Netherlands and (3) a preference for high pitch in women in Japan and for medium or low pitch in women in the Netherlands. All three hypotheses were confirmed. However, results also suggest a strong emphasis in Japan on masculinity in men, possibly leading to a lowering of pitch.
Bahasa indonesia: Between faqs and facts
  • J Tan
Language treasures in Indonesia
  • lauder
Language policy in Indonesia: The promotion of a national language amidst ethnic diversity. In Fighting words: Language policy and ethnic relations in Asia
  • J Bertrand
The Indonesian language
  • G Quinn
Preliminary proposal for encoding additional sundanese characters for old sundanese in the ucs
  • M Everson
Contemporary use of the balinese script
  • sudewa
An efficient algorithm to search for a minimum sentence set for collecting speech database
  • zhang
Indonesian Paleography: A History of Writing in Indonesia from the Beginning to AD 1500
  • H Kahler
  • J G De Casparis
Language policy in indonesia: The promotion of a national language amidst ethnic diversity
  • bertrand