Estimated cost per language family/script, relative to English. The language families are abbreviated as follows: IE: Indo-European, ST: Sino-Tibetan, AC: Atlantic-Congo, AA: Afro-Asiatic, DR: Dravidian, KA: Kartvelian.

Estimated cost per language family/script, relative to English. The language families are abbreviated as follows: IE: Indo-European, ST: Sino-Tibetan, AC: Atlantic-Congo, AA: Afro-Asiatic, DR: Dravidian, KA: Kartvelian.

Source publication
Preprint
Full-text available
Language models have graduated from being research prototypes to commercialized products offered as web APIs, and recent works have highlighted the multilingual capabilities of these products. The API vendors charge their users based on usage, more specifically on the number of ``tokens'' processed or generated by the underlying language models. Wh...

Contexts in source publication

Context 1
... we make a reasonable assumption that its training dataset has a similar proportion of languages as the publicly available large corpus CC100 ( Wenzek et al., 2020). If we sort languages shown in Figure 2 based on their data size in CC100 (see Figure 14 in the Appendix), lowresourced languages of Latin script appear to be less fragmented compared to other mid-resourced languages of non-Latin scripts. In Figure 15 in the Appendix, we present a similar analysis for BLOOMZ's tokenizer. ...
Context 2
... the same information is expressed using different number of tokens in different languages, we aim to investigate the disparity in what users pay to use the API for different languages. From the results of our analysis in §4.1, we compute the estimated cost of API use per language as a function of the average sequence length derived in Figure 2. We report this on a subset of languages in Figure 16 in the Appendix and present a granular analysis of languages that share family and script in Figure 4. ...

Similar publications

Conference Paper
Full-text available
Significant progress has been made on text generation by pre-trained language models (PLMs), yet distinguishing between human and machine-generated text poses an escalating challenge. This paper offers an in-depth evaluation of three distinct methods used to address this task: traditional shallow learning, Language Model (LM) fine-tuning, and Multi...