Article

Towards building a Urdu Language Corpus using Common Crawl

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Urdu is the most popular language in Pakistan which is spoken by millions of people across the globe. While English is considered the dominant web content language, characteristics of Urdu language web content are still unknown. In this paper, we study the World-Wide-Web (WWW) by focusing on the content present in the Perso-Arabic script. Leveraging from the Common Crawl Corpus, which is the largest publicly available web content of 2.87 billion documents for the period of December 2016, we examine different aspects of Urdu web content. We use the Compact Language Detector (CLD2) for language detection. We find that the global WWW population has a share of 0.04% for Urdu web content with respect to document frequency. 70.9% of the top-level Urdu domains consist of . com, . org, and . info. Besides, urdulughat is the most dominating second-level domain. 40% of the domains are hosted in the United States while only 0.33% are hosted within Pakistan. Moreover, 25.68% web-pages have Urdu as primary language and only 11.78% of web-pages are exclusively in Urdu. Our Urdu corpus consists of 1.25 billion total and 18.14 million unique tokens. Furthermore, the corpus follows the Zipf’s law distribution. This Urdu Corpus can be used for text summarization, text classification, and cross-lingual information retrieval.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... For instance, corpora of Arabic, Persian, Kurdish, Japanese, Bangla, Nepali, and multilingual content are developed by crawling digital and social media content [15][16][17][18]. For the Urdu language, previously, we build corpora of 1.28 and 8.0 million Urdu webpages [19,20]. With growing online social media users, it is imperative to build a high-quality Urdu language corpus as tools developed from web content are not extendable for social media due to its characteristics of short, abbreviated, and noisy text. ...
... Finally, in the context of Urdu language corpus, we have built two large scale repositories of Urdu webpage. First, we processed 2.87 billion webpages from Common Crawl to develop a repository containing 1.28 million Urdu webpages [19,33]. Next, we use these webpages as seed and crawl WWW for three years (2017-2020) to build a highquality corpus of UrduWeb20 containing 8.0 million Urdu webpages [20]. ...
... In addition, we filter the Urdu words in order to examine the vocabulary richness of the dataset. We use the American Standard Code for Information Interchange (ASCII) range of Urdu characters to identify and remove words containing a single character out of the ASCII range [19]. Table 3 shows the total and unique number of Urdu uni, bi, and tri-grams in the dataset. ...
Article
The confluence of high performance computing algorithms and large scale high-quality data has led to the availability of cutting edge tools in computational linguistics. However, these state-of-the-art tools are available only for the major languages of the world. The preparation of large scale high-quality corpora for low resource language such as Urdu is a challenging task as it requires huge computational and human resources. In this paper, we build and analyze a large scale Urdu language Twitter corpus Anbar. For this purpose, we collect 106.9 million Urdu tweets posted by 1.69 million users during one year (September 2018-August 2019). Our corpus consists of tweets with a rich vocabulary of 3.8 million unique tokens along with 58K hashtags and 62K URLs. Moreover, it contains 75.9 million (71.0%) retweets and 847K geotagged tweets. Furthermore, we examine Anbar using a variety of metrics like temporal frequency of tweets, vocabulary size, geo-location, user characteristics, and entities distribution. To the best of our knowledge, this is the largest repository of Urdu language tweets for the NLP research community which can be used for Natural Language Understanding (NLU), social analytics, and fake news detection.
... For instance, only 2.48-12.83% webpages of Asian languages have content in one language [40], [41]. The inclusion of multi-lingual content will result in low-quality corpora of a target language. ...
... In this work, we build upon our previous approach [40] where we developed a dataset consisting of 1.28 million Urdu webpages from CCC 2016 dataset. Our analysis on the dataset manifests the presence of 84% noisy webpages that motivated us to design a framework to build highquality LRL corpora. ...
... Swiss-AL corpus contains 8 million texts and 1.55 billion tokens. Similarly, we built a Urdu language corpus of 1.28 million Urdu webpages from CC corpus of 2.87 billion webpages [40], [50]. ...
Article
Full-text available
The rapid proliferation of artificial intelligence has led to the development of sophisticated cutting-edge systems in natural language processing and computational linguistics domains. These systems heavily rely on high-quality dataset/corpora for the training of deep-learning algorithms to develop precise models. The preparation of a high-quality gold standard corpus for natural language processing on a large scale is a challenging task due to the need of huge computational resources, accurate language identification models, and precise content parsing tools. This task is further exacerbated in case of regional languages due to the scarcity of web content. In this article, we propose a generic framework of Corpus Analyzer – Corpulyzer – a novel framework for building low resource language corpora. Our framework consists of corpus generation and corpus analyzer module. We demonstrate the efficacy of our framework by creating a high-quality large scale corpus for the Urdu language as a case study. Leveraging dataset from Common Crawl Corpus (CCC), first, we prepare a list of seed URLs* by filtering the Urdu language webpages. Next, we use Corpulyzer to crawl the World-Wide-Web (WWW) over a period of four years (2016–2020). We build Urdu web corpus "UrduWeb20" that consists of 8.0 million Urdu webpages crawled from 6,590 websites. In addition, we propose Low-Resource Language (LRL) website scoring algorithm and content-size filter for language-focused crawling to achieve optimal use of computational resources. Moreover, we analyze UrduWeb20 using variety of traditional metrics such as web-traffic-rank, URL depth, duplicate documents, and vocabulary distribution along with our newly defined content-richness metrics. Furthermore, we compare different characteristics of our corpus with three datasets of CCC. In general, we observe that contrary to CCC that focuses on crawling the limited number of webpages from highly ranked Urdu websites, Corpulyzer performs an in-depth crawling of Urdu content-rich websites. Finally, we made available Corpulyzer framework for the research community for corpus building.
... Urdu is a major language spoken by more than 160 million people across the world. According to a study, about 0.04% of the global web content is in the Urdu language [4]. It is written in Arabic script, possesses a free word order and has a rich inflectional morphology. ...
Article
Full-text available
Detecting the communicative intent behind user queries is critically required by search engines to understand a user’s search goal and retrieve the desired results. Due to increased web searching in local languages, there is an emerging need to support the language understanding for languages other than English. This article presents a distinctive, capsule neural network architecture for intent detection from search queries in Urdu, a widely spoken South Asian language. The proposed two-tiered capsule network utilizes LSTM cells and an iterative routing mechanism between the capsules to effectively discriminate diversely expressed search intents. Since no Urdu queries dataset is available, a benchmark intent-annotated dataset of 11,751 queries was developed, incorporating 11 query domains and annotated with Broder’s intent taxonomy (i.e., navigational, transactional and informational intents). Through rigorous experimentation, the proposed model attained the state of the art accuracy of 91.12%, significantly improving upon several alternate classification techniques and strong baselines. An error analysis revealed systematic error patterns owing to a class imbalance and large lexical variability in Urdu web queries.
... In terms of data interface, McBSP serial port is connected with AIC23 through six pins CLKX, CLKR, FSX, FSR, DR and CX. DSP and AIC23 transmit digital voice signal through DR and DX pins [9]. ...
Article
Full-text available
The accuracy of the existing spoken English intelligent evaluation system is not high, and the spoken English evaluation effect is poor. To improve the accuracy and speed of spoken English evaluation, this paper puts forward the design and research of oral English Intelligent Evaluation System based on DTW algorithm. The DTW algorithm is applied to recognize spoken English speech, and a new spoken English intelligent evaluation system is designed. The hardware units are DSP chip selection unit, spoken English audio acquisition unit and its external memory unit; the software modules are spoken English speech preprocessing module, spoken English speech recognition module and spoken English intelligent evaluation module. Through the design of hardware unit and software module, the operation of spoken English intelligent evaluation system is realized. The experimental results show that the oral evaluation accuracy of this system is 65.63% -76.58%, and the response time is 8.23 ms − 13.57 ms, with high accuracy, high evaluation efficiency and improving the effect of spoken English intelligent evaluation.
Article
Full-text available
In this article, we introduce the first Kurdish text corpus for Central Kurdish (Sorani) branch, called AsoSoft text corpus. Kurdish language, which is spoken by more than 30 million people, has various dialects. As one of the two main branches of Kurdish, Central Kurdish is the formal dialect of Kurdish literature. AsoSoft text corpus is of size 188 million tokens and has been collected mostly from Web sites, published books, and magazines. The corpus has been normalized and converted into Text Encoding Initiative XML format. In both collecting and processing the text, we have faced several challenges and have proposed solutions to them. About 22% of the corpus is topic annotated with six topic tags, and a topic identification task has been done to evaluate the correctness of annotation. The computational experiments of the Central Kurdish text processing are also presented with the support of related supplementary statistics. For the first time, the validity of Zipf’s law for Central Kurdish is presented and also perplexity of this language is calculated using standard N-gram language models. The perplexity of Central Kurdish is 276 for a tri-gram language model.
Conference Paper
Full-text available
Large Web corpora containing full documents with permissive licenses are crucial for many NLP tasks. In this article we present the construction of 12 million-pages Web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs. Our highly-scalable Hadoop-based framework is able to process the full CommonCrawl corpus on 2000+ CPU cluster on the Amazon Elastic Map/Reduce infrastructure. The processing pipeline includes license identification, state-of-the-art boilerplate removal, exact duplicate and near-duplicate document removal, and language detection. The construction of the corpus is highly configurable and fully reproducible, and we provide both the framework (DKPro C4CorpusTools) and the resulting data (C4Corpus) to the research community.
Article
Full-text available
Despite being a paradigm of quantitative linguistics, Zipf's law for words suffers from three main problems: its formulation is ambiguous, its validity has not been tested rigorously from a statistical point of view, and it has not been confronted to a representatively large number of texts. So, we can summarize the current support of Zipf's law in texts as anecdotic. We try to solve these issues by studying three different versions of Zipf's law and fitting them to all available English texts in the Project Gutenberg database (consisting of more than 30000 texts). To do so we use state-of-the art tools in fitting and goodness-of-fit tests, carefully tailored to the peculiarities of text statistics. Remarkably, one of the three versions of Zipf's law, consisting of a pure power-law form in the complementary cumulative distribution function of word frequencies, is able to fit more than 40% of the texts in the database (at the 0.05 significance level), for the whole domain of frequencies (from 1 to the maximum value) and with only one free parameter (the exponent).
Conference Paper
Full-text available
Arabic natural language processing (ANLP) has gained increasing interest over the last decade. However, the development of ANLP tools depends on the availability of large corpora. It turns out unfortunately that the scientific community has a deficit in large and varied Arabic corpora, especially ones that are freely accessible. With the Internet continuing its exponential growth, Arabic Internet content has also been following the trend, yielding large amounts of textual data available through different Arabic websites. This paper describes the TALAA corpus, a voluminous general Arabic corpus, built from daily Arabic newspaper websites. The corpus is a collection of more than 14 million words with 15,891,729 tokens contained in 57,827 different articles. A part of the TALAA corpus has been tagged to construct an annotated Arabic corpus of about 7000 tokens, the POS-tagger used containing a set of 58 detailed tags. The annotated corpus was manually checked by two human experts. The methodology used to construct TALAA is presented and various metrics are applied to it, showing the usefulness of the corpus. The corpus can be made available to the scientific community upon authorisation.
Conference Paper
Full-text available
For linguistics related research on a language there is always a need for a large collection of database which includes all features of a language such as grammatical information, style of writing, syntax etc. Corpus provides a platform for investigation on a natural language. As compared to other languages very limited research work is done on Urdu language due to its segmentation dilemma and difficult character shape. Very less number of editable printed text data is available in Urdu language, most of the data is available in graphical or picture format. To increase Natural Language Processing research work on Urdu language there is a need for a large database which contains a range of variance in annotated Urdu handwritten as well as printed text. In our work we purpose a large database of Urdu text including 1000 handwritten text images written by 500 different writers. Each image would be four to six lines of Urdu text having 60-80 words per line the estimated number of words would be around .35 million. Selection of words would be done from six different categories so that maximum number of distinct words can be included. Corpus would be annotated for line as well as word segmentation where a word may be an individual character or component. The corpus would be a benchmark for quantitative analysis of Handwritten Text Recognition techniques for Urdu language such as text line extraction, word segmentation and character recognition etc., and for linguistic research in Part of Speech, writer identification, dictionary etc.
Conference Paper
Full-text available
The World Wide Web has grown so big, in such an anarchic fashion, that it is difficult to describe. One of the evident intrinsic characteristics of the World Wide Web is its multilinguality. Here, we present a technique for estimating the size of a language-specific corpus given the frequency of commonly occurring words in the corpus. We apply this technique to estimating the number of words available through Web browsers for given languages. Comparing data from 1996 to data from 1999 and 2000, we calculate the growth of a number of European languages on the Web. As expected, non-English languages are growing at a faster pace than English, though the position of English is still dominant.
Conference Paper
The World Wide Web has emerged as the most important and essential tool for the society. Today, people heavily rely on rich resources available in the web for communication, business, maps, and social networking etc. In addition, people seek web content in their preferred regional language besides English. The global statistics of the world wide web are well known, however, the regional context of the world wide web is poorly understood. This paper presents large scale web study using Common Crawl Corpus of December 2016. We examine 200+ terabytes of data with Amazon’s Elastic MapReduce infrastructure. We analyze 2.87 billion web documents with respect to content type, domains, and content language. Furthermore, we explore multilingual web pages for European and Asian languages. Our results show that 97.8% of web documents present in our data are ”text/html”. In addition, 57.2% of web documents contain content in the English language. Moreover, web content in Russian language has 5.7% share which is more that any other European language. Furthermore, we found that 60.6% of web documents have content exclusively in the English language. Finally, we found that Japanese and traditional Chinese language content dominate the Asian web pages with 1.89% and 1.23% share. To the best of our knowledge, this is the first large scale web study to explore the language mix present in the web documents.
Conference Paper
Web crawls provide valuable snapshots of the Web which enable a wide variety of research, be it distributional analysis to characterize Web properties or use of language, content analysis in social science, or Information Retrieval (IR) research to develop and evaluate effective search algorithms. While many English-centric Web crawls exist, existing public Arabic Web crawls are quite limited, limiting research and development. To remedy this, we present ArabicWeb16, a new public Web crawl of roughly 150M Arabic Web pages with significant coverage of dialectal Arabic as well as Modern Standard Arabic. For IR researchers, we expect ArabicWeb16 to support various research areas: ad-hoc search, question answering, filtering, cross-dialect search, dialect detection, entity search, blog search, and spam detection. Combined use with a separate Arabic Twitter dataset we are also collecting may provide further value.
Toward kurdish 654 language processing: Experiments in collecting and pro-655
  • H Veisi
  • M M Amini
  • H Hosseini
H. Veisi, M.M. Amini and H. Hosseini, Toward kurdish 654 language processing: Experiments in collecting and pro-655
Ethnologue: Languages of the World
  • Sil International
SIL International. Ethnologue: Languages of the World. https://www.ethnologue.com/language/urd, June 2019.
Understand-671 Proceedings of 13th Malaysia International Con-673
  • M A Mehmood
  • H M Shafiq
  • A Waheed
M.A. Mehmood, H.M. Shafiq and A. Waheed, Understand-671 Proceedings of 13th Malaysia International Con-673
Elastic compute cloud (EC2)
  • Amazon
Amazon. Elastic compute cloud (EC2). https://aws. amazon.com/ec2/, June 2019.
Elastic map reduce (EMR)
  • Amazon
Amazon. Elastic map reduce (EMR). https://aws. amazon.com/emr/, June 2019.
Exploratory 688 analysis of a terabyte scale web corpus
  • V Kolias
  • I Anagnostopoulos
  • E Kayafas
V. Kolias, I. Anagnostopoulos and E. Kayafas, Exploratory 688 analysis of a terabyte scale web corpus. arXiv preprint 689 arXiv:1409.5443, 2014.
Statistics of common crawl corpus 2012
  • Sebastain Spiegler
Sebastain Spiegler. Statistics of common crawl corpus 2012. https://commoncrawl.org/2013/08/alook-insidecommon-crawls-210tb-2012-web-corpus/, June 2013.
langdetect: language-detection library to Python
  • M Danilak
M. Danilak, langdetect: language-detection library to Python. https://github.com/Mimino666/langdetect, June 2019.
Language-detection at wiki
  • N Shuyo
N. Shuyo. Language-detection at wiki. https://github.com/ shuyo/languagedetection/blob/wiki/ProjectHome.md.
  • V Kolias
  • I Anagnostopoulos
  • E Kayafas
V. Kolias, I. Anagnostopoulos and E. Kayafas, Exploratory analysis of a terabyte scale web corpus. arXiv preprint arXiv:1409.5443, 2014.