Science topic

Corpus Linguistics - Science topic

Explore the latest questions and answers in Corpus Linguistics, and find Corpus Linguistics experts.
Questions related to Corpus Linguistics
  • asked a question related to Corpus Linguistics
Question
3 answers
Thanks
Relevant answer
Answer
Probably, such tools like GATE (General Architecture for Text Engineering) and NLTK (Natural Language Toolkit) can be used to annotate and analyze text corpora, allowing for deeper semantic analysis. They can be adapted to work across languages depending on the data available. Or FrameNet , a linguistic resource that provides a structured inventory of meaning or semantics for words and sentences in different languages. EuroWordNet for European languages also may be of use.
  • asked a question related to Corpus Linguistics
Question
4 answers
Hello Everyone,
Can anyone guide me to find Corpus/ Training data for readability difficulty of English texts?
Thanks in advance
Udaysimha Nerella
Relevant answer
Answer
Rafal Rzepka Hi there! This link is down, do you know how to access it now? Thank you so much!
  • asked a question related to Corpus Linguistics
Question
2 answers
Hello everyone,
I'm compiling a list of all the Arabic L1 EFL learner corpora. If you know of any, please let me know.
Relevant answer
Answer
Hello, I also curious to know about corpora in the sketch engine. I think we can add data from our language to that machine. If other persons who create the data on that machine by using our language, we can use the data to make comparisons with other languages.
  • asked a question related to Corpus Linguistics
Question
7 answers
Hello ResearchGate Members,
I hope this message finds you well. I am currently exploring different tools for visualizing frequency collocations extracted from the AntConc program using network graphs. While I have tried VOSviewer and KH Coder, I've encountered challenges as they don't seem to generate graphs based on simple frequency.
I would greatly appreciate any recommendations or insights from the community on alternative tools or methods that effectively visualize frequency-based collocations through network graphs. Your expertise and suggestions will be invaluable in enhancing my research visualization.
Thank you in advance for your assistance, and I look forward to learning from your experiences.
Relevant answer
Answer
Thank you very much, Illia, for sending me the figure. I guess the frequency of the collocate is given via the saturation of the colour of the spot, whereas the distance from the node indicates the real distance of the collocate from the node in a text (your span is -5/+5).
  • asked a question related to Corpus Linguistics
Question
2 answers
For instance I am interested to find out the representation of Christian identity in any single text through corpus linguistics methods? how is it possible to do qualitative analysis of the text without any other approach?
Relevant answer
Answer
If I were you, I would go for collocation analysis, choosing such lemmas as “Christian”, “faith”, etc., as collocation heads. It is also useful to utilize a large corpus, which usually yields more interesting results.
  • asked a question related to Corpus Linguistics
Question
10 answers
  1. Observational Analysis: Researchers observe and record conversations to study how individuals from diverse linguistic backgrounds interact.
  2. Transcription: Spoken language is transcribed into written form for detailed analysis, including pronunciation, intonation, and pauses.
  3. Coding and Categorization: Linguistic patterns and sociolinguistic variables are identified in transcripts, such as code-switching and language choices.
  4. Quantitative Analysis: Statistical techniques may quantify sociolinguistic phenomena like code-switching frequency or linguistic feature distribution.
  5. Qualitative Analysis: Researchers explore the meaning and context behind linguistic behaviors and language choices.
  6. Questionnaires and Surveys: Self-reported data from participants, including language preferences and attitudes toward languages, can be collected.
  7. Corpus Linguistics: Large collections of texts or spoken data are analyzed to uncover linguistic patterns.
  8. Experimental Studies: Researchers design experiments to manipulate variables related to peer interactions and sociolinguistic competence.
  9. Interviews: Semi-structured interviews provide insights into participants' experiences and perceptions.
  10. Audio and Video Recordings: Recordings capture spoken and nonverbal aspects of communication, such as gestures and facial expressions.
Relevant answer
Answer
I intend to use open-ended questions and target linguistics students, although I haven't yet begun to formulate the questions.
  • asked a question related to Corpus Linguistics
Question
2 answers
I am currently working on a project, part of which is for presentation at JK30 this year in March hosted at SFU, and I have been extensively searching for a part of speech (POS) segmenter/tagger capable of handling Korean text.
The one I currently have access to and could make execute is relatively outdated and requires many modifications to execute runs on the data.
I do not have a strong background in Python and have zero background in Java and my operating system is Windows.
I wonder if anyone may be able to recommend how may be the best way to go about segmenting Korean text data so that I can examine collocates with the aim of determining semantic prosody, and/or point me in the direction of a suitable program/software.
Relevant answer
Answer
Kerry Sluchinski You might try the following user-friendly POS taggers/segmenters for Korean language data:
1. KoNLPy: KoNLPy is a Python module for Korean natural language processing. It features a POS tagger as well as numerous tools for Korean language processing. KoNLPy is straightforward and well-documented.
2. KOMORAN: KOMORAN is a Korean morphological analyzer and POS tagger that is free source. It is available as a command-line utility and as a Java library. For testing reasons, KOMORAN offers a user-friendly online interface.
3. Hannanum is a Korean morphological analyzer and POS tagger. It is a Java library that is built on a dictionary-based approach. Hannanum is simple to use and provides a user-friendly online interface for testing.
4. Korean Parser: Korean Parser is a dependency parser and part-of-speech tagger for Korean. It is written in Python and may be used as either a command-line utility or a Python library. Korean Parser is straightforward and well-documented.
5. Lingua-STS: Lingua-STS is a web-based tool for processing Korean language. It features a POS tagger as well as numerous tools for Korean language processing. Lingua-STS is simple to use and features an intuitive online interface.
These tools are all simple to use and may be used to separate Korean text data and conduct POS tagging.
  • asked a question related to Corpus Linguistics
Question
7 answers
Please let me know the name or URL of any comprehensive Bangla corpus data for SA or ER.
  • asked a question related to Corpus Linguistics
Question
2 answers
Hello everyone,
I am looking for a repository of corpus building for the domain of Sentiment Analysis for the Bangla/Bengali language.
Thank you for your sharing.
Relevant answer
Answer
  • asked a question related to Corpus Linguistics
Question
3 answers
I’m using Mann Whitney test in a linguistic study to compare the frequencies of a linguistic feature in two collections of texts. One collection includes a lot more (x10) texts than the other one. Ive Read that Mann Whitney can be used to compare groups of unequal size, but the examples usually given are smth like 224 vs 260, not 224 vs 2240.
Can I still use this test? Does it make sense to thin the bigger sample to match the smaller one? They’re both random samples representative of a certain genre, so conceptually i think downsampling is possible.
Relevant answer
Answer
There is No need to thin the bigger sample. The test can be conducted on samples of different size.
  • asked a question related to Corpus Linguistics
Question
4 answers
L2, Grice's Maxims, Speech Acts, or Spoken language are approximating my area of interest.
Relevant answer
Answer
Check my paper here on creative writing assisted by corpora.
  • asked a question related to Corpus Linguistics
Question
4 answers
I’m doing topic model with a collection of technical documents related to the repair of device. The reports are extracted from different softwares from different repair shops. I need to do proper cleaning so model focuses on the key words, specifically I want to automatically remove useless words like:
* Additional findings
* External appereance
* Incoming condition, etc
These "fil-in / template word" are found in almost every document and there are even more others, the documents are collected from different sources and consolidated in one database from which I do the extractions.I already tried segregating by repair shop using tfidf, term frequency, bm25 and segregating by software.
Relevant answer
Answer
I am building an app to help with some of these problems, for example we added a cleaner (it's called not very accurately 'remove custom stopwords') where you can input the words you want removed. You can try the app here, it's still in beta, would love your feedback: https://sagetextipocapp.azurewebsites.net/
  • asked a question related to Corpus Linguistics
Question
20 answers
I was trying to determine whether there are differences in the frequencies of words (lemmas) in a given language corpus starting with the letter K and starting with the letter M. Some 50 000 words starting with K and 54000 words starting with M altogether. I first tried using the chi-square test, but the comments below revealed that this was an error.
Relevant answer
Answer
Did you try Python word count?
  • asked a question related to Corpus Linguistics
Question
1 answer
I have a question, as to whether collocations in corpus linguistics can be used to indicate diversity. I have a corpora of media news articles, and I manage to find out the frequency of my target word and the collocates. I then used a regression analyses to find out if demographic variables predicted the frequency of the target word, and another regression model with collocates as the outcome. I simply took the number of collocates for each article as the dependent variable, with the understanding that more number of collocates meant more diversity in the media representation. Consequently, I conclude that demographic variables predicts this diversity, i.e., a country's cultural value singificantly predicts higher diversity in the media, indicated by the number of collocates. The research aim is to explore whether national values predict diversity in the way a particular issue is presented in the media. Please please tell me if this reasoning is sound as I have very little background knowledge on this.
Relevant answer
Answer
Of course, the way how an issue is presented by the media powerfully reacts on the cultural value people attribute to the self-analysis of its particular people section. Pre-labeled political parties or groups may assume a diametrical opposite value whether classifified as "right" or "left" for instance – as the recent debate on 6th Jan "assault" to Capitol Hill has been presented.
Work of freaks/or politically organized right wing expression ?
  • asked a question related to Corpus Linguistics
Question
6 answers
I noticed that some scholars mentioned corpus-assisted method in Cognitive Translation Studies (CTS) or Cognitive Translation and Interpreting Studies (CTIS). However, the dominant method designs in CTIS are eye tracking-based or verbal report-based. I want to know more about how to utilize corpus tools in CTIS but I have not found any comprehensive introduction.
I only read some calls for corpus-assisted cognitive translation studies in Chinese and English academia. Only recently, I read a book chapter by Lang & Li (2020) about the cognitive processing routes of culture-specific linguistic metaphors in simultaneous interpreting. They have discussed many cognitive models but not enough for me as a layman to have a better picture of the whole area.
Thus, I am thinking whether there are any references I can have to help me go further in this regard. I have read some works in Cognitive Linguistics. Yes, some have used a corpus-driven approach to discuss cognitive linguistic issues but the explanations seem not very clear.
However, I am still curious about the comments from translation scholars in CTIS. Do CTIS scholars actually believe that corpus can analyze cognitive aspects of translation as this is not the dominant tool for this group?
Whether yes or no, I am also interested in the reasons.
Thanks for noticing and answering this question :)
Relevant answer
Answer
Dear Yufeng,
Translating is not only problem solving. And definitely cognitive experience is not only problem solving. Some product analysis results may lead to some very modest conclusions that can be safely assumed to unequivocally point to cognitive processing and phenomena. Analogically, consider what you can learn about what goes on in the mind of a shoemaker or a luthier by studying his shoes or her violins.
  • asked a question related to Corpus Linguistics
Question
4 answers
Hi, I'm searching for a corpus that contains doctor and patients speech/ dialogue, or at least only patient talks, How can I find it? any suggestion?
Relevant answer
Answer
You may check Staples (2015), the title is: The Discourse of Nurse-Patient Interactions: Contrasting the Communicative Styles of U.S. and International Nurses
  • asked a question related to Corpus Linguistics
Question
13 answers
I am working on a Natural Language Processing task and want to create my own corpus of some doucments. Each document has approximately 500-600 words.
Can anyone suggest how to create a corpus? As I am new to this concept of NLP.
Relevant answer
Answer
Interesting question: would you please elaborate on your NLP task so I can provide more suggestions?
In general, you need to think about (1) corpus size should be large enough (2) representativenss of the corpus design (3) use machine-readable format.
  • asked a question related to Corpus Linguistics
Question
5 answers
actually I need semantic relations to agent nominals as well.
fx. I need the verb 'grave' (eng: (to) dig) which have semantic relations to 'jord' (eng: dirt) and 'skovl' (eng: showel) and of course alot of other obvious relations.
I need the verbs in order to test how organizational resources (knowledge, money, stuff which is all nominals) can be combined with verbs into tasks fx "grav i jorden med skovlen" (eng: dig into the dirt with the showel)
Relevant answer
Answer
I would suggest using Standford CoreNLP to annotate your texts (corpus) with POS tags, and I believe this computational package can have different scripts for different languages. Then, extract the words with verb tags. Let me know if there is any question.
  • asked a question related to Corpus Linguistics
Question
26 answers
I need to learn about authentic specific uses of language in English to serve for my ESP course designs as I teach ESP courses in an EFL setting, which makes it even harder to reach such genuine language uses for specific purposes. I plan to make use of a concordancer for pedagogical purposes as well. I will be glad if you could suggest me a few online concordancing tools that you have found effective.
Thank you in advance.
Relevant answer
Answer
Maybe Anthony Lawrence's Antconc. You can google and download easily.
  • asked a question related to Corpus Linguistics
Question
2 answers
“The state of moral education and citizenship education within the schools of Kurdistan” this is my new research title!
I was wondering! What are the anticipated, proposed resolution, problems and challenges that I would face Through out my paper?
Relevant answer
Answer
You can read my 2 papers on citizenship education, I may help you with tools for a quantitative research, may something compared
  • asked a question related to Corpus Linguistics
Question
11 answers
I know the formula for calculating normalised frequency. but i want to know whether there is an existing software to aid in calculating normalised frequencies
Relevant answer
Answer
Dear Peter,
While there exist some data normalisation/standardisation softwares, as suggested by dear Milos, I believe normalisation of frequency count can be done manually through a very basic and simple mathematical/statistical formula. Personally, in order to control length variation and make the comparison between my datasets of different sizes possible, I normalise the occurrence counts of the finalized identified features following Biber et al. (1998) [(Raw frequency count/number of words in the text) x 1,000 = normalized frequency count].
  • asked a question related to Corpus Linguistics
Question
3 answers
Could you please tell me what are the best available Arabic speech corpuses for a TTS system? Please include even non free options.
Relevant answer
Answer
Text to speech system, Shatha
  • asked a question related to Corpus Linguistics
Question
2 answers
If you build your own corpus to address specific research questions, which method to you use to make sure It is saturated? I'm interested in methods as I work on digital data and I wonder which method is more efficient and less time-consuming.
Relevant answer
Answer
In Corpus design, the "saturation corpus" is associated with the concept of "representativeness", developed by Douglas Biber: <http://otipl.philol.msu.ru/media/biber930.pdf>.
Here are some other sources from the University of Lancaster that might interest you:
... Methods: e.g. a short paper from the University of Birmingham :
4. A quantitative approach to corpus representativeness: <http://www.lexytrad.es/assets/cl24_0.pdf>
  • asked a question related to Corpus Linguistics
Question
3 answers
I'm willing to collaborate in any research in the field of corpus linguistics, Data driven learning or material design as long as the final work will be published in a peer-reviewed journal. my master viva is due next month and you can read part of my thesis in my profile
Relevant answer
Answer
This is really interesting.
  • asked a question related to Corpus Linguistics
Question
12 answers
Good day! I need some topic suggestions for my Language and Linguistic Research class. Can you please help me with a researchable topic? I prefer applied, corpus, or sociolinguistics. Thank you!
Relevant answer
Answer
Applied linguistics includes teaching languages' the fist language & the second language', different kinds of written and spoken texts (corpus linguistics), style, sociolinguistics, compiling dictionaries, machine translation, language and brain 'neurolinguistics', language disorders.
Good luck
  • asked a question related to Corpus Linguistics
Question
4 answers
I am looking for something similar to the OpenText.org project that has developed annotated Greek texts.
There is the University of Maryland Parallel Corpus Project that is annotated in conformance with the Corpus Encoding Standard and that also includes English. Unfortunately though, I haven't found any syntactically annotated version of the English text yet.
Relevant answer
Answer
Hi,
you might want to take a look at the following link. It is a King James Version New Testament, parsed with the Charniak parser into Penn Treebank format:
Please note the caveats in the accompanying README. In particular, no cleanup or hand-correction has been made to the data.
Hope this helps.
Best wishes,
Dr. Ulrik Sandborg-Petersen
Denmark
  • asked a question related to Corpus Linguistics
Question
3 answers
“Philosophical discussion in the absence of a theory is no criterion of the validity of evidence.”
-- A. N. Whitehead. Adventure of ideas. (1933:221)
In case of an investigation or in a disciplinary technology, empirically (irrationally speaking, i.e., speaking in a strict non-Cartesian way)speaking, data/corpora is the raw material (ephemeral ‘arbitrary signifiers’ in case of linguistics) to built up a theory following inductive method.
Why, then, mere ‘corpus’ is tagged with linguistics, an epistemological disciplinary technology?
‘Corpus’ is not tagged with Physics, Geology, Psychology, Sociology etc (e.g., Corpus Physics or Corpus Sociology), though they are also dealing with data!
Collection of data and arranging them (typing?) in a digital machine do not involve any knowledge or wis(h)dom but a special skill that needs clerical precision. Documentation, no doubt, is a tiresome job. Utilizing a tool (a digital machine) as a repertoire, does not necessarily entail the birth of discipline.
Ascribing static (“thetic...”, Kristeva,1974) meaning to those entries, though needs epistemology and that can be handled by well-established theory-based disciplines: Lexicology, Semantics, Pragmatics etc. If we have such levels of linguistic analysis, do we need such dubious coinage, “Corpus Linguistics”?
And each empirical discipline needs data for further observation, experimentation and inductive generalization (one may raise Popper’s [1934, 2009] points for refuting Inductivism here), i.e., data is an initial part of the whole, but neither a theory nor a praxis.
However, it is a salebrated discipline now! Why is it so? What is the purpose of such discipline?
My friend says, “We, the residents of the so-called third world, are part of the data-collection team—don’t you understand that? How dare you? You cannot be allowed to perform theoretical plays.” (Galtung, 1980)
Relevant answer
Answer
Perhaps you are referring to the way that corpus linguistics tools enable us to track the "typical" ways in which words are combined? This sheds a lot of light on "priming", for example, which would probably link up with a (fairly) behaviouristic notion of human language, even though, as we all know, these "habits" are only part of the story. The language we produce is not just the outcome of habit!!
  • asked a question related to Corpus Linguistics
Question
9 answers
I recently had an article published for which if researched two Kazakh proverbs using ethnographic as well as corpus linguistic methods. As I consider expanding this project, I am interested in reading about comparable projects.
Relevant answer
Answer
Hi Judit,
Thanks so much for your feedback. This was a conference paper that I worked on a bit and had published. Now, I am looking to do a larger project along the same lines.
  • asked a question related to Corpus Linguistics
Question
13 answers
I'm aware of some projects in sociolinguistics and historical linguistics that share their data either in an open access format, without any substantial restrictions or delays, or without any "application" process as long as the work is for non-profit purposes. The idea is that everything that goes beyond a simple "Safeguard" letter hinders the maximal exploitation of limited and valuable resources.
These best practice examples, which make (often publicly-funded) data collections available to the public deserve recognition. While I can think of many historical data collection, the Helsinki Corpora Family or the BYU corpora, the more contemporary the data get, the fewer resources are publicly accessible. On the more contemporary end, I can think of, as exceptions,
* the Linguistic Atlas Project (http://www.lap.uga.edu)
and our own
* J. K. Chambers Dialect Topography database (http://dialect.topography.chass.utoronto.ca)
* Dictionary of Canadianisms on Historical Principles (www.dchp.ca/dchp2).
Which other projects of active data sharing do you know?
I'd appreciate your input for a list of Best Practice Data Collections that I'm preparing.
Best wishes,
Stefan D.
Relevant answer
Answer
MICASE (the Michigan Corpus of Academic Spoken English) has been open access since completion. No registration required!
  • asked a question related to Corpus Linguistics
Question
16 answers
I mean 'this' will be very frequent item in the corpus, comparing with terms for emotions such as 'anger', so I wonder is it possible any qualitative way of investigating very frequent items. I will appreciate all your suggestions. Thanks in advance.
Relevant answer
Answer
In my opinion, analysing determiners in English corpora can be tricky.
One problem is that you have classes of determiners which have identical pronominal forms (possessive determiners/pronouns, demonstrative determiners/pronouns). You can minimize this problem by using part-of-speech tagged corpora. Many corpora today come in a part-of-speech tagged version already. By using part-of-speech tags, you can specify whether you are looking for a pronominal form, a determiner form, or both, when looking for a word form such as 'this'.
Second, English has noun phrases where an overt determiner is required, and others where the determiner is either optional or not permitted (as per the standard). NPs without an overt article form (sometimes called zero article) can only be found automatically when the corpus is annotated for syntactic structure. A corpus that is annotated for syntactic structure is referred to as a 'parsed' corpus, and they are quite rare (due to the high error rate involved in automatic parsing). However, using tools such as the UAM CorpusTool, you can parse a corpus yourself.
Finally, you mentioned that you are looking for the use of determiners in the context of emotions such as 'anger'. I fully agree with the suggestions above: You can look for determiners that contain certain words in their immediate context. In corpus linguistics, this is often called 'collocation'. Using a collocation approach, you could compile a list of nouns (or verbs, adjectives) that express the notion of anger, and then ask the corpus software to only give you those instances of 'this' which has one of the words from your lists in its immediate context (e.g. within a range of 9 words to the right or the left). Some corpora are even tagged for synonyms (e.g. the corpora on the popular corpus.byu.edu platform), so that you could search for 'this' and any synonym of 'anger' in its immediate context.
I apologize for the somewhat lengthy response, and hope it was helpful (I explained some corpus-linguistic terminology because I don't know your background). Feel free to ask a follow-up.
  • asked a question related to Corpus Linguistics
Question
4 answers
Hi,
I'm looking for accessible/online corpora and tools to help me calculate the phonetic and phonological complexity of words in German (e.g. Jakielski's Index of Phonetic Complexity, IPC and the like) -- as well as any pointers to what useful measures of phonological complexity that have been identified experimentally.
Thanks very much in advance!
Relevant answer
Answer
Dear Gina,
Here you have a comprehensive list of Speech analysis and transcription tools for several languages.
I hope this helps.
Kind regards,
Begoña
  • asked a question related to Corpus Linguistics
Question
2 answers
Does anyone know of any open-access multi-million word bilingual (French/English) corpora that are both aligned (paragraph or sentence level) and tagged for POS and Lemma? I'm familiar with OPUS, but regrettably the alignment and the tags are not very reliable.
Relevant answer
Answer
Dear Adrian,
That sounds quite interesting.  I've had a look at your link, and I'll definitely be contacting you soon with some questions about the corpus structure.
Thanks,
Daniel
  • asked a question related to Corpus Linguistics
Question
15 answers
I am currently doing my MA dissertation and required to code my data, but I don't have other coders to ensure interrater reliability (due to time constraints). As Mackey and Gass (2005) suggest, I repeated the data coding in 2 different periods (Time 1 and Time 2) for intra-rater reliability; however, the results in Time 1 and Time 2 were slightly different. If this happened in the case of multiple coders, they could discuss the disagreement in their coding and decide one definite set of coded materials. As I am the only researcher in a situation in which negotiation with other coders aren't possible, how can I decide which coding to use in my research? Thank you.
Additional info: I am doing research on (corpus) linguistics, specifically how writers express doubts in their research papers by looking at how many times, for example, the modal verb "may" appears in their texts. Since "may" can have multiple meanings other than expressing doubts (e.g. to express permission as in "You may go now"), I need to exclude those which do not function to reflect uncertainty. I have tried converting them into categorical data (e.g. 1 for expressions of doubts and 0 for non-expression of doubts) and I am thinking of using Cohen's Kappa for reliability test of my coding in Time 1 and Time 2. And perhaps I can try to resolve the little difference in both times by asking other people to help me judge/decide the definite sets of data to use.
Relevant answer
Answer
I agree that you should go beyond saying that your results are "slightly different." In particular, I would recommend calculating an inter-rater reliability index such as Krippendorff"s alpha.
You might also consider whether doing this kind of re-rating is really necessary for your work. In some fields, such as communication studies, inter-rater reliability is almost a requirement in you are doing content analysis on media, but in other fields where qualitative research is more interpretive, it is not considered to be useful.
  • asked a question related to Corpus Linguistics
Question
6 answers
I want to build a corpus to test a language identification system. Can you suggest links to collect textual data for these languages transcribed in Arabic characters :
Arabic
Pashto
Balochi
Kashmiri
Kurdish
Punjabi
Persian
Uighur
Urdu
Sindhi
Malay
Relevant answer
Answer
Dear Sadik,
Sketch Engine contains corpora for Arabic, Punjabi, Persian, Urdu and Malay.  You can prepare corpora for the remaining languages yourself using the WebBootCAT functionality in Sketch Engine. In case of any questions, just ask the support team of Sketch Engine at support@sketchengine.co.uk (they could also do that job for you, but there would be some costs involved).
Best regards,
Milos Jakubicek
  • asked a question related to Corpus Linguistics
Question
5 answers
Considering a grammar 'G' having certain semantic rules provided for the list of production 'P'. If intermediate Code needs to be generated and if I follow DAG method to represent it.
In that regard, What are the other variants of Syntax tree apart from DAG for the same?
Relevant answer
Answer
Hi Rebeka, DAG is a variant ( form) of a syntax tree which gives direction to it. There are nothing such as other variants. This much I can suggest you as from your question it is not clear what exactly you are looking for.
  • asked a question related to Corpus Linguistics
Question
5 answers
  • asked a question related to Corpus Linguistics
Question
4 answers
I am looking out for parallel corpora either for English-Hindi translation or English-Marathi translation.
Relevant answer
Answer
http://opus.lingfil.uu.se/ is a nice resource of parallel corpora for many language pairs.
You have 13.4M tokens for (en->hi).  Not much for Marathi though..(en->mr 6.2M)
  • asked a question related to Corpus Linguistics
Question
7 answers
Thanks.
Relevant answer
Answer
  • asked a question related to Corpus Linguistics
Question
2 answers
I found two applications in github for php
but their POStagger results not correct (all words were tagged as NN). Any help will be appreciated.
if there is a link for any other free parser supports Arabic language can be integrated with php will be appreciated.
Relevant answer
Answer
Hi Sara,
You may need to consider using the "arabicFactored" as a Stanford parser model of "POStagger".
HTH.
Samer
  • asked a question related to Corpus Linguistics
Question
14 answers
Hello
For my research, I'm looking for a rich corpus in English literature. is there anybody knows a good one?
Relevant answer
Answer
(My answer assumes that you are referring to computer corpora of literary texts. I hope I have not misunderstood you.)
I am not familiar with specific literary corpora: I know they exist, but I do not know their size or composition. However, you might do well to bear in mind that a number of the main general corpora, including the OEC, the BNC and the COCA have a literature subcorpus.
Whether you would be better off with a smaller specific literature corpus or a larger general corpus which includes literary texts will depend on the nature of your research. If you are looking above all for a large amount of texts, then you might be better off with a mega corpus such as the OEC or the Coca: even though literature is not the mainstay of their composition, the sheer overall size of these two corpora (around 2 billion words in the case of the OEC and around half a billion in the case of Coca) will ensure that you are not short of literary texts. However, you would need to examine the composition of the literature subcorpus in these corpora, to see if it fits your purposes. If you require more information than that supplied with the corpus, you could always try contacting the corpus compilers.
The BNC is available on both SketchEngine and the Brigham Young University platform. You would need to check if it's the same version on both platforms, as it wasn't at one time. The OEC is also available on Sketch Engine, if you obtain prior permission to use it from the publishers (Oxford University Press.) Coca is available on the Brigham Young University platform.
  • asked a question related to Corpus Linguistics
Question
9 answers
I recently was going over some data compiled by the Ethnologue and soon discovered that the number of speakers listed exceeded the population for certain countries/regions.  Due to diasporas and communities of expats throughout the world, this did not altogether surprise me, but knowing that many countries are inhabited with people speaking many different languages, it made me wonder what percentage of inhabitants within each country spoke the official language of the country.
In the process, I have found that tracking down these percentages is proving to be a bit difficult.  Thus far, I've resorted to cobbling them together from a number of different sources and even making an educated guess for some of them.  This is not the ideal solution for collecting these figures, especially since I've already seen some pretty wide discrepancies.  For example, one source told me that only 72 percent of Spain's population speaks Spanish, while another told me that this percentage is 98.8 percent.
Relevant answer
Answer
no. no so definitely that it isn't worth trying to find one.
as for Ethnologue specifically, they are very strongly biased toward considering what many people would consider as dialects as distinct languages. Ethnologue is also part of SIL, so if you're looking for information on languages spoken in different places, SIL will give you the link to Ethnologue.
and the CIA Factbook lists Spanish as the only language spoken in Colombia. that's enough to render it unreliable. then there's the United States. even if you want to accept what it says, which is not a good idea, data at the level of "other Indo-European languages" and "Asian and Pacific Island" isn't what you seem to be looking for.
also, no one knows how many languages (and we are pretending that we have a definition of language right now) are spoken in the regions of the world that are geographically characterized with very dense vegetation or mountains, or both. these would include, at the minimum, the Amazon-Orinoco-Ucayali-Xingu river system in South America, the Congo River system with its tributaries and the mountainous area in the Great Lakes region adjacent to it, the inner region of southeast Asia, and New Guinea and the Solomon Islands.
  • asked a question related to Corpus Linguistics
Question
4 answers
Hi there,
I'm facing a problem in my quantitative analysis. The keywords list cannot be generated without uploading a reference corpus, and I cannot decide the appropriate reference corpus for my data.
My corpus is one million words size, the genre is online news (UK), the period is 2013 to 2015, the language is English.  
I have read that that genre and diachrony are more important factors to consider than other factors when choosing a Reference Corpus, especially in that the differences in these two factors, unlike those in other factors such as corpus size and varietal difference, bring about a statistically significant difference in the number of the keywords.
Thus, BNC and Brown corpus, I think, would not be suitable due to the time gap in relation to my study.
Hope I could find an answer,
Thanks in advance for your help.
Relevant answer
Answer
Dear Amaal,
I would argue that "the bigger the better" is one of the key factors for the reference corpus when using it to extract keywords. In Sketch Engine you can try a number of big English corpora as reference corpora for keyword extraction, among others also the English Feed Corpus, which is a diachronic corpus of news feeds. You can also choose a part of this corpus (e.g. a single year) to be used as the reference data.
Best regards,
Milos Jakubicek
  • asked a question related to Corpus Linguistics
Question
6 answers
Hi everyone! I need to search for english lexical frequencies of a list of 60 words, and I'm looking for an online database that allows a text file containing the list (or copy-pasting) as input. I already tried in COCA, but I can't find a way to submit the complete list, I can only do the search word by word. Any useful advice? Thank you!
Relevant answer
Answer
For COCA simply separate your list items by vertical bars as in 'this|is|my|list' and copy-paste it into the form.
  • asked a question related to Corpus Linguistics
Question
4 answers
I am trying to get corpus linguistic methodologies involved in studying language maintenance and /or shift instead of using traditional method such as questionnaire, interviews...etc.
Relevant answer
Answer
Studying language shift through documented language use is obviously better than questionnaires etc, but it presupposes some language tools that may not be available for all languages. I am speaking of general, multipurpose language tools here.
Ideally, you need (1) a corpus covering the period you wish to study (from before to after), (2) a digitised speller, i.e. some kind of full form generator documenting the standard. You can then run the full form generator against different time sections of the corpus and look at what you get, specify etc.
There are a number of difficulties which will need to be dealt with, the first one being the size of the corpus. For Norwegian we have both the full form generator and a corpus of 100 million + tokens, covering the period 1866-2015. But the full form generator would be useless on the text from before 1940, because the orthography before 1940 is too variable, and the text mass too small.
Then you should ideally have comparable text selections from the different time periods. But our corpus has virtually no newspaper text from before 1998, while newspaper text dominates after 1998. This is because early newspaper text has to be keyed in manually, and keying costs money.
We have run the sort of comparison I talked about earlier on the post-1940 part of the corpus. We were faced with some results which I believe would turn up for most languages, i.e.
half the possible word forms from the full form generator were not found in the corpus. (Norwegian is a medium inflected, compounding language - eight possible noun forms)
half of the word forms from the full form generator that occurred in the corpus, occurred only once.
half the tokens of the corpus occurred once only.
ca 350  tokens equalled 50 % of the corpus token mass (prepositions, conjunctions etc)
After this possibly discouraging comment, I would still encourage collecting actual language as a corrective to questionnaires etc. Ideally, such materials should be formatted and saved as part of a larger text corpus.
  • asked a question related to Corpus Linguistics
Question
3 answers
Dear Colleagues Hope you are as right as rain. Were working on a project titled "A comparative study of lexical bundles in spoken and written registers in politics ". I wonder if you could please help me find proper corpora. Which texts and articles can be regarded as politic and which as apolitical? clarity and consistency in its definition and its application in compiling corpora is an important issue. Please help me tackle this hurdle. How about spoken corpora? How can I find? Which resources? More explanation on the procedure and sharing your experience is warmly welcome.
Relevant answer
Answer
 For UK politics you could try using Hansard. Also some articles by Professor Stef Slembrouck (early 1990s) might be of interest.
  • asked a question related to Corpus Linguistics
Question
3 answers
As in the attached pictures i get different result between POS tagger and parser despite it follow to the same producer(Stanford).
For example, look to the POS tag for the  word "على" in Statment3 file :
in tagger the result is CD
in Parser the result is NNP
I work on a research  about the Statistical Arabic Grammar Analysis by applying Naive bayes classification and then optimize the results using Genetic algorithms.
And I search on an efficient Arabic NLP tools  that give me the features that specifies the Grammar Analysis(E'arab) but really I didn't find that. If any one have an idea or interest in this research field please, give me your experiment and knowledge.
Relevant answer
Answer
The really strange thing is that the word you are discussing is neither a proper noun (NNP) nor a cardinal number (CD). It's a closed class preposition, which neither the parser nor the tagger should be getting wrong.
Actually, I believe you have a different problem: From your image I see the model you have selected for the parser is arabicFactoredBuckwalter.ser.gz. This model is probably expecting Arabic in Buckwalter transliteration using Latin characters, see:
The input you are giving to the parser is in actual Arabic characters, which probably never appear in the data it is trained on. I think you need to convert your data to romanized Buckwalter style transliteration before you can use the parser. Alternatively I think there is also a model for native Arabic characters, called arabicFactored.ser.gz (see this guide: http://nlp.stanford.edu/software/parser-arabic-faq.shtml).
  • asked a question related to Corpus Linguistics
Question
5 answers
Most of the studies looked at main/subordinate clauses to infer clausal architecture of a given language and how it has evolved over time. One of the most discussed topics is pragmatic domain and the interplay between syntactic structure and information structure, e.g. topicalization, clefting etc. Would you agree or disagree that infinitival clauses may be a better source to look at the word order change, as it is a reduced clause and we are not "distracted" by some stylistic variation? Looking forward for your opinions! Thank you!
Relevant answer
Answer
I would differentiate between grammar and semantics. From one side, I doubt that Infinitival Clauses or word order as syntactic elements play an important role in variation and change since grammatical rules are very specific for every concrete language.
From the other side, I would study their semantic meaning to get rid of language-dependence. For instance, in the emotion domain repetition of words must be studied to localize "hot spots". Below is an abstract from my PhD thesis (p.21):
[Leech & Svartvik, 2003] describe grammatical means to express emotions (the code of a mean referred to hereafter is designated in brackets): interjections (299), e.g. Oh, what a beautiful present!; exclamations (300a), e.g. What a wonderful time we’ve had!; emphatic so and such (300b), e.g. I’m so afraid they’ll get lost!; repetitions (300c), e.g. This house is ‘far, ‘far too expensive!; intensifying adverbs and modifiers (301), e.g. We are utterly powerless!; emphasis (302), e.g. How ever did they escape?; intensifications of negative sentences (303a), e.g. She didn’t speak to us at all; negative noun phrases beginning with not a (303b), e.g. We arrived not a moment too soon; fronted negations (303c), e.g. Never have I seen such a crowd of people!; exclamatory and rhetorical questions (304, 305), e.g. Hasn’t she grown! and What difference does it make?. Note that this thesis uses findings by [Leech & Svartvik, 2003] for appraising emotions.
Similarly, Infinitival Clauses or word order markers could be used as identification anchors to localize variation and change since they identify conspicuous points through the syntactic structure. I would assume there are also more grammatical elements in a language to be investigated comprehensively. Not only Infinitival Clauses that do not "distract" by some stylistic variation.
  • asked a question related to Corpus Linguistics
Question
5 answers
I have already got the data from their reflection journal. I used wordsmith tools. 
The participants as the immigrant workers who learn EFL. Thank you 
Relevant answer
I agree with Joel Walters that using context is the best way to teach conjunctions besides asking learners to use in real dialogues inside class or outside to be sure that they use them appropriately and give feedback about that to their colleges
  • asked a question related to Corpus Linguistics
Question
5 answers
I want to link up my research in address patterns with corpus linguistics. Although I am not too conversant with corpus analysis/stylistics, I want to extend my research area to corpus linguistics. please advise on what aspects of addressing can be researched in corpus linguistics. 
Relevant answer
Answer
Corpus linguistics is a methodology/approach rather than a theory. A corpus can support or provide evidence for your theories on address forms.
Depending on the data in the corpus you are using, you can look for frequency patterns in the use of terms of address according to the variables of age, gender, institutional role, language proficiency, formality/informality of discourse situation, etc. However, the corpus you are using (or constructing) has to have this information available to you in the form of metadata, otherwise you will have no way of linking the tokens you find in the corpus to such factors.
  • asked a question related to Corpus Linguistics
Question
10 answers
Is there anything, for instance, like the findings about preference organisation in conversation analysis, or hypercorrect patterns in sociolinguistics, or semantic prosody in corpus linguistics?
Relevant answer
Answer
Multimodality can be used to build inventories of the semiotic resources, organizing principles, and cultural references that modes make available to people in particular places and times: the actions, materials and artifacts people communicate with. This has included contributions to mapping the semiotic resources of visual communication and colour, gesture and movement, gaze, voice and music, to name a few.
Multimodal studies have also been conducted that set out to understand how semiotic resources are used to articulate discourses across a variety of contexts and media for instance school, workplaces, online environments, textbooks and advertisements. The relationships across and between modes in multimodal texts and interaction are a central area of multimodal research.
Multimodal research makes a significant contribution to research methods for the collection and analysis of digital data and environments within social research. It provides novel methods for the collection and analysis of types of visual data, video data and innovative methods of multimodal transcription and digital data management.
Four core concepts are common across multimodal research: mode, semiotic resource, modal affordance and inter-semiotic relations. Within social semiotics, a mode is understood as an outcome of the cultural shaping of a material through its use in the daily social interaction of people. The semiotic resources of a mode come to display regularities through the ways in which people use them and can be thought of as the connection between representational resources and what people do with them. The term modal affordance refers to the material and the cultural aspects of modes: what it is possible to express and represent easily with a mode. It is a concept connected to both the material as well as the cultural and social historical use of a mode. Modal affordance raises the question of what a mode is ‘best’ for what. This raises the concept of inter-semiotic relationships, and how modes are configured in particular contexts. These four concepts provide the starting point for multimodal analysis.
  • asked a question related to Corpus Linguistics
Question
5 answers
All according to classical definition, no more explanation needed.
Relevant answer
Answer
No; and one of the (big) problems is that the notion of 'semantics' in the context of formal systems does not have an unambiguous semantics [sic!].
Best,
Hans
  • asked a question related to Corpus Linguistics
Question
1 answer
Does anyone know of a study that includes a full text sample--minimum 800 words, but the longer the better--with the lexical chains marked up? I am looking to demonstrate a visual method of identifying lexical chains, and would like to compare the analysis that can be done using the visual method against a manually (or computationally) completed analysis. If there is a gold standard, that would be great, but otherwise, any full-text example will do! Thanks in advance for your help.
Relevant answer
  • asked a question related to Corpus Linguistics
Question
3 answers
How can I analyse the characteristics of attributive clauses used by Senior Three Students based on a corpus?
Relevant answer
Biber, D. (1990). Methodological issues regarding corpus-based analyses of linguistic variation. Literary and linguistic computing, 5(4), 257-269.
For the third article, you may request full text from the author.  Here's the link.
  • asked a question related to Corpus Linguistics
Question
5 answers
It has been a while since I am searching for a freely available and reliable term extraction for arabic (specially for single-word terms). Any suggestions will be highly appreciated. 
Relevant answer
Answer
I've started working on AntConc only recently; it helps you to view the files. However, it is not very good for extracting words or searching for collocations. Laurence Anthony (the designer) promised to develop a new version (4.0)  that can help you process Arabic texts without messing things up. Meanwhile, I'm working on Wordsmith, which is not free unfortunately. All the best!
  • asked a question related to Corpus Linguistics
Question
3 answers
I would like to see how other researchers approached similar data sets. I have taken a couple of paths through it already, but I would be curious about other approaches to consider.
Relevant answer
Answer
I am quite conversant with content analysis and have been able to use it to analysis newspaper advertisements and Universities marketing communication materials ( websites and prospectus).
Hope you find them useful, let me know if you want to discuss further.
  • asked a question related to Corpus Linguistics
Question
4 answers
, the application of the lexical approach to teach lexical collocations in writing .
Does an   intensive program for  teaching  lexical collocations guarantee the acquisition of operation on the idiom principle in writing
Relevant answer
Answer
 I think Chris Turner has given some excellent pointers above. You might also find the following article interesting: Bahns, J. and Eldaw, M. (1993). Should we teach EFL students collocations? System 21(1): 101-114.
  • asked a question related to Corpus Linguistics
Question
9 answers
How can corpus linguistics and discourse analysis be applied gender discourse in fictional texts? 
Relevant answer
Answer
Critical discourse analysis often focuses on gender-see works by Lazar or Wodak. For applications of corpus research within critical discourse analysis see works by Mautner. However, critical discourse analysts have sometimes been accused of adopting a priorist stances regarding power abuses in language, which may adulterate their linguistic analyses,  and blind them to interpretations which do not fit with their point of departure. For a discussion of this problem, related to CDA per se rather than to gender analysis within this school, see Section 3 of Controversies in Applied Linguistics, edited by Seidlhofer.
Moving outside discourse analysis per se, there may be some mileage in using sentiment analysis to discover differences in the attitudes portrayed by male and female writers of fiction, and/or in the way that men and women are portrayed in fiction. You could explore the literature available on the use of corpora within sentiment analysis. I am not familiar with this area, but a quick Google search suggests that there are a number of studies available.
It might also be worth your while to carry out an analysis of fictional dialogues within Conversational Analysis. For one possible framework of Conversation Analysis, see  English Conversation by Amy Tsui.
As you are no doubt aware, many large corpora such as the Oxford English Corpus or the BNC allow you to carry out searches within the specific fiction subcorpus (called "imaginative" in the version of the BNC available on Sketch Engine). Both these corpora will tell you if the text has been written by a man or a woman. With this information, you could compile one specific fiction subcorpus containing texts written by male authors and another by female writers. Use the following link to help you create subcorpora: https://www.sketchengine.co.uk/xdocumentation/wiki/SkE/Help/HowTo/CreateSubcorpus.
Once you have your subcorpora, you can use many functions to establish possible differences between male and female writers, such as collocations, word sketches and key words. It might also be worth your while to explore possible differences in the way that men and women writers realize key speech functions (requesting, ordering etc) in fictional dialogues. This could be done by searching for specific strings which you feel are likely to occur in dialogues written by one particular sex or by establishing the frequency of strings that you have observed during corpus searches.
  • asked a question related to Corpus Linguistics
Question
20 answers
I am looking for colleagues who have experience in linguistic corpus analysis. What is your research about? Anyone doing analysis in German Corpora? Italian or French? Thanks a million for your input and experiences!
Relevant answer
Answer
Hi Birgit,
If you're interested in L2 German and comparable native data, I can recommend the Falko corpus:
I think this is the largest freely available learner corpus of German, and it contains essays and summaries from German learners of a variety of backgrounds, as well as comparable native German texts using exactly the same prompts (same essay topics etc.). There are also extensive annotations, including Target Hypotheses giving different versions of what native annotators would have written in cases where errors occur. These can be very helpful for studying specific types of errors.
Hope this helps,
Amir
  • asked a question related to Corpus Linguistics
Question
5 answers
The Ethnologue and Wikipedia give very little details about this language.
Relevant answer
Answer
As far as I can find out only Max Planck Institute for Evolutionary Anthropology provides some information. You find it in OLAC (but I guess you have already been there):
  • asked a question related to Corpus Linguistics
Question
3 answers
According to the Glossary of Corpus Linguistics (Baker, Hardie and Mc. Enery 2006) there is one composed by recordings from Dallas Fort Worth, Logan International and Washington National airports. Access is paid only and available at the Linguistic Data Consortium, but since I am not really interested in the data but in the corpus structure and design I would like to know if there is another one available online or any papers about them. Thanks!
Relevant answer
Answer
Psycholinguist Rainer Dietrich of the Humboldt University (Berlin) has worked on the language of ATC. You might find references to corpora in his work.
  • asked a question related to Corpus Linguistics
Question
4 answers
I am carrying out research into "some" and "any" using the OEC, which I access via Sketch Engine. Many of my searches into specific patterns with "some" and "any" produce far too many results for one researcher working alone. To overcome this, I have been using the Sample Size Calculator from Survey System (www.surveysystem.com/sscalc.htm) : I set the Confidence Interval (CI) at 4  and the Confidence Level (CL) at 95% and write the total number of searches for the pattern in the Population Size box. So, for  example 5804 total examples at CI 4 and CL 95% gives a random size of 544 examples, while 15,000 examples gives a random size of 577. Does this seem a sensible way of calculating random sample size? Does anyone have any better ideas?
Relevant answer
Answer
Perhaps you may use AntConc to search for and to identify the pattern in your randomized samples and whole corpus, and compare the results to see whether the randomized samples are representative or not~
  • asked a question related to Corpus Linguistics
Question
21 answers
Maybe a tool that would also let me annotate parallel texts?
Hi everyone! I'm a linguist having basic computer skills, so I have only some vague notions about Java, Python or other programming languages. I'm interested in annotating a small parallel corpus for discourse relations and connectives, so I need to be able to define several criteria in my analysis (arguments, connectives, explicitness/implicitness, etc.). I would welcome any suggestions... Thanks!
Relevant answer
Answer
Hi Sorina,
I am using SALT for Spanish and English (http://www.saltsoftware.com/). I don't know what languages you need to manage . It is a very user-friendly tool. You can transcript and redefine your own lists of words (concordances) and declare your own tags ([tag]).
You can check also the CHILDES project  tools (http://childes.psy.cmu.edu/).
Hope it helps.
Good luck!
  • asked a question related to Corpus Linguistics
Question
10 answers
For most of my projects I use R to manage my big data and firing statistical analyses on the results. My domain of research is linguistics (computational linguistics, corpus linguistics, variational linguistics) and in this case I'm concerned with big heaps of corpus data. However, R feels sluggishly slow - and my system isn't the bottleneck: when I take a look at the task manager, R only uses 70% of one CPU core (4 cores available) and 1Gb of RAM (8Gb available). It doesn't seem to use the resources available. I have taken a look at multicore packages for R, but they seem highly restricted to some functions and I don't think I'd benefit from them. Thus, I'm looking for another tool.
I am looking at Python. Together with Pandas it ought to be able to replace R. It should be able to crunch data (extract data from files, transform into data table, mutate, a lot of find-and-replace and so on) *and* to make statistical analyses based on the resulting final (crunched) data.
Are there any other tools that are worth checking out to make the life of a corpus linguist easier? Note that I am looking for an all-in-one package: data crunching and data analysis.
Relevant answer
Answer
Hi Bram,
I'm an experienced Python user but don't really know much about that specific scientific discipline. In any case you have a Python library (Natural Language Toolkit) that should be able to help you a lot:
Besides this for numpy based data (arrays) you may want to look at (besides panda) scikit-learn for Machine Learning algorithms:
Matplotlib for plotting:
http://matplotlib.org/ (although there are plenty of others if you need more performance in plots)
I'm not sure what in what formats your raw data comes from but there's a lot of libraries for dealing with CSV, EXCEL formats, among others...
I would advise you to install everything with 64 bits version. Python has official 64 bit releases but some libraries don't so this link might help you in that regard:
Besides I would advise you to install a more complete package set instead one library at a time. WinPython (for windows) has an excellent installation manager (besides already coming with a lot of stuff) :
Anaconda is another distribution (64 bits) also good for Mac or Linux:
Hope it helps.
  • asked a question related to Corpus Linguistics
Question
19 answers
With the exception of Language Explorer (FLEx). It is well-known.
It would be very helpful if you mention the programme which is not heavy and can be easily used.
Relevant answer
Answer
I absolutely agree with Amy! I once used AntConc 3.2.4. It's convenient and easy to use!
  • asked a question related to Corpus Linguistics
Question
4 answers
I have a draft article analyzing a Kazakh trickster tale that was appropriated to present the idea of the "New Kazakh". In the article, I discuss how the folktale was adapted and then consider feedback form Kazakhs who met with me in focus groups. I would like to analyze this using the three levels of discourse as provided by Johnston (2002): representative, frame-aligning, and general. The article would explore the appropriated tale as frame-aligning discourse and the focus group discussions as general discourse and then consider how the text and talk interface with all three aspects of frame discourse. What are other examples of a comparable investigation? Which articles and book would you recommend? Would you have any suggestions as to steps in an effective process?
Relevant answer
Answer
Larisa,
Thanks so much for your reply. I have downloaded the book and look forward to reading it.
All the best,
Erik
  • asked a question related to Corpus Linguistics
Question
3 answers
I would like to know how big a corpus can be built using LingSync, and for what goals. I would also like to know to what extent such a corpus can be converted to a stand-alone online corpus.
Relevant answer
Answer
It is free, only that I don't know how versatile it is and what are other alternatives.
  • asked a question related to Corpus Linguistics
Question
4 answers
We have high diversity of terms in text corpus and want to filter all social humanity terms through thesaurus construction. Does somebody  have experience and want to share/cooperate with us? Best, Veslava
Relevant answer
Answer
Dear Veslava,
As to the construction of a thesaurus of social humanity terms, all you need to do is delimit the semantic field that includes these terms first, and then look for their paradigmatic relations in context. Tom do so, you'd better start with the nucleus of the field. As semantic fields are gnoseological rather than ontological categorries, it is up to you and your team what criteria and indicators you'll be applying in determinig the focal or nuclear elements. Afterwords, you need to excerpt all the lexical items that match these criteria and analyze them at the level of word cobinations, sentences and sentence sequences. The delicacy of analysis depends on what your goals and objectives in constructing the thesaurus are. After you've delimited the zones of semantic density within the field, you need to start looking for less clearly related terms which can be found in other semantic fields as well. For example discrete system refers to language but also to mathematics. Such terms require great care and deliberation because you need to go through a variety of contexts in order to decide how salient, or rather, how un-salient these are with respect to the other terms in the field. Also, in constructing the thesaurus you need to decide whether you'll be including terms from neighbouring fields that have overlapping bounderioes with social humanity terms. You might find it useful to choeck on the Visual Thesaurus of the English Language. I think parts of it are available on-line. It can also give you interesting clues. Once the semantic field with all the criteria ase been designed, you can proceed with the associative relations between the terms - connotation, stylistic variants, etc.. Then the time comes for paradigmatic relations. These are best identified if based on syntagmatic ones, e.g. Which terms can replace/are opposite to/include/exlude, etc. the term X in the sentence XeP.
I'll be more than happy to share all my experience with your team. I've done the same for therms from the conceptual doamain WATER/LIQUIDS and it was quite an adventure.
  • asked a question related to Corpus Linguistics
Question
4 answers
I'm drawing (choropleth) maps visualizing language use in big text corpora, e.g. words which are attributed to places. Currently I'm doing that using R and the cshapes Package (http://nils.weidmann.ws/projects/cshapes/r-package). I'm also experimenting with Nolan's and Lang's R package to produce interactive SVG graphs (http://www.omegahat.org/SVGAnnotation/SVGAnnotationPaper/SVGAnnotationPaper.html) to show tooltips on the map. As you can see here (http://www.bubenhofer.com/sprechtakel/2013/08/06/geocollocations-die-welt-der-zeit/) it works generally, but the resulting SVG (and also PDF) files are huge. There is also the problem of the SVG files produced in R, that all text is converted to vector graphics which again increases the complexity of the plot. This seems to be a known problem of SVG in R (http://stackoverflow.com/questions/17555331/how-to-preserve-text-when-saving-ggplot2-as-svg).
What are better means to produce interactive maps showing a lot of data?
Relevant answer
Answer
I suggest head/tail breaks, if your data are heavy tailed.
Jiang B. (2015), Head/tail breaks for visualization of city structure and dynamics, Cities, 43, 69-77.
Jiang B. (2013), Head/tail breaks: A new classification scheme for data with a heavy-tailed distribution, The Professional Geographer, 65 (3), 482 – 494.
  • asked a question related to Corpus Linguistics
Question
8 answers
I am seeking information on corpus building:
1. How does big it have to be to be defined as a corpus?
2. What specific methodologies could be used to build a corpus?
3. Examples of empirical studies that report on corpus development?
Links to any online sources would be much appreciated.
Relevant answer
Answer
You may like to have a look at the above link. The book is written by a few defining figures of corpus linguistics including Sinclair. Size is now not the paramount consideration, esp. in building of a special corpus, while representativeness (if the text collection can represent the domains/fields/registers/genres of language use you are investigating) is crucial. 
  • asked a question related to Corpus Linguistics
Question
50 answers
What were the main weaknesses of generative semantics adherents' claim that "a grammar starts with a description of meaning of the sentence and then generates syntactical rules through introduction of syntactical rules and lexical rules?  
Relevant answer
Answer
A good question. I think the best book to read about this is The Linguistics Wars by Randy Allen Harris (1993). He gives an enormous amount of detail about the arguments for and against generative semantics (GS) and comes to the conclusion on p. 241 that GS promised too much and failed to deliver: it claimed not just to handle semantics and syntax, but also pragmatics, fuzziness, logic, ... . As a result, representations were becoming more and more unwieldy and the main practitioners just seemed to abandon the ideas they had put forward, very often turning their back on generative grammar and founding cognitive grammar/linguistics, various functional approaches, etc. I don't think that their conception of syntax/semantics was ever really shown to be wrong.
  • asked a question related to Corpus Linguistics
Question
5 answers
Have you ever looked into the above topic? If so, could you possibly share your findings, or provide references? I've been asked to submit a paper about "soundscapes" in a couple of months and I would like to focus on sound symbolism in journalistic English, with a specific view to economic and financial terminology. Any suggestions and/or comments are very welcome. Many thanks, Antonio.
Relevant answer
Answer
I'm not sure if this is imagery or not but on the podcast Lexicon Valley, they talked about the term "fiscal cliff" (I think the episode was called "down is up" or "up is good" or something like that). They mostly talked about metaphor and language, but there were some comments about how the word sounded, with the repeated "ih" vowel and all the fricatives. 
  • asked a question related to Corpus Linguistics
Question
3 answers
A standard corpus is necessary to evaluate the performance of any retrieval or text analysis activity/ experiment. Is there any standard free/payment based corpus available?
Relevant answer
Answer
EMILLE (Enabling Minority Language Engineering)  Beta version of the corpus consists of: 30 million words of monolingual written data (Gujarati, Tamil, Hindi, Punjabi); 600,000 words of monolingual spoken data (Hindi, Urdu, Punjabi, Bengali, Gujarati); 120,000 words of parallel data in each of (English, Hindi, Urdu, Punjabi, Bengali, Gujarati).  Website: http://www.emille.lancs.ac.uk/
  • asked a question related to Corpus Linguistics
Question
3 answers
I 'm interested in the syntactic annotation of English texts.
Thanks.
Relevant answer
Answer
Thanks for your answers. I want to compile my own corpus and study a syntactic phenomenon so I need to annotate it first. Thanks again!
  • asked a question related to Corpus Linguistics
Question
1 answer
What is the corpus used to train the opennlp english models such as POS tagger, tokenizer, sentence detector. I am  aware that the chunker is trained on wall street journal corpus, however, I am still not sure about the POS tagger, tokenizer, and the sentence detector.
Relevant answer
Answer
I don't have a specific answer to your question, but a paragraph in the OpenNLP wiki documentation may be an indication. The Instructions to train and run a simple PoS tagger for English specify a list of 5 training sets:
  • ERG from DELPH-IN
  • Maxent from OpenNLP v1.5
  • Perceptron from OpenNLP 1.5
  • NLTK Files from NLTK
  • CDT Files from Copenhagen Treebank
  • Penn Treebank 3 from LDC
  • asked a question related to Corpus Linguistics
Question
8 answers
In my linguistic data - I have categorical predictors and a binomial response value. However, the size of the data is too small (2100 tokens) to include all of the predictors. I am running into an issue when adding or taking one predictor out changes the significance of another one. Intuitively, I could see some pragmatic factors of my data interacting possibly with semantic factors. I tried to use pairs in R to look for interaction - hard to interpret categorical data interaction. Do you have any suggestions how to construct the best model?
Relevant answer
Answer
I'm not into linguistics, so can give specific advice on methods, just some general observation based on general statistical principles.
How many predictors are you trying out?
What validation steps are you following?
With a binomial response you have a very high risk of overfitting and what you are describing sounds like a classic case of overfitting. Do you do any data reduction methods to reduce the number of input variables while retaining maximal explanatory information? In datasets with properties similar to what you describe I would use a multi-layered validation procedure - start off with a subset of data (75-80%) doing cross validation to identify the most promising predictors, then a validation set (the remainder) to test the performance of those predictors. Ideally you would then test that model on a new set of collected data.
  • asked a question related to Corpus Linguistics
Question
6 answers
For my questionnaire, I have 7 groups of words, 36 words altogether. Each group contains 5-6 words that are semantically similar, or have very close "yield". They are in English. I wonder if there is a computerised way to come up with one word that would describe overall meaning or bias of the whole group. Something like a multiple-words-in-one thesaurus.
many thanks
Relevant answer
Answer
hello Matej,
You can use Wordnet to get the all synonyms of a word. R, Python, and Gate tools have packages that call Wordnet services.
  • asked a question related to Corpus Linguistics
Question
9 answers
I am familiar with TLG and the Perseus Digital Project.  I want to do corpus linguistics on Hellenistic Greek.  Some of the things I need to do is search by POS, search by Lemma, search by morphological element (reduplication, particular morpheme, stem formation, etc.) and search for collocates. 
I am not sure either of the above will do all of that. I am considering developing my own corpora and using a tagger that does all of this to the corpora, as well as a search engine that will recognize what I tagged. 
Do I need to do this, or is there already a selection of tools that will get the job done?
Relevant answer
Answer
Do you know the PROIEL treebank? http://foni.uio.no:3000/ It has a considerable amount of Greek text from several periods (the core is the New Testament, but there is also Herodotus and some Byzantine chronicles). You can download fully lemmatised and tagged texts there and use them to train a morphological tagger. 
  • asked a question related to Corpus Linguistics
Question
10 answers
Note: So far I have experimented with an untrained TreeTagger, but (unsurprisingly) only with mediocre results :-/ Any hints on existing training data are also appreciated
The results so far can be viewed here: http://dh.wappdesign.net/post/583 (lemmatized version is displayed in the second text column)
Relevant answer
Answer
Hi Manuel,
there are indeed some options you can choose to lemmatize German. In case you are already happy with a stemmer you might want to have a look at this part of NLTK: http://www.nltk.org/api/nltk.stem.html
If you need lemmatization you probably find something useful here if you are familir with python:
If you prefer java you might want to look at Stanfords parser:
That is to my knowledge also able to parse and lemmatize German.
Python as well as Java have API's that allow you to scrape Facebook just google for it. It is easy to find. I hope this helps you.
Cheers, Markus
  • asked a question related to Corpus Linguistics
Question
38 answers
There are plenty of debates in the literature which statistical practice is better. But both approaches have many advantages but also some shortcomings. Could you suggest any references that would describe which approach to choose and when? Thank you for your valuable help!
Relevant answer
Answer
There are lots of papers on this which will be a better way to inform your opinion than a small number of brief responses. Maybe if we list examples of these. I'll start with Efron at http://statweb.stanford.edu/~ckirby/brad/papers/2005BayesFreqSci.pdf which I think provides a fairly direct answer to your question from someone whose opinions about statistics are much better to listen to than mine!
  • asked a question related to Corpus Linguistics
Question
5 answers
Thanks.
Relevant answer
Answer
Depending on what you're looking for, you might also consider PhonBank, which contains orthography and IPA transcriptions of a child's target and actual utterances. You can find a link to this on the CHILDES website (http://childes.psy.cmu.edu/phon/). These can be used with Phon (https://www.phon.ca/phontrac/wiki/Downloads), a software created for child language research and particularly useful for phonological analysis.
  • asked a question related to Corpus Linguistics
Question
3 answers
I am analyzing infinitival clauses in Latin and Old French. Could you suggest any research/study of such clauses in general and/or in Indo-European Languages? Thank you!
Relevant answer
Answer
Thank you for sharing these references!
  • asked a question related to Corpus Linguistics
Question
11 answers
I'm talking about things like themes, but also cooccorrencies count etc. I can't seem to find any literature.
Relevant answer
Answer
I'm rather from the speech area than text, so I don't know in much detail. Google might help you more. If you don't find anything, I'd suggest you take your text corpus with tagged anxiety/non-anxiety sentences and do some simple statistics/clustering of parts-of-speech that tend to appear in those two classes.
It's automatic linguistic-based detection of deception attempts in text, but might be helpful for detection of anxiety as well.
  • asked a question related to Corpus Linguistics
Question
3 answers
I'm very new in the field of big data analysis and I strongly believe there is potential know-how that could be beneficial also in the field of corpus linguistics. Has anybody ever tried to merge corpus linguistics and big data methodologies?
Relevant answer
Answer
BigData:  The very question is whether there exists a schema (type) or not. If there is a schema we are very close to databases, except for the huge extension of "big". If not then information retrieval comes into play. with its indexing techniques.
H.Wedekind
  • asked a question related to Corpus Linguistics
Question
16 answers
Besides common methodological limitations (e.g. limited historical data, chronological gaps, lack of annotated data), we still need to justify our choice, ex. 500/1000+  tokens per text, or century, or genre etc. Thank you in advance for you valuable input.
Relevant answer
Answer
Hi,
just thought I'd add a couple more points to this very interesting discussion. I think that for a quantitative analysis some thought should be given to the degrees of freedom. If the quantitative analysis will look at, say, 8 features per sentence, then you'll need more data than if you'll simply be counting observations of the sentence type per time period. For my PhD thesis on English diachronic syntax (available from my profile), I used all the data I could suck out of the YCOE & Penn corpora, since I didn't know exactly how many features would be included in the final analysis
Also, since the topic deals with corpora and Latin diachronic grammar: you might want to check out Barbara McGillivray's 2013 book Methods in Latin Computational Linguistics (http://www.brill.com/methods-latin-computational-linguistics), which has a chapter discussing a corpus-based approached to diachronic changes in Latin argument structure.
  • asked a question related to Corpus Linguistics
Question
15 answers
Our large SMS corpus in French (88milSMS) is available. User conditions and downloads can be accessed here: http://88milsms.huma-num.fr/
Is there a website that list all corpora available for NLP and text-mining communities?
Relevant answer
Answer
Hello,
Thanks Ali for the pointer. We can indeed help you share it with the HLT community and give it some further visibility at ELRA/ELDA (http://www.elra.info and http://www.elda.org). You can have a look at our ELRA Catalogue (http://catalog.elra.info/) and the Universal Catalogue (http://universal.elra.info/) and get in touch with us for any further information (http://www.elda.org/article.php?id_article=68). We'll be happy to help! Kind regards, Victoria.
  • asked a question related to Corpus Linguistics
Question
4 answers
I am currently building one using Lingsync and would like to see similar ones.
Relevant answer
Answer
These researchers seem to be working on something like that http://aclweb.org/anthology/D09-1150 Sounds like a very interesting project!
  • asked a question related to Corpus Linguistics
Question
13 answers
E.g. Business reports, emails from a mix of domains.
Relevant answer
Answer
How about this one? 
"The PERC Corpus (formerly called the "Corpus of Professional English (CPE)") is a 17-million-word corpus of copyright-cleared English academic journal texts in science, engineering, technology and other fields. It was compiled as a part of the project of the Professional English Research Consortium (PERC) and is intended to be used for research in the field of Professional English. " Source: http://scn.jkn21.com/~percinfo/
  • asked a question related to Corpus Linguistics
Question
4 answers
Except Google Translate?
Relevant answer
Answer
The following are the translation services that support Albanian language :
Yandex even exposes a free API with the help of which you can build an automated translation tool ( unlike Google translate ).
Cheers!
  • asked a question related to Corpus Linguistics
Question
3 answers
I'm working on move analysis and would appreciate if I can get your views on interrater process in analyzing text. The text analyzed is from engineering discipline so a 2nd rater with engineering expertise is needed to ensure the rating by the 1st rater who is a linguist interprets the text correctly.
Is it sufficient for the 2nd rater to interrate only half of the samples scripts instead of all the scripts. Reason given is to establish consistency in the 1st rater analysis to enable the 1st rater to conduct independent rating to the rest of the scripts.
It is understood that the interrater process is usually used to derive similarity index between two raters for reliability purpose and in medical area especially, interrater involves the whole sample. Would appreciate your views on using selective samples for interrater process in establishing consistency in 1st rater independent rating.
The idea to interrate only a small number instead of all the scripts is upon reflection on the practice in using 2nd rater for on exam script marking. The 2nd rating is done on a small number of the scripts rather than the whole exam paper as the purpose is to establish consistency in the 1st rater rating rather than checking on the similarity between the two markers.
Relevant answer
Answer
I agree with Mohammad that you get the most accurate inter-rater reliability score if all judgments are included. That being said, I have seen studies in which randomly chosen subsets of ratings or codes were used to calculate reliability. It is crucial that if a sample is used it represents every possible type of decision, preferably more than once., so if the number of categories and items being categorized is small it will not work. It is also important that during training one rater not "indoctrinate" the other rater(s) too much and create a sort of group-think" whereby legitimate differences in how to code data are not allowed to become a possible way to refine the coding categories. I believe this comment has been made before though cannot say exactly by whom-- Lois Bloom maybe-- so I cannot take credit for it.
  • asked a question related to Corpus Linguistics
Question
3 answers
I am working on Ontology of language to build a lexical resource. I want to know what has already been done in this domain.
Relevant answer
Answer
Also you might be interested in babelnet:
  • asked a question related to Corpus Linguistics
Question
4 answers
The standards of Morphological dictionary.
I'm planning to make morphological dictionary (finite-state transducers) for Kazakh language, that will analyze word and find its stem. Are there any standard dictionary formats, except Appertium?
Is Appertium good in this task?
Relevant answer
Answer
Sort of a de facto standard for finite-state morphologies especially for morphologically complexer languages has been lexc from Koskenniemi 1983 Two-level morphologies to Beesley & Karttunen's Xerox Finite-State Morphology. See http://code.google.org/p/foma and http://hfst.sf.net for the open source clones. Notably many apertium languages and pairs use this too, Kazakh at apertium-kaz included. Apertium's code structure is steered towards rbmt, if you are envisioning other applications we have a free/open source repo in http://giellatekno.uit.no/ for building analysers, spell-checkers and other things in traditional finite-state morphology style.
  • asked a question related to Corpus Linguistics
Question
4 answers
Does anyone know of a source where I can download a precompiled distributional thesaurus for German or of an easy to use tool to generate one myself from raw text?
Relevant answer
Answer
Yes, you can find a German thesaurus from news corpora here:
The program to construct it is also there: http://sourceforge.net/projects/jobimtext/
Let me know if you need something bigger or different,
Chris
  • asked a question related to Corpus Linguistics
Question
14 answers
I am particularly interested in constructions that are not identifiable via regular expressions (see, e.g., Dufter 2009 "Clefting and Discourse Organization" where he could use regular expressions because he could look for co-occurring closed sets of words. I am interested in constructions such as Spanish Clitic Left-Dislocation for which the only constant part is the clitic, but the dislocated phrases form an infinite set).
Relevant answer
Answer
Hi,
you'd have to preprocess your data so that you can recognize the construction of interest. If your construction is covered by POS patterns, then POS-tagging is enough; if you need something more complex, you'd have to parse (dependency or constituency) and define your extraction on the parse structures.
If the construction is not correctly recognized by the parser, it does not matter: You can give the parser an example of what you are looking for, and then search your collection for parse structures you get from that. As long as the parser error is consistent, you'll find what you're looking for.
In any case, fishing for a specific construction usually requires a lot of data. See http://sourceforge.net/projects/webcorpus/ for a project that can be used to perform data extraction, annotation and counting of arbitrary annotation structures on web crawls.
  • asked a question related to Corpus Linguistics
Question
7 answers
When we transcribe a spoken corpus, the corpus we obtain can we describe it as written?
Relevant answer
Answer
In corpus linguistics, “any language whose original presentation was in oral form” is considered as ‘spoken language’.
  • asked a question related to Corpus Linguistics
Question
4 answers
Corpus of Counseling/Therapy sessions
Relevant answer
Answer
May be you can read Irvin Yalom´s books, he often includes pieces of transcriptions, and he also has a book with letters of a patient, I cannot remember its exact name right npw, but is something as "Therapy at two voices"
  • asked a question related to Corpus Linguistics
Question
11 answers
I've been working in a system, which evaluates products based on the comments of the consumers, but I've had some issues in detecting the patterns of responses. I was considering applying some statistical test for it. Which test I should use?
Relevant answer
Answer
You might want to try some simple measures of lexical association, like the t-test and the log-likelihood ratio. This will help you find bigrams (and perhaps longer ngrams) that occur more often than chance, which can perhaps be useful. Various implementations already exist, including those in the Ngram Statistics Package http://ngram.sourceforge.net
  • asked a question related to Corpus Linguistics
Question
5 answers
Larger corpora for telugu or telugu tagged corpus
Relevant answer
Answer
Hi ! Great to know that you are working on corpus building for Telugu. Unfortunately, there is not enough web content for the different dialects of Telugu. You can find large amounts of corpora for Standard Telugu which you can crawl and clean using the Natural Language toolkit (http://nltk.org/).
  • asked a question related to Corpus Linguistics
Question
8 answers
I have tried to extract such a list from big corpora (ItWac, ItTenTen), which (=the list) arrived at up to 100,000 words, but there is too much noise, because the POS tagging in these corpora is not very precise.
Relevant answer
Answer
The "Wortschatz Leipzig" corpus might be a good source. It has been compiled from newspaper texts and can be downloaded in different sizes and formats at http://corpora.informatik.uni-leipzig.de/download.html. The plain text files at least contain (counted) wordlists.
  • asked a question related to Corpus Linguistics
Question
10 answers
For English there are quite a few reasonably well-written applications. With German and French I'm rather lost. I haven't encountered any monolingual extractors for these languages, yet. And I haven't found any reliable + affordable bilingual term extractors, either. As a conference interpreter I'd love to extract difficult terms from (parallel) texts in the nick of time.
Relevant answer
Answer
We have developed a method called Likey (Language-Independent
KEYphrase extraction) based on the use of reference corpora. Likey has a very light-weight preprocessing and no language-specific resources are needed in addition to the reference corpus. Thus, the method is not restricted to any single language or language family. We have been very pleased with the results that have been based on experiments with more than ten languages and used the method in various applications. The method is presented in a Coling paper that you can find at
The method and its use is discussed in some detail in Mari-Sanna Paukkeri's recent PhD thesis (Section 4.1):
We have applied the Likey method, for instance, in assessing user-specific difficulty of documents:
  • asked a question related to Corpus Linguistics
Question
6 answers
Specifically, I'm looking for a program that would be able to take a corpus of Wikipedia articles and put values to the themes and rhemes... Even if it's as basic as "positive" or "negative." Anybody know anything about this?
Relevant answer
Answer
I know some programs to analyze processes and corpora, is that what you need?