Questions related to Corpus Linguistics
I'm looking for software that might allow me to measure lexical density as Halliday understood it (i.e., number of lexical items per ranking clause). Everything I've been able to find will only measure lexical density as percentage lexical items in an entire text. Automating this would save me a huge amount of time.
I’m using Mann Whitney test in a linguistic study to compare the frequencies of a linguistic feature in two collections of texts. One collection includes a lot more (x10) texts than the other one. Ive Read that Mann Whitney can be used to compare groups of unequal size, but the examples usually given are smth like 224 vs 260, not 224 vs 2240.
Can I still use this test? Does it make sense to thin the bigger sample to match the smaller one? They’re both random samples representative of a certain genre, so conceptually i think downsampling is possible.
I’m doing topic model with a collection of technical documents related to the repair of device. The reports are extracted from different softwares from different repair shops. I need to do proper cleaning so model focuses on the key words, specifically I want to automatically remove useless words like:
* Additional findings
* External appereance
* Incoming condition, etc
These "fil-in / template word" are found in almost every document and there are even more others, the documents are collected from different sources and consolidated in one database from which I do the extractions.I already tried segregating by repair shop using tfidf, term frequency, bm25 and segregating by software.
I was trying to determine whether there are differences in the frequencies of words (lemmas) in a given language corpus starting with the letter K and starting with the letter M. Some 50 000 words starting with K and 54000 words starting with M altogether. I first tried using the chi-square test, but the comments below revealed that this was an error.
I have a question, as to whether collocations in corpus linguistics can be used to indicate diversity. I have a corpora of media news articles, and I manage to find out the frequency of my target word and the collocates. I then used a regression analyses to find out if demographic variables predicted the frequency of the target word, and another regression model with collocates as the outcome. I simply took the number of collocates for each article as the dependent variable, with the understanding that more number of collocates meant more diversity in the media representation. Consequently, I conclude that demographic variables predicts this diversity, i.e., a country's cultural value singificantly predicts higher diversity in the media, indicated by the number of collocates. The research aim is to explore whether national values predict diversity in the way a particular issue is presented in the media. Please please tell me if this reasoning is sound as I have very little background knowledge on this.
I noticed that some scholars mentioned corpus-assisted method in Cognitive Translation Studies (CTS) or Cognitive Translation and Interpreting Studies (CTIS). However, the dominant method designs in CTIS are eye tracking-based or verbal report-based. I want to know more about how to utilize corpus tools in CTIS but I have not found any comprehensive introduction.
I only read some calls for corpus-assisted cognitive translation studies in Chinese and English academia. Only recently, I read a book chapter by Lang & Li (2020) about the cognitive processing routes of culture-specific linguistic metaphors in simultaneous interpreting. They have discussed many cognitive models but not enough for me as a layman to have a better picture of the whole area.
Thus, I am thinking whether there are any references I can have to help me go further in this regard. I have read some works in Cognitive Linguistics. Yes, some have used a corpus-driven approach to discuss cognitive linguistic issues but the explanations seem not very clear.
However, I am still curious about the comments from translation scholars in CTIS. Do CTIS scholars actually believe that corpus can analyze cognitive aspects of translation as this is not the dominant tool for this group?
Whether yes or no, I am also interested in the reasons.
Thanks for noticing and answering this question :)
I am working on a Natural Language Processing task and want to create my own corpus of some doucments. Each document has approximately 500-600 words.
Can anyone suggest how to create a corpus? As I am new to this concept of NLP.
actually I need semantic relations to agent nominals as well.
fx. I need the verb 'grave' (eng: (to) dig) which have semantic relations to 'jord' (eng: dirt) and 'skovl' (eng: showel) and of course alot of other obvious relations.
I need the verbs in order to test how organizational resources (knowledge, money, stuff which is all nominals) can be combined with verbs into tasks fx "grav i jorden med skovlen" (eng: dig into the dirt with the showel)
I need to learn about authentic specific uses of language in English to serve for my ESP course designs as I teach ESP courses in an EFL setting, which makes it even harder to reach such genuine language uses for specific purposes. I plan to make use of a concordancer for pedagogical purposes as well. I will be glad if you could suggest me a few online concordancing tools that you have found effective.
Thank you in advance.
“The state of moral education and citizenship education within the schools of Kurdistan” this is my new research title!
I was wondering! What are the anticipated, proposed resolution, problems and challenges that I would face Through out my paper?
If you build your own corpus to address specific research questions, which method to you use to make sure It is saturated? I'm interested in methods as I work on digital data and I wonder which method is more efficient and less time-consuming.
I'm willing to collaborate in any research in the field of corpus linguistics, Data driven learning or material design as long as the final work will be published in a peer-reviewed journal. my master viva is due next month and you can read part of my thesis in my profile
Good day! I need some topic suggestions for my Language and Linguistic Research class. Can you please help me with a researchable topic? I prefer applied, corpus, or sociolinguistics. Thank you!
I am looking for something similar to the OpenText.org project that has developed annotated Greek texts.
There is the University of Maryland Parallel Corpus Project that is annotated in conformance with the Corpus Encoding Standard and that also includes English. Unfortunately though, I haven't found any syntactically annotated version of the English text yet.
“Philosophical discussion in the absence of a theory is no criterion of the validity of evidence.”
-- A. N. Whitehead. Adventure of ideas. (1933:221)
In case of an investigation or in a disciplinary technology, empirically (irrationally speaking, i.e., speaking in a strict non-Cartesian way)speaking, data/corpora is the raw material (ephemeral ‘arbitrary signifiers’ in case of linguistics) to built up a theory following inductive method.
Why, then, mere ‘corpus’ is tagged with linguistics, an epistemological disciplinary technology?
‘Corpus’ is not tagged with Physics, Geology, Psychology, Sociology etc (e.g., Corpus Physics or Corpus Sociology), though they are also dealing with data!
Collection of data and arranging them (typing?) in a digital machine do not involve any knowledge or wis(h)dom but a special skill that needs clerical precision. Documentation, no doubt, is a tiresome job. Utilizing a tool (a digital machine) as a repertoire, does not necessarily entail the birth of discipline.
Ascribing static (“thetic...”, Kristeva,1974) meaning to those entries, though needs epistemology and that can be handled by well-established theory-based disciplines: Lexicology, Semantics, Pragmatics etc. If we have such levels of linguistic analysis, do we need such dubious coinage, “Corpus Linguistics”?
And each empirical discipline needs data for further observation, experimentation and inductive generalization (one may raise Popper’s [1934, 2009] points for refuting Inductivism here), i.e., data is an initial part of the whole, but neither a theory nor a praxis.
However, it is a salebrated discipline now! Why is it so? What is the purpose of such discipline?
My friend says, “We, the residents of the so-called third world, are part of the data-collection team—don’t you understand that? How dare you? You cannot be allowed to perform theoretical plays.” (Galtung, 1980)
I recently had an article published for which if researched two Kazakh proverbs using ethnographic as well as corpus linguistic methods. As I consider expanding this project, I am interested in reading about comparable projects.
I'm aware of some projects in sociolinguistics and historical linguistics that share their data either in an open access format, without any substantial restrictions or delays, or without any "application" process as long as the work is for non-profit purposes. The idea is that everything that goes beyond a simple "Safeguard" letter hinders the maximal exploitation of limited and valuable resources.
These best practice examples, which make (often publicly-funded) data collections available to the public deserve recognition. While I can think of many historical data collection, the Helsinki Corpora Family or the BYU corpora, the more contemporary the data get, the fewer resources are publicly accessible. On the more contemporary end, I can think of, as exceptions,
* the Linguistic Atlas Project (http://www.lap.uga.edu)
and our own
* J. K. Chambers Dialect Topography database (http://dialect.topography.chass.utoronto.ca)
* Dictionary of Canadianisms on Historical Principles (www.dchp.ca/dchp2).
Which other projects of active data sharing do you know?
I'd appreciate your input for a list of Best Practice Data Collections that I'm preparing.
I mean 'this' will be very frequent item in the corpus, comparing with terms for emotions such as 'anger', so I wonder is it possible any qualitative way of investigating very frequent items. I will appreciate all your suggestions. Thanks in advance.
I'm looking for accessible/online corpora and tools to help me calculate the phonetic and phonological complexity of words in German (e.g. Jakielski's Index of Phonetic Complexity, IPC and the like) -- as well as any pointers to what useful measures of phonological complexity that have been identified experimentally.
Thanks very much in advance!
Does anyone know of any open-access multi-million word bilingual (French/English) corpora that are both aligned (paragraph or sentence level) and tagged for POS and Lemma? I'm familiar with OPUS, but regrettably the alignment and the tags are not very reliable.
I am currently doing my MA dissertation and required to code my data, but I don't have other coders to ensure interrater reliability (due to time constraints). As Mackey and Gass (2005) suggest, I repeated the data coding in 2 different periods (Time 1 and Time 2) for intra-rater reliability; however, the results in Time 1 and Time 2 were slightly different. If this happened in the case of multiple coders, they could discuss the disagreement in their coding and decide one definite set of coded materials. As I am the only researcher in a situation in which negotiation with other coders aren't possible, how can I decide which coding to use in my research? Thank you.
Additional info: I am doing research on (corpus) linguistics, specifically how writers express doubts in their research papers by looking at how many times, for example, the modal verb "may" appears in their texts. Since "may" can have multiple meanings other than expressing doubts (e.g. to express permission as in "You may go now"), I need to exclude those which do not function to reflect uncertainty. I have tried converting them into categorical data (e.g. 1 for expressions of doubts and 0 for non-expression of doubts) and I am thinking of using Cohen's Kappa for reliability test of my coding in Time 1 and Time 2. And perhaps I can try to resolve the little difference in both times by asking other people to help me judge/decide the definite sets of data to use.
I want to build a corpus to test a language identification system. Can you suggest links to collect textual data for these languages transcribed in Arabic characters :
Considering a grammar 'G' having certain semantic rules provided for the list of production 'P'. If intermediate Code needs to be generated and if I follow DAG method to represent it.
In that regard, What are the other variants of Syntax tree apart from DAG for the same?
I know the formula for calculating normalised frequency. but i want to know whether there is an existing software to aid in calculating normalised frequencies
I found two applications in github for php
but their POStagger results not correct (all words were tagged as NN). Any help will be appreciated.
if there is a link for any other free parser supports Arabic language can be integrated with php will be appreciated.
I recently was going over some data compiled by the Ethnologue and soon discovered that the number of speakers listed exceeded the population for certain countries/regions. Due to diasporas and communities of expats throughout the world, this did not altogether surprise me, but knowing that many countries are inhabited with people speaking many different languages, it made me wonder what percentage of inhabitants within each country spoke the official language of the country.
In the process, I have found that tracking down these percentages is proving to be a bit difficult. Thus far, I've resorted to cobbling them together from a number of different sources and even making an educated guess for some of them. This is not the ideal solution for collecting these figures, especially since I've already seen some pretty wide discrepancies. For example, one source told me that only 72 percent of Spain's population speaks Spanish, while another told me that this percentage is 98.8 percent.
I'm facing a problem in my quantitative analysis. The keywords list cannot be generated without uploading a reference corpus, and I cannot decide the appropriate reference corpus for my data.
My corpus is one million words size, the genre is online news (UK), the period is 2013 to 2015, the language is English.
I have read that that genre and diachrony are more important factors to consider than other factors when choosing a Reference Corpus, especially in that the differences in these two factors, unlike those in other factors such as corpus size and varietal difference, bring about a statistically significant difference in the number of the keywords.
Thus, BNC and Brown corpus, I think, would not be suitable due to the time gap in relation to my study.
Hope I could find an answer,
Thanks in advance for your help.
Hi everyone! I need to search for english lexical frequencies of a list of 60 words, and I'm looking for an online database that allows a text file containing the list (or copy-pasting) as input. I already tried in COCA, but I can't find a way to submit the complete list, I can only do the search word by word. Any useful advice? Thank you!
I am trying to get corpus linguistic methodologies involved in studying language maintenance and /or shift instead of using traditional method such as questionnaire, interviews...etc.
Dear Colleagues Hope you are as right as rain. Were working on a project titled "A comparative study of lexical bundles in spoken and written registers in politics ". I wonder if you could please help me find proper corpora. Which texts and articles can be regarded as politic and which as apolitical? clarity and consistency in its definition and its application in compiling corpora is an important issue. Please help me tackle this hurdle. How about spoken corpora? How can I find? Which resources? More explanation on the procedure and sharing your experience is warmly welcome.
As in the attached pictures i get different result between POS tagger and parser despite it follow to the same producer(Stanford).
For example, look to the POS tag for the word "على" in Statment3 file :
in tagger the result is CD
in Parser the result is NNP
I work on a research about the Statistical Arabic Grammar Analysis by applying Naive bayes classification and then optimize the results using Genetic algorithms.
And I search on an efficient Arabic NLP tools that give me the features that specifies the Grammar Analysis(E'arab) but really I didn't find that. If any one have an idea or interest in this research field please, give me your experiment and knowledge.
Most of the studies looked at main/subordinate clauses to infer clausal architecture of a given language and how it has evolved over time. One of the most discussed topics is pragmatic domain and the interplay between syntactic structure and information structure, e.g. topicalization, clefting etc. Would you agree or disagree that infinitival clauses may be a better source to look at the word order change, as it is a reduced clause and we are not "distracted" by some stylistic variation? Looking forward for your opinions! Thank you!
I have already got the data from their reflection journal. I used wordsmith tools.
The participants as the immigrant workers who learn EFL. Thank you
I want to link up my research in address patterns with corpus linguistics. Although I am not too conversant with corpus analysis/stylistics, I want to extend my research area to corpus linguistics. please advise on what aspects of addressing can be researched in corpus linguistics.
Is there anything, for instance, like the findings about preference organisation in conversation analysis, or hypercorrect patterns in sociolinguistics, or semantic prosody in corpus linguistics?
Does anyone know of a study that includes a full text sample--minimum 800 words, but the longer the better--with the lexical chains marked up? I am looking to demonstrate a visual method of identifying lexical chains, and would like to compare the analysis that can be done using the visual method against a manually (or computationally) completed analysis. If there is a gold standard, that would be great, but otherwise, any full-text example will do! Thanks in advance for your help.
How can I analyse the characteristics of attributive clauses used by Senior Three Students based on a corpus?
It has been a while since I am searching for a freely available and reliable term extraction for arabic (specially for single-word terms). Any suggestions will be highly appreciated.
I would like to see how other researchers approached similar data sets. I have taken a couple of paths through it already, but I would be curious about other approaches to consider.
, the application of the lexical approach to teach lexical collocations in writing .
Does an intensive program for teaching lexical collocations guarantee the acquisition of operation on the idiom principle in writing
I am looking for colleagues who have experience in linguistic corpus analysis. What is your research about? Anyone doing analysis in German Corpora? Italian or French? Thanks a million for your input and experiences!
According to the Glossary of Corpus Linguistics (Baker, Hardie and Mc. Enery 2006) there is one composed by recordings from Dallas Fort Worth, Logan International and Washington National airports. Access is paid only and available at the Linguistic Data Consortium, but since I am not really interested in the data but in the corpus structure and design I would like to know if there is another one available online or any papers about them. Thanks!
I am carrying out research into "some" and "any" using the OEC, which I access via Sketch Engine. Many of my searches into specific patterns with "some" and "any" produce far too many results for one researcher working alone. To overcome this, I have been using the Sample Size Calculator from Survey System (www.surveysystem.com/sscalc.htm) : I set the Confidence Interval (CI) at 4 and the Confidence Level (CL) at 95% and write the total number of searches for the pattern in the Population Size box. So, for example 5804 total examples at CI 4 and CL 95% gives a random size of 544 examples, while 15,000 examples gives a random size of 577. Does this seem a sensible way of calculating random sample size? Does anyone have any better ideas?
Maybe a tool that would also let me annotate parallel texts?
Hi everyone! I'm a linguist having basic computer skills, so I have only some vague notions about Java, Python or other programming languages. I'm interested in annotating a small parallel corpus for discourse relations and connectives, so I need to be able to define several criteria in my analysis (arguments, connectives, explicitness/implicitness, etc.). I would welcome any suggestions... Thanks!
For most of my projects I use R to manage my big data and firing statistical analyses on the results. My domain of research is linguistics (computational linguistics, corpus linguistics, variational linguistics) and in this case I'm concerned with big heaps of corpus data. However, R feels sluggishly slow - and my system isn't the bottleneck: when I take a look at the task manager, R only uses 70% of one CPU core (4 cores available) and 1Gb of RAM (8Gb available). It doesn't seem to use the resources available. I have taken a look at multicore packages for R, but they seem highly restricted to some functions and I don't think I'd benefit from them. Thus, I'm looking for another tool.
I am looking at Python. Together with Pandas it ought to be able to replace R. It should be able to crunch data (extract data from files, transform into data table, mutate, a lot of find-and-replace and so on) *and* to make statistical analyses based on the resulting final (crunched) data.
Are there any other tools that are worth checking out to make the life of a corpus linguist easier? Note that I am looking for an all-in-one package: data crunching and data analysis.
With the exception of Language Explorer (FLEx). It is well-known.
It would be very helpful if you mention the programme which is not heavy and can be easily used.
I have a draft article analyzing a Kazakh trickster tale that was appropriated to present the idea of the "New Kazakh". In the article, I discuss how the folktale was adapted and then consider feedback form Kazakhs who met with me in focus groups. I would like to analyze this using the three levels of discourse as provided by Johnston (2002): representative, frame-aligning, and general. The article would explore the appropriated tale as frame-aligning discourse and the focus group discussions as general discourse and then consider how the text and talk interface with all three aspects of frame discourse. What are other examples of a comparable investigation? Which articles and book would you recommend? Would you have any suggestions as to steps in an effective process?
I would like to know how big a corpus can be built using LingSync, and for what goals. I would also like to know to what extent such a corpus can be converted to a stand-alone online corpus.
We have high diversity of terms in text corpus and want to filter all social humanity terms through thesaurus construction. Does somebody have experience and want to share/cooperate with us? Best, Veslava
I'm drawing (choropleth) maps visualizing language use in big text corpora, e.g. words which are attributed to places. Currently I'm doing that using R and the cshapes Package (http://nils.weidmann.ws/projects/cshapes/r-package). I'm also experimenting with Nolan's and Lang's R package to produce interactive SVG graphs (http://www.omegahat.org/SVGAnnotation/SVGAnnotationPaper/SVGAnnotationPaper.html) to show tooltips on the map. As you can see here (http://www.bubenhofer.com/sprechtakel/2013/08/06/geocollocations-die-welt-der-zeit/) it works generally, but the resulting SVG (and also PDF) files are huge. There is also the problem of the SVG files produced in R, that all text is converted to vector graphics which again increases the complexity of the plot. This seems to be a known problem of SVG in R (http://stackoverflow.com/questions/17555331/how-to-preserve-text-when-saving-ggplot2-as-svg).
What are better means to produce interactive maps showing a lot of data?
I am seeking information on corpus building:
1. How does big it have to be to be defined as a corpus?
2. What specific methodologies could be used to build a corpus?
3. Examples of empirical studies that report on corpus development?
Links to any online sources would be much appreciated.
What were the main weaknesses of generative semantics adherents' claim that "a grammar starts with a description of meaning of the sentence and then generates syntactical rules through introduction of syntactical rules and lexical rules?
Have you ever looked into the above topic? If so, could you possibly share your findings, or provide references? I've been asked to submit a paper about "soundscapes" in a couple of months and I would like to focus on sound symbolism in journalistic English, with a specific view to economic and financial terminology. Any suggestions and/or comments are very welcome. Many thanks, Antonio.
A standard corpus is necessary to evaluate the performance of any retrieval or text analysis activity/ experiment. Is there any standard free/payment based corpus available?
What is the corpus used to train the opennlp english models such as POS tagger, tokenizer, sentence detector. I am aware that the chunker is trained on wall street journal corpus, however, I am still not sure about the POS tagger, tokenizer, and the sentence detector.
In my linguistic data - I have categorical predictors and a binomial response value. However, the size of the data is too small (2100 tokens) to include all of the predictors. I am running into an issue when adding or taking one predictor out changes the significance of another one. Intuitively, I could see some pragmatic factors of my data interacting possibly with semantic factors. I tried to use pairs in R to look for interaction - hard to interpret categorical data interaction. Do you have any suggestions how to construct the best model?
For my questionnaire, I have 7 groups of words, 36 words altogether. Each group contains 5-6 words that are semantically similar, or have very close "yield". They are in English. I wonder if there is a computerised way to come up with one word that would describe overall meaning or bias of the whole group. Something like a multiple-words-in-one thesaurus.
I am familiar with TLG and the Perseus Digital Project. I want to do corpus linguistics on Hellenistic Greek. Some of the things I need to do is search by POS, search by Lemma, search by morphological element (reduplication, particular morpheme, stem formation, etc.) and search for collocates.
I am not sure either of the above will do all of that. I am considering developing my own corpora and using a tagger that does all of this to the corpora, as well as a search engine that will recognize what I tagged.
Do I need to do this, or is there already a selection of tools that will get the job done?
Note: So far I have experimented with an untrained TreeTagger, but (unsurprisingly) only with mediocre results :-/ Any hints on existing training data are also appreciated
The results so far can be viewed here: http://dh.wappdesign.net/post/583 (lemmatized version is displayed in the second text column)
There are plenty of debates in the literature which statistical practice is better. But both approaches have many advantages but also some shortcomings. Could you suggest any references that would describe which approach to choose and when? Thank you for your valuable help!
I am analyzing infinitival clauses in Latin and Old French. Could you suggest any research/study of such clauses in general and/or in Indo-European Languages? Thank you!
I'm talking about things like themes, but also cooccorrencies count etc. I can't seem to find any literature.