Science topic

Corpora - Science topic

Explore the latest questions and answers in Corpora, and find Corpora experts.
Questions related to Corpora
  • asked a question related to Corpora
Question
1 answer
Hi everyone, I am new to LancsBox#. I tried to obtain the keywords list by comparing two specialised corpora (political speeches). But in the results, the dispersion is zero for both corpora in the whole keywords list. Can someone please explain to me if I missed out anything?
Relevant answer
Answer
I think you may have uploaded your corpus as a single document. Dispersion indicates changes of values throughout various texts in a corpus; if a corpus is made out of a single document, it gets at 0. May it be the point?
  • asked a question related to Corpora
Question
20 answers
I was trying to determine whether there are differences in the frequencies of words (lemmas) in a given language corpus starting with the letter K and starting with the letter M. Some 50 000 words starting with K and 54000 words starting with M altogether. I first tried using the chi-square test, but the comments below revealed that this was an error.
Relevant answer
Answer
Did you try Python word count?
  • asked a question related to Corpora
Question
14 answers
End of year gift for all the data scientist: A list of many valuable resources.
Max of sharing, it is really helpful.
Relevant answer
Answer
thanks for sharing.
  • asked a question related to Corpora
Question
2 answers
I am a bachelor student preparing for my bachelor thesis. Currently, I am looking for a corpus or a dictionary containing medical abbreviations (preferly in german) to access and to include in machine learning classifiers. Are there any or should I build one myself?
Relevant answer
Answer
لا علم لدي
  • asked a question related to Corpora
Question
4 answers
Hi, I'm searching for a corpus that contains doctor and patients speech/ dialogue, or at least only patient talks, How can I find it? any suggestion?
Relevant answer
Answer
You may check Staples (2015), the title is: The Discourse of Nurse-Patient Interactions: Contrasting the Communicative Styles of U.S. and International Nurses
  • asked a question related to Corpora
Question
4 answers
Hi, I am quite a newbie with python, and I need to run some text mining analysis on 100+ literary texts in German, which I have stored as individual txt files in a folder. They are with the scheme author_title_date (for example "schnitzler_else_1924.txt").
I was thinking of using the python package nltk and/or spaCy, and maybe the Stanford NER, as I need to analyse sentiments in the different texts and to identify specific locations as well as the sentiments in relation to such locations.
I am stuck on a very preliminary passage though: how do I import the all the text files from the folder to a single corpus/vector corpus that retains the metadata in the title? I could relatively easily produce that in R with TM, but I can't find a way to do it in python. Thanks!
Relevant answer
Answer
Assuming each text file has the same columns, you can read each into Python using PANDAS. See https://stackoverflow.com/questions/20906474/import-multiple-csv-files-into-pandas-and-concatenate-into-one-dataframe
Then the single file becomes your corpus.
  • asked a question related to Corpora
Question
2 answers
Hello,
In my PhD thesis on referent introductions in Chinese and French, both as L1s and L2s, I'm writing a section about the tendency (more or less strong) generally observed across languages to avoid preverbal indefinite subjects.
I've consulted some corpus-based studies, as, for instance, Francis et al. (1999) on English, Cappeau (2008) on French, Sornicola (1995) on Italian and Hasselgård (2018) on Norwegian.
Now I'm looking for a study about preverbal indefinite subject in Chinese on the basis of corpora results (both spoken and written corpora, even if I'm mainly interested in the spoken register).
Is anyone working on this subject and/or could you recommend me a reference?
Thank you a lot!
All the best
Relevant answer
Answer
Hongyin Tao thank you!
  • asked a question related to Corpora
Question
2 answers
I'm looking for Arabic speech corpora, in any dialect that has the associated text. If you have any knowledge of such corpora I'll more than happy if you could direct to them, even if they were not free.
Relevant answer
Answer
Depending on what you want to achieve, you could take a look at Mozilla's Common Voice (https://voice.mozilla.org/en/datasets). They have currently 7 hours of validated Arabic sentences with audio and text.
Another resource that takes a little bit of work is to use the free audio books on LibriVox (https://librivox.org/search?primary_key=9&search_category=language&search_page=1&search_form=get_results).
  • asked a question related to Corpora
Question
5 answers
I am working on a project that aims at testing the viability of training a NMT on a language specific corpora. Any recommendations/suggestions? (Language pair: Arabic/English)
Relevant answer
Answer
How latent variable is defined for Non auto-regenerative model?
Any suggestion is appreciated.
  • asked a question related to Corpora
Question
6 answers
Hello,
I'm looking for French electronic corpora that mostly contain different newspaper texts online. Does anyone know where to find that kind of corpus in French? I am looking for other corpora than FrWac or Orféo.
Kind regards,
Henrik Ruotsalainen
Bonjour,
Je suis en train de chercher des corpus français en ligne qui contiennent des textes de presse. Est-ce que quelqu'un connaît un tel corpus? Je cherche d'autres corpus que FrWac ou Orféo.
Bien cordialement,
Henrik Ruotsalainen
Relevant answer
Answer
Hello Henrik,
You could check out our corpus of French regional newspaper L'Est Républicain (https://wiki.korpus.cz/doku.php/en:cnk:lestrepublicain) available from our corpus manager KonText ( https://kontext.korpus.cz/first_form?corpname=lestrepublicain&usesubcorp=).
For any further information do not hesitate to contact me.
Best regards,
Adrian
  • asked a question related to Corpora
Question
7 answers
For my bachelor thesis I am working on a speech assistant that can be used in bank consulting sessions. For example, the consultant can ask the assistant to provide the history of a specific share's price in a chart.
The focus of my thesis lies on improving the speech-to-text results that we receive from our ASR engine. For this task I need a german corpus from the field of bank consulting conversations.
Since to my knowledge such a corpus does not exist, I need to create my own. As I have no experience in creating corpora, do any of you have suggestions as where to start, what tools to use or how to proceed in general? I was thinking that maybe there are transcriptions of television shows debating share prices or interviews in newspapers that could be used?
Thanks in advance!
Relevant answer
Answer
I'm not personally familiar with any web tools for creating corpora. Perhaps someone else here can help you.
  • asked a question related to Corpora
Question
2 answers
If you build your own corpus to address specific research questions, which method to you use to make sure It is saturated? I'm interested in methods as I work on digital data and I wonder which method is more efficient and less time-consuming.
Relevant answer
Answer
In Corpus design, the "saturation corpus" is associated with the concept of "representativeness", developed by Douglas Biber: <http://otipl.philol.msu.ru/media/biber930.pdf>.
Here are some other sources from the University of Lancaster that might interest you:
... Methods: e.g. a short paper from the University of Birmingham :
4. A quantitative approach to corpus representativeness: <http://www.lexytrad.es/assets/cl24_0.pdf>
  • asked a question related to Corpora
Question
5 answers
I am looking for open speech databases which contain information of speaker's age and the age groups are (more or less) equally represented.
Relevant answer
Answer
Magdalena, I would direct you to Roy Patterson, he may be able to answer.
  • asked a question related to Corpora
Question
14 answers
In my current research I use corpora data, data elicited from native speakers and lexicographic sources (e.g. definitions from dictionaries).
I operate in this way because corpora may not include tokens for certain constructions (they may be vanishingly rare), but speakers may judge possible construction examples quickly and efficiently. The two methods can complement one another nicely, I believe.
I also use dictionaries (and grammars) to guide the design of possible constructions. For instance, I was testing the polysemy of prepositions in Italian and French by verifying whether possible definitions can be used to describe their senses in corpora data.
I was thus wondering if the joint use of these methods could be considered as a form methodological triangulation of the kind found in social sciences.
I would like to thank in advance any colleagues who will provide answers, after showing enough patience by reading this not so clear description (!).
Francesco
Relevant answer
Answer
Sure, you may triangulate the position of a mobile phone by measuring the distance between the phone and two different antennas. Also, you might determine whether the use of a noun in a certain text is rare by comparing it on a corpus of similar texts from the same text type and also by determining its value in a reliable word frequency list. That is triangulation. In both cases, both pieces of data are necessary to determine a value (cell phone) or they are redundant and may help you fix some distortions or problems with each way of measuring (noun case).
On the other hand, you may want to describe relevant characteristics of a bilingual population in a study, and decide to apply a psych test and also run a survey to determine their actual, daily language use. In this case, both pieces of data are relevant for your goal, but they do not focus on the same factor but rather contribute to a larger picture on the subjects. In this case, often methods are quite different (test vs survey) and thus such strategies are multi method.
As usual, borders between triangulation and multi method are fuzzy. What is relevant is being aware of whether you are focusing on one factor or many, and whether both interpretation and statistical analyses can be applied in similar ways
  • asked a question related to Corpora
Question
3 answers
Dear members of the project,
I'm really interested in joining your project.
I'm an ESL teacher at the Education School at the Catholic University of Valencia and my PhD was on Metadiscourse features in English essays written by Spanish Speakers.
I'm currently working in a national metadiscourse project on the analysis of metadiscourse features in reasearch articles together with Marisa Carrió from the Polytechnic University of Valencia.
My main research interest is analysing metadiscourse interactional features using learners' corpora. I'm be deeply interested in carrying out a contrastive cross-cultural study on the use of interactional metadiscourse features in ESL writers with different L1.
I'm looking forward to receiving your feedabck on my proposal.
Kind regards,
Chiara Tasso
Relevant answer
Answer
Dear Chiara Tasso,
This project is for the 2nd Metadiscourse Across Genres Conference to be held by CERLIS research group in Italy in 2019. You are more than welcome to join our conference. If you intend to share your research with the MAG audience, I would like to remind that the deadline for proposal submission is approaching. You can also follow us on Facebook: https://www.facebook.com/metadiscourseacrossgenres/
  • asked a question related to Corpora
Question
3 answers
Here's an open-ended question relating to copyright, ethics, power relations in academia, and corpus linguistics:
What is the situation in your country/university with respect to the intellectual property rights of corpora/data collected and constituted by a PhD student during the preparation of their thesis?
All other considerations aside (i.e. suppose that the data is original, with no prior copyright holders, and that they have been duly collected with the consent of participants):
(1) Does the PhD student retain the intellectual property rights to such data? Or do they automatically become the intellectual property of the university, by means of an employment contract or another legal document (e.g. one that PhD students may be forced to sign in order to be authorised to defend their thesis)?
(2) What happens if the PhD student wishes to share/publish their data/corpora under an Open Access license (e.g. Creative Commons) after their defence or even before it? Do they need the permission of their supervisor, of a higher-level university body, of their funding agency, of all of the above? Has it ever happened in your university? Have there been cases where the researcher wanted to share data under an Open Access license and were prevented from doing so by another level of the hierarchy?
(3) If the data does become the intellectual property of the university, is there any obligation for the university afterwards (e.g. are they obliged to make them available through an institutional repository)? If the data becomes part of an institutional repository, does the PhD student have any say on the type of license under which they will be distributed? (for example, do they get to choose "non-commercial")?
(4) After the defence, is it possible for the university (or even an individual supervisor) to formally ask their former student (now Dr) to refrain from using the data/corpus they had collected during their thesis? Note that, in theory, if the corpus automatically becomes the intellectual property of the university, this is entirely possible. Do you know any cases of universities sending formal "cease and desist" letters against their former PhD students?
I would like to collect information about current practice and law in different countries with respect to this issue. For example, some countries limit these practices (considered an abusive utilisation of copyright); some Codes of Conduct in Dutch universities explicitly state that, unlike other productions, the copyright of a PhD thesis is retained by the PhD holder; in "business-friendly" Belgium, the issue is dealt under labour law (therefore a PhD student is just another employee and everything they produce belongs to their employer).
Researchers are becoming increasingly aware that the current situation is not really conducive to early-career researchers sharing their corpora under Open Access licenses.
Legal experts will provide data and analyses, as these matters can get complicated. But I would also like to hear some experiences and the opinions of corpus linguistics practitioners. Any pointer to your country's laws, university's code of conduct, case law, cases reported in the media, stories and anecdotes or even personal experiences (if you don't mind sharing them) are welcome.
Thank you very much for participating in the discussion and thank you for your help!
Relevant answer
Answer
Dear George.
Awareness of the concept Intellectual property right in Africa, specifically in Nigeria and presently Uganda is gaining momentum in the sence that before now plagiarism testing was overlooked, right now publications have to go through arrays of test to certify its authenticity or originality.
However have full right to your Phd work even after being declared authentic by various examining bodies of the particular school depends on the extant laws of the school, as each university or college have laws establishing them and also laws guiding the publication and publicizing of such research.
However, from my brief experience and based on advice which i have proffered in different occasions, it is expedient that schools hold on to the finished research for some number of years (Published via institutional repositories as a read only document), after which the research is released to its owner,
I have further opined that the researcher also MAY decide to allow the institution have full authority over the work.
These agreements should be in written form.
  • asked a question related to Corpora
Question
3 answers
Hi everyone! I have recently started focusing on NLP research and also interested in discourse parsing and I am relatively new in this field. I want to analyse different sentences (arguments) to check the relationship of the sentences that is whether they are contradictory or supporting statements or one is the background of the other.I am interested in finding an annotation tool where I can analyse the sentences atleast which does ,if not full then partial automatic analysis so that I can conclude a structure tree which will show the relation between the arguments with each other.I am looking for an annotaion tool that is already existing and if it offers online testing of my corpora then it would be the best option for me.
I am looking for an annotaion tool that is already existing and if it offers online testing of my corpora then it would be the best option for me.I am actually looking for a tool like Penn Discourse Tree Bank where I can analyse multiple arguments and their relations.The online version is not working and so I was looking for an alternative or if anyone can also tell me how to parse arguments using PDTB's online tool, I would also be grateful.
I would welcome any suggestions and it would also be a big help if you can suggest on how to approach so that I can analyse argument structures and can eventually build a relationship between each other ... Any publication on the similar research area would also be helpful.. Thank you everyone in advance for your time!
Relevant answer
Answer
I am looking for an annotaion tool that is already existing and if it offers online testing of my corpora then it would be the best option for me.I am actually looking for a tool like Penn Discourse Tree Bank where I can analyse multiple arguments and their relations.The online version is not working and so I was looking for an alternative or if anyone can also tell me how to parse arguments using PDTB's online tool, I would also be grateful.
  • asked a question related to Corpora
Question
5 answers
I intend to use only the BNC spoken component which consists of approximately 10 million words as the reference corpus. The study corpus is 10 times larger i.e. 100 million words and is made up of informal language mainly. Will it be alright to calculate normalised frequencies to solve discrepancy issue between 2 corpora with different sizes?
  • asked a question related to Corpora
Question
1 answer
I'm looking for publicly available speech perception EEG databases with large corpora (preferably at least 10-20 words) or articles that share their data. Can anyone help me find some? Your help would be greatly appreciated!!
Relevant answer
Answer
I hope that the References of the manuscript that I posted in the Attachment will be useful to you.
  • asked a question related to Corpora
Question
3 answers
I'm looking to train and test NLP algorithms, and would like to use large corpora formatted as PDF files. Any help to find/access will be appreciated.
Relevant answer
Answer
Hi Chris,
The Internet Archive's Open Library (https://openlibrary.org/) has many books written in natural/conventional language that you may download as pdf. Project Gutenberg (http://www.gutenberg.org/) also has a huge collection of books. You may open a book as html or txt page in a browser and then "print" or "export" it as pdf.
Hope this helps,
Saif
  • asked a question related to Corpora
Question
15 answers
I mean 'this' will be very frequent item in the corpus, comparing with terms for emotions such as 'anger', so I wonder is it possible any qualitative way of investigating very frequent items. I will appreciate all your suggestions. Thanks in advance.
Relevant answer
Answer
In my opinion, analysing determiners in English corpora can be tricky.
One problem is that you have classes of determiners which have identical pronominal forms (possessive determiners/pronouns, demonstrative determiners/pronouns). You can minimize this problem by using part-of-speech tagged corpora. Many corpora today come in a part-of-speech tagged version already. By using part-of-speech tags, you can specify whether you are looking for a pronominal form, a determiner form, or both, when looking for a word form such as 'this'.
Second, English has noun phrases where an overt determiner is required, and others where the determiner is either optional or not permitted (as per the standard). NPs without an overt article form (sometimes called zero article) can only be found automatically when the corpus is annotated for syntactic structure. A corpus that is annotated for syntactic structure is referred to as a 'parsed' corpus, and they are quite rare (due to the high error rate involved in automatic parsing). However, using tools such as the UAM CorpusTool, you can parse a corpus yourself.
Finally, you mentioned that you are looking for the use of determiners in the context of emotions such as 'anger'. I fully agree with the suggestions above: You can look for determiners that contain certain words in their immediate context. In corpus linguistics, this is often called 'collocation'. Using a collocation approach, you could compile a list of nouns (or verbs, adjectives) that express the notion of anger, and then ask the corpus software to only give you those instances of 'this' which has one of the words from your lists in its immediate context (e.g. within a range of 9 words to the right or the left). Some corpora are even tagged for synonyms (e.g. the corpora on the popular corpus.byu.edu platform), so that you could search for 'this' and any synonym of 'anger' in its immediate context.
I apologize for the somewhat lengthy response, and hope it was helpful (I explained some corpus-linguistic terminology because I don't know your background). Feel free to ask a follow-up.
  • asked a question related to Corpora
Question
4 answers
I'm looking for a corpus or a collection of texts (freely available online) that represents the Dutch language from the earliest attestations onwards in the form of correspondence. I already tried to query the DBNL database and nederlab, but unfruitfully.
Relevant answer
Answer
Dear Cefas,
They might be helpful, thank you. Many of the corpora I use can be donwloaded from the internet, which is an advantage if the aim is to quantify things and you want control over the data and the way you can process them. If you hear of such corpora for Middle Dutch or else Early Modern Dutch (charters, tracts, official or personal correspondence), please let me know.
BTW, my first interest is language contact, and I think French could be the epicenter for the diffusion of complex prepositions in English and Dutch. Glad to see people interested in language contact react to my questions!
Best regards,
Christophe Béchet.
  • asked a question related to Corpora
Question
1 answer
Hi,
We're looking at the frequencies, forms, and functions of adjectives found in child directed speech in different types of corpora e.g. shared book reading and toy play. We'd like to conduct a power analysis to find out the minimum of data points required. I've previously used GPower to calculate sample sizes required to detect a particular effect size for experimental data, but I'm unsure about how to do this for corpus data, given that we're more interested in data points required rather than number of participants.
Any help would be much appreciated!
Thanks,
Jamie
Relevant answer
Answer
Hi.
I am not sure if this is what you are looking for but I know Raven's Eye uses corpi for their analytics. I know they have 65 different language corpus. I do not know if they have corpi built with child directed speech. Maybe check out their website.
  • asked a question related to Corpora
Question
2 answers
I have been going through several papers and I am a bit lost. Different methods used with various corpora, lexicons and only slightly altered deep learning approaches. If someone could recommend any method or a paper with a good (and possibly recent) overview, I'd be delighted (can be written in Chinese).
Relevant answer
Answer
I would like to respond, but I do not specialize in this topic. I will also read the answers of fellow researchers, to learn from them. Greetings.
  • asked a question related to Corpora
Question
4 answers
Hi,
I'm looking for accessible/online corpora and tools to help me calculate the phonetic and phonological complexity of words in German (e.g. Jakielski's Index of Phonetic Complexity, IPC and the like) -- as well as any pointers to what useful measures of phonological complexity that have been identified experimentally.
Thanks very much in advance!
Relevant answer
Answer
Dear Gina,
Here you have a comprehensive list of Speech analysis and transcription tools for several languages.
I hope this helps.
Kind regards,
Begoña
  • asked a question related to Corpora
Question
6 answers
I am working on the topic "Utility Enhancement for Textual Document redaction and Sanitization". I have noted in the literature of de-identification of the medical document that Privacy models perform unnecessary sanitization by sanitizing the negated assertions, (“AIDS negative”). I want to exclude the negated assertions before sanitizing the medical document, which will improve the utility of document. I want to know which dataset will be appropriate for my work. I tried to use the 2010 i2b2 dataset but I could not find the metadata of that dataset. The 2014 i2b2 de-identification Challenge Task 1 consists of 1304 medical records with respect to 296 patients, of which 790 records (178 patients) are used for training, and the remaining 514 records (118 patients) for testing. The medical records are a fully annotated gold standard set of clinical narratives. The PHI categories are grouped into seven main categories with 25 associated sub-categories. Distributions of PHI categories in the training and test corpora are known as in the test corpora 764 age, 4980 dates, hospitals 875, etc. I want to know the same above information for 2010 i2b2 dataset that I could not find yet.
Thank you.
Relevant answer
Answer
Dibakar Pal from here we can download the dataset which i have already did but i need information and description about the dataset.
  • asked a question related to Corpora
Question
2 answers
Does anyone know of any open-access multi-million word bilingual (French/English) corpora that are both aligned (paragraph or sentence level) and tagged for POS and Lemma? I'm familiar with OPUS, but regrettably the alignment and the tags are not very reliable.
Relevant answer
Answer
Dear Adrian,
That sounds quite interesting.  I've had a look at your link, and I'll definitely be contacting you soon with some questions about the corpus structure.
Thanks,
Daniel
  • asked a question related to Corpora
Question
3 answers
What do we need parallel corpora for ? Especially if we are humanists or linguists doing work or research ?
Relevant answer
Answer
Dear Krzysztof Walk,
Generally speaking, in linguistic research in general and in contrastive linguistic research in particular, the use of parallel corpora is indispensable. Over the past few decades, there has been a significant use of computer corpora in linguistic research addressing various linguistic domains such as discourse analysis , socio pragmatic contrastive analysis and so on. The reason is that parallel corpora  are a valuable source of data which have proliferated the bulk of  insightful research in  contrastive linguistics. As such, we can say that parallel corpora can serve a number of purposes in linguistic and humanistic studies:
a) They increase the level of rigor  in comparative analysis of  the two targeted languages.
b) They provide us with an in-depth  understanding of micro/ macro linguistic objectives which serve as the tertium comprationis in contrastive research.
c) They can shed light on the delicate structural and cultural  differences between languages under investigation.
d) They can serve a wide range of pedagogical and professional objectives such as  lexicography,  researching nature of language teaching and learning as well as translation.
Best regards,
R. Biria
  • asked a question related to Corpora
Question
1 answer
in general, Are the implemented algorithms in NLTK applicable to analysis languages morphology such as Arabic or I must define customized analyzers ?
Relevant answer
Answer
Dear Shady Bajary,
Very interesting work. I think that  the master thesis by  Shouhani Rabiee on Arabic available on  the following link can be helpful.
Best of luck,
R. Biria
  • asked a question related to Corpora
Question
6 answers
I want to build a corpus to test a language identification system. Can you suggest links to collect textual data for these languages transcribed in Arabic characters :
Arabic
Pashto
Balochi
Kashmiri
Kurdish
Punjabi
Persian
Uighur
Urdu
Sindhi
Malay
Relevant answer
Answer
Dear Sadik,
Sketch Engine contains corpora for Arabic, Punjabi, Persian, Urdu and Malay.  You can prepare corpora for the remaining languages yourself using the WebBootCAT functionality in Sketch Engine. In case of any questions, just ask the support team of Sketch Engine at support@sketchengine.co.uk (they could also do that job for you, but there would be some costs involved).
Best regards,
Milos Jakubicek
  • asked a question related to Corpora
Question
5 answers
  • asked a question related to Corpora
Question
3 answers
The content are from existing Technology Manuals, Blogs both written by paid technology writers and external content. We intend to build a FAQ corpus out of these. Thanks.
Relevant answer
Answer
Pretend that you lecture the corpus as a course, and that you have to test an employee on every sentence in the corpus. Manually all you do is convert each sentence into a question. The question can seek to establish the subject , verb or object of the sentence, as pleases you. The research problem is how to do this automatically with NLP. I attach the beginning of your reading list, and a link to ArikIturri: An Automatic Question Generator Based on Corpora and NLP Techniques. Have fun!
  • asked a question related to Corpora
Question
3 answers
Hi everyone,
I need to perform a topic analysis on various corpora of documents and I need a procedure that can be applied to all of these corpora independently in a standard way. 
These are the characteristics of the corpora:
  • the number of documents in each corpus will hardly be more the 500 and most of the times is around 50;
  • documents are generally very shot (from 20 to 200 words most fo the times);
  • each corpus is independent and analyses will never be done merging corpora, but only performed within each corpus;
  • the language of documents will be homogeneous within each corpus, but it may vary between corpora;
  • the number of topics is unknown a priori, and topics will be different in every corpus.
 Specifically, I’m looking for a procedure that:
  • automatically detects the best number of recurrent topic in each corpus, but that it is also able to take into account that some documents may have “peculiar” topics that are not represented in any other document. These are not of interest and may be seen as a kind of “residuals”. If these peculiar, single-document topics are identified as further topics by the model it is fine too;
  • gives for every document a % for all the identified recurrent topics, plus a % that is “residual” from them. Otherwise, also the single-document topics have to be identified and scored in each document.
if I understand the LDA models well, they don’t allow this “residual” part and the sum of the %-score of the topics is always 1. Moreover, they are not good in identifying single-document topics and the result for these “outcast” documents is somehow a uniform score for all the topics, even though none of them is truly present in the document.
Are there other topic analysis models that better fit with my task or I misunderstood the LDA models?
Thank you very much!
Massimiliano
Relevant answer
Answer
I do not think topic analysis on a collection of 50 documents would give robust and stable results since LDA is generally an ill-posed task which has many solutions. Why can't you perform some soft clustering to detect "outliers" with peculiar topics. I am not an expert in topic modelling but the authors of this work suggest a general model that embraces LDA and PLSA (though I do not know whether it is used in practice). If I understand them properly, you could regularize the model to enforce the topics to be as diverse as possible but that is by no means a "black-box" procedure.
  • asked a question related to Corpora
Question
3 answers
I'm mostly interested in telephone speech. Now I'm working with KALAKA3, and recordings from voxforge.org.
Relevant answer
Answer
Maybe too late to answer. Euronews should be not too difficult to crawl, for example http://www.lrec-conf.org/proceedings/lrec2014/pdf/695_Paper.pdf
  • asked a question related to Corpora
Question
2 answers
I'm working with the application of the divergences of Kullback-Leibler and Jensen-Shanon over some texts (e.g. A and B). In some cases A and B do not have the same vocabulary, therefore I need to set a value to the unseen n-grams. For the moment I was seting, by myself a small probability for those unseen cases. However, in some cases this value is not small enough and I can get negative divergences.
Therefore, I am looking for a smoothing method where it is not necessary to have a learning corpus. I must use the probability values from the texts and not from a learning corpus. As well, if it is possible I wouldn't like to use a smoothing method where different n-grams sizes are used. The reason is that I make use of n-grams, but several of them are skip n-grams.
P.S. For the moment I have been testing the Good-Turing smoothing.
Relevant answer
Answer
I am agree with Caitlin answer: Use Additive Smoothing. You can start with plus-one version but there is a more general expression plus-delta to test with your data . Here the details:
  • asked a question related to Corpora
Question
4 answers
Standard corpora exist in various domains, however i can not find a corpus containing large amounts of technical documentation. 
The only corpus I've heard of is the "Scania Corpus" from the PLUG project 1998. However i can not find any resources.
Does anybody know of another corpus or has access to the Scania documents?
Thank you in advance
Best regards
-Sebastian
Relevant answer
Answer
Hi,
I'm not sure if software documentation qualifies for technical documentation, but Opus project has parallel corpora for PhP, Gnome, Kde and Ubuntu manuals:
Hope this helps.
  • asked a question related to Corpora
Question
21 answers
Maybe a tool that would also let me annotate parallel texts?
Hi everyone! I'm a linguist having basic computer skills, so I have only some vague notions about Java, Python or other programming languages. I'm interested in annotating a small parallel corpus for discourse relations and connectives, so I need to be able to define several criteria in my analysis (arguments, connectives, explicitness/implicitness, etc.). I would welcome any suggestions... Thanks!
Relevant answer
Answer
Hi Sorina,
I am using SALT for Spanish and English (http://www.saltsoftware.com/). I don't know what languages you need to manage . It is a very user-friendly tool. You can transcript and redefine your own lists of words (concordances) and declare your own tags ([tag]).
You can check also the CHILDES project  tools (http://childes.psy.cmu.edu/).
Hope it helps.
Good luck!
  • asked a question related to Corpora
Question
13 answers
There are a few computational models of CIT for concept invention out there (eg. Pereira, 2007; Li, Zook, Davis & Riedl, 2012). I was wondering whether this idea could be turned on its head and repurposed in streamlining information extraction from corpora. Any suggestions on how one could go about it?
Relevant answer
Answer
@ Marc Le Goc
Abstraction is a part of the blending process for sure. Especially during the construction of the Generic Space. I haven't come across the term "Knowledge Engineering" before. It sounds like a pretty interesting field. :)
@ Ignacio Arroyo
I haven't got into annotation schemes yet. But since CIT has a strong 'evolutionary' undercurrent running through it., they'll need to reflect it somehow.  A simple static semantic tag won't work. I'm thinking something more along the lines of vectors and graphs. 
  • asked a question related to Corpora
Question
1 answer
There is a paper titled 'Stochastic Language Generation for Spoken Dialogue Systems' by Alice H. Oh and Alexander I. Rudnicky, which describes two corpora (CMU corpus and SRI corpus) but I'm not able to find those corpus (not sure, whether they're text corpus, but the language model is trained from it).
Relevant answer
Answer
The CMU corpus you may be referring to is one that we collected in a travel agency, with agents doing their usual work. This corpus was later used to create the language model for the Oh&Rudnicky study (the reasoning being that we would get language most like that of professional agents). If you are interested in it do send me an email at my cmu address (just search me and you will find it). I'm sure we have SRI corpus around as well, but I'm not completely sure if we can just send it to you.
  • asked a question related to Corpora
Question
3 answers
Are there any free Arabic morphologically tagged corpora?
Relevant answer
Answer
Hi Ibrahim,
I don't believe so. Please check out the following paper for the issues involved in developing such corpora: http://archimedes.fas.harvard.edu/mdh/arabic/NAACL.pdf.
  • asked a question related to Corpora
Question
3 answers
My motivation is to somehow (blindly) learn (negative) patterns with plain text corpora.
For a bag of words {this, is, a, book}, once a corpus tells us there is no usage of "book is this a" for sure, and so on so forth, then hopefully by negation one may find some hidden rules to promote the bag-of-words model to something similar to LDA.
Relevant answer
Answer
If I understand your question in the right way, then it is more about the syntactical possibilities given by a language, especially an SVO one. If that is the case, then you could try to define the term via the function the specific wordcluster fills i.e. "determiner" or "premodification". You can find more information on this terminology in Quirk 1985 - 'A comprehensive grammar of the English Language'
  • asked a question related to Corpora
Question
7 answers
learner corpora
Relevant answer
Answer
Dear Anju
please read my article on this topic , it could provide some reflections and findings.
good luck with your research
a.
  • asked a question related to Corpora
Question
9 answers
This is for an identification task involving EFL learners
Relevant answer
Answer
Thank a lot, Rahimi.
  • asked a question related to Corpora
Question
2 answers
Could anybody suggest a speech corpus of read speech suitable for recognition with grammar? That means a simple linguistic content that could be easily described with a BNF-style grammar.
an4 is such a corpus, but its content is still a bit complex for a simple grammar. Maybe there is some 'dialing numbers' database?
Relevant answer
Answer
you can try Aurora2
  • asked a question related to Corpora
Question
6 answers
I am able to access the transcripts but I am unable to access the audio files even on free online corpora webpages. Could anyone tell me how to access both transcripts as well as audio files together?
Relevant answer
Answer
Sir, You can write to John M Swales who was instrumental in developing MICASE. He responds to our queries. Generally we get access to transcripts only. The audio databases are not shared. There is Dr Claudia from Germany, Dresden . she collected a lot of samples from Indian users of English.  Her contact is also useful.
  • asked a question related to Corpora
Question
3 answers
Hi everybody. 
Do you know or have corpora for chat summarization in English? The corpora should have a document with its human verse summary.
Relevant answer
  • asked a question related to Corpora
Question
6 answers
Hello friends, I'm looking for criteria for minimum number of words in keyword and collocate testing.  I have seen work addressing potential problems with very large corpora (for both log likelihood and chi-squared tests), but I can't seem to find anything on corpus size minimums.  Specifically I'm interested in log-likelihood and MI-score.  Thanks for your help.
Relevant answer
Answer
Perhaps p. 979-980 in this paper might help: Brysbaert, M., & New, B. (2009). Behavior Research Methods, 41(4), 977–990.  Cheers.
  • asked a question related to Corpora
Question
5 answers
I also need Arabic tagger- best if using Perl language
Relevant answer
Answer
There are some Arabic corpora that are available through the websites:
AQMAR Arabic Wikipedia Dependency Corpus which is available on:
Some other corpora are annotated with named entitiies, such as:
  • asked a question related to Corpora
Question
9 answers
I am familiar with TLG and the Perseus Digital Project.  I want to do corpus linguistics on Hellenistic Greek.  Some of the things I need to do is search by POS, search by Lemma, search by morphological element (reduplication, particular morpheme, stem formation, etc.) and search for collocates. 
I am not sure either of the above will do all of that. I am considering developing my own corpora and using a tagger that does all of this to the corpora, as well as a search engine that will recognize what I tagged. 
Do I need to do this, or is there already a selection of tools that will get the job done?
Relevant answer
Answer
Do you know the PROIEL treebank? http://foni.uio.no:3000/ It has a considerable amount of Greek text from several periods (the core is the New Testament, but there is also Herodotus and some Byzantine chronicles). You can download fully lemmatised and tagged texts there and use them to train a morphological tagger. 
  • asked a question related to Corpora
Question
15 answers
Our large SMS corpus in French (88milSMS) is available. User conditions and downloads can be accessed here: http://88milsms.huma-num.fr/
Is there a website that list all corpora available for NLP and text-mining communities?
Relevant answer
Answer
Hello,
Thanks Ali for the pointer. We can indeed help you share it with the HLT community and give it some further visibility at ELRA/ELDA (http://www.elra.info and http://www.elda.org). You can have a look at our ELRA Catalogue (http://catalog.elra.info/) and the Universal Catalogue (http://universal.elra.info/) and get in touch with us for any further information (http://www.elda.org/article.php?id_article=68). We'll be happy to help! Kind regards, Victoria.
  • asked a question related to Corpora
Question
9 answers
I want to construct a corpus (of comments) from social networks and am searching for an existing tool dedicated to such an issue.
Relevant answer
Answer
This very much depends on the privacy / data sharing rules and the volume of the particular social network you want to use (also on your ability / willingness to write codes).
scraperwiki.com is a great website to get basic data from Twitter.
NodeXL is a nice software to both extract data from Twitter (as well as from Facebook) and analyze social networks.
R's TwitteR package is also pretty decent - http://cran.r-project.org/web/packages/twitteR/index.html
If the social network does not have an API that you can use to retrieve data from, you can always go 'old school'. I have a colleague who practically went through the forums and copied/pasted comments to a Word file.
  • asked a question related to Corpora
Question
10 answers
For English there are quite a few reasonably well-written applications. With German and French I'm rather lost. I haven't encountered any monolingual extractors for these languages, yet. And I haven't found any reliable + affordable bilingual term extractors, either. As a conference interpreter I'd love to extract difficult terms from (parallel) texts in the nick of time.
Relevant answer
Answer
We have developed a method called Likey (Language-Independent
KEYphrase extraction) based on the use of reference corpora. Likey has a very light-weight preprocessing and no language-specific resources are needed in addition to the reference corpus. Thus, the method is not restricted to any single language or language family. We have been very pleased with the results that have been based on experiments with more than ten languages and used the method in various applications. The method is presented in a Coling paper that you can find at
The method and its use is discussed in some detail in Mari-Sanna Paukkeri's recent PhD thesis (Section 4.1):
We have applied the Likey method, for instance, in assessing user-specific difficulty of documents: