Computational Linguistics - Science topic
Computational linguistics is an interdisciplinary field dealing with the statistical or rule-based modeling of natural language from a computational perspective.
Questions related to Computational Linguistics
La comunicazione letteraria in digitale una riformulazione del testo in molteplici codici che investe un processo in una pluralità di SSD con specificità di ruoli e di competenze Io Ritamaria Bucciarelli ho inteso perseguire questi ambiti con il supporto scientifico di super -eccellenze per SSD , che hanno giustificato le scelte : Humanae litterae; Fisica Quantistica, matematica , linguistica computazionale , implementazioni .
Il modello di riferimento è: musicologico quantistico: Obiettivo trasferimento dati della tipologia testuale IU ad IA due codici e una moltiplicità di meccanismi linguistici , fonici e ancora grafi da produrre, automi , analisi trasformazionali in ambienti linguistici e ancora automi e infine implementazioni da produrre . Io ci sono riuscita per riprodurre il verso emotivo del testo letterario della Divina commedia in un calcolo quantistico , spiegato nel Piano di Fano ed infine risolto nella schiuma dei quanti . Vi invito a rispondere al mio appello . Per queste competenze io penso che non ci sono referi per il giudizio . Grazie
I am currently working on a project, part of which is for presentation at JK30 this year in March hosted at SFU, and I have been extensively searching for a part of speech (POS) segmenter/tagger capable of handling Korean text.
The one I currently have access to and could make execute is relatively outdated and requires many modifications to execute runs on the data.
I do not have a strong background in Python and have zero background in Java and my operating system is Windows.
I wonder if anyone may be able to recommend how may be the best way to go about segmenting Korean text data so that I can examine collocates with the aim of determining semantic prosody, and/or point me in the direction of a suitable program/software.
I would like to do part of speech tagging in an unsupervised manner, what are the potential solutions?
There are high-status conferences such as NeurIPS, ICSE, and ACL. Sometimes they accept more than 1000 papers each year.
On the other hand, there are several Q1 journals (with high impact factors) in each category.
Based on your experience, what would be the pros and cons of each one for you as a researcher? How well they are received when you are applying for a position?
Most times data for analysis in Language studies are elicited from Media related tools and are subjected to some form of computational Linguistic discourse, how can one apply the linguistic tools needed for the explication of such data?
What program is best for the computer-assisted phonetic comparison of dialects? We would like to compare several phonetically quite close dialects of a more or less well-documented language (with the respective protoforms available in case they're required for comparison). The aim of the comparison is to see how close the dialects are to each other and if maybe one stands out against the others, as well as to possibly get input for solving the questions of how the language and / or its speakers spread across the area where the dialects are currently spoken (within the possibilities, of course).
Computational linguistics is the basic future of all languages, considering the electronic processing of anything is the controller in the continuation of production or not, especially since language is no longer only a means of communication, as much as it has become a way of production ... But the Arab world cannot pay attention to it, despite the attempts of research writing and articles, and despite the talk about it in the various media.
It is common that a source domain is conceptually linked to multiple target domains. According to the neural theory of metaphor, once a concept as a source domain is activated, signals will spread through neural circuits/mappings. This is, multiple target domains should be activated simultaneously. Is this true? Or context plays a moderating role in this process. Any terms or articles to recommend, please?
Can I inhibit the processing of several other mappings by making one mapping more accessible? (accessibility theory).
actually I need semantic relations to agent nominals as well.
fx. I need the verb 'grave' (eng: (to) dig) which have semantic relations to 'jord' (eng: dirt) and 'skovl' (eng: showel) and of course alot of other obvious relations.
I need the verbs in order to test how organizational resources (knowledge, money, stuff which is all nominals) can be combined with verbs into tasks fx "grav i jorden med skovlen" (eng: dig into the dirt with the showel)
Actually there are some popular tools I've been working for so long but I'm interested in a specialized tool for that matter.
I need schizophrenic people's writings dataset for natural language processing, there are some work on social media contents through self disclosure ones, but I want clinical data in English.
Any help will be appreciated.
I was not able to find a comprehensive survey about this on the Internet. Going through some books I came across semantic nets and conceptual dependency theory but are there any others? Any web resources or survey papers would be most helpful.
What are the available benchmarks for evaluating semantic textual similarity approaches?
I am aware of the following:
- SemEval STS
- Microsoft Research Paraphrase Corpus
- Quora Question Pairs
Do you use other that these in your research?
Except PER metric, what are the existing performance metrics to compare two different recognizers in speech recognition?
What are the free or open source Arabic morphological analyzers, which we can download from Internet?
Please provide the links.
Can anyone guide me to find Corpus/ Training data for readability difficulty of English texts?
Thanks in advance
I'm looking for accessible/online corpora and tools to help me calculate the phonetic and phonological complexity of words in German (e.g. Jakielski's Index of Phonetic Complexity, IPC and the like) -- as well as any pointers to what useful measures of phonological complexity that have been identified experimentally.
Thanks very much in advance!
I want to know about the best Arabic named entity tools available and how to use them?
Thanks in advance
Given a text, how do you extract its introduction, its development, and its conclusion? Which Computational Linguistics technique can serve to identify the beginning and end of the introduction, development, and the conclusion? Which article addresses these questions?
I am actually trying to extract data from SEC Edgar filings, however I see that building parsers for each form is quite exhausting, in addition not all those filings are of same format even though they are from the same form say for eg. 10k filing.
I am intrigued if some one can direct me in the right direction.
I have a dictionary which its values are matrices and its keys are the most frequent words in the train file. There is a test file which I have to see if the words in each line of that is in the dictionary (the keys), get their values and add them together and then divide that to the number of the words in each line which matches to the keys. The answer is one matrix. I tried "sum(val)", but it doesn't add the matrices together. How can I fix the code (the end part of that) which I've enclosed?
I need English proof reading for my Arabic computational linguistics research, about 80 pages, it's English text with Arabic linguistics terms, a job not for free
I want to probabilistic way to model the affix distribution .Anyone knows the algorithm or technique for achieving the same
Hi, I'm doing a multidimensional analysis following the work of Douglas Biber, on two corpora (one learner data, one professional texts). I have the following dimensions following exploratory factor analysis, but am having trouble working out how to define and characterise these dimensions according to function (e.g. involved vs. informational discourse, context (in)dependent discourse, etc.).
Here are the 5 dimensions. In EACH CASE, the z-scores are HIGHER in the learner texts than the professional texts except where a * is seen after the linguistic feature.
VBD – Past tense verbs
PRT – Present tense verbs*
NN - Other nouns not nominalisations or gerunds*
NOMZ – Nominalisation
POMD – Possibility modals
VB – Verbs in base form
TO1 – Infinitives
JJ – Adjectives*
PRMD – Predictive Modals
PIN – Total prepositional phrases
DT – Determiners
VBN – Verbs with part participle
FPP1 – First person pronouns
SPP2 – Second person pronouns
QUPR – Quantifier pronouns
TPP3 – Third person pronouns
IN – Preposition or subordinate conjunction.
I hope that anyone who has done their own MDA might want to provide some pointers here. Many thanks in advance!
Lemma lists represent a necessary tool in NLP. Despite lengthy investigation, I could not locate an Arabic lemma list that would be freely available, and the complexity of Arabic inflections means that the creation of one from scratch is no easy task and should only be undertaken once it is ascertained that none is already available.
I am looking for a corpus containing documents for extractive summarization. The sentences of the documents should be labelled as "positive" if that sentence is included in the summary, "negative" otherwise. The sentences will be fed as training data for the summarizer I am currently working with.
I'm writting my Master's Thesis about mobile learning and I'm lost with the terminology.
We are developing a mobile application to practice the Spanish conjugation. The system is not really social oriented since it is more a behavioral activity in which the user writes the answer and the device gives him feedbak. Do you think that it can still be considered as MALL (mobile assisted language learning)?
Thank you in advanced.
I am looking for a stylometry dataset, using all or part of lexical, syntact and structural features in a form of CSV, arff, or db.
I will really appreciate if you provide me with part of your dataset or suggest a link to get a dataset.
I intend to read about the criticism leveled to divergence time estimation of languages based on both lexicostatistics (glottochronology) and methods of comparative linguistics such as maximum parsimony. Could you introduce me some critical papers?
I wish to find all ditransitive constructions in a corpus like BNC or COCA, e.g. "verb + noun + noun" and "verb + noun + preposition + noun", so that I can see which words can be used in a ditransitive construction.
I have done my masters study research in Sentiment Mining and I have worked on Multi aspect Sentiment Analysis. I want to continue the work in the area, can you help me choosing?
I need tool/dictionary/algorithm that when I give it the Arabic verb it will give me the noun of that verb and vice versa.
Is there any database or program which links arabic nouns to their derivatives (for example linking "اسخراج" to"استخرج").
I do not need a root extractor i.e. linking اسخراج to خرج
Do the formal languages of logic share so many properties with natural languages that it would be nonsense to separate them in more advanced investigations or, on the contrary, are formal languages a sort of ‘crystalline form’ of natural languages so that any further logical investigation into their structure is useless? On the other hand, is it true that humans think in natural languages or rather in a kind of internal ‘language’ (code)? In either of these cases, is it possible to model the processing of natural language information using formal languages or is such modelling useless and we should instead wait until the plausible internal ‘language’ (code) is confirmed and its nature revealed?
The above questions concern therefore the following possibly triangular relationship: (1) formal (symbolic) language vs. natural language, (2) natural language vs. internal ‘language’ (code) and (3) internal ‘language’ (code) vs. formal (symbolic) language. There are different opinions regarding these questions. Let me quote three of them: (1) for some linguists, for whom “language is thought”, there should probably be no room for the hypothesis of two different languages such as the internal ‘language’ (code) and the natural language, (2) for some logicians, natural languages are, in fact, “as formal languages”, (3) for some neurologists, there should exist a “code” in the human brain but we do not yet know what its nature is.
I have seen that these two terms have been used interchangeably in the literature. I am wondering what are the main distinguishing factors between these two systems?
Please refer any website/ Paper/ Tutorial/ Link where text mining analysis is being used with conclusion formulation.
Are there any corpus of "easy to read medical text" freely available?
It has been a while since I am searching for a freely available and reliable term extraction for arabic (specially for single-word terms). Any suggestions will be highly appreciated.
Automatic indexing - Given a text document extract terms that describe the main topics or concepts covered in the document. It is a task done before inverted index construction in Information Retrieval Systems development. Terms may be keywords, keyphrases, noun phrases, noun group, entities etc.
Have asked in Statistical Area. Am interested in identifying probabilistic and statistic distributions of Mandarin tones [either in general or in specific corpora].
I have developed some very general data eg Tone 1 occurs around 18% of the time, Tones 2 and 3 slightly higher than Tone 1, Tone 4 occurs > 40%, and the neutral is relatively low. But I'd like to obtain more detailed data and also theories as to how experts view tones in probability [if this style can even be accomplished]. Would Bayesian probabilities not be appropriate?
A sentence of a tonal language presents critical lexical information in the tones whereas a nontonal language such as English does not. What might be a way of developing useful statistics that measure and show this difference? In other words, how much of the information content is in the tones?
- E.G. let us say English is 100% nontonal and Mandarin can be shown to be 60% nontonal and 40% tonal [I do not really know what the statistics would be].
Are there studies that identify the possible probabilistic and statistic distributions of Mandarin tones [either in general or in specific corpora]? I have developed some very general data eg Tone 1 occurs around 18% of the time, Tones 2 and 3 slightly higher than Tone 1, Tone 4 occurs > 40%, and the neutral is relatively low. But I'd like to obtain more detailed data and also theories as to how experts view tones in probability [if this style can even be accomplished]. Would Bayesian probabilities not be appropriate?
I am making a research to create indexes that will contain names and other keywords. My resource texts are written in Greek polytonic characters. I think that it would be very useful to find a way to make them editable and searchable. Furthermore, in order to summarize and classify the information mined, I believe that a software with stylometry function is needed. For the above reasons I am looking for: a) OCR software, b) stylometry software.
Any kind of help will be greatly appreciated! Thank you!
Hello, my research is about sign language recognition, many researchers choose to use the sign as a base unit of modeling , while others attempt to use a structure similar to phonemes to create models. what's the better approach for modeling the sign?
Till now Traditional N-Gram approach is used for Word Prediction system.
I need a hint how choose feature from connected alphabet. If you do not familiar with this language just think English handwriting where in a word every letter is connected.
Has it been calculated mathematically or logically?
Which Neural Network techniques are used in computational linguistic applications?
Thanks in advance for your replies.
I search a way to interpret logical representation of natural language, more or less as Latent Semantics.
Corpus must contain documents (texts) with hand annotated keywords by human experts.
Is there a need to build a new standard corpus for Arabic Information Retrieval? Is it possible in the current state of the art?
I try to find the best method for Information Extraction in long-distance languages ( long-distance means have less similarity) , Korean , Japan and Myanmar languages are same sentence structure similarly. Finally , I want to show the brief summary for these extracted information with Myanmar language. Thanks a lot for concentration.
Extracting causal relationships from texts is far from trivial, but there are quite a few intriguing pieces in the recent literature that discuss how this could be done. E.g. http://www.hindawi.com/journals/tswj/2014/650147/. The 'technology readiness level' of this work seems significantly below that of things like entity, sentiment, event, etc extraction. But at least some progress seems to have been made.
Given the availability of so many large full-text academic databases, it would be of course fantastic to be able to 'extract' all of the causal hypotheses that have been formulated over the years in various disciplines. But so does anybody know of any existing textmining tools that can already do this - even if it's just for English?
Textmining tools are becoming ever more useful, but it remains difficult to find good tools for CJK languages. If anybody knows of good tools for - especially - Chinese, I'd be grateful for a link.
I'm looking more specifically for studies that used Word Sketch to analyze collocations in/across different academic disciplines.
For most of my projects I use R to manage my big data and firing statistical analyses on the results. My domain of research is linguistics (computational linguistics, corpus linguistics, variational linguistics) and in this case I'm concerned with big heaps of corpus data. However, R feels sluggishly slow - and my system isn't the bottleneck: when I take a look at the task manager, R only uses 70% of one CPU core (4 cores available) and 1Gb of RAM (8Gb available). It doesn't seem to use the resources available. I have taken a look at multicore packages for R, but they seem highly restricted to some functions and I don't think I'd benefit from them. Thus, I'm looking for another tool.
I am looking at Python. Together with Pandas it ought to be able to replace R. It should be able to crunch data (extract data from files, transform into data table, mutate, a lot of find-and-replace and so on) *and* to make statistical analyses based on the resulting final (crunched) data.
Are there any other tools that are worth checking out to make the life of a corpus linguist easier? Note that I am looking for an all-in-one package: data crunching and data analysis.
Hello I am student working on NLP project. The top ranked words under a given topic number which is obtained from LDA (Latent Dirchlet allocation), I am trying to assign topic names to each topic number using wikipedia as a knowledge base.
Wikipedia category graph contains links between categories which are having some relationship but do not have an hierarchical structure. From this graph I removed the non hierarchical links to get a DAG (directed acyclic graph) as a consequence a given category can have one or more parent, after this I applied BFS like algorithm to get a taxonomy but this misses relevant hierarchical links.
Is there any factor which I consider to get more accurate and meaningful taxonomy.
Thank you in advance
I implemented a study using a pseudoword as a prime and real words as targets. When looking for relevant literature, quite a while ago, I found nothing. Now I have the results and a found a clear N400 component and a quite strong P600 effect. I would really like to be able to cite some similar work, but so far I just haven't found anything. References about word pairs and N400 and P600 would also do.
What is the best method for extracting information in long-distance languages ( e.g. Myanmar , Japan , Korean ) ?
How to summerize the information in these languages ?
I want to know if there are systems that generates programs from Natural Language description, and how this is called in science journals. Broadly speaking I mean something like, we describe e.g. web app in natural words. Like "I need catalog web app, with users, and admins. and users are registered in site and admin approves them. and etc.". So the program uses data from different structured frameworks (ROR or Django), and create program by itself. And if there are some researches in this fields?
What is the most suitable tool for sequence classification/mining with the following features: langauge modeling, HMM, CRF, multi-attribute nodes?
Are there any free Arabic morphologically tagged corpora?
can any one help me and suggest to me papers on hybrid approaches in NER(Named Entity Recognition), and any one can ssuggest me usefull techniques to use
I am searching for a study which examined the number of annotators for creation of a reliable corpus for a text classification task evaluation.
Snow et al  argue that on average 4 non-expert raters are required for annotation tasks but the tasks described are no classification tasks (only the study on affectual data might be considered as classification task). I'm rather searching for a statement for topic-based classifications.
Often, three annotators are used and a majority-voting is done but without real evidence that this a sufficient number...
Thank you very much in advance for your answers!
 Rion Snow, Brendan O'Connor, Daniel Jurafsky, and Andrew Y. Ng. 2008. Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '08). Association for Computational Linguistics, Stroudsburg, PA, USA, 254-263.
I'm reading about context free grammar and i recognized how to eliminate the left recursion but i did not find out what is the problem with left recursion?? can any one explain
Thanks in advance
I have considered 3 datasets and 4 classifiers & used the Weka Experimenter for running all the classifiers on the 3 datasets in one go.
When I Analyze the results, considering say classifier (1) as the base classifier, the results that I see are :
Dataset (1) functions.Linea | (2)functions.SM (3) meta.Additiv (4) meta.Additiv
'err_all' (100) 65.53(9.84) | 66.14(9.63) 65.53(9.84) * 66.14(9.63)
'err_less' (100) 55.24(12.54) | 62.07(18.12) v 55.24(12.54) v 62.08(18.11) v
'err_more' (100) 73.17(20.13) | 76.47(16.01) 73.17(20.13) * 76.47(16.02)
(v/ /*) | (1/2/0) (1/0/2) (1/2/0)
As far as I know:
v - indicates that the result is significantly more/better than base classifier
* - indicates that the result is significantly less/worse than base classifier
Running multiple classifiers on single database is easy to interpret, but now for multiple datasets, I am not able to interpret which is better or worse as the values indicated do not seem to match the interpretation.
Can someone pls. help interpret the above result as I wish to find which classifier performs the best & for which dataset.
Also what does (100) next to each dataset indicate?
'err_all' (100), 'err_less' (100), 'err_more' (100)
I'm just curious about what other Natural Language Processing instructors might be using as introductory NLP packages, esp. to non-experienced programmers. Any thoughts? I'd much appreciate your advice and commentary on level of difficulty, effectiveness, etc.
I would like to know how big a corpus can be built using LingSync, and for what goals. I would also like to know to what extent such a corpus can be converted to a stand-alone online corpus.
I am doing my final year project on "Classification of Tonal and Non-Tonal languages" using neural networks. The system takes pitch contour and energy as parameter Using only the pitch contour as a parameter yields an accuracy of 66%, whereas adding short term energy increases it to above 80%.
Many standard literatures also consider energy as a characteristic feature of the language, but provides no explanation.
N-gram is pretty suitable for NLP or any sequential data. I am wondering if anybody is working n-gram for software engineering research. Please share your experience.
what model or models to be used in speaker recognition under automatic speech recognition (ASR)?
Other than finding open datasets and privacy issues, is there any other challenges that might be faced by sentiment analysis applications in smart cities contexts?
I am able to access the transcripts but I am unable to access the audio files even on free online corpora webpages. Could anyone tell me how to access both transcripts as well as audio files together?
Hello to all. I can't find any contributions to machine translation using pregroup grammar or Lambek calculus, on the net. I am working on this and wanted to know if there is any literature.
If I am using any dataset suppose of movie dataset. then is it necessary to have the opinions(reviews) of same movies? Or can I take the dataset as it is for my research project. If not, then suggest me how can I take the dataset?. If some reviews give opinions for movie A and some review for movie B, then can i give the generalized output for that of movie reviews output(not classifying wether it is of which movie), just telling the output of movie reviews is this. Then I am really confused with the dataset. Please Help me