Science topic

Arabic NLP - Science topic

Explore the latest questions and answers in Arabic NLP, and find Arabic NLP experts.
Questions related to Arabic NLP
  • asked a question related to Arabic NLP
Question
1 answer
I created a lecture series for this, please suggest any improvement.
Relevant answer
Answer
Yes, absolutely! Building an AI chatbot from scratch is an excellent way to dive deep into the world of Natural Language Processing (NLP). This hands-on experience will provide you with a comprehensive understanding of various NLP techniques and their practical applications.
Here's why it's a good starting point:
  1. Deep Understanding of NLP Concepts: Tokenization: Breaking down text into smaller units (tokens) like words or subwords. Stemming and Lemmatization: Reducing words to their root forms. Part-of-Speech Tagging: Identifying the grammatical role of words. Named Entity Recognition (NER): Recognizing entities like names, locations, and organizations. Sentiment Analysis: Determining the emotional tone of text. Intent Classification: Identifying the user's goal or purpose. Dialogue Management: Managing the flow of conversation and generating appropriate responses.
  2. Practical Experience with Libraries and Tools: NLTK (Natural Language Toolkit): A versatile library for various NLP tasks. spaCy: A powerful library for advanced NLP, known for its speed and accuracy.TensorFlow and PyTorch: Deep learning frameworks for building complex language models. Hugging Face Transformers: A library for state-of-the-art language models like BERT and GPT-3.
  3. Problem-Solving and Debugging Skills: You'll encounter challenges like ambiguous queries, context-dependent responses, and handling out-of-scope inputs.This will force you to think critically, experiment with different approaches, and refine your models.
  4. Building a Strong Foundation for Future Projects:The knowledge and skills gained from building a chatbot can be applied to other NLP tasks, such as text summarization, machine translation, and question answering.
Key Considerations for Building Your Chatbot:
  • Data Quality: A high-quality dataset is crucial for training your model.
  • Model Architecture: Choose an appropriate architecture based on the complexity of your task.
  • Evaluation Metrics: Use relevant metrics to assess your model's performance.
  • Continuous Improvement: Regularly evaluate and refine your model to improve its accuracy and user experience.
By embarking on this journey, you'll gain valuable insights into the intricacies of NLP and position yourself for future advancements in the field.
  • asked a question related to Arabic NLP
Question
1 answer
For word segmentation. Thank you very much!
Relevant answer
Answer
  1. Accuracy: Evaluate the accuracy of both tools in segmenting Arabic text. This involves comparing their performance in correctly identifying word boundaries, handling punctuation marks, and tokenizing complex linguistic constructs common in Arabic text.
  2. Robustness: Assess the robustness of each tool across different types of Arabic text, including formal and informal language, dialectal variations, and domain-specific terminology. A robust segmenter should perform consistently well across diverse text sources.
  3. Speed and Efficiency: Measure the processing speed and efficiency of each tool, considering factors such as runtime performance, memory usage, and scalability to handle large volumes of text data.
  4. Language Support: Consider the breadth of language support offered by each tool, including support for different Arabic dialects, regional variations, and language-specific features or conventions.
  5. Customization and Fine-tuning: Evaluate the extent to which each tool allows for customization and fine-tuning to adapt to specific linguistic requirements or domain-specific challenges in Arabic text processing.
  6. Community Support and Documentation: Assess the availability of community support, documentation, and resources for each tool, including tutorials, forums, and user guides that facilitate integration, troubleshooting, and usage.
To conduct a comparative evaluation, you may need to design experiments and benchmarks tailored to your specific use case and evaluation criteria. Additionally, consider consulting academic research papers, user reviews, and developer documentation to gather insights and perspectives on the performance of StanfordNLP CoreNLP and Elasticsearch default segmenter for Arabic text segmentation.
Please follow me if it's helpful. All the very best. Regards, Safiul
  • asked a question related to Arabic NLP
Question
11 answers
I created my own huge dataset from different sites and labeled it on some NLP task. How can i publish it in form of Paper or article and where?
Relevant answer
Answer
Publishing your own created labeled corpus can be done through various avenues depending on your goals and the field you're working in. If you wish to contribute to the academic community and share your research findings, publishing it in the form of an article or paper in relevant journals or conference proceedings would be appropriate. This allows you to provide a detailed description of your corpus creation process, its applications, and potential insights derived from it. Alternatively, you could explore open-access platforms or repositories specific to linguistic resources, such as the Linguistic Data Consortium (LDC), where researchers can deposit and share their corpora. Additionally, if your corpus is of significant value and relevance, you may consider reaching out to organizations or institutions involved in language processing or research, as they may be interested in hosting and making it accessible to others in the field.
  • asked a question related to Arabic NLP
Question
4 answers
We are an active Arabic Natural Language processing (NLP) and AI research group doing research in Deep learning, machine learning and social network analysis for Arabic NLP.
We are looking for an RA that can work remotely on a number of NLP/Deep learning/Machine Learning projects, where can we find such candidates?
Responsibilities:
Data cleaning, analysis and visualization using various approaches.
Ability to conduct literature review and summarize them in a coherent way.
Ability to implement different ML/DL approaches using different datasets to serve specific NLP problems.
Ability to fine tuning BERT/AraBert and its different variations to serve specific NLP tasks.
Ability to communicate the experiments and results in clear English language.
Required Minimum Qualifications:
Master/PhD in computer science.
Experience in Python (including numPy, sciPy, pandas, matplotlib)
Excellent working knowledge of Deep learning/Machine Learning.
Experience with word embeddings, BERT, etc.
Ability to clearly communicate technical ideas in English.
Motivated, Independent, self-learner and ability to work with diverse team.
Excellent verbal and written communication skills are required.
Relevant answer
Answer
That is interesting. I am fully occupied, I will share that with my colleagues and students, if you are still looking.
  • asked a question related to Arabic NLP
Question
17 answers
Because a human thought is interconnected with a language, what do you think about the Integration of Natural Language Processing (NLP) with Deep Learning (DP)? I think that it is the main way to build General Artificial Intelligence.
What approaches are used in the Integration of NLP with DP? What are trends in this area?
Relevant answer
Answer
Dear Amin Honarmandi Shandiz , thank you for your contribution. It is very interesting paper. On the other hand, the Integration of Vision and Language Processing is only one part of the way to implementation into AI the understanding of meaning.
  • asked a question related to Arabic NLP
Question
10 answers
I want an Arabic dataset specially in chatting 
Thanks 
Relevant answer
Answer
You can find it at: https://metatext.io/datasets
  • asked a question related to Arabic NLP
Question
2 answers
I am wondering if there is a dataset or online database for clinical reports, electronic health records or discharge summaries written in Arabic.
Relevant answer
Antonio Moreno Sandoval and his team created an Arabic Medical Corpus. I'd suggest to contact them in case this resource is available for other researchers.
  • asked a question related to Arabic NLP
Question
16 answers
By mathematical pattern, I mean mathematical pattern in the textual structure of the resulted language
Relevant answer
Answer
Many thanks for you reply, Edmond Furter
* “Mappings are arbitrary, anything can be said.” But limited by context, situation, convention, intelligence, ultimately archetype.
Yes, but there is no direct (algorithmic) relationship between environment and what can be said - that lens (your 'limitation') is provided by your interpretation - the 'world' you construct.
Daniel Chandler talks about whether language surrounds ideas (a cloak) or whether language supports ideas (a mould) . My software works on the principle that words are ideas, depending on that lens: the 'lens' is constructed with words (noting that - as Ogden and Richards point out - words don't 'mean' anything in themselves!).
There is one caveat, that we extend belief to the speaker until they say something negative - something we don't understand, or something that is contrary to our beliefs, at which point we need to express something ourselves - "now hold on/sorry, can you repeat that". This is known as 'felicity' by Austin, or 'adequacy' in the Ogden and Richards Triangle, and this directs the flow of conversation (which is how I model conversation in Enguage) I'm thinking of encoding it, "yes, ...", "ok, ...", "sorry, ..." etc, but it becomes stilted.
But I think this simple 'truth' is what Human Life Programming Laguage should really be looking for as mathematical pattern.
  • asked a question related to Arabic NLP
Question
3 answers
Dear everybody!
I do a hobby project as creating a character-level seq2seq2 LSTM.
In my task, I give a text as an input (max 40 characters) and the LSTM generates an output that rhymes with the input.
I created very large rhyming rows databases.
At the beginnings I trained my model with the next parameters:
batch_size = 200 epochs = 250 latent_dim = 300 num_samples = 10000
with these parameters my model converged to 0.4 after 75 epoch, but i waited all the 250 epoch and tested that model.
The result wan't so bad, but I wanted more.
After that I tried very large batch sizes, with more than 200k training data (almost all possible parameres) and every result leads to overfitting, that means my model threw the same sentence to every input. BUT(!) after I tried the 250 epoch model, I used checkpoint saving and tested only the best model after it didn't converge more. It stops at 0.29 acc usually.
I know the character level lstm in this task has its own limitations, but it would be really 10k training data?
Is it possible the convergence doesn't matter in this case and the model needs only more epochs?
Is the database too big and has a lot of stopwords and I need to do word-frequency-based filtering on the training data?
I know that the word-level method could be more effective, but I'm afraid of I misunderstood something and I don't want to waste more time to wait results from training until I don't know what I'm doing wrong.
What should I do?
Thank you all.
Relevant answer
Answer
There are two methods which can help with overfitting. One is regularization. Is your output sigmoid? If so, activity regularization on the previous layer is helpful. Otherwise, kernel regularization might be a better bet.
The other common method is to add noise: Dropout to a binary layer, or GaussianDropout to a layer with continuous values. If you are using Keras, there is an option on the LSTM class to specify some dropout. The most commonly used value is 0.5, but I often use less.
  • asked a question related to Arabic NLP
Question
12 answers
Hi, I am trying to solve the problem of imbalanced dataset using SMOTE in text classification while using TfidfTransformer and K-fold cross validation. I want to solve this problem by using Python code. Actually it takes me over two weeks and I couldn't find any clear and easy way to solve this problem.
Do you have any suggestion where exactly to look?
After implementing SMOTE is it normal to get different results accuracy in the dataset?
Relevant answer
Answer
You need to fix the seed number so that you can replicate the result each time you perform the task.
HTH.
Dr. Samer Sarsam
  • asked a question related to Arabic NLP
Question
6 answers
Is there any tool or algorithm that find the pattern of a given Arabic words?
For example: Extract the pattern of Arabic word "ملعب", which is "مفعل".
Relevant answer
Answer
I will double-check.
Regards,
  • asked a question related to Arabic NLP
Question
4 answers
What is the best word embedding evaluation method?
Relevant answer
Refer to this paper, for various word embedding evaluation methods
  • asked a question related to Arabic NLP
Question
14 answers
Except TALLIP, I do not know any journals which are specialized in Arabic NLP and Information Retrieval.
Can anyone cite other journals?
Relevant answer
Answer
For journals, except TALLIP, there is no dedicated one about Arabic NLP. There were some attempts years ago to propose one but it didn't succeed. You can check however these two journals where it is very common to see papers and even special issues about Arabic NLP
- Journal of King Saud University - Computer and Information Sciences
- Arabian Journal for Science and Engineering
For conferences, there is one dedicated regular conference about Arabic NLP since 2006 which is "International conference on Arabic Language Processing" ICALP. next edition will take place in Nancy in October 2019. There second more recent one is ACLing which takes place in Dubai. There is finally a regular workshop that takes place in conjunction with other known conference such EMNLP and ACL. The workshop is named WANLP. next edition will take place in Florence, July 2019.
  • asked a question related to Arabic NLP
Question
3 answers
hello ,
I'm working on a text to speech research for the Arabic language . One particular component of TTS that i have noticed that is under researched in the Arabic language is G2P ( grapheme to phonemes ) conversion , especially when doing G2P using Neural networks or AI.
in your opinion , why this area ( G2P for Arabic ) is under researched? why there are no ( or little ) papers on using AI and Neural networks for Arabic G2P ? is it not worth working on ? do you think that this is a good idea to research .
thank you
Relevant answer
Answer
Dear Mousa, there are around 6000 spoken languages in the world and most researches are doing nlp kind of research on their local language or national language , so what my point here is that iit is a responsibility of local researcher like you to take a lead and start serving your nation and whole humanity.
  • asked a question related to Arabic NLP
Question
3 answers
I want to know about the best Arabic named entity tools available and how to use them?
Thanks in advance
Relevant answer
Answer
  • asked a question related to Arabic NLP
Question
3 answers
I am working in Arabic NLP and I was using Badaro's dataset called ArSenL 
I used it in my lexicon dataset to enrich my own dataset but when it was manually inspected I found that it had many misclassified words [words are classified as positive with high confidence while it is very clear it has negative polarity , ex: murder was classified as positive]
so if someone have used it for the same purpose can he tell me if I can rely on it blindly or I will need to manually inspect it
Relevant answer
Answer
قتل is classified as positive with 100 confidence, this is the one I remember, I may give you some others when I get to the office
  • asked a question related to Arabic NLP
Question
2 answers
Hi ,
I know GATE library has some support for Non-English ontology such as Arabic. Please, I am wondering if there is another library package for Arabic ontologies?
Arabic plugin
How do I create RDF with GATE? documentation
Thanks
Relevant answer
  • asked a question related to Arabic NLP
Question
3 answers
I use this tool TextDirectoryToArff.java on WEKA web site as the link below
but the result as in figure number1.Not recognized by WEKA.
I need each word to have its features only in a single row not multiple row as figure1.
I try coding without any useful>
I need a help in arranging the file to be readable in WEKA.
If any  one know  a tool or can  guide me for a solution.
Thanks
Relevant answer
Answer
You should look into the example of arff file in WEKA. WEKA includes examples in the directory where WEKA is installed.
An arff file contains two main parts:
1. Attributes
2. Feature values
To generate a arff file, which WEKA can read, you should:
1. Define attributes (feature name along its type)
2. Coding to generate the value of each feature and write it into a file.arff
Hope this helps.
  • asked a question related to Arabic NLP
Question
5 answers
   
I want to do a very simple job: given a string containing pronouns, I want to resolve them.
For example, I want to turn the sentence "Mary has a little lamb. She is cute." into "Mary has a little lamb. Mary is cute.".
     I use jave and Stanford Coreference which a part of Stanford CORENLP. I have managed to write some of the code but I am unable to complete the code and finish the job. Below is some of the code which I have used. Any help and advice will be appreciated.
String file=" Mary has a little lamb. She is cute.";
            Properties props = new Properties();
            props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
            StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
            Annotation document = new Annotation(file);
            pipeline.annotate(document);
            List<CoreLabel> tokens = new ArrayList<CoreLabel>();
            List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
           for(CoreMap sentence: sentences)
              { 
                 Map<Integer, CorefChain> graph = document.get(CorefChainAnnotation.class);
                 System.out.println(graph);
                 for (Map.Entry entry : graph.entrySet()) {
                   CorefChain c = (CorefChain) entry.getValue();
                   CorefMention cm = c.getRepresentativeMention();
                   System.out.println(c);
                   System.out.println(cm);
              }
        }               
Relevant answer
Answer
A plausible semantic solution to your problem for the concerned example could be as follows:
Keep 3 bounded queues for each type of subjects: {male, female, objects}. Have an extensible mapping knowledge base to identify the which category a given subject fall under. E.g, names like Mary fall under female category, names like Mark fall under male, the rest as objects (places, animals, and other things).
You may keep the queue size for each category to be two. E.g, you could say "Mark was talking to Anthony. He told him that..." in here 'he' would refer to Mark and 'him' would refer to Anthony. For simplicity you may keep the queue size as one considering the simple case you have illustrated in the example.
Each time a pronoun is referenced, replace it accordingly with the names in the respective queues.
Each time a new subject is encountered, queue in the subject to the queue that matches the respective category (one of male, female or object).
The above solution would certainly work for the simple cases as the one described in the question. Accuracy for complex forms may vary. Complex forms would require more use of deductive systems. You may also use Turing Machines to process such tasks.