Science topic
Information Extraction - Science topic
Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP).
Questions related to Information Extraction
I need an annotated dataset of English news reports on natural disasters. Please help me with this information.
I have a data set that contains a text field for approximately more than 3000 records, all of which contain notes from the doctor. I need to extract specific information from all of them, for example, the doctor's final decision and the classification of the patient, so what is the most appropriate way to analyze these texts? should I use information retrieval or information extraction, or the Q and A system will be fine
I am looking for competitions/benchmarks in the field of e-discovery. My objective is to understand the state of the art in this field.
I found TREC (https://trec.nist.gov/) but their last legal track dates back to 2011.
Any idea? Thanks
As we know, most of the researchers use manual validation by the experts for the unlabeled User Reviews for a specific domain , but is there a new way? Because I worked with big sized dataset and using experts will be difficult?
if anyone use a new performance measure or a new way for validation, plz inform me .
Thanks in advance.
As we know, most of the researchers use manual validation by the experts on a domain, but is there a new way? or any benchmark, if anyone has a benchmark dataset for this task in any domain plz provides to me if possible. Thanks in advance.
For instance, layout of the sentences (i.e. knowing that a specific sentence is a bullet and it is correlated to another sentence that is stating the scope of the bullets). Moreover, lots of NLP parsers break if the mechanism delivers broken sentences (i.e. in a way that regex could not know if there is indeed a breakline in the source document and therefore unable to clean the text). This type of broken sentences also may disturb when obtaining embeddings, since it takes into account neighbor words.
I need EHR datasets to test my algorithm on semantic interoperability and conflict resolution of different EHR systems.
I am trying to use Stanford TokensRegex, however, I am getting an error in line number (11). It says that (). Please do your best to help me. Below is my code:
1 String file="A store has many branches. A manager may manage at most 2 branches.";
2 Properties props = new Properties();
3 props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
4 StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
5 Annotation document = new Annotation(file);
6 pipeline.annotate(document);
7 List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
8for(CoreMap sentence: sentences)
9 {
10 TokenSequencePattern pattern = TokenSequencePattern.compile("[]");
11 TokenSequenceMatcher matcher = pattern.getMatcher(sentence);
12 while( matcher.find()){
13 JOptionPane.showMessageDialog(rootPane, "It has been found");
14 }
15 }
Hi. I have a query regarding Text Classification. I have a list of words with the following attributes. word, weight, class. The class can be positive or negative. Weight is between -1 to 1. How can I train a classifier like SVM using this word list to classify unseen documents? An example in any tool is welcome
1. There is a need for my research to create an ontology- both domain specific & in English (for language). Is Protege the best option? What criteria should be kept in mind while creating ontologies?
2. A voluminous text file is given as input, the delimiter is Fullstop(".") i.e. at sentence level analysis has to be done, what would be the best way to keep track of the word order for a sentence?
3. Is there any repository for Unstructured text data (In English language) which can be used for testing? Thanks in advance.
What are the best tools or algorithms used to extract information from difference kind of paper?
Example:
Extracting the Author names, Journal Name, Year of publication from different journal papers.
Thank you
My aim is to obtain valuable info from Prospectus (document that describes a financial security). I.e., I need to build a metadata repository about financial securities by extracting info from documents that describe them.
I have used ClausIE and it returns the Subject, verb and Object triples from a a sentence. But these wont work when the text is short text and not even a complete sentence. I just want a library or otherwise which can return just the subject and verb from short text. Example short text: "Proposal 32 accepted". It should have some dependency or maybe rules used to identify that the term "Proposal" is the subject where "accepted" is verb.
I am working on the area of diagram understanding. Currently working on text to object(arrow, data points, ...etc.) association techniques. Is there any exist research on information extraction from vector graphics with significant object association techniques?
hi, I am testing my method by foursquare dataset: https://archive.org/details/201309_foursquare_dataset_umn. I don not know why in some cases there are several rating for one item. I mean, a person have some different rating for one item. for example, person 'a' rated item 'b' for three times and sometimes these three ratings are different from each other. how should I handle these ratings? should I get the average of these ratings to obtain the rating of that person for the item?
thanks in advance
Dear All,
I want to group my dataset using clustering technique. I apply k-means and used Dunn index for the validation. now I want to know what should be the optimal cluster size based on the Dunn index. For your reference i am uploading the DI graph.
Please suggest what should be the cluster size i need to consider for this plot.
thanks

I have a project where I need to classify both documents that are news articles and short messages/blog comments for the articles.
I have tried the following with the same sentences and get different confidence levels. So now I am a bit worried as to which service to use best. I have tried Alchemy API, MonkeyLearn, Algorithmia and Aylien with the following sample:
"While I fully agree that the pilots, like any other worker, have the right to know the state of affairs in their company, I think it is very stupid to start thinking about industrial action at this stage. If they are not careful there may not be an airline left anyway. Please do not shoot yourselves in the foot and think carefully before you take any 'action'."
The answers where:
- Alchemy: negative 0.473512
- MonkeyLearn: negative 0.484
- Algorithmia: Score of 1 (0 is very negative - 2 is neutral and 4 is very positive)
- Aylien: negative 0.62843 (in tweet mode), positive 0.9867 (in document mode).
I tried different comments as text and have obtained mixed results, as for example the nearness of the Alchemy and MonkeyLearn is not repeated often.
My question is: which service shall I choose? :-)
Dear All,
I need the following dataset and/or any dataset that has one or more of the following features. The data is related to interaction among humans (either social or otherwise)
Initial friendship (with time), interactions (with time), afterwards friendship (with time), other static profiles (e.g. interests, locations)
Example dataset could be instagram data or any other dataset. It would be highly appreciated.
Thank you in advance.
i want to implement index as tree with bloom filter using locality sensitive hashing.
Hello , All dears ,,,
I would like to know for what are the suitable extractive methods for Myanmar Text whose structure is similar to SOV (Sub,Obj,Verb) structure.
Before extracted, we have to make preprocessing with many stages. So, what are they and how I extract important words or phrases from News ?
I'm going to propose CRF method for word extraction but now I'm in trouble for it.
So, please kindly advice to me how I should to try for it ?
Thanks for All.
A newly developed Information Retrieval system requires its testing against the existing solutions. Besides evaluation metrics, often comprehensive datasets are required so that the system can be tested and evaluated. It is therefore, requested to share your ideas and suggestions on developing datasets that can make evaluation process easier, accurate, and precise.
Thanks in advance
I'm currently working on feature extraction on cyberbullys but I am having difficulty in finding available datasets.
How can we build our own corpus of tweets that includes tweets written by people that are suspected as people with mental diseases?
Relevant papers and ideas will be welcome.
I want to do an experiment with an information displa matrix (IDM). Is there are programm, that is functional and easily accessable?
Thanks for the help.
Great attention should be paid to methods of search and selection of sources to establish their credibility and value of information sources.
Natural Languange Processing, World Knowledge, WordNet, Natural Language Understanding, Semantic, Lexical-Semantic Relation, Latent Semantic Analysis, Information Extraction, Extraction
Standard corpora exist in various domains, however i can not find a corpus containing large amounts of technical documentation.
The only corpus I've heard of is the "Scania Corpus" from the PLUG project 1998. However i can not find any resources.
Does anybody know of another corpus or has access to the Scania documents?
Thank you in advance
Best regards
-Sebastian
Which are best data mining algorithms in classification if my data set is of healthcare and accuracy is priority?
I am trying to develop software to get suitable attributes for entities names depending on entity type.
For example if I have entities such doctor, nurse, employee , customer, patient , lecturer , donor, user, developer, designer, driver, passenger and technician, they all will have attributes such as name, sex, date of birth, email address, home address and telephone number because all of them are people.
Second example word such as university, college, hospital, hotel and supermarket can share attributes such as name, address and telephone number because all of them could be organization.
Are there any Natural Language Processing tools and software could help me to achieve my goal. I need to identify entity type as person or origination then I attached suitable attributes according to the entity type?
I have looked at Name Entity Recognition (NER) tool such as Stanford Name Entity recognizer which can extract Entity such as Person, Location, Organization, Money, time, Date and Percent But it was not really useful.
I can do it by building my own gazetteer however I do not prefer to go to this option unless I failed to do it automatically.
Any helps, suggestions and ideas will be appreciated.
Corpus must contain documents (texts) with hand annotated keywords by human experts.
Greetings,
I am currently working on an application which is aimed at measuring and storing maximum rotation speeds of the device attached to an object (in rotation).
Unfortunately I think I have finally come across a problem - my software recalculates gyro values (angular velocity) into rotational speed (in cycles per minute) - unfortunately my Samsung Note II reaches only 167-168 rot/min.
Can somebody advise where I can find the max value of measurements that such "budget" gyro is able to reach?
Do you know of any method to extend that value?
Kind regards,
Mariusz.
There are a few computational models of CIT for concept invention out there (eg. Pereira, 2007; Li, Zook, Davis & Riedl, 2012). I was wondering whether this idea could be turned on its head and repurposed in streamlining information extraction from corpora. Any suggestions on how one could go about it?
In all papers of rough set theory (RST), authors always point that RST proposed by prof. Pawlak is a tool to deal with incomplete data.
But in some papers authors say that RST is based on equivalent relation and unable to deal with incomplete information systems (with either don't care * or lost ? values) so they propose another relations as Tolerance or Dominance.
Can someone elaborate on these two words "incomplete data" and "incomplete information systems"?
As the Large Scale Information Extraction (LaSIE) project led to the creation of a base IE system designed by Prof. R. Gaizauskas and has served as the basis for future projects, I have spent too much time in trying to figure out how I could download LaSIE and use it in my own application but all my attempts have failed. It will be appreciated If any member of ResearchGate could send me some information about how I could download and use it. I would like to get the result of the discourse analysis stage of LaSIE. Next, I am going to use this result to build my application.
Can the depth be controlled by the complexity of the object (e.g. faces, written characters, and the like) in a deep learning network for image processing?
At the moment, I was able to find these papers:
1. Prototype a Knowledge Discovery Infrastructure by Implementing Relational Grid Monitoring Architecture (R-GMA) on European Data Grid (EDG) by Frank Wang, Na Helian, Yike Guo, Steve Thompson, John Gordon.
2. Knowledge grid-based problem-solving platform by Lu Zhen, Zuhua Jiang,Jun Liang.
Thank you in advance for any help.
Now, it looks like that the texture feature is only for single band?
Hi,
I have a set of ontologies related to Cultural Heritage domain created by technical experts and a textual corpus written by archaeological experts. My problem is that the ontologies need to be filled by archaeological knowledge (that I don't know big things), so I'll use the archaeological texts to try to extract the information needed.
I need your recommendations about methods of information extraction.
And for ontologies, is there any heuristics to fill an Ontology automatically? (I have the T-Box and I have to generate the A-Box)
Thank you for your interest,
Best regards.
I would like to explore methods for re-ranking result sets retrieved using a term-based query against a database of bibliographic records. I believe that this additional layer of processing could improve a user's information-seeking experience by helping them to more find easily find articles relevant to their need.
An alternative implementation is to exclude records from the result set which, although contain the search term, fail to meet other criteria.
In either case, am looking for existing literature which could help me identify a suitable method of analysis for comparing one set of ranked results to another. I have found studies in which a subject matter expert codes each individual record returned in a result set as relevant or not, in order to compute precision and recall. This may be one strategy, but I am not sure if this alone will really be able to describe and express the differences between two result sets, or the differences in how they are ranked (at least for some arbitrary number of results returned-- it could become unfeasible for a human to evaluate thousands of results, for example.) I am also considering the value of a mixed method approach, in which I integrate more qualitative assessments of user satisfaction with what they feel to be the quality of results retrieved.
I would appreciate any suggestions for literature or methods to consider for this type of research. Thank you!
rank aggregation algorithm etc. for recommendation process.
Is there any generalize tool for information extraction from multimedia data ?
Using Color, edge and texture features to extract robust silhouettes
"If carrier is single, we are able send only one vector of symbols on the channel by using the carrier.
In the case of OFDM systems, we are going to pass multiple carriers through a single channel and each carrier carries vector of array. "
Is the above statement correct. because I have understand the OFDM as above?
Please clarify my doubt.
Thanking you
In opinion mining, feature extraction plays a very important role in summarizing reviews. There are some research techniques for extracting features from the online reviews. What are the most successful techniques for explicit/implicit feature extraction?
It will be appreciated if I could have examples with code, tutorial or any other useful resource.
I am trying to use Stanford TokensRegex to design patterns. I am attempting to catch "A manager may manage at most 2 branches" where it has been mentioned once in the text, however I failed to get it. below is my code
String file="A store has many branches. Each branch must be managed by at most 1 manager. A manager may manage at most 2 branches. The branch sells many products. Product is sold by many branches. Branch employs many workers. The labour may process at most 10 sales. It can involve many products. Each Product includes product_code, product_name, size, unit_cost and shelf_no. A branch is uniquely identified by branch_number. Branch has name, address and phone_number. Sale includes sale_number, date, time and total_amount. Each labour has name, address and telephone. Worker is identified by id’.";
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// read some text in the text variable
// create an empty Annotation just with the given text
Annotation document = new Annotation(file);
// run all Annotators on this text
pipeline.annotate(document);
// these are all the sentences in this document
// a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
for(CoreMap sentence: sentences)
{
TokenSequencePattern pattern = TokenSequencePattern.compile("A manager may manage at most 2 branches");
String sentence1=sentence.toString();
String[] tokens = sentence1.split(" ");
TokenSequenceMatcher matcher = pattern.getMatcher(document.get (CoreAnnotations.SentencesAnnotation.class));
while( matcher.find()){
JOptionPane.showMessageDialog(rootPane, "It has been found");
}
}
Please suggest any books, articles which could help me in learning to design patterns in Stanford TokensRegex within Stanford CoreNLP.
In my experience MaxEnt is always better than SVM for natural language processing task, like text classification, machine translation, named entity extraction. I've tried to train MaxEnt with different parameters and I find that SVM outperforms always MaxEnt.
K-mean algorithm of clustering is iterative in nature. A particular functionality has to be carried out iteratively till convergence criteria meets. one Map and one reduce calls are doing work in one iteration. But I need to call Map-reduce then again map-reduce then again map-reduce till certain condition is satisfied. Can someone help me in this regard? How to call map-reduce in loop? Again output of first reduce becomes input to second map, output of second reduce becomes input of third map and so on.
Most of the previous researches focus on city level (i.e. Beijing or Guangzhou ), using remote sing images to obtain land information and landscape metrics to analyse the changing patterns. Some of previous researches focus on discovering the changing progress in different regions and then compare them. Some of them focus on methodology innovation such as information extraction or putting forward new landscape metrics.
However research at Megalopolis level is less than at city level. What are the differences in scientific problem, method and theory between researches at city level and researches at Megalopolis level? Doing research at Megalopolis level , what (or which aspect) should we pay more attention to than at city level?
Could anyone give me some tips?
In airborne hyperspectral (1m), can signatures of target objects be extracted from training pixel instead of field campaign? This is to reduce cost; but does this affect information extraction performance? The application is tree species identification in mixed urban and rural environment.
I have:
- Polarity words.
Example:
- Good: Pol 5.
- Bad: Pol -5.
My assignment:
Determine a document is negative or positive. So how I have to do, please tell me about that, I'm a newbie in NLP (sentiment analysis).
I want to use polarity to do that, don't use Naive Bayes. So anyone tell me about algorithm based on polarity words.
Thanks for your time.
I would like to integrate WordNet into Alignment API in order to have linguistic ontology alignment. I accessed alignment API compiled file, found many of its functions but I am a bit confuse in as, where to integrate WordNet API code in order to get linguistic ontology alignment results? Thank you.
1. co-occurrence of two words
2. co-occurrence of document and words
Do algorithms work on these two concepts?
In the context of supervised learning approach,I would like to know which is the most appropriate method to extract pertinent information from annotated text, notably the identification of relation between named entities whithin a sentence.
Which later can be used for svm classification with segmentation being involved as well.
I've gone through some scenarios, like hospital data process mining and restaurant process mining, but want to find a scenario that is not only new but whose log data is also accessible.
I am interested in doing some work in area of semantic web crawling/scraping and using that semantic data to do some discovery.
Hello everyone!
Can you advice me what java is more learn for my opinion?
I am interested in using machine learning to recognize social interaction patterns such as disagreements, and potentially use those patterns to generate new simulated interactions. I've been working with crowd sourced descriptions of social interactions, but these are more narrative and less action driven.
Are you aware of publicly available datasets of annotated social interactions?
Types of data that might be good candidates are annotated movie scripts or forum threads. Skeletal/gesture data could also be interesting.
We have huge volume of atmospheric data. Would u suggest how to perform mining and what type of analysis is possible?
I need to extract all words after the following pattern "/[Ee]ach/ ([tag:NN]|[tag:NNS]) /has|have/ /\\w|[ ]|[,]/" until the end of the sentence but I am getting unexpected output:
in the second sentence I am getting: "Each campus has a" where the right output is "Each campus has a different name, address, distance to the city center and the only bus running to the campus "
in the third sentence I am getting "Each faculty has a " where the right output is " Each faculty has a name, dean and building "
in the fourth sentence the pattern is unable to match the right output which is " each problem has solution, God walling"
It will be appreciate if you could help me in solve this problem, I think that there my pattern has not been written correctly , below is my code
String file="ABC University is a large institution with several campuses. Each campus has a different name, address, distance to the city center and the only bus running to the campus. Each faculty has a name, dean and building. this just for test each problem has soluation, God walling.";
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation document = new Annotation(file);
pipeline.annotate(document);
List<CoreLabel> tokens = new ArrayList<CoreLabel>();
List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
for(CoreMap sentence: sentences)
{
for (CoreLabel token: sentence.get(CoreAnnotations.TokensAnnotation.class))
tokens.add(token);
TokenSequencePattern pattern = TokenSequencePattern.compile("/[Ee]ach/ ([tag:NN]|[tag:NNS]) /has|have/ /\\w|[ ]|[,]/");
TokenSequenceMatcher matcher = pattern.getMatcher(tokens);
while( matcher.find()){
JOptionPane.showMessageDialog(rootPane, matcher.group());
}
tokens.removeAll(tokens);
}
The tolerance rough set model is a perfect model that deal with missing values in the data sets. But how can I use the tolerance rough set model to classify data using the conventional classifier as KNN?
Can someone help me find a data repository or valid datasets on records of patients with history of mental/psychiatric disorders?
I am preferably looking for data sets in R. But even other formats will do to begin with.
Any electronic resources include books, example, tutorial are appreciated.
I am working on pattern detection of dermatological images and I would like to know how to extract and match them.
I work on validating a given method and I need dermatological image database.
The rotation forest algorithm requires to eliminate randomly a subset of classes from the data. Afterwards, a bootstrap (I guess without reposition) of 75% of the remaining data has to be generated to perform PCA. How and how many classes should be eliminated? In every iteration a new random subset has to be selected? What if it is a two-class data set? In order to perform PCA the data has to be zero-mean (for covariance-PCA) or normalized (for correlation-PCA). I might not have understood it correctly, but does it make sense to select a bootstrap, centering the data to do PCA and then to generate scores using the rearranged rotation matrix on the whole data? The algorithm presented in the paper from Rodriguez and Kuncheva, Rotation Forest: A new classifier ensemble method, IEEE, 2006, explains that overlapping features (random selection with repetition) can be used but it is not shown how the principal components are merged. Can someone clarify these issues?
Applications of different types / theories and best entropy method used in image processing.
The large photo album has extra charges on delivery.
The adjective large may indicate an attribute size of the photo.
Red car
The adjective red may indicate an attribute colour of the photo.
How can I write a Java program to implement the above job (How do we know the root for large, big is size) and how we know the root for the color?
I am doing a project related to Natural Language Processing. I am using Stanford CoreNLP v 3.3.1 to analyse text. I would like to know how could I extract adjectives from a sentence by using Stanford CoreNLP.
Sometimes this tree view is very useful.

In a given candidate character extracted by MSER, I believe that there are a limited number of features for detecting text in scene images. In cases of limited features for a given MSER-based extracted candidate characters, what are possible alternative ways to use as features of extracted candidates that can help in text detection in scene images? Please comment to resolve such issues to generalize a solution.
I am looking for a code to run Stanford coreference resolution in java netbeans.