Science topic

Information Extraction - Science topic

Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP).
Questions related to Information Extraction
  • asked a question related to Information Extraction
Question
7 answers
I have a data set that contains a text field for approximately more than 3000 records, all of which contain notes from the doctor. I need to extract specific information from all of them, for example, the doctor's final decision and the classification of the patient, so what is the most appropriate way to analyze these texts? should I use information retrieval or information extraction, or the Q and A system will be fine
Relevant answer
Answer
DEAR Matiam Essa
This text mining technique focuses on identifying the extraction of entities, attributes, and their relationships from semi-structured or unstructured texts. Whatever information is extracted is then stored in a database for future access and retrieval.The famous technique are:
Information Extraction (IE)
Information Retrieval (IR)
Natural Language Processing
Clustering
Categorization
Visualization
With the increasing amount of text data, effective techniques need to be employed to examine the data and to extract relevant information from it. We have understood that various text mining techniques are used to decipher the interesting information efficiently from multiple sources of textual data and continually used to improve text mining process.
GOOD LUCK
  • asked a question related to Information Extraction
Question
4 answers
I am looking for competitions/benchmarks in the field of e-discovery. My objective is to understand the state of the art in this field.
I found TREC (https://trec.nist.gov/) but their last legal track dates back to 2011.
Any idea? Thanks
Relevant answer
Nice Dear Muhammad Ali
  • asked a question related to Information Extraction
Question
1 answer
As we know, most of the researchers use manual validation by the experts for the unlabeled User Reviews for a specific domain , but is there a new way? Because I worked with big sized dataset and using experts will be difficult?
if anyone use a new performance measure or a new way for validation, plz inform me .
Thanks in advance.
Relevant answer
Answer
Thanks Gopi Battineni , I will check the paper.
  • asked a question related to Information Extraction
Question
10 answers
As we know, most of the researchers use manual validation by the experts on a domain, but is there a new way? or any benchmark, if anyone has a benchmark dataset for this task in any domain plz provides to me if possible. Thanks in advance.
Relevant answer
Answer
Very interesting regarding the knowledge of ITES - Following .
  • asked a question related to Information Extraction
Question
3 answers
For instance, layout of the sentences (i.e. knowing that a specific sentence is a bullet and it is correlated to another sentence that is stating the scope of the bullets). Moreover, lots of NLP parsers break if the mechanism delivers broken sentences (i.e. in a way that regex could not know if there is indeed a breakline in the source document and therefore unable to clean the text). This type of broken sentences also may disturb when obtaining embeddings, since it takes into account neighbor words.
Relevant answer
Answer
Hi,
In relation to the text cleaning part of your question, we can make a list, containing your example and Arturo's:
1. Line feeds ("\n") could be invisible and they break the parsing of the sentence
2. Different types of brackets, you name them ( ), [ ], { }, of course it depends
3. Encoding, sometimes there are characters inside the text which should be set before, even one single comma (,) can stop a whole system
4. Some languages (e.g. Java, PHP) are confused with single quote sometimes. It is better to use regex like style for them
5. A mixture of above points is also possible, for example a foreign name inside an English text with an encoding and single quote inside it
6. My mentor does not agree that a really too long sentence could confuse the parser but I have seen it :) The question is though how long a really long sentence is which I cannot answer
So I suggest in order to start the text extraction, following could be beneficial:
1. Visual inspection of a sample of text to see if there are unwanted items like comments in brackets and the like
2. Setting encoding correctly or neutralizing the text, if possible, e.g. by converting them to plain text in Notepad, Gedit, Nano, etc. (We are not talking about big data)
3. Step #2 does not remove line feeds. If system needs to process a single sentence at a time, there may remained line feeds.
3. Replacing single quotes, etc with their regex
  • asked a question related to Information Extraction
Question
8 answers
I need EHR datasets to test my algorithm on semantic interoperability and conflict resolution of different EHR systems.
Relevant answer
Answer
You should look at MIMIC:
  • asked a question related to Information Extraction
Question
4 answers
I am trying to use Stanford TokensRegex, however, I am getting an error in line number (11). It says that (). Please do your best to help me. Below is my code:
1 String file="A store has many branches. A manager may manage at most 2 branches.";
2 Properties props = new Properties();
3 props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
4 StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
5 Annotation document = new Annotation(file);
6 pipeline.annotate(document);
7 List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
8for(CoreMap sentence: sentences)
9 {
10 TokenSequencePattern pattern = TokenSequencePattern.compile("[]");
11 TokenSequenceMatcher matcher = pattern.getMatcher(sentence);
12 while( matcher.find()){
13 JOptionPane.showMessageDialog(rootPane, "It has been found");
14 }
15 }
Relevant answer
Answer
Java native Regex engine is a horror for every Java developer. Why they include it in the release is a great puzzle.
There is a regex engine I used for the NLP processing from Apache: https://opennlp.apache.org/docs/.
My own favorite is though the regex engine in Python that simply works. However, you have to instantiate the Python engine first before you use regexps http://www.jython.org/jythonbook/en/1.0/JythonAndJavaIntegration.html
  • asked a question related to Information Extraction
Question
9 answers
Hi. I have a query regarding Text Classification. I have a list of words with the following attributes. word, weight, class. The class can be positive or negative. Weight is between -1 to 1. How can I train a classifier like SVM using this word list to classify unseen documents? An example in any tool is welcome
Relevant answer
Answer
weka , rapid miner, sklearn python library are the easy to use for classification
  • asked a question related to Information Extraction
Question
4 answers
1. There is a need for my research to create an ontology- both domain specific & in English (for language). Is Protege the best option? What criteria should be kept in mind while creating ontologies?
2. A voluminous text file is given as input, the delimiter is Fullstop(".") i.e. at sentence level analysis has to be done, what would be the best way to keep track of the word order for a sentence?
3. Is there any repository for Unstructured text data (In English language) which can be used for testing? Thanks in advance.
Relevant answer
Answer
Thank you sir for your valuable answers
  • asked a question related to Information Extraction
Question
14 answers
What are the best tools or algorithms used to extract information from difference kind of paper?
Example:
Extracting the Author names, Journal Name, Year of publication from different journal papers.
Thank you
Relevant answer
Answer
Dear Mohammed Chalouli,
There are many software's which help you for example "citavi picker".
Regards,
Javad.
  • asked a question related to Information Extraction
Question
4 answers
My aim is to obtain valuable info from Prospectus (document that describes a financial security). I.e., I need to build a metadata repository about financial securities by extracting info from documents that describe them. 
Relevant answer
Answer
Vicente: 
If you want to start with Information Extraction, I recommend you to take a look at "The Wiley's Handbook of Computational Linguistics and Natural Language Processing - Chapter 18: Information Extraction". 
Best, Farshad 
  • asked a question related to Information Extraction
Question
1 answer
I have used ClausIE and it returns the Subject, verb and Object triples from a a sentence. But these wont work when the text is short text and not even a complete sentence. I just want a library or otherwise which can return just the subject and verb from short text. Example short text: "Proposal 32 accepted". It should have some dependency or maybe rules used to identify that the term "Proposal" is the subject where "accepted" is verb.
Relevant answer
Answer
First You tag the sentence using any tagger.. Opennlp kit., etc. Then English sentence can be active or passive.. English sentence follows SVO structure. Identify Verb from the tag returns by the tagger.. Depending upon this before Verb Subject will be there after that object will be there.. You can search one of my paper based on types of sentence as semantic analysis 
  • asked a question related to Information Extraction
Question
2 answers
I am working on the area of diagram understanding. Currently working on text to object(arrow, data points, ...etc.) association techniques. Is there any exist research on information extraction from vector graphics with significant object association techniques?
Relevant answer
Answer
I have used many text association techniques for understanding mathematical diagrams. 
There are many generic methods such as machine learning clustering using flagged image data set, pre-cluster text labels based on patterns. I am not aware of any successful generic text association work to understand diagrams.
But generic text association techniques are not enough to extract knowledge from diagrams such as mathematical diagrams (Euler Digrams, Coordinates Graphs) which are in generic image formats such as jpeg or SVG. For successful text association, optimized techniques have to used using domain knowledge of the diagram type.
I have successfully tested some methods with understanding several types of mathematical drawings. Text association techniques in diagrams are bound to the domain knowledge of the specific diagram type.
  • asked a question related to Information Extraction
Question
4 answers
hi, I am testing my method by foursquare dataset: https://archive.org/details/201309_foursquare_dataset_umn. I don not know why in some cases there are several rating for one item. I mean, a person have some different rating for one item. for example, person 'a' rated item 'b' for three times and sometimes these three ratings are different from each other. how should I handle these ratings? should I get the average of these ratings to obtain the rating of that person for the item?
thanks in advance
Relevant answer
Answer
Hi Maryam,
It's based on the idea behind collecting the data. Rating different time means each time reflexes a particular instance feature. Generally, we don't recommend using the mean for such case, but instead of that, we recommend considering them as independent data features/variables that all of them are required.
HTH.
Samer 
  • asked a question related to Information Extraction
Question
6 answers
Dear All,
I want to group my dataset using clustering technique. I apply k-means and used Dunn index for the validation. now I want to know what should be the optimal cluster size based on the Dunn index. For your reference i am uploading the DI graph. 
Please suggest what should be the cluster size i need to consider for this plot.
thanks 
Relevant answer
Answer
 Each cluster validity method varies in its evaluation, some methods calculates the intra cluster similarity and some evaluates inter cluster distances or centroids. The purpose of using the right index method depends on our data. In case of Silhouette method, it’s always better to choose Silhouette coefficients nearer to 1, which says it is a good number to choose.
  • asked a question related to Information Extraction
Question
3 answers
I have a project where I need to classify both documents that are news articles and short messages/blog comments for the articles.
I have tried the following with the same sentences and get different confidence levels. So now I am a bit worried as to which service to use best. I have tried Alchemy API, MonkeyLearn, Algorithmia and Aylien with the following sample:
"While I fully agree that the pilots, like any other worker, have the right to know the state of affairs in their company, I think it is very stupid to start thinking about industrial action at this stage. If they are not careful there may not be an airline left anyway. Please do not shoot yourselves in the foot and think carefully before you take any 'action'."
The answers where:
  • Alchemy: negative 0.473512
  • MonkeyLearn: negative 0.484
  • Algorithmia: Score of 1 (0 is very negative - 2 is neutral and 4 is very positive)
  • Aylien: negative 0.62843 (in tweet mode), positive 0.9867 (in document mode).
I tried different comments as text and have obtained mixed results, as for example the nearness of the Alchemy and MonkeyLearn is not repeated often.
My question is: which service shall I choose? :-)
Relevant answer
Answer
As well, not all the APIs take into account the same features to determine the polarity of text. For example, some APIs make uses of ponctuation or emoticons, while some other no. Maybe these article could lead you some clues about which system you should take:
  • asked a question related to Information Extraction
Question
7 answers
Dear All,
I need the following dataset and/or any dataset that has one or more of the following features. The data is related to interaction among humans (either social or otherwise)
Initial friendship (with time), interactions (with time), afterwards friendship (with time), other static profiles (e.g. interests, locations)
Example dataset could be instagram data or any other dataset. It would be highly appreciated.
Thank you in advance.
Relevant answer
Answer
Check link below for datasets from facebook, google+, twitter, and other social networks.
  • asked a question related to Information Extraction
Question
3 answers
i want to implement index as tree with bloom filter using locality sensitive hashing.
Relevant answer
Answer
MG4J uses Bloom Filter: http://mg4j.di.unimi.it/
From the documentation: "Optionally, an index cluster may provide Bloom filters to reduce useless access to local indices that do not contain a term."
  • asked a question related to Information Extraction
Question
3 answers
Hello , All dears ,,,
I would like to know for what are the suitable extractive methods for Myanmar Text whose structure is similar to SOV (Sub,Obj,Verb) structure.
Before extracted, we have to make preprocessing with many stages. So, what are they and how I extract important words or phrases from News ?
I'm going to propose CRF method for word extraction but now I'm in trouble for it.
So, please kindly advice to me how I should to try for it ?
Thanks for All.
Relevant answer
Answer
If I understand correctly for word extraction you can use simple TF-IDF. Also try latent semantic analysis (LSA) and latent Dirichlet allocation (LDA) for topic modeling. But if you're going to do (name entity recognition) NER then I think you need to develop and train your own model, probably based on hidden Markov.
  • asked a question related to Information Extraction
Question
6 answers
A newly developed Information Retrieval system requires its testing against the existing solutions. Besides evaluation metrics, often comprehensive datasets are required so that the system can be tested and evaluated. It is therefore, requested to share your ideas and suggestions on developing datasets that can make evaluation process easier, accurate, and precise.
Thanks in advance
Relevant answer
Answer
Irfan,
Take a look at :
A Large Benchmark Dataset for Web Document Clustering by  Mark P. Sinka and David W. Corne
Some of the challenges in deriving a dataset for information retrieval is the dependence of the initial selection of documents based on search engines. The best thing to do is to extract samples from as many search engines as possible to avoid bias.
Regards
  • asked a question related to Information Extraction
Question
8 answers
I'm currently working on feature extraction on cyberbullys but I am having difficulty in finding available datasets.
Relevant answer
Answer
Hi
Maybe it is too late to tell you that, but also you can find datasets on this page if you like
Goodluck
  • asked a question related to Information Extraction
Question
8 answers
How can we build our own corpus of tweets that includes tweets written by people that are suspected as people with mental diseases?
Relevant papers and ideas will be welcome.
Relevant answer
Answer
The problem is, identification of persons with
mental illness requires common sense. For example,
if a person says: tomorrow I'll eat hundred whales
for breakfast. A computer program isn't, at least
not now, able to identify that as a joke or
sign of insanity.
Articles:
Regards,
Joachim
  • asked a question related to Information Extraction
Question
4 answers
I want to do an experiment with an information displa matrix (IDM). Is there are programm, that is functional and easily accessable?
Thanks for the help.
Relevant answer
Answer
Phillipa:
In the mid 1980's we used a software product called "Mouselab", a computerized Information Display Board system run on micro-computers. Since then I imagine there are versions that can be used ONLINE over the internet.
  • asked a question related to Information Extraction
Question
8 answers
Great attention should be paid to methods of search and selection of sources to establish their credibility and value of information sources.
Relevant answer
Answer
The Use internet for Educational Research:
Since internet is a public domain and no one guaranty and responsible of what are being wrote or spread. It may be an obsolete knowledge or information, or it is partially tested information or it is uncompleted information. Therefore, it is higher risk to consume internet information as base for academicals writing. Especial the information whereas is spread over the anonymous sites.
Although there were anonymous and uncompleted sites, we can found millions of popular journals, economic magazines, and bulletins such as New York Post, Strait times, Washington post, universities libraries etc.
Reference
  • asked a question related to Information Extraction
Question
3 answers
Suppose I need to extract code for the voting portion of a webpage alone. Can it be any tool for doing this.
Relevant answer
Answer
we use python-based scrapper: http://www.crummy.com/software/BeautifulSoup/. after scrapping, perform some manual editing and get things done as per specific requirement
  • asked a question related to Information Extraction
Question
6 answers
Natural Languange Processing, World Knowledge, WordNet, Natural Language Understanding, Semantic, Lexical-Semantic Relation, Latent Semantic Analysis, Information Extraction, Extraction
Relevant answer
Answer
The Term Document matrix {X} has decomposed using SVD as follows:
{X}={W}{S}{P}'
After this these decomposed matrices have been reduced and the product of these reduced matrices has been calculated as {X`}. Then the similarity between two terms have been calculated using correlation between the vectors of terms from this matrix.
I have a question: Can we calculate term similarity only from {W} matrix after reduction? Is {W} matrix also give similar type of relation between terms as {X}?
  • asked a question related to Information Extraction
Question
4 answers
Standard corpora exist in various domains, however i can not find a corpus containing large amounts of technical documentation. 
The only corpus I've heard of is the "Scania Corpus" from the PLUG project 1998. However i can not find any resources.
Does anybody know of another corpus or has access to the Scania documents?
Thank you in advance
Best regards
-Sebastian
Relevant answer
Answer
Hi,
I'm not sure if software documentation qualifies for technical documentation, but Opus project has parallel corpora for PhP, Gnome, Kde and Ubuntu manuals:
Hope this helps.
  • asked a question related to Information Extraction
Question
38 answers
Which are best data mining algorithms in classification if my data set is of healthcare and accuracy is priority?
Relevant answer
Answer
  • asked a question related to Information Extraction
Question
6 answers
I am trying to develop software to get suitable attributes for entities names depending on entity type.
For example if I have entities such doctor, nurse, employee , customer, patient , lecturer , donor, user, developer, designer, driver, passenger and technician, they all will have attributes such as name, sex, date of birth, email address, home address and telephone number because all of them are people.
Second example word such as university, college, hospital, hotel and supermarket can share attributes such as name, address and telephone number because all of them could be organization.
Are there any Natural Language Processing tools and software could help me to achieve my goal. I need to identify entity type as person or origination then I attached  suitable attributes according to the entity type?
I have looked at Name Entity Recognition (NER) tool such as Stanford Name Entity recognizer which can extract Entity such as Person, Location, Organization, Money, time, Date and Percent But it was not really useful.
I can do it by building my own gazetteer however I do not prefer to go to this option unless I failed to do it automatically.  
Any helps, suggestions and ideas will be appreciated.  
Relevant answer
Answer
Mussa,
This might not be a very helpful answer, but from my understanding NLP techniques often rely on context to understand what is being discussed.  So a single word like "doctor" is very difficult to understand unless it is in some kind of context like "a doctor treats sick people".  From the sentence, an NLP machine might recognize that doctor is a noun and might infer something about relating to people.  Without this context, it will be tough to discern the categorical differences between single words. 
It might be less complicated (although more time-consuming) to create a predefined list of terms that you would like to classify and then simply match words to those lists in order to create your associated list of features for a given entity.
Hope that helps.
Sean
  • asked a question related to Information Extraction
Question
3 answers
Corpus must contain documents (texts) with hand annotated keywords by human experts.
Relevant answer
Answer
In the European project MANTRA we developed a parallel corpus based on EMEA and MedLine. It comprises of a silver standard large corpus (using different systems tagging the corpus with a majority vote) and a manually crafted 550 document gold corpus (only EMEA) used to calculate precision and recall of the silver standard corpus. The link for the downloads is https://sites.google.com/site/mantraeu/project-output and project information can be found at https://sites.google.com/site/mantraeu/
  • asked a question related to Information Extraction
Question
3 answers
Greetings,
I am currently working on an application which is aimed at measuring and storing maximum rotation speeds of the device attached to an object (in rotation).
Unfortunately I think I have finally come across a problem - my software recalculates gyro values (angular velocity) into rotational speed (in cycles per minute) - unfortunately my Samsung Note II reaches only 167-168 rot/min.
Can somebody advise where I can find the max value of measurements that such "budget" gyro is able to reach?
Do you know of any method to extend that value?
Kind regards,
Mariusz.
Relevant answer
Answer
I finally manager to sum up the system we have created for measuring the rotational speed of an object. Some initial assumptions:
1. The system must be mobile - in order to achieve this we combined our proprietary sensor with a smartphone.
2. The application is used for recording and evaluating received values of the rotational speed.
3. The application selects the maximum value form 2 second time window which is used for data aggregation.
Unfortunately we have not been able to produce proper results on built-in android sensors. We needed to develop a dedicated sensor solution which have than been integrated through Bluetooth with a smartphone.
 I will attach some dev specs to show how.
  • asked a question related to Information Extraction
Question
13 answers
There are a few computational models of CIT for concept invention out there (eg. Pereira, 2007; Li, Zook, Davis & Riedl, 2012). I was wondering whether this idea could be turned on its head and repurposed in streamlining information extraction from corpora. Any suggestions on how one could go about it?
Relevant answer
Answer
@ Marc Le Goc
Abstraction is a part of the blending process for sure. Especially during the construction of the Generic Space. I haven't come across the term "Knowledge Engineering" before. It sounds like a pretty interesting field. :)
@ Ignacio Arroyo
I haven't got into annotation schemes yet. But since CIT has a strong 'evolutionary' undercurrent running through it., they'll need to reflect it somehow.  A simple static semantic tag won't work. I'm thinking something more along the lines of vectors and graphs. 
  • asked a question related to Information Extraction
Question
4 answers
In all papers of rough set theory (RST), authors always point that RST proposed by prof. Pawlak is a tool to deal with incomplete data. 
But in some papers authors say that RST is based on equivalent relation and unable to deal with incomplete information systems (with either don't care * or lost ? values) so they propose another relations as Tolerance or Dominance.
Can someone elaborate on these two words "incomplete data" and "incomplete information systems"?
Relevant answer
Answer
information which is incomplete is not the same as missing data. In simple terms, "Incomplete information" means that we do not have enough *information* (attributes) to describe the underlying concept in order to be able to say which objects belong to it with certainty. An equivalence relation is used in order to build information 'granules' which can then be used to measure indiscernibility between data objects as a way to try to distinguish between them.  Missing data is just that: data that is not available or has been lost or corrupted in the dataset for whatever reason. RST can still be used to do data imputation based on the indiscernibility but this is a rather different task. 
  • asked a question related to Information Extraction
Question
1 answer
As the Large Scale Information Extraction (LaSIE) project led to the creation of a base IE system designed by Prof. R. Gaizauskas and has served as the basis for future projects, I have spent too much time in trying to figure out how I could download LaSIE and use it in my own application but all my attempts have failed. It will be appreciated If any member of ResearchGate could send me some information about how I could download and use it. I would like to get the result of the discourse analysis stage of LaSIE. Next, I am going to use this result to build my application.
Relevant answer
Answer
sorry I do not have the expertise to answer this question.
Regards
  • asked a question related to Information Extraction
Question
8 answers
Can the depth be controlled by the complexity of the object (e.g. faces, written characters, and the like) in a deep learning network for image processing?
Relevant answer
Answer
Dear Lohweg,
The authors of the following paper:
that I tried to summarize and review:
choose to increase the model width instead of depth, because deeper models have only diminishing improvement or even degradation on accuracy. 
I believe that there might be some theoretical lower bound that determine the minimum number of layers given a raw features, whereas increasing the number of layers will only give minimal advantages or even degradation. But I don't know if such lower bound exists.
  • asked a question related to Information Extraction
Question
1 answer
At the moment, I was able to find these papers:
1. Prototype a Knowledge Discovery Infrastructure by Implementing Relational Grid Monitoring Architecture (R-GMA) on European Data Grid (EDG) by Frank Wang, Na Helian, Yike Guo, Steve Thompson, John Gordon.
2. Knowledge grid-based problem-solving platform by Lu Zhen, Zuhua Jiang,Jun Liang.
Thank you in advance for any help.
Relevant answer
Answer
Hello Pawel,
Look into the attached paper. & this link for the application.
This paper gives you the well structured idea about application and extension of the Grid technology to knowledge discovery in Grid databases.
If you are working on larger Datasets, I'm certain OLAP could help you with it and provide better results than any other.
Furthermore, you also need to work on the performance results & usability of such applications.
Regards,
Manish
  • asked a question related to Information Extraction
Question
2 answers
Now, it looks like that the texture feature is only for single band? 
Relevant answer
Answer
Hi, chandra Prakash, I mean I would like to analyze texture features of segmented objects, but it will generate much texture features for each band because of a lot of bands. I don't know if it is feasible to integrate several bands, and how to carry it out? 
  • asked a question related to Information Extraction
Question
7 answers
I would like to explore methods for re-ranking result sets retrieved using a term-based query against a database of bibliographic records. I believe that this additional layer of processing could improve a user's information-seeking experience by helping them to more find easily find articles relevant to their need.
An alternative implementation is to exclude records from the result set which, although contain the search term, fail to meet other criteria.
In either case, am looking for existing literature which could help me identify a suitable method of analysis for comparing one set of ranked results to another. I have found studies in which a subject matter expert codes each individual record returned in a result set as relevant or not, in order to compute precision and recall. This may be one strategy, but I am not sure if this alone will really be able to describe and express the differences between two result sets, or the differences in how they are ranked (at least for some arbitrary number of results returned-- it could become unfeasible for a human to evaluate thousands of results, for example.)  I am also considering the value of a mixed method approach, in which I integrate more qualitative assessments of user satisfaction with what they feel to be the quality of results retrieved. 
I would appreciate any suggestions for literature or methods to consider for this type of research. Thank you!
Relevant answer
Answer
If you are interesting in comparing ranks then look for measures like MRR(Mean Reciprocal Rank) or rank co-relations like Kendall tau or Spearman' rho, these measure will only help you in finding the rank co-relation, if you consider one of the rank created by some state-of-the art method as baseline and then re-rank using your approach. An alternative could be that use generative method like language models to see how likely it is to generate the given query from a retrieved set of documents which will hint you towards the notion of relevance of certain documents, when you don't have relevance assessment available for your test collection. If your experiment collection already comes with the relevance assessments then the standard TREC based evaluation measures will be the preferred choice. Moreover, if relevance judgement are graded then I suggest go for Normalized Discounted Cummulative Gain (NDCG) etc. 
  • asked a question related to Information Extraction
Question
3 answers
Hi,
I have a set of ontologies related to Cultural Heritage domain created by technical experts and a textual corpus written by archaeological experts. My problem is that the ontologies need to be filled by archaeological knowledge (that I don't know big things), so I'll use the archaeological texts to try to extract the information needed.
I need your recommendations about methods of information extraction.      
And for ontologies, is there any heuristics to fill an Ontology automatically? (I have the T-Box and I have to generate the A-Box)  
Thank you for your interest,
Best regards.
Relevant answer
Answer
  • asked a question related to Information Extraction
Question
11 answers
rank aggregation algorithm etc. for recommendation process.
Relevant answer
Answer
Hi, you can give weight distribution for each approaches. You can simply add all the scores of each approaches, then you can re-rank it again and get the best solution. So, first you need to specify the appropriate weight for each approach.
  • asked a question related to Information Extraction
Question
3 answers
Is there any generalize tool for information extraction from multimedia data ?
Relevant answer
Answer
You can try Brand24 the most famous product which is the tool for monitoring the Internet and social media
  • asked a question related to Information Extraction
  • asked a question related to Information Extraction
Question
8 answers
"If carrier is single, we are able send only one vector of symbols on the channel by using the carrier.
In the case of OFDM systems, we are going to pass multiple carriers through a single channel and each carrier carries vector of array. "
Is the above statement correct. because I have understand the OFDM as above?
Please clarify my doubt.
Thanking you
Relevant answer
Answer
Dear Sir,
Yes, that is correct.
The goal of OFDM is to enlarge the symbol duration T. In order not to lower the data rate, multiple carriers are then used. E.g. suppose symbol duration T, then set
w = 2PI/T, then the carriers used are
exp(jkwt), k = 0 ... N
You can see these carriers are orthogonal, meaning that if you multiply any 2 of these carriers, and integrate the result over T, then the result is always zero.
To transfer information, these carriers are modulated. Typically, QPSK, QAM16, QAM64, QAM256 modulation is used.
Note that not all carriers need to be modulated. Sometimes, pilots are sent on some to help in synchronization.
Note the main advantage of OFDM is immunity for multipath distortion. (dispersion.) Due to dispersion/multipath , there is a variable delay in the receive signal, there being a max time difference of delta-t between the first received signal, and the last. Now, in order avoid intersymbol interference, the transmit symbol duration is at least delta-t larger than T. In this case, it is possible in the receiver to recover the received symbols without intersymbol interference. There is still a need to compensate the frequency-dependent fading, but this is rather easy.
You easily see, to have efficient transmission in case of rather long delta-t, you need long T, and then in order to have sufficient data throughput, you need many carriers.
(rather large N).
Best Regards,
Henri.
  • asked a question related to Information Extraction
Question
11 answers
In opinion mining, feature extraction plays a very important role in summarizing reviews. There are some research techniques for extracting features from the online reviews. What are the most successful techniques for explicit/implicit feature extraction?
Relevant answer
Answer
Dear Noah,
Here are some papers with a short summary of their best feature sets for opinion mining + their references.
Pang et al. (2002) reported that unigrams outperform bigrams when performing the sentiment classification of movie reviews,
Dave et al. (2003) show that bigrams and trigrams worked better from unigrams for the product-review polarity classification.
Pak and Paroubek (2010) showed that their classifier is able to determine positive, negative and neutral sentiments of documents. Their classifier is based on the multinomial Na¨ıve Bayes classifier that uses N-gram and POS-tags as features.
Full references:
Kushal Dave, Steve Lawrence, and David M. Pennock.
2003. Mining the peanut gallery: opinion extraction
and semantic classification of product reviews. In WWW
’03: Proceedings of the 12th international conference on
World Wide Web, pages 519–528, New York, NY, USA.
ACM.
Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.
2002. Thumbs up? sentiment classification using machine
learning techniques. In Proceedings of the Conference
on Empirical Methods in Natural Language Processing
(EMNLP), pages 79–86.
Pak, A., & Paroubek, P. (2010, May). Twitter as a Corpus for Sentiment Analysis and Opinion Mining. In LREC (Vol. 10, pp. 1320-1326).‏
Best regards,
Yaakov
  • asked a question related to Information Extraction
Question
1 answer
It will be appreciated if I could have examples with code, tutorial or any other useful resource.
Relevant answer
  • asked a question related to Information Extraction
Question
4 answers
I am trying to use Stanford TokensRegex to design patterns. I am attempting to catch "A manager may manage at most 2 branches" where it has been mentioned once in the text, however I failed to get it. below is my code
String file="A store has many branches. Each branch must be managed by at most 1 manager. A manager may manage at most 2 branches. The branch sells many products. Product is sold by many branches. Branch employs many workers. The labour may process at most 10 sales. It can involve many products. Each Product includes product_code, product_name, size, unit_cost and shelf_no. A branch is uniquely identified by branch_number. Branch has name, address and phone_number. Sale includes sale_number, date, time and total_amount. Each labour has name, address and telephone. Worker is identified by id’.";
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// read some text in the text variable
// create an empty Annotation just with the given text
Annotation document = new Annotation(file);
// run all Annotators on this text
pipeline.annotate(document);
// these are all the sentences in this document
// a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
for(CoreMap sentence: sentences)
{
TokenSequencePattern pattern = TokenSequencePattern.compile("A manager may manage at most 2 branches");
String sentence1=sentence.toString();
String[] tokens = sentence1.split(" ");
TokenSequenceMatcher matcher = pattern.getMatcher(document.get (CoreAnnotations.SentencesAnnotation.class));
while( matcher.find()){
JOptionPane.showMessageDialog(rootPane, "It has been found");
}
}
Please suggest any books, articles which could help me in learning to design patterns in Stanford TokensRegex within Stanford CoreNLP.
Relevant answer
Answer
Consider the following code
Binding of variables for use in compiling patterns:
Use Env env = TokenSequencePattern.getNewEnv() to create a new environment for binding
Bind string to attribute key (Class) lookup: env.bind("numtype", CoreAnnotations.NumericTypeAnnotation.class);
Bind patterns / strings for compiling patterns
// Bind string for later compilation using: compile("/it/ /was/ $RELDAY");
env.bind("$RELDAY", "/today|yesterday|tomorrow|tonight|tonite/");
// Bind pre-compiled patter for later compilation using: compile("/it/ /was/ $RELDAY");
env.bind("$RELDAY", TokenSequencePattern.compile(env, "/today|yesterday|tomorrow|tonight|tonite/"));
  • asked a question related to Information Extraction
Question
5 answers
In my experience MaxEnt is always better than SVM for natural language processing task, like text classification, machine translation, named entity extraction. I've tried to train MaxEnt with different parameters and I find that  SVM outperforms always MaxEnt.
Relevant answer
Answer
Your question is very confusing.  First you say that MaxEnt > SVM for NLP but then you say that SVM > MaxEnt.
Is there a clarification you could add to your question?
Also, if you are using linear classifiers, I would strongly recommend checking out L1 regularized logistic regression.
If you are using kernel methods to get non-linear classifiers, check out a recent implementation of neural networks with drop-out regulariztion, rectified linear units and accelerated learning like Adagrad.  You should get superior result versus your SVM or MaxEnt result.
  • asked a question related to Information Extraction
Question
13 answers
K-mean algorithm of clustering is iterative in nature. A particular functionality has to be carried out iteratively till convergence criteria meets. one Map and one reduce calls are doing work in one iteration. But I need to call Map-reduce then again map-reduce then again map-reduce till certain condition is satisfied. Can someone help me in this regard? How to call map-reduce in loop? Again output of first reduce becomes input to second map, output of second reduce becomes input of third map and so on.
Relevant answer
Answer
Speaking as a Mahout committer, I would currently recommend several other options for k-means due to the passage of time,
1) check out H2O.  The guys at 0xdata have implemented a blazingly fast fork-join framework that allows very nice implementations of iterative algorithms like k-means.  They have a very nice implementation, in fact.
2) check out Spark.  Not nearly as advanced as H2O for k-means, but provides a very nice programming environment for parallel computing, Implementing k-means is relatively easy and Spark is now included in the major Hadoop distributions like MapR or Cloudera.
  • asked a question related to Information Extraction
Question
4 answers
      Most of the previous researches focus on city level (i.e. Beijing or Guangzhou ), using remote sing images to obtain land information  and landscape metrics to analyse the changing patterns. Some of previous researches focus on discovering the changing progress in different regions and then compare them. Some of them focus on methodology innovation  such as information extraction or putting forward new landscape metrics.
      However research at Megalopolis level is less than at city level.  What are the differences  in scientific problem, method and  theory between researches at city level and researches at Megalopolis level? Doing research  at Megalopolis level , what (or which aspect) should we pay more attention to than at city level?
      Could anyone give me some tips?
Relevant answer
Answer
Thank you Fakhri again. Your advice is that we should divide the whole megalopolis into many subregion, then we research the rules in these subregion. Yes, that's a good idea! However, another question maybe we have to face with is that how to divede the megalopolis?  If we could not provide strong evidence about how to divide the megalopolis, our research results maybe questined. Someone could say that we get the results maybe just by accident. Do you have any idea about that?
Thanks so much!
  • asked a question related to Information Extraction
Question
4 answers
In airborne hyperspectral (1m), can signatures of target objects be extracted from training pixel instead of field campaign? This is to reduce cost; but does this affect information extraction performance? The application is tree species identification in mixed urban and rural environment.
Relevant answer
Answer
The unsupervised classifications are tacitly built this way.
In case you intend to develop your own algorithm, try to select also windows of more pixels than one, for example 3x3 (it depends on the crown spread) and analyze the results as respect to the "one-pixel" training.
[ Reason: the spatial resolution of 1 m is too close to the target size, so the identification results may be affected if, for example, a major part of some training pixels is not covered by the supposed trees; the idea is to size a window for training so that it is mostly covered by the crown of the tree to be identified].
  • asked a question related to Information Extraction
Question
12 answers
I have:
- Polarity words.
Example:
- Good: Pol 5.
- Bad: Pol -5.
My assignment:
Determine a document is negative or positive. So how I have to do, please tell me about that, I'm a newbie in NLP (sentiment analysis).
I want to use polarity to do that, don't use Naive Bayes. So anyone tell me about algorithm based on polarity words.
Thanks for your time.
Relevant answer
Answer
Hi Phuong, you can try many algorithm:
Many tools also you can try such as Weka, Matlab, RapidMiner, etc.
Being noted that instead of using only 2 class (positive and negative) in sentiment analysis. You can consider to employ the neutral label as the 3rd class.
This article and paper could be your further reference:
Good Luck
  • asked a question related to Information Extraction
Question
3 answers
I would like to integrate WordNet into Alignment API in order to have linguistic ontology alignment. I accessed alignment API compiled file, found many of its functions but I am a bit confuse in as, where to integrate WordNet API code in order to get linguistic ontology alignment results? Thank you.
Relevant answer
Answer
Hello Muhammad, 
I believe this quick tutorial should help you. The 2nd link I've included has further instructions on how to compile WordNet not included in the 1st link.
  • asked a question related to Information Extraction
Question
3 answers
1. co-occurrence of two words
2. co-occurrence of document and words
Do algorithms work on these two concepts?
Relevant answer
Answer
Co-occurrence of words with documents can also be used as a measure for semantic similarity. This was done, for example, in Explicit Semantic Analysis (ESA), http://www.cs.technion.ac.il/~gabr/resources/code/esa/esa.html
  • asked a question related to Information Extraction
Question
8 answers
In the context of supervised learning approach,I would like to know which is the most appropriate method to extract pertinent information from annotated text, notably the identification of relation between named entities whithin a sentence.   
Relevant answer
Answer
Dear Ines,
Though information extraction is highly data dependent, still Kernel based information extraction is more advanced approach for such cases. It certainly gives a better result compared to feature based information extraction.
  • asked a question related to Information Extraction
Question
7 answers
Which later can be used for svm classification with segmentation being involved as well.
Relevant answer
Answer
(Unfortunately I could not attach the doc file containing the following text. Therefore I just pasted the text here.)
• In applications like face recognition which a vector (or a matrix) is extracted from the whole sample image:
1. Extract histogram (first order or second order histogram) of the image:
 First order histogram: (MATLAB)
      im = imread('tire.tif');
  nf = 20; featureVec = imhist(im, nf); % with any arbitrary value of nf
  % OR
  featureVec = imhist(im); % with default nf=256
 Second order histogram, here GLCM (MATLAB)
  im = imread('tire.tif');
  glcm = graycomatrix(im);
  temp = graycoprops(glcm);
  featureVec(1) = temp.Contrast;
  featureVec(2) = temp.Correlation;
  featureVec(3) = temp.Energy;
  featureVec(4) = temp.Homogeneity;
2. This feature vector can be used in any classification/segmentation framework.
• In applications like image (pixel) classification which a vector (or a matrix) is extracted for each pixel in the sample image:
1. Extract histogram (first order or second order histogram) of a neighborhood of each pixel (e.g. a square window around the pixel):
 First order histogram: (MATLAB)
  im = imread('tire.tif');
  [Nx, Ny] = size(im); % suppose the image is a single-band image
  w = 3; % neighborhood window --> (2w+1)-by-(2w+1)
  nf = 50;
  extIm = padarray(im, [w, w], 'symmetric');
  featureVec = zeros(Nx, Ny, nf); % memory allocation
  for x = 1+w:Nx+w
      for y = 1+w:Ny+w
         WIN = extIm(x-w:x+w, y-w:y+w);
         featureVec(x-w, y-w, :) = imhist(WIN, nf);
     end
  end
 Second order histogram, here GLCM (MATLAB)
   im = imread('tire.tif');
   [Nx, Ny] = size(im); % suppose the image is a single-band image
   w = 3; % neighborhood window --> (2w+1)-by-(2w+1)
   extIm = padarray(im, [w, w], 'symmetric');
   nf = 4; % graycoprops gives 4 features
   featureVec = zeros(Nx, Ny, nf);
   for x = 1+w:Nx+w
      for y = 1+w:Ny+w
         WIN = extIm(x-w:x+w, y-w:y+w);
         glcm = graycomatrix(WIN);
         temp = graycoprops(glcm);
         featureVec(x-w, y-w, 1) = temp.Contrast;
         featureVec(x-w, y-w, 2) = temp.Correlation;
         featureVec(x-w, y-w, 3) = temp.Energy;
         featureVec(x-w, y-w, 4) = temp.Homogeneity;
      end
   end
2. These feature vectors can be used in any classification/segmentation framework.
  • asked a question related to Information Extraction
Question
5 answers
I've gone through some scenarios, like hospital data process mining and restaurant process mining, but want to find a scenario that is not only new but whose log data is also accessible.
Relevant answer
Answer
Hi Ayesha, I think you should first decide, which kind of industry you are interested in. One of the main challanges in process mining is getting access to the event log data. You can use publicly available logs as suggested by Marta. But it depends on what you want to do with that data and what kind of research question you intend to answer. Getting access to real event log data usually requires the cooperation of a partner from industry, and setting up this cooperation is usually hard work. I guess that you will probably not be able to get easy access to such data without a partner from practice. From my perspective searching for publicly available event log data is not a good approach to identify a new application scenario for process mining. I would start with the identification of a research problem maybe in a specific industry and then look for a partner from this industry that is willing to provide the necessary data.
  • asked a question related to Information Extraction
Question
8 answers
I am interested in doing some work in area of semantic web crawling/scraping and using that semantic data to do some discovery.
Relevant answer
Answer
Hi,
Another type of ontology is knowledge graph such as Freebase (https://www.freebase.com/), which allows users to download the weekly data dumps or use API to access the information.
best regards,
  • asked a question related to Information Extraction
Question
9 answers
Hello everyone!
Can you advice me what java is more learn for my opinion?
Relevant answer
Answer
Generally, in the area of academic work, the core of programming languages are used. not software technologies.
but the decision about web-based(j2ee) or desktop-application based(j2se) should be made by through consideration of customer requirements. if you want to implement a package for classification and information extraction.
  • asked a question related to Information Extraction
Question
7 answers
I am interested in using machine learning to recognize social interaction patterns such as disagreements, and potentially use those patterns to generate new simulated interactions. I've been working with crowd sourced descriptions of social interactions, but these are more narrative and less action driven.
Are you aware of publicly available datasets of annotated social interactions?
Types of data that might be good candidates are annotated movie scripts or forum threads. Skeletal/gesture data could also be interesting.
Relevant answer
Answer
I suppose it is no longer relevant, but as you mentioned forum threads, I believe this work uses social networks of forum-like environments annotated for agreement/disagreement:
The data sets should be available at:
  • asked a question related to Information Extraction
Question
10 answers
We have huge volume of atmospheric data. Would u suggest how to perform mining and what type of analysis is possible?
Relevant answer
Answer
  • asked a question related to Information Extraction
Question
3 answers
I need to extract all words after the following pattern "/[Ee]ach/ ([tag:NN]|[tag:NNS]) /has|have/ /\\w|[ ]|[,]/" until the end of the sentence but I am getting unexpected output:
in the second sentence I am getting: "Each campus has a" where the right output is "Each campus has a different name, address, distance to the city center and the only bus running to the campus " 
in the third sentence I am getting  "Each faculty has a " where the right output is " Each faculty has a name, dean and building "
in the fourth sentence the pattern is unable to match the right output which is " each problem has solution, God walling"
It will be appreciate if you could help me in solve this problem, I think that there my pattern has not been written correctly , below is my code
String file="ABC University is a large institution with several campuses. Each campus has a different name, address, distance to the city center and the only bus running to the campus.  Each faculty has a name, dean and building. this just for test each problem has soluation, God walling.";
  Properties props = new Properties();
  props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
  StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
  Annotation document = new Annotation(file);
  pipeline.annotate(document);
  List<CoreLabel> tokens = new ArrayList<CoreLabel>();
  List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
  for(CoreMap sentence: sentences)
   {          
    for (CoreLabel token: sentence.get(CoreAnnotations.TokensAnnotation.class))
            tokens.add(token);
    TokenSequencePattern pattern = TokenSequencePattern.compile("/[Ee]ach/ ([tag:NN]|[tag:NNS]) /has|have/ /\\w|[ ]|[,]/");
    TokenSequenceMatcher matcher = pattern.getMatcher(tokens);
    while( matcher.find()){
        JOptionPane.showMessageDialog(rootPane, matcher.group());
     }
     tokens.removeAll(tokens);
   }
Relevant answer
Answer
You can also forget the Stanford Tokens and try something like: "[Ee]ach .* ha(s|ve).*\." <-that's will return the first three sentences. 
  • asked a question related to Information Extraction
Question
9 answers
The tolerance rough set model is a perfect model that deal with missing values in the data sets. But how can I use the tolerance rough set model to classify data using the conventional classifier as KNN?
Relevant answer
Answer
Hi Ahmed,
In addition to imputation methods such as KNN, regression and MLP for missed data, other approaches can be applied. Ensemble classifiers such as AdaBoost and Bayesian Network can classify without imputation.
Please find the review paper as an attachment.
  • asked a question related to Information Extraction
Question
7 answers
Can someone help me find a data repository or valid datasets on records of patients with history of mental/psychiatric disorders?
I am preferably looking for data sets in R. But even other formats will do to begin with.
Relevant answer
Answer
  • asked a question related to Information Extraction
Question
7 answers
Any electronic resources include books, example, tutorial are appreciated.
Relevant answer
Answer
A simplified definition of a token in NLP is as follows: A token is a string of contiguous characters between two spaces, or between a space and punctuation marks. A token can also be an integer, real, or a number with a colon (time, for example: 2:00). All other symbols are tokens themselves except apostrophes and quotation marks in a word (with no space), which in many cases symbolize acronyms or citations. A token can present a single word or a group of words (in morphologically rich languages such as Hebrew) as the following token "ולאחי" (VeLeAhi) that includes 4 words "And to my brother".
A stirng as written by one of the previous researchers who responded
is a oncept taken from programming languages.
  • asked a question related to Information Extraction
Question
3 answers
I am working on pattern detection of dermatological images and I would like to know how to extract and match them.
Relevant answer
Answer
I think the previous two references provided by @Christos P Loizou and @Lucia Ballerini are new and excellent to start with.
  • asked a question related to Information Extraction
Question
5 answers
I work on validating a given method and I need dermatological image database.
Relevant answer
Answer
No problem,may be it is used as benchmark to validate the develpped methods of for training.
  • asked a question related to Information Extraction
Question
5 answers
The rotation forest algorithm requires to eliminate randomly a subset of classes from the data. Afterwards, a bootstrap (I guess without reposition) of 75% of the remaining data has to be generated to perform PCA. How and how many classes should be eliminated? In every iteration a new random subset has to be selected? What if it is a two-class data set? In order to perform PCA the data has to be zero-mean (for covariance-PCA) or normalized (for correlation-PCA). I might not have understood it correctly, but does it make sense to select a bootstrap, centering the data to do PCA and then to generate scores using the rearranged rotation matrix on the whole data? The algorithm presented in the paper from Rodriguez and Kuncheva, Rotation Forest: A new classifier ensemble method, IEEE, 2006, explains that overlapping features (random selection with repetition) can be used but it is not shown how the principal components are merged. Can someone clarify these issues?
Relevant answer
Answer
I did not carefully read the whole paper, but maybe I can answer some of your questions.
How and how many classes should be eliminated?
I did not find explanations from the paper on this issue. A possible solution is to fix the number of classes to be selected, say between 50% and 75% of the whole classes. Then perform random class selection. The authors mentioned that "running PCA on a subset of classes instead on the whole set is done in a bid to avoid identical coefficients if the same feature subset is chosen for different classifiers." So basically, if you can ensure that you can avoid occasionally identical selection of features across the L classifiers, then you do not need to perform class selection, i.e. you can use the whole classes. Actually, even if you could not avoid identical feature selection, I personally do not think it is a big issue.
In every iteration a new random subset has to be selected?
The answer is yes, but as discussed above, I do not think it is a must.
What if it is a two-class data set?
Then obviously, you do not need to (cannot) perform class selection.
Does it make sense to select a bootstrap, centering the data to do PCA and then to generate scores using the rearranged rotation matrix on the whole data?
Yes, I think it makes sense. The idea is quite similar to random forest in the sense that both select a subset of features (feature dimensions if my understanding is correct) to introduce variation and thus avoid correlation between the L classifiers. The difference lies in the fact that rotation forest casts an additional PCA to the selected features. If we can consider feature selection itself as a simple feature projection process, then both random forest and rotation forest project the original feature onto a feature subspace, where rotation forest also translates (covariance-PCA), scales (correlation-PCA) and rotates the subspace. Rotation forest separates the features into subsets, and performs translation, scaling, and rotation to the subset features for both training and testing instances. It is similar to classification based on multiple types of features, where we perform PCA to each individual feature type, concatenate the projected feature vectors to obtain a single vector, and then use the single vector for training and testing. So yes, it makes sense to me.
  • asked a question related to Information Extraction
Question
6 answers
Applications of different types / theories and best entropy method used in image processing.
Relevant answer
Answer
1872 Orange pog.svg – Ludwig Boltzmann presents his H-theorem, and with it the formula Σpi log pi for the entropy of a single gas particle.
1878 Orange pog.svg – J. Willard Gibbs defines the Gibbs entropy: the probabilities in the entropy formula are now taken as probabilities of the state of the whole system.
1924 Red pog.svg – Harry Nyquist discusses quantifying "intelligence" and the speed at which it can be transmitted by a communication system.
1927 Orange pog.svg – John von Neumann defines the von Neumann entropy, extending the Gibbs entropy to quantum mechanics.
1928 Red pog.svg – Ralph Hartley introduces Hartley information as the logarithm of the number of possible messages, with information being communicated when the receiver can distinguish one sequence of symbols from any other (regardless of any associated meaning).
1929 Orange pog.svg – Leó Szilárd analyses Maxwell's Demon, showing how a Szilard engine can sometimes transform information into the extraction of useful work.
1940 Red pog.svg – Alan Turing introduces the deciban as a measure of information inferred about the German Enigma machine cypher settings by the Banburismus process.
1944 Red pog.svg – Claude Shannon's theory of information is substantially complete.
1947 Purple pog.svg – Richard W. Hamming invents Hamming codes for error detection and correction. For patent reasons, the result is not published until 1950.
1948 Red pog.svg – Claude E. Shannon publishes A Mathematical Theory of Communication
1949 Red pog.svg – Claude E. Shannon publishes Communication in the Presence of Noise – Nyquist–Shannon sampling theorem and Shannon–Hartley law
1949 Red pog.svg – Claude E. Shannon's Communication Theory of Secrecy Systems is declassified
1949 Green pog.svg – Robert M. Fano publishes Transmission of Information. M.I.T. Press, Cambridge, Mass. – Shannon–Fano coding
1949 Green pog.svg – Leon G. Kraft discovers Kraft's inequality, which shows the limits of prefix codes
1949 Purple pog.svg – Marcel J. E. Golay introduces Golay codes for forward error correction
1951 Red pog.svg – Solomon Kullback and Richard Leibler introduce the Kullback–Leibler divergence
1951 Green pog.svg – David A. Huffman invents Huffman encoding, a method of finding optimal prefix codes for lossless data compression
1953 Green pog.svg – August Albert Sardinas and George W. Patterson devise the Sardinas–Patterson algorithm, a procedure to decide whether a given variable-length code is uniquely decodable
1954 Purple pog.svg – Irving S. Reed and David E. Muller propose Reed–Muller codes
1955 Purple pog.svg – Peter Elias introduces convolutional codes
1957 Purple pog.svg – Eugene Prange first discusses cyclic codes
1959 Purple pog.svg – Alexis Hocquenghem, and independently the next year Raj Chandra Bose and Dwijendra Kumar Ray-Chaudhuri, discover BCH codes
1960 Purple pog.svg – Irving S. Reed and Gustave Solomon propose Reed–Solomon codes
1962 Purple pog.svg – Robert G. Gallager proposes Low-density parity-check codes; they are unused for 30 years due to technical limitations.
1965 Purple pog.svg – Dave Forney discusses concatenated codes.
1967 Purple pog.svg – Andrew Viterbi reveals the Viterbi algorithm, making decoding of convolutional codes practicable.
1968 Purple pog.svg – Elwyn Berlekamp invents the Berlekamp–Massey algorithm; its application to decoding BCH and Reed-Solomon codes is pointed out by James L. Massey the following year.
1968 Red pog.svg – Chris Wallace and David M. Boulton publish the first of many papers on Minimum Message Length (MML) statistical and inductive inference
1970 Purple pog.svg – Valerii Denisovich Goppa introduces Goppa codes
1972 Purple pog.svg – J. Justesen proposes Justesen codes, an improvement of Reed-Solomon codes
1973 Red pog.svg – David Slepian and Jack Wolf discover and prove the Slepian–Wolf coding limits for distributed source coding.[1]
1974 Red pog.svg – George H. Walther and Harold F. O'Neil, Jr., conduct first empirical study of satisfaction factors in the user-computer interface[2]
1976 Purple pog.svg – Gottfried Ungerboeck gives the first paper on trellis modulation; a more detailed exposition in 1982 leads to a raising of analogue modem POTS speeds from 9.6 kbit/s to 33.6 kbit/s.
1976 Green pog.svg – R. Pasco and Jorma J. Rissanen develop effective arithmetic coding techniques.
1977 Green pog.svg – Abraham Lempel and Jacob Ziv develop Lempel–Ziv compression (LZ77)
1989 Green pog.svg – Phil Katz publishes the .zip format including DEFLATE (LZ77 + Huffman coding); later to become the most widely used archive container and most widely used lossless compression algorithm
1993 Purple pog.svg – Claude Berrou, Alain Glavieux and Punya Thitimajshima introduce Turbo codes
1994 Green pog.svg – Michael Burrows and David Wheeler publish the Burrows–Wheeler transform, later to find use in bzip2
1995 Orange pog.svg – Benjamin Schumacher coins the term qubit and proves the quantum noiseless coding theorem
2001 Green pog.svg – Dr. Sam Kwong and Yu Fan Ho proposed Statistical Lempel Ziv
2008 Purple pog.svg – Erdal Arıkan introduced Polar Codes, the first practical construction of codes that achieves capacity for a wide array of channels.
  • asked a question related to Information Extraction
Question
8 answers
The large photo album has extra charges on delivery.
The adjective large may indicate an attribute size of the photo.
Red car
The adjective red may indicate an attribute colour of the photo.
How can I write a Java program to implement the above job (How do we know the root for large, big is size) and how we know the root for the color?
Relevant answer
Answer
Mussa,
ANSWER TO “How can I extract the adjective?”
--------------------------------------------------------------
You can extract adjectives from sentences by applying a parser, or more precisely, a part of speech tagger. Let me use your two sentences as example.
EXAMPLE 1. For the sentence:
“The large photo album has extra charges on delivery.”
1. You apply a parser in order to get the part of speech tags (nouns, verbs, adjectives, etc.) for each word. Suppose that we are applying the Stanford Parser to the above sentence. The parser will output the following:
The/DT large/JJ photo/NN album/NN has/VBZ extra/JJ charges/NNS on/IN delivery/NN
WHERE: JJ stands for adjective, NN stands for noun, VBZ stands for verb, and so on.
2. You extract all the words which have the JJ part of speech tag. For this sentence you extract:
- large/JJ
- extra/JJ
EXAMPLE 2: For the sentence/expression:
“Red car”
1. You apply the parser and you get the following output:
Red/JJ car/NN
2. You extract the words wich have the JJ part of speech tag:
- Red/JJ
PARSERS FOR JAVA:
---------------------------
In Java you can use the following free available parsers for the English language:
- The Stanford Parser: A statistical parser
- The OpenNLP library
(This library includes several tools for natural language processing. However you just need the “Part-of-Speech Tagger” for extracting adjectives).
- The TreeTagger
This tagger is not implemented in Java. Therefore if you want to use it from Java you need a wrapper. The tt4j is one of those Java wrappers for the TreeTagger: https://code.google.com/p/tt4j/
All the above links have all the information you need.
  • asked a question related to Information Extraction
Question
2 answers
I am doing a project related to Natural Language Processing. I am using Stanford CoreNLP v 3.3.1 to analyse text. I would like to know how could I extract adjectives from a sentence by using Stanford CoreNLP.
Relevant answer
Answer
  • asked a question related to Information Extraction
Question
1 answer
Sometimes this tree view is very useful.
Relevant answer
Answer
Dear Hamed, you can do that using the WordNet (OWL version).
Use Protégé as a visualisation tool since it supports OWL ontologies
  • asked a question related to Information Extraction
Question
2 answers
In a given candidate character extracted by MSER, I believe that there are a limited number of features for detecting text in scene images. In cases of limited features for a given MSER-based extracted candidate characters, what are possible alternative ways to use as features of extracted candidates that can help in text detection in scene images? Please comment to resolve such issues to generalize a solution.
Relevant answer
Answer
Thanks Hradis for your response with a reference.
My question is that each MSER-based candidate character can be like a window. Each window may have some features like height, width etc. What could be the other possible features that may help in detecting text?
Regards
  • asked a question related to Information Extraction
Question
5 answers
I am looking for a code to run Stanford coreference resolution in java netbeans.
Relevant answer
Answer
You have a java syntax error here:
import edu.stanford.nlp.dcoref;
As 'dcoref' is a package, you should replace this line by:
import edu.stanford.nlp.dcoref.*;
To get the output a coreference resolution:
// Get coreference chains with:
Map<Integer, CorefChain> graph = annotation.get(CorefChainAnnotation.class);
// Then:
for(Map.Entry<Integer, CorefChain> entry : graph.entrySet()) {
System.out.println("ClusterId: " + entry.getKey());
System.out.println("CHAIN : " + entry.getValue());
CorefChain c = entry.getValue();
CorefMention representativeMention = c.getRepresentativeMention();
}
The javadoc should help you for doing what you want with these objects:
Also, if you have other questions, I would suggest to check and ask to the dedicated mailing list: