Science topic

Data Integration - Science topic

Explore the latest questions and answers in Data Integration, and find Data Integration experts.
Questions related to Data Integration
  • asked a question related to Data Integration
Question
1 answer
I want to integrate 5 GEO datasets and analyze them together. However, I don't want to include all the samples in all the GEO datasets but to include a specific phenotype along with the control group. Are there any tools available to do the same such as R package, web tools, linux based applications, etc? and if available what are the steps to do the analysis?
Relevant answer
Answer
To integrate and analyze multiple microarray datasets, including specific phenotypes and control groups, there are several tools and approaches available. Here is a general outline of the steps you can follow using R packages:
Data preprocessing and normalization:
-Download the microarray datasets from the GEO database.
-Import the datasets into R using packages such as GEOquery or limma.
-Preprocess the raw data by performing quality control checks, background correction, and normalization using methods like RMA (Robust Multi-array Average) or quantile normalization.
Sample selection:
-dentify the specific phenotypes you want to include in your analysis and select the corresponding samples from each dataset.
-Ensure that the control groups are properly matched or selected from the datasets.
Data integration:
-Merge the selected samples from each dataset into a single integrated dataset.
-Perform batch effect correction if necessary to address any systematic variations introduced by different datasets using methods like ComBat or SVA.
Differential expression analysis:
-Use statistical methods such as linear models or empirical Bayes methods to identify genes that are differentially expressed between the selected phenotypes and control groups.
-Perform appropriate statistical tests, such as t-tests, ANOVA, or regression models, depending on the experimental design and research question.
Functional enrichment analysis:
-Once you have a list of differentially expressed genes, perform functional enrichment analysis using tools like clusterProfiler, enrichR, or DAVID to identify overrepresented gene ontology (GO) terms, pathways, or biological functions associated with the gene list.
Visualization and interpretation:
-Generate informative visualizations, such as heatmaps, volcano plots, or pathway enrichment plots, to visualize the results.
-Interpret the findings based on the biological knowledge and existing literature.
Some R packages that can be useful for these steps include GEOquery, limma, sva, clusterProfiler, enrichR, ggplot2, among others.
  • asked a question related to Data Integration
Question
3 answers
Hello, I am interested in data integration approaches. As I discovered, there are two main approches: materialized and virtual (mediator - wrapper).
I want to combine both (hence hybrid) as part of my solution, but I can't find a well informed process on how to do so.
Relevant answer
  • asked a question related to Data Integration
Question
3 answers
I have transcriptomic and proteomics data from prokaryotes. What is the best way to integrate this data? Is there any bioinformatic tool for this data integration?
Relevant answer
Answer
Galaxy-P (http://galaxyp.org/) might be useful.
  • asked a question related to Data Integration
Question
4 answers
Data Integration
Merging
Relevant answer
Answer
There are various ways to integrate different datasets. You may extract and generate common feature from the dataset and integrate the datasets in similar fashion as people do in relational or object oriented databases. There is need to be specific over dataset.
  • asked a question related to Data Integration
Question
8 answers
Are data fusion one stage of data integration? Is data fusion is reduced or replacement technique?
Please do let me know, thanks
Relevant answer
Answer
Dear Mahdis Dezfouli,
01. Data Fusion :
Data fusion is the process of getting data from multiple sources in order to build more sophisticated models and understand more about a project. It often means getting combined data on a single subject and combining it for central analysis.
or
Data fusion frequently involves “fusing” data at different abstraction levels and differing levels of uncertainty to support a more narrow set of application workloads.
or
Data fusion is the process of integrating multiple data sources to produce more consistent, accurate, and useful information than that provided by any individual data source.
Various types of data fusion work in different ways:
Low, intermediate and high-level data fusion – and likewise distinguish geospatial types of data fusion from other types of data fusion. Another specific type of data fusion is called “sensor fusion” where data from diverse sensors are combined into one data-rich image or analysis.
Data fusion is broadly applied to technologies, for instance, in a research project, scientists might use data fusion to combine physical tracking data with environmental data, or in a customer dashboard, marketers might combine client identifier data with purchase history and other data collected at brick-and-mortar store locations to build a better profile.
Data fusion involves a level of concrete definition from something called the Joint Directors of Laboratories Data Fusion Group which produces six levels for a data fusion information group model:
  1. Source preprocessing
  2. Object assessment
  3. Situation assessment
  4. Impact assessment
  5. Project refinement
  6. User refinement
02. Data integration :
Data integration is a process in which heterogeneous data is retrieved and combined as an incorporated form and structure. Data integration allows different data types (such as data sets, documents and tables) to be merged by users, organizations and applications, for use as personal or business processes and/or functions.
or
Data integration is the combination of technical and business processes used to combine data from disparate sources into meaningful and valuable information. A complete data integration solution delivers trusted data from various sources.
or
Data integration involves combining data from several disparate sources, which are stored using various technologies and provide a unified view of the data. Data integration becomes increasingly important in cases of merging systems of two companies or consolidating applications within one company to provide a unified view of the company's data assets. The later initiative is often called a data warehouse.
or
Data integration in the purest sense is about carefully and methodically blending data from different sources, making it more useful and valuable than it was before. IBM provides a strong definition, stating “Data integration is the combination of technical and business processes used to combine data from disparate sources into meaningful and valuable information.”
An example of data integration in a smaller paradigm is spreadsheet integration in a Microsoft Word document.
Data integration is a term covering several distinct sub-areas such as:
  1. Data warehousing
  2. Data migration
  3. Enterprise application/information integration
  4. Master data management
I hope I have answered your question.
With Best Wishes,
Samir G. Pandya.
  • asked a question related to Data Integration
Question
3 answers
Relevant answer
Answer
Sorry, I am not familiar with and experized in the identification of similar individuals in the OBDI. Maybe some other guys could help you.
Best,
Yanshi
  • asked a question related to Data Integration
Question
5 answers
The BCS in the UK have developed a Blueprint for Cyber Security in Health and Care, and aim to bring together key stakeholders to work together in improving security within the NHS. But are our health and social care systems really fit for the 21st Century, and are they citizen-focused?
If you have an opinion, please consider submitting to a paper to the link provided on this question.
Call for Papers
The development of new electronic services within health and social care provides an opportunity for citizen-focused health care, especially in sharing information across traditional organisational boundaries. With the move to enhanced services, there are also increasing risks, including from data breaches and in the hacking of medical devices.
As health care data often contains sensitive data, there are many risks around the trustworthiness of the security infrastructures used within health and social care and in the methods that can be applied to share information across domains. New regulations, too, such as with GDPR (General Data Protection Regulation), focus on an integrated security approach for incident response, encryption, and pseudoanonymisation and in providing citizens with more control over their data. This will require new approaches to the design of data architectures and the services used within health and social care infrastructures.
This special issue focuses on the latest research within health and social care for cyber security, including the application of new methods to integrate data from each part of the patient journey. It will also focus on the integration of policies and data protection methods, to protect against data breaches, along with allowing data to be used to improve patient safety and in reducing the costs of health care provision.
Potential topics include but are not limited to the following:
- Data breaches and risk models within health and social care
- Detecting and responding to health care data breaches
- New design principles for GDPR requirements within health and social care data
- Models for risk analysing cyber security threats
- Citizen-focused health and social care systems
- Information sharing and secure architectures within health and social care
- Trust, governance, and consent around interagency approaches
- Data sharing models for interagency approaches
- Creating consensus models for health and social care
- Cloud-based architectures for integrated health and social care
- Cryptography for health and social care data
- Integration of cryptography for trust and governance, including methods for anonymization and secure data processing
- Policy integration for data access
- Anonymization and sanitisation of health care data
- Application of blockchain methods within a health care environment
- Attacks on health care devices
- Vulnerabilities in health care devices
Relevant answer
Answer
It is very important to maintain a high level of cyber security also in the area of computerized health and social care institutions.
Unfortunately, despite the assurances of companies that run social media portals, the information contained on these websites is not always fully secured against the activities of cybercriminals.
In addition, the issue of downloading data from social media portals by large companies to Big Data database systems should be added in order to process them for marketing purposes.
The issue of privacy in social media is very important and is related to the security of personal information. Privacy is at risk in terms of information posted on social media portals.
I invite you to the discussion
  • asked a question related to Data Integration
Question
5 answers
Hello everyone:
Many countries around the world are looking for updated and integrated data and information related to Renewable Energy Sources for solving practical problems in this area. The question is Do you think that building a Knowledge Graph on Renewable Energy Sources could be a suitable research project?
Do you know about any research project related to this topic?
I would like to hear your comments and insight about that.
Best regards.
Relevant answer
Answer
why not.it is a good idea.
  • asked a question related to Data Integration
Question
3 answers
Dear Researchers,
I would like to know what were technical, practical challenges with integratation of various country-, regional- and provider-level health data and how you overcome them (i.e. from problem to solution).
Relevant answer
Answer
As the prescription drugs, getting more severe side effects, the low acupuncturist reimbursement makes the alternative treatments more useless.
When there are more symptoms and more severe diseases, the patients need more needles and/or longer treating time. Due to there is no acupuncturist's seat on any insurance company's board, the insurance low payments to the acupuncture clinics makes an acupuncturist need to treat multiple patients in an hour. That destroyed acupuncture treatment effectiveness.
  • asked a question related to Data Integration
Question
1 answer
ALCOA is a compliance used in life sciences and pharmaceuticals for maintaining the integrity of the data. i am thinking of using this standard in IT sector for maintaining the integrity of the data.can anyone please suggest do we use it or not or is there any standard available for data integrity in IT sector.
Relevant answer
Answer
It depends on what the data is being used for!  If it is the pharmaceutical or biological arenas it must be compliant with 21 CFR part 11 (or EU equivalent depending on where you are selling).  Many places also look at GAMP5.
  • asked a question related to Data Integration
Question
3 answers
I want to compare with what i currently have 
Relevant answer
Answer
I recommend https://opendatakit.org if you plan to collect information.
  • asked a question related to Data Integration
Question
4 answers
I have a list in the following format. I want to sort my list in decreasing order according to the length of each list.
mylist:
[[1]]
[[1]][[1]]
[[1]][[1]][[1]]
+ 6/9453 vertices, named:
[1] VEGFA EPHB2 GRIN2B AP2M1 KCNJ11 ABCC8
[[1]][[2]]
[[1]][[2]][[1]]
+ 4/9453 vertices, named:
[1] VEGFA VTN PRKCA ADCY5
[[1]][[3]]
[[1]][[3]][[1]]
+ 0/9453 vertices, named:
[[1]][[4]]
[[1]][[4]][[1]]
+ 4/9453 vertices, named:
[1] VEGFA KDR GRB2 ADRB1
[[1]][[5]]
[[1]][[5]][[1]]
+ 3/9453 vertices, named:
[1] VEGFA AKT1 AKT2
[[1]][[6]]
[[1]][[6]][[1]]
+ 4/9453 vertices, named:
[1] VEGFA CTGF AP3D1 AP3S2
[[2]]
[[2]][[1]]
[[2]][[1]][[1]]
+ 6/9453 vertices, named:
[1] HHEX EFEMP2 TP53 ARIH2 ENSA ABCC8
[[2]][[2]]
[[2]][[2]][[1]]
+ 5/9453 vertices, named:
[1] HHEX TLE1 POLB PRKCA ADCY5
[[2]][[3]]
[[2]][[3]][[1]]
+ 0/9453 vertices, named:
[[2]][[4]]
[[2]][[4]][[1]]
+ 5/9453 vertices, named:
[1] HHEX TLE1 ATN1 MAGI2 ADRB1
[[2]][[5]]
[[2]][[5]][[1]]
+ 4/9453 vertices, named:
[1] HHEX JUN ESR1 AKT2
[[2]][[6]]
[[2]][[6]][[1]]
+ 6/9453 vertices, named:
[1] HHEX TLE1 CDK1 BUB1 AP3B1 AP3S2
[[3]]
[[3]][[1]]
[[3]][[1]][[1]]
+ 7/9453 vertices, named:
[1] PPP1R3A RPS6KA1 MAPK1 TP53 ARIH2 ENSA ABCC8
[[3]][[2]]
[[3]][[2]][[1]]
+ 4/9453 vertices, named:
[1] PPP1R3A PLN PRKACA ADCY5
[[3]][[3]]
[[3]][[3]][[1]]
+ 0/9453 vertices, named:
[[3]][[4]]
[[3]][[4]][[1]]
+ 4/9453 vertices, named:
[1] PPP1R3A RPS6KA1 GRB2 ADRB1
[[3]][[5]]
[[3]][[5]][[1]]
+ 4/9453 vertices, named:
[1] PPP1R3A RPS6KA1 PDPK1 AKT2
[[3]][[6]]
[[3]][[6]][[1]]
+ 6/9453 vertices, named:
[1] PPP1R3A RPS6KA1 MAPK1 IRS1 AP3S1 AP3S2
where  "+ 6/9453" indicating the length of that list.For component 1 there are six list of different length, so i want to sort all the component list in decreasing order. The zero length element is not to be considered. 
I used one command i don't know is it right way to do it.
mylist[sort(order(mylist))]
Error: unexpected ']' in "mylist[sort(order(mylist))]"
Thanks.
Relevant answer
Answer
Hii, not a recursive sort, in my given example in list [[1]] there are 6 sub-lists like; [[1]][[1]], [[1]][[2]] .... [[1]][[6]]. ans each has different length like 4, 3,6 and so on... So i want to sort this sub-lists [[1]][[1]], [[1]][[2]].... by their given length which is 6 for [[1]][[1]], 4 for [[1]][[2]], and the sub-list which has 0 length would not be considered. Thanks.
  • asked a question related to Data Integration
Question
1 answer
I have a data set where I have collected the following scatter parameters:
FSC-A, SSC-A, SSC-H, and SSC-W.
( I dont have FSC-H and FSC-W in this set).
Is there a pair that I can use to to isolate singlets? Thanks.
Relevant answer
Answer
Hi Steve,
you can use SSC-H vs SSC-W.
  • asked a question related to Data Integration
Question
1 answer
I know about three panel co-integration existence tests i.e. pedroni, kao and Johansen fisher tests. if all tests give significant results then which test should be used for interpreting results? 
Which test is more valid?
Please provide reference too..
  • asked a question related to Data Integration
Question
8 answers
When you read scientific research related to climate change it is apparent that many types of disparate data are integrated when simulations of the future are made.  How does the educated, social scientist (with no climatology training ) evaluate the quality of the research and how it is integrated to produce different scenarios with broader or more narrow ranges of values?  Are many researchers focusing on the replication of research?  
How would you rate the state of climate research:  nascent, developing, developed in specific areas, mature?  What are the biggests current gaps in our knowledge of ocean and atmospheric systems...?  Are there also gaps in modeling the interactions between systems that contribute to less certainty?   Does the climate change community readily admit to those gaps?  What are some significant recent anomalies?  Have they been accounted for sufficiently?
How do you judge climate scientists that have gradually evolved into advocates?  
Relevant answer
Answer
As a modeller I found data on soil carbon, its changes with time, soil bulk density are lacking.
We usually get help from long term fertilizer trials datasets; unfortunately these are inadequate and heavily skewed towards crop data.
Soil data are few and far between.
Regards
  • asked a question related to Data Integration
Question
5 answers
How many observation will suffice to conduct panel co-integration test? I mean, how many groups and time spans are needed for this test?  
Relevant answer
Answer
Thank you Kotosz. 
  • asked a question related to Data Integration
Question
3 answers
Hi All,
I am doing some analyses on a population of adolescents using measurements of some clinical variables collected over 3 time points and broken down by 5 groups of stages of puberty. I was wondering which is the most appropriate strategy of analysis if I add another level of complexity by adding a variable "year" indicating a repetition of the measurements in the same experimental setting but one year later (so the population under analysis has own puberty stage changed of 1-2 units). Another element of complexity is that the set of participants in the 2 consecutive years is different with some overlap between the 2 sets but including disjoint sets.
this is a demo of my data set:
idx  id stage_of_p clinical_var time_point year
1 1 1 11 1 2003
2 1 1 10 2 2003
3 1 1 11 3 2003
4 2 2 14 1 2003
5 2 2 13 2 2003
6 2 2 10 3 2003
7 3 3 15 1 2003
8 3 3 13 2 2003
9 3 3 10 3 2003
10 4 4 10 1 2003
11 4 4 11 2 2003
12 4 4 10 3 2003
13 5 5 13 1 2003
14 5 5 15 2 2003
15 5 5 17 3 2003
16 6 1 11 1 2004
17 6 1 12 2 2004
18 6 1 12 3 2004
19 7 2 11 1 2004
20 7 2 14 2 2004
21 7 2 13 3 2004
22 1 1 12 1 2004
23 1 1 11 2 2004
24 1 1 15 3 2004
25 2 2 11 1 2004
26 2 2 12 2 2004
27 2 2 11 3 2004
Thanks,
Maria
Relevant answer
Answer
This seems to me to be suitable for a multilevel model, where you have periods nested within individuals, nested within groups. I am not sure why you need the year if you have periods? Also, I am not clear on what sort of post-hoc test you are looking for? What is it that you want to analyze as a post-hoc analysis?
  • asked a question related to Data Integration
Question
9 answers
There are many papers associated with my research direction in different regions around the world,so I want to synthesize the data from papers in different regions .However ,It is difficult to synthesize the data because of different experiments and statistical methods.And I don't know whether there is a software or method to solve my question.I anticipate your reply sincerely
Relevant answer
Answer
Let me take a different approach to your question on synthesizing. Synthesis is about writing, putting together themes to establish an interconnected and organized point of view using information from multiple sources, which can include a meta-analysis. By synthesizing,you are trying to create a new understanding, and I do agree that the best way to do this is through a systematic review, which can be quantitative or qualitative. It depends upon your question.. You need to be able to write effectively to do this. Here are some sources that may help:
  • asked a question related to Data Integration
Question
1 answer
In schema matching, there are various ambiguities. I would like to know in real scenario, what are the various differences exists.
Relevant answer
Answer
There is very good research on matching patterns by Scharffe and Euzenat. I have also done some work be cataloging and defining typical ambiguities/mismatches (see attached papers).
  • asked a question related to Data Integration
Question
1 answer
When working in MAUD with 2D data that is integrated in say 10 degree slices around the azimuthal direction you get data sets that cover different 2theta ranges. I want to cut several such ranges off before the edge of the detector as it works badly there. Is there any way to actually do this in MAUD?
As far as I can tell, you can only set one cut-off range that applies to all sections and you can also only make excluded regions that apply to all ranges.
Relevant answer
Answer
you can independentize your datasets as several datasets (eventually grouping them by eta values), then declare your ranges independently on the datasets
  • asked a question related to Data Integration
Question
10 answers
I manage a research participant registry. I have just come on board and the data from participant surveys has been being input by volunteers. I have checked the first 100 surveys for errors and have found around a 60% error rate (around 60% of the surveys have at least one entry error). I plan to double enter all of the current surveys at 100%. However, outside of more extensive volunteer training, I am looking for measures to ensure data integrity for the  future surveys.
Relevant answer
Answer
One error on 60% of surveys does not constitute a 60% error rate -- the rate would be dependent on # of responses entered.  Actually, only l error on 60 out of 100 surveys doesn't sounds unusual.  All manual data entry is subject to error. The best way is to double enter, as you are doing. If you could load the survey on to a telephone survey software, you would reduce your error rate and lessen the # hours in data entry.
  • asked a question related to Data Integration
Question
1 answer
Deconvolution (or commonly named integration) of NMR 13C spectra of compost samples seems to be difficult in free software (such as DmFit). Is there another way to do it rapidly without fitting spectrum manually to the one given by the instrument whitch is a Bruker UltraShield 400.
Relevant answer
Answer
use origin 8.5 it may be useful to you, for this should have raw data (excel sheet) 
  • asked a question related to Data Integration
Question
9 answers
I am conducting a study about evaluating Document database (MongoDb) and Relational database. For both databases, I created a simple application in web2py with only CRUD functions to evaluate fairly. However, I cannot find a good way in evaluating the two. I am mostly focused in terms of their Data Integrity, Ease of Use, Cost (in-memory) and Database Structure. 
I need to find a reference on how to measure the 4 criteria. For instance, how can I measure the data integrity of Document database?
Relevant answer
Answer
It's also good to work into the direction of the TPC (http://www.tpc.org/). It's widely know, acceptable benchmarking practice with very robust parts.
In the paper recommend by the Neamat El-Tazi "OLTP-Bench: An Extensible Testbed for Benchmarking Relational Databases" they are also considering part of the TPC called TPC-C.
P.S. Many thanks to the Neamat El-Tazi for recommending "OLTP-Bench: An Extensible Testbed for Benchmarking Relational Databases".
  • asked a question related to Data Integration
Question
10 answers
Please, could someone show me some good material to study and understand about the methods of data integration?
My project involves the use microarray and HPLC to generate the data. I'm having trouble to finding good information about the new methods employed today.
Relevant answer
Answer
3Omics is also good. But, even though one uses, same material for both studies, it would be hard to compare for obvious reasons. A transcriptome can easily come close to show expression of 99% of the coding/ expressing regions of the genome, while identifiable metabolomes is only 5-10% of the genome with best possible available tools, annotations and mass-specs. thus limitation is "annotation" and the "diverse class of chemistry" of metabolome in general. Proteomes lay somewhere in the middle of both. Combing both may help you 'focus on few pathways' where most of your metabolites concentrate (instrumental and chemical bias/ basis being the major reasons!), but even would be hard to explain why in a metabolic pathway, one molecule upstream is upregulated, the one below it is downregulated while many do not show any change. Statistics alone would not help in deciphering the 'missing dots' in metabolomic datasets.
  • asked a question related to Data Integration
Question
5 answers
How to interpret Mass spectra of DNA from raw data, actually we have conjugate DNA with organic molecules after  conjugation we want characterized our product by MALDI-mass analysis and got spectra in m/z , these spectra are not showing actual peak of product, it is less than 300D from product .
Relevant answer
Answer
thank you sir, right now  this facility not available in our institute we have send the sample to other institute   they analyzed the sample according to our instruction.   they operate the instrument   in negative mode.
  • asked a question related to Data Integration
Question
8 answers
All of us, with our skills, expertise and "know hows" could get the intellectual power to give a solution to whatever problem we face in science
Relevant answer
Answer
Four pictures, just to share with you the basic idea of evolution of social relations within a virtual environment [bit it works also within real organizations].
—g
  • asked a question related to Data Integration
Question
14 answers
Instead of doing a simulation, I need a dataset with hundreds or thousands of records distributed over multiple data sources. I need it for data integration purposes where the same entity may have multiple records in different data sources.
Relevant answer
Answer
Hi Cliff, a problem with making such a simulated dataset is that there not so much known on how you should introduce "noise" in a realistic fashion. Most researchers agree that the size of a cluster of duplicates is Zipf distributed and some have proposed some models for introducing typographical errors, but apart from that, it would be "guessing" what a realistic error model would be. For example, abbreviations, multi-valued attributes, subjectiveness (e.g. the musical genre of a CD)... And of course, if there is no realistic error model, simulating a dataset tends to be biased to work good on the algorithm you want to test.
  • asked a question related to Data Integration
Question
18 answers
I have read a couple of articles which are trying to sell the idea that the organization should basically choose between either implementing Hadoop (which is a powerful tool when it comes to unstructured and complex datasets) or implementing Data Warehouse (which is a powerful tool when it comes to structured datasets). But my question is, can´t they actually go along, since Big Data is about both structured and unstructured data?
Relevant answer
Answer
It's very hard to answer this question in general without taking into considerations what your specific needs are. Also, "Data Warehouse" is a pretty general term which basically can mean any kind of technology where you put in your data for later analysis. It can be a classical SQL database, Hadoop (yes, Hadoop can be a Data Warehouse, too), or anything else. Hadoop is a general Map Reduce framework you can also use for a lot of different tasks, including Data Warehousing, but also many other things. You also have to bear in mind that Hadoop itself is a piece of infrastructure which will require a significant amount of coding on your part to do anything useful. You might want to look into projects like Pig or Hive which build on Hadoop and provide a higher level query language to actually do something with your data.
Ultimately you have to ask yourself what existing infrastructure is already in place, how much data you have, what the kind of questions are you want to extract from your data and so on, and then use something which fits your needs.
  • asked a question related to Data Integration
Question
1 answer
This is basically integrating web query interfaces.
Relevant answer
Answer
I would go for the semi-supervised learning algorithms for helping the data integration. Also, clustering techniques can probably work for the visualization.
  • asked a question related to Data Integration
Question
8 answers
I have implemented an algorithm which integrates two geographical datasets. Each record of both datasets must define its geographical coordinates (latitude and longitude) and a label (e.g. the name of the record). I would like to compare my algorithm with existing algorithms in literature.
Can anyone suggest any algorithm which integrates two geographical datasets? Thanks in advance!
Relevant answer
Answer
I've worked with geographical datasets for many years and have worked on similar problems. It helps a little to think of the problem as a nearest neighbor problem in 2D space. That brings together possible matches, but the crux of the problem is how to measure the similarity between two geographic features, that is, how to compute a distance function for each pair. For text similarity, there is the Levenshtein edit distance. I have usually improved on the basic algorithm. For comparing the geometry of 2 features, I have used dynamic programming algorithms. Levenshtein edit distance is itself computed by a dynamic programming algorithm, and the algorithm I have used for geometries is somewhat analogous. But those only deal with some of the complexity of integrating two geographic datasets. It is not simple.
  • asked a question related to Data Integration
Question
19 answers
In systems biology, most of the time its needed to integrate several layers of information (e.g. genomics, proteomics and transcriptomics). There is software in some levels for example transcriptome to genome integration. What are the mathematical basics? Does anyone have good introductory references?
Relevant answer
Answer
Abstract:
Thanks for great introduced papers:
1- there are horizontal and vertical integration for each OMICs data: e.g. Breast cancer
a) integration of Breast cancer microarrays (horizontal),
b) integration of Breast cancer Protemoics (horizontal),
c) integration of Breast cancer GWAS data (horizontal),
finally:
d) integrating integrated data of proteomics, genomics and transcriptomics (vertical)
Mta-analysis methods for horizontal and vertical combining of data:
1) Vote counting, 2) Combining ranks, 3) Combining p-values, and 4) combining effect sizes ???????? I need a more detailed papers about these analytic methods. Are methods eaquel for a, b, c, and d (above)?
Then I have learned that there are several ways to analyze pathways including:
1) Overrepresentation methods (EASE, GOStat, DAVID, MetaCore)
2) Gene set enrichment methods (GO, GSEA, SAFE)
3) Set-based methods (PINK)
4) Modeling methods
5- Network-based methods (KEGG, GSEA, PANTHER, DAVID, GO, MSIGDB)
6- Text mining methods
Right?
  • asked a question related to Data Integration
Question
3 answers
I tried searching on Pathway Commons, SBML.org and Cytoscape App Store, trying PC2Path, web services as PSICQUICUniversalClient and several plug-in Cytoscape ... the merging results?? ...., different macro-pathway for the same query , duplicate nodes, and new graphs, often missing connections between existing nodes of curated databases as Reactome ... I think that often the problem is due to different languages ...... anyway….advice?
Relevant answer
Answer
Have you tried semanticSBML (http://semanticsbml.org/semanticSBML/simple/index)? The problem you mentioned, merging systems biology models, is a tough one and a mostly unsolved one, so it will most likely include lots of manual work.
  • asked a question related to Data Integration
Question
11 answers
Also I would like to know if there is any good software for automating the process of metabolic network reconstruction for Mycobacterium smegmatis.
Relevant answer
Answer
Depending of the organism that you are looking for. Generally the database with more reliable curation are associated with single organism-like databases, for example SGD (http://www.yeastgenome.org/) for Sacharomyces, FlyBase (http://flybase.org/) for drosophila, Wormbase (http://www.wormbase.org/) for C.elegans or TAIR (http://www.arabidopsis.org/) for Arabidopsis and Genome Reference Consortium at NCBI (http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/) for human, mouse and zebrafish. Beyond these databases with intensive manual data curation, other good resource for genomic database is EMBLGenomes (http://www.ensemblgenomes.org/) but the quality of the genome depends of the consortium that produced the data. Also it is interesting databases such as SGN (plants, Solanaceae) where you can find a community curation model based in the assignment of Locus editor role to relevant publication authors (http://solgenomics.net/).
About the software to build metabolic networks, I like Pathway Tools (http://bioinformatics.ai.sri.com/ptools/) associated with MetaCyc database, but there are other options such Blast2GO for annotation (http://www.blast2go.com/b2ghome) and KEGG pathways (http://www.genome.jp/kegg/pathway.html) or Mercator (http://mapman.gabipd.org/web/guest/mercator) for annotation and MapMan (http://mapman.gabipd.org/web/guest/mapman).
  • asked a question related to Data Integration
Question
10 answers
Many top publications present results from "advanced" Pathway analysis tools such as IPA or MetaCore as shown on vendors' website.
While this is a convenient, and pretty, way to summarise a mechanism I wonder whether it has ever proved essential in the discovery process.
On the one hand biologists usually know their pathways of interest which are mainly public knowledge. On the other, no fancy tool is required to perform a test for enrichment.
I am therefore curious to find publications were novel, validated, finding was achieved with such a tool that could not have be done with a simpler, cheaper, approach.
Relevant answer
Answer
Yes, it was. I attest to that statement with [quite] a few publications for evidence.
"Biologists know their pathways" is a dangerous proposition. First, it's not about them, biologists and us, presumably something else. One can't handle an advanced tool effectively without being a biologists. Second, some researchers tend to focus on a single gene, single hypothesis, single pathway, which may or may not turn out essential. This focal approach is common recipe to successfully complete the study, make a publication and miss the discovery. It can be balanced by a broader view provided by those expensive advanced tools with all conveniences, bells and whistles. You don't need this unless you are going to spend hours clicking around trying to figure out the way things happen in the experiment, the meaning of expression level changing in all directions and patterns in chaotic shadows. Simple things like grabbing some Cytoscape, routine GO tables and enrichment tests are perfect if you do your job 9 to 5, perform your duties of highly qualified specialist and let them clients, biologists, doctors worry about discoveries. This is how core facilities usually operate. I concord with your point, all those "advanced" tools are not as advanced under the hood as one may think reading their sales booklets. They also don't solve the problems and make discoveries. However, these fancy tools give you a better fighting chance if you know how to use them.
  • asked a question related to Data Integration
Question
2 answers
Does any one know the exact formula(s) for convex objective functions which their solution is SVD decomposition of matrix X?
Relevant answer
Answer
I suppose you refer to the optimisation problem of finding a linear combination of p vectors of length n that have maximum variance (or sum of squares when directly linking it with SVD) ..and then adding up an uncorrelated constraints for the next linear combination (having maximum variance but being uncorrelated with the first one). This usually leads to to eigen-decomposition of the Xt(X) (with your notation).
You may also have a look at the bilinear optimisation, equation (5) in the following paper, expressed using a tensor formalism.
Leibovici, D.G. (2010) "Spatio-temporal Multiway Decomposition using Principal Tensor Analysis on k-modes: the R package PTAk." Journal of Statistical Software, 34(10), 1-34. I think the paper is uploaded on my ResearchGate.
didier