Cancer Informatics

Published by Libertas Academica
evaluation of the 2D and 3D structures of the m11βhSD2 model. (A) homology-aligned sequences of h11βhSD1 (PDB code: 3hFg; black) and m11βhSD2 (magenta). red line: α-helix. Blue line: turn. Yellow line: β-sheet. The alignments reveal that the cofactor-related gXXXgXg and catalytic YXXXK domains are conserved in m11βhSD2. The catalytic domain is often in the vicinity of a conserved S, and this is also the case for m11βhSD2. (B) The secondary structure of the m11βhSD2 model exhibits a central 6-stranded all-parallel β-sheet sandwich-like structure, flanked on both sides by 3-helices. (c) The constructed m11βhSD2 model. The model exhibits similar 3D structure to the structures of h11βhSD2 and 2. 18,33 
ramachandran plots for the m11βhSD2 model. For m11βhSD2, 87.1% of the residues are in the favored region, 11.8% are in the allowed region and only 1.1% are in the disfavored region. The residues in the disfavored regions are located far away from the residues in the LBSs. green: favored region. Light-brown: allowed region. 
(A) Contact energy profiles of m11βhSD2. The positions of the amino acid residues are shown on the x-axis, while the contact energies are shown on the y-axis. The trends in the variation of the contact energy in most parts of the m11βhSD2 model are in good agreement with those of the structures of h11βhSD1 and 2, 18 indicating that the h11βhSD2 and m11βHSD2 models exhibit similar contact energy profiles. (B) The MeP map for the m11βhSD2 model at the LBS. Arrows: LBS. Deep blue: most positive potential. Deep red: most negative potential. The model exhibits an almost identical surface MeP map at the LBS in h11βhSD2, 18 sharing common features such as a positively charged surface. 
Ligand-receptor interaction between corticosterone and m11βhSD2. (A) Corticosterone successfully binds to the LBS in m11βhSD2. The similarity between the present docked corticosterone-m11βhSD2 pose and the h11βhSD2 model 16 suggests that the present methods are capable of generating the corticosterone-m11βhSD2 model similar to the reported h11βhSD2 complex. (B) The ligand-receptor interaction plots for the corticosteronem11βHSD2 model. Ten residues are identified as important residues in the m11βhSD2 model. The bound conformation of corticosterone present in the LBS suggests that it can form a strong hydrogen bond with Tyr226. (c) Ligand-residue interaction energies for the corticosterone-m11βhSD2 model. A negative value indicates that the residue attracts the ligand, while the residue with a positive value repels the ligand. Among the 10 identified residues (Fig. 4B), Ser219, Ala221, Cys228, Phe265 and Trp276 appear to attract the ligand, while Tyr226, Leu229, Tyr232, Lys266 and Leu282 could repel the ligand. The identified 10 residues, including the catalytic triad, would contribute to the stable binding of corticosterone to m11βhSD2. 
Ligand-receptor interaction between gA and m11βhSD2. (A) Corticosterone (Fig. 4A) and gA have a similar binding orientation to the LBS in the m11βhSD2. (B) The ligand-receptor interaction plots for the gA-m11βHSD2 model. Eleven residues are identified as important residues in the m11βhSD2 model. The bound conformation of gA present in the LBS suggests that gA can form strong hydrogen bonds with Trp276. (c) Ligand-residue interaction energies for the gA-m11βhSD2 model. The residues with the interaction energy value of less than-4 kcal/mol and more than 6 kcal/mol are shown with their interaction energy values since these values are not within the lowest and the highest graphic ranges. Among the 11 identified residues (Fig. 5B), Cys228, Tyr232, Phe265, Asn272, Trp276 and gln281 would attract the ligand, while Tyr226, Leu229, Lys266, Leu282 and Leu283 could repel the ligand. Phe265 and Trp276 are the ligand-attracting residues with rather higher negative energy values in the corticosterone-m11βhSD2 (Fig. 4C) and gA-m11βhSD2 models. 
Mouse (m) 11β-hydroxysteroid dehydrogenase type 2 (11βHSD2) was homology-modeled, and its structure and ligand-receptor interaction were analyzed. The modeled m11βHSD2 showed significant 3D similarities to the human (h) 11βHSD1 and 2 structures. The contact energy profiles of the m11βHSD2 model were in good agreement with those of the h11βHSD1 and 2 structures. The secondary structure of the m11βHSD2 model exhibited a central 6-stranded all-parallel β-sheet sandwich-like structure, flanked on both sides by 3-helices. Ramachandran plots revealed that only 1.1% of the amino acid residues were in the disfavored region for m11βHSD2. Further, the molecular surfaces and electrostatic analyses of the m11βHSD2 model at the ligand-binding site exhibited that the model was almost identical to the h11βHSD2 model. Furthermore, docking simulation and ligand-receptor interaction analyses revealed the similarity of the ligand-receptor bound conformation between the m11βHSD2 and h11βHSD2 models. These results indicate that the m11βHSD2 model was successfully evaluated and analyzed. To the best of our knowledge, this is the first report of a m11βHSD2 model with detailed analyses, and our data verify that the mouse model can be utilized for application to the human model to target 11βHSD2 for the development of anticancer drugs.
preprocessing: We discretized the expression and methylation data to determine if a gene is under, over, or moderately (non-extremely) expressed/methylated in each sample. To turn a gene (or miRNA or methylation site) into a discrete vector over all tumor samples, we evaluated the gene expression's mean value and standard deviation over all samples. Then, if the gene had value greater than the mean plus standard deviation in a sample, we represented it as 1 (over-expressed or hypermethylated). If the gene had value lower than the mean minus standard deviation in a sample, we represented it as -1 (under-expressed or hypomethylated). Otherwise, the gene was represented as 0 in the sample. We used the discrete vector representation of each miRNA's expression and methylation over all samples (either gbm or ovarian) from the preprocessing step. step 1: We evaluated the Pearson correlation between all pairs of mirNAs and methylation sites in glioblastoma and ovarian cancer. Then, we ranked all pairwise correlations in descending order, as shown. step 2: We kept the top ranked mirNAs for ovarian and gbm. A condition was that the Bonferroni-corrected p-value, derived from a twotailed t-test that evaluated the Pearson correlation, should be less than 0.01. step 3: We found the mirNAs appearing in the top ranks in both ovarian and gbm and we selected the best mirNA as representative. step 4: Using the best mirNA as representative, we found the top correlated methylation sites and genes in gbm and ovarian. We refer to the resulting sets as M_gbm, M_ovarian (methylation sites) and R_gbm, R_ovarian (gene expression).  
Left: Histogram of the miRNA-gene expression correlations for cancer types that had miRNA and gene expression data (RNASeq and miRNASeq) available. We matched the miRNA-gene expression data on the same samples. Since we did not find a significant negative miRNA-gene correlation, the left graph shows just positive values. Right: the mirNA-methylation correlations for gBM and ovarian cancer (the other TCgA cancer types lacked integrated mirNA-methylation data). We matched the mirNA-methylation data on the same samples. We plotted the mirNAs that appear in all cancer types, which resulted in 680 mirNAs and 421 mirNAs, respectively. For each mirNA we included the correlation values for all 17,814 genes or 27,578 methylation sites. We averaged the correlations over all cancer types to determine if a correlation remains consistently high in all cancers. As shown, the mirNA hsa-mir-142 is highly correlated with a larger set of genes or methylation sites than other mirNAs.  
results of step4 of our analysis method (see Methods section). Left: The methylation sites that are most correlated with hsa-mir-142, either positively or negatively, in glioblastoma (M_gbm). Right: M_ovarian in ovarian cancer. We distinguish the signature M into the methylation sites having positive correlation with hsa-mir-142 (M+) and those with negative correlation (M-). The overlap of methylation sites between M+_gbm (236) and M+_ovarian (471) is 76, while the overlap between M-_gbm (259) and M-_ovarian (126) is 63.  
The known functions of the top-ranked genes in r.
The top three functional annotation term clusters associated with the list of 671 r genes that overlap between r_ovarian and r_gBM.
Gene expression profiling has provided insights into different cancer types and revealed tissue-specific expression signatures. Alterations in microRNA expression contribute to the pathogenesis of many types of human diseases. Few studies have integrated all levels of gene expression, miRNA and methylation to uncover correlations between these data types. We performed an integrated profiling to discover instances of miRNAs associated with a gene expression and DNA methylation signature across multiple cancer types. Using data from The Cancer Genome Atlas (TCGA), we revealed a concordant gene expression and methylation signature associated with the microRNA hsa-miR-142 across the same samples. In all cancer types examined, we found a signature of co-expression of a gene set R and methylated sites M, which correlate positively (M+) or negatively (M-) with the expression of hsa-miR-142. The set R consistently contains many genes, such as TRAF3IP3, NCKAP1L, CD53, LAPTM5, PTPRC, EVI2B, DOCK2, LCP2, CYBB and FYB. The signature is preserved across glioblastoma, ovarian, breast, colon, kidney, lung, uterine and rectum cancer. There is 28% overlap of methylation sites in M between glioblastoma (GBM) and ovarian cancer. There is 60% overlap of genes in R between GBM and ovarian (P = 1.3e(-11)). Most of the genes in R are known to be expressed in lymphocytes and haematopoietic stem cells, while M reflects membrane proteins involved in cell-cell adhesion functions. We speculate that the hsa-miR-142 associated signature may signal haematopoietic-specific processes and an accumulation of methylation events triggering a progressive loss of cell-cell adhesion. We also observed that GBM samples belonging to the proneural subtype tend to have underexpressed hsa-miR-142 and R genes, hypomethylated M+ and hypermethylated M-, while the mesenchymal samples have the opposite profile.
Silencing of several genes on chromosome 20q sensitizes HeLa cells to Kinesin-5 inhibitor. HeLa cells were transfected with siRNA pools (3 siRNAs per gene) to each of ∼3500 individual genes, including 378 genes on chromosome 20q, in the presence (Y-axis) or absence (X-axis) of 30 nM Kinesin-5i. Cell survival was measured 72-hours post transfection by Alamar blue assay. Each dot indicates survival of cells transfected with siRNAs targeting a single gene. Blue dots indicates genes whose silencing affects viability in response to Kinesin-5i (2SD from the mean of the population). Red dots indicate Kinesin-5i enhancers composed of siRNA pools targeting genes on chromosome 20q. 
Silencing of AURKA enhances cell killing by Kinesin-5 inhibitor in a resistant colon cancer cell line. siRNAs targeting AURKA or negative control luciferase were transfected into SW480 or HCT116 colon cancer cells at the indicated doses. Cells were then grown in the absence (dotted lines) or presence (solid lines) of Kinesin-5i. Cell survival was measured by Alamar blue assay 72 hours post-transfection. 
We identified gene expression signatures predicting responsiveness to a Kinesin-5 (KIF11) inhibitor (Kinesin-5i) in cultured colon tumor cell lines. Genes predicting resistance to Kinesin-5i were enriched for those from chromosome 20q, a region of frequent amplification in a number of tumor types. siRNAs targeting genes in this chromosomal region identified AURKA, TPX2 and MYBL2 as genes whose disruption enhances response to Kinesin-5i. Taken together, our results show functional interaction between these genes, and suggest that their overexpression is involved in resistance to Kinesin-5i. Furthermore, our results suggest that patients whose tumors overexpress AURKA due to amplification of 20q will more likely resist treatment with Kinesin-5 inhibitor, and that inactivation of AURKA may sensitize these patients to treatment.
Overlap among the characteristic genes of three groups. Characteristic genes were selected by Welch's t-value of 3.5. 
A) and (B) show time-course gene expression profiles of top 10 genes in Group 1 (Table 3) in male rat livers treated with thioacetamide (C35), and lithocholic acid (NC19). The up-and downregulation of the characteristic genes was observed in nine and one out of them with treatment duration, and their expression changes reached a plateau at 3-day after administration of thioacetamide (Fig. 3A). On the other hand, such gene expression changes were not observed in the specimen treated with lithocholic acid. 
A) Connectivity map of the responses in the characteristic genes of carcinogens clustered to group I by Ingenuity Pathway assistant analysis. B) explanation of the symbols, the edges, and their labels. 
Top 10 characteristic genes of group 2 carcinogens at 28th day. 
Top 10 characteristic genes of group 1 carcinogens at 28th day. 
This study aimed at discriminating carcinogens on the basis of hepatic transcript profiling in the rats administrated with a variety of carcinogens and non-carcinogens. We conducted 28-day toxicity tests in male F344 rats with 47 carcinogens and 26 non-carcinogens, and then investigated periodically the hepatic gene expression profiles using custom microarrays. By hierarchical cluster analysis based on significantly altered genes, carcinogens were clustered into three major groups (Group 1 to 3). The formation of these groups was not affected by the gene sets used as well as the administration period, indicating that the grouping of carcinogens was universal independent of the conditions of both statistical analysis and toxicity testing. Seventeen carcinogens belonging to Group 1 were composed of mainly rat hepatocarcinogens, most of them being mutagenic ones. Group 2 was formed by three subgroups, which were composed of 23 carcinogens exhibiting distinct properties in terms of genotoxicity and target tissues, namely nonmutagenic hepatocarcinogens, and mutagenic and nonmutagenic carcinogens both of which are targeted to other tissues. Group 3 contained 6 carcinogens including 4 estrogenic substances, implying the group of estrogenic carcinogens. Gene network analyses revealed that the significantly altered genes in Group 1 included Bax, Tnfrsf6, Btg2, Mgmt and Abcb1b, suggesting that p53-mediated signaling pathway involved in early pathologic alterations associated with preceding mutagenic carcinogenesis. Thus, the common transcriptional signatures for each group might reflect the early molecular events of carcinogenesis and hence would enable us to identify the biomarker genes, and then to develop a new assay for carcinogenesis prediction.
Number of chemicals in each subgroup. 
gene symbol, gene name, refseq ID and expression direction of the predictive genes. 
We have previously shown the hepatic gene expression profiles of carcinogens in 28-day toxicity tests were clustered into three major groups (Group-1 to 3). Here, we developed a new prediction method for Group-1 carcinogens which consist mainly of genotoxic rat hepatocarcinogens. The prediction formula was generated by a support vector machine using 5 selected genes as the predictive genes and predictive score was introduced to judge carcinogenicity. It correctly predicted the carcinogenicity of all 17 Group-1 chemicals and 22 of 24 non-carcinogens regardless of genotoxicity. In the dose-response study, the prediction score was altered from negative to positive as the dose increased, indicating that the characteristic gene expression profile emerged over a range of carcinogen-specific doses. We conclude that the prediction formula can quantitatively predict the carcinogenicity of Group-1 carcinogens. The same method may be applied to other groups of carcinogens to build a total system for prediction of carcinogenicity.
The present paper aims at demonstrating clinically oriented applications of the multiscale four dimensional in vivo tumor growth simulation model previously developed by our research group. To this end the effect of weekend radiotherapy treatment gaps and p53 gene status on two virtual glioblastoma tumors differing only in p53 gene status is investigated in silico. Tumor response predictions concerning two rather extreme dose fractionation schedules (daily dose of 4.5 Gy administered in 3 equal fractions) namely HART (Hyperfractionated Accelerated Radiotherapy weekend less) 54 Gy and CHART (Continuous HART) 54 Gy are presented and compared. The model predictions suggest that, for the same p53 status, HART 54 Gy and CHART 54 Gy have almost the same long term effects on locoregional tumor control. However, no data have been located in the literature concerning a comparison of HART and CHART radiotherapy schedules for glioblastoma. As non small cell lung carcinoma (NSCLC) may also be a fast growing and radiosensitive tumor, a comparison of the model predictions with the outcome of clinical studies concerning the response of NSCLC to HART 54 Gy and CHART 54 Gy is made. The model predictions are in accordance with corresponding clinical observations, thus strengthening the potential of the model.
Datasets and patients included in the study.
Kaplan–Meyer plots of the 5q14 and TNFR2 groups in basal-like tumors. The training set consisted of 280 basal-like tumors ( A and E ) and 959 nonbasal-like tumors ( B and F ). The test set contained 41 basal-like tumors ( C and G ) and 159 nonbasal-like tumors ( D and H ). The score is calculated for each sample as the mean standardized expression values for all genes in the gene set The low score group was defined as the third of samples with the lowest score for the indicated gene set, and the high score group was defined as the third of the samples with the highest score. The middle third was excluded from the analysis. The P -values were calculated using the log-rank test. Abbreviation: TNFR, tumor necrosis factor receptor. 
Kaplan–Meyer plots of 5q33 and IL-12 groups in ERBB2 tumors. The training set consisted of 196 tumors with high ERBB2 expression ( A and E ) and 1,043 tumors with low ERBB2 expression ( B and F ). The test set contained 70 tumors with the highest ERBB2 expression ( C and G ) and 277 tumors with lowest ERBB2 expression ( D and H ). The score is calculated for each sample as the mean standardized expression values for all genes in the gene set. The low score group was defined as the third of samples with lowest score for the indicated gene set, and the high score group was defined as the third of the samples with the highest score. The middle third was excluded from the analysis. The P -values were calculated using the log-rank test. Abbreviation: IL, interleukin. 
Figure S1. Distribution of ERBB2 and GRB7 expression in test set. Notes: For validation of metastatic mechanisms in ERBB2 positive tumors, a test set including TRANSBIC-S and Mainz was used resulting in a total of 347 samples. Instead of using single sample prediction, we classified tumor ERBB2 status as 20% of tumors with highest ERBB2 expression. 
Figure S2. 5q14, 321 basale.
Breast tumors have been described by molecular subtypes characterized by pervasively different gene expression profiles. The subtypes are associated with different clinical parameters and origin of precursor cells. However, the biological pathways and chromosomal aberrations that differ between the subgroups are less well characterized. The molecular subtypes are associated with different risk of metastatic recurrence of the disease. Nevertheless, the performance of these overall patterns to predict outcome is far from optimal, suggesting that biological mechanisms that extend beyond the subgroups impact metastasis. We have scrutinized publicly available gene expression datasets and identified molecular subtypes in 1,394 breast tumors with outcome data. By analysis of chromosomal regions and pathways using "Gene set enrichment analysis" followed by a meta-analysis, we identified comprehensive mechanistic differences between the subgroups. Furthermore, the same approach was used to investigate mechanisms related to metastasis within the subgroups. A striking finding is that the molecular subtypes account for the majority of biological mechanisms associated with metastasis. However, some mechanisms, aside from the subtypes, were identified in a training set of 1,239 tumors and confirmed by survival analysis in two independent validation datasets from the same type of platform and consisting of very comparable node-negative patients that did not receive adjuvant medical therapy. The results show that high expression of 5q14 genes and low levels of TNFR2 pathway genes were associated with poor survival in basal-like cancers. Furthermore, low expression of 5q33 genes and interleukin-12 pathway genes were associated with poor outcome exclusively in ERBB2-like tumors. The identified regions, genes, and pathways may be potential drug targets in future individualized treatment strategies.
DLBCL splits into sub-groups independent of signatures. Optimal bipartitions of patients are calculated by ISIS based on optimal bipartition subsets of genes (50). Every column of the x-axis represents a patient. On the bottom, the DLBCL-type of the patient is labelled. On the y-axis every row shows the bipartitions ranked in increasing score of separation quality. The three best bipartitions show a very consistent and clear signal separating the ABC-from the GCB-patients. The unsupervised method ISIS reveals the ABC-GCB classifi cation independent of proliferation signatures. No evidence for a previously suggested third group "Type 3" was found. Only a few patients are falsely assigned if compared to the DLBCL gene signature assignment. 
Aiming to find key genes and events, we analyze a large data set on diffuse large B-cell lymphoma (DLBCL) gene-expression (248 patients, 12196 spots). Applying the loess normalization method on these raw data yields improved survival predictions, in particular for the clinical important group of patients with medium survival time. Furthermore, we identify a simplified prognosis predictor, which stratifies different risk groups similarly well as complex signatures. We identify specific, activated B cell-like (ABC) and germinal center B cell-like (GCB) distinguishing genes. These include early (e.g. CDKN3) and late (e.g. CDKN2C) cell cycle genes. Independently from previous classification by marker genes we confirm a clear binary class distinction between the ABC and GCB subgroups. An earlier suggested third entity is not supported. A key regulatory network, distinguishing marked over-expression in ABC from that in GCB, is built by: ASB13, BCL2, BCL6, BCL7A, CCND2, COL3A1, CTGF, FN1, FOXP1, IGHM, IRF4, LMO2, LRMP, MAPK10, MME, MYBL1, NEIL1 and SH3BP5. It predicts and supports the aggressive behaviour of the ABC subgroup. These results help to understand target interactions, improve subgroup diagnosis, risk prognosis as well as therapy in the ABC and GCB DLBCL subgroups.
heatmap showing fold change patterns of most altered genes.  
Top up-regulated genes (common to all datasets). 
Top down-regulated genes (common to all datasets). 
Top molecular and cellular functions that are associated with commonly dysregulated genes. 
Mechanism of hedgehog signaling.  
Lung cancer is the second most commonly occurring non-cutaneous cancer in the United States with the highest mortality rate among both men and women. In this study, we utilized three lung cancer microarray datasets generated by previous researchers to identify differentially expressed genes, altered signaling pathways, and assess the involvement of Hedgehog (Hh) pathway. The three datasets contain the expression levels of tens of thousands genes in normal lung tissues and squamous cell lung carcinoma. The datasets were combined and analyzed. The dysregulated genes and altered signaling pathways were identified using statistical methods. We then performed Fisher’s exact test on the significance of the association of Hh pathway downstream genes and squamous cell lung carcinoma. 395 genes were found commonly differentially expressed in squamous cell lung carcinoma. The genes encoding fibrous structural protein keratins and cell cycle dependent genes encoding cyclin-dependent kinases were significantly up-regulated while the ones encoding LIM domains were down. Over 100 signaling pathways were implicated in squamous cell lung carcinoma, including cell cycle regulation pathway, p53 tumor-suppressor pathway, IL-8 signaling, Wnt-β-catenin pathway, mTOR signaling and EGF signaling. In addition, 37 out of 223 downstream molecules of Hh pathway were altered. The P-value from the Fisher’s exact test indicates that Hh signaling is implicated in squamous cell lung carcinoma. Numerous genes were altered and multiple pathways were dysfunctional in squamous cell lung carcinoma. Many of the altered genes have been implicated in different types of carcinoma while some are organ-specific. Hh signaling is implicated in squamous cell lung cancer, opening the door for exploring new cancer therapeutic treatment using GLI antagonist GANT 61.
Calculation of the "volume" statistic for chromosomal arm 2p amplifi cations in GSE7230 (Neuroblastoma). A) The height matrix H (raw data) of 2p, where each element (m, s) on 2p is the log2 ratio of aCGH marker m in sample s. Each row corresponds to a marker, and each column corresponds to a sample. For presentation only, values are truncated to [−1, 1]. B) The amplifications matrix A, where each element (m, s) on chromosome 2p that is amplifi ed in sample s is marked by 1, otherwise 0. C) The length matrix L of 2p, where each element (m, s) on chromosome 2p for which A ms = 1 is replaced by the length of the sequence of 1s to which it belongs on sample s. Maximal represented length is K = 5. Non amplifi ed markers are white. D) X, the matrix created by multiplying elements of H, A and L. Non amplifi ed markers are white. E) Averaging the rows of X gives the volume statistic. The red line is the value of the volume statistic above which it is signifi cantly amplifi ed (corresponding to FDR of 0.05). F) The markers of the only region on chromosome 2p that passes this threshold-the MYCN region, marked in A-E by red asterisks. For presentation only, values are truncated to [−1, 1].
Aberrations common to both Neuroblastoma datasets.
A) The height matrix H (raw data) of 2p, where each element (m, s) on 2p is the log2 ratio of aCGH marker m in sample s. Each row corresponds to a marker, and each column corresponds to a sample. For presentation only, values are truncated to [−1, 1]. B) The amplifications matrix A, where each element (m, s) on chromosome 2p that is amplified in sample s is marked by 1, otherwise 0. C) The length matrix L of 2p, where each element (m, s) on chromosome 2p for which Ams = 1 is replaced by the length of the sequence of 1s to which it belongs on sample s. Maximal represented length is K = 5. Non amplified markers are white. D) X, the matrix created by multiplying elements of H, A and L. Non amplified markers are white. E) Averaging the rows of X gives the volume statistic. The red line is the value of the volume statistic above which it is significantly amplified (corresponding to FDR of 0.05). F) The markers of the only region on chromosome 2p that passes this threshold—the MYCN region, marked in A–E by red asterisks. For presentation only, values are truncated to [−1, 1].
A) Chromosomal status of dataset GSE8634. Each row corresponds to a chromosomal arm. Due to space limitation, only every second arm is labelled. Since some chromosomes are telocentric (with short p arm), there is a change from p to q. Values are color coded according to the mean log2 ratio of the markers on each chromosomal arm. B) Discussed aberrations in Medulloblastoma dataset GSE8634. Each column corresponds to a sample. Samples are manually ordered according to known and new clinicogenetic subgroups, as the bar below shows. Each row corresponds to an aberration discussed in the text, and the label indicates the gene associated with it. Values are color coded according to the mean log2 ratio of the markers on each aberration. In all subfigures, for presentation only, values are truncated to the range [−1, 1], rising from blue to red.
Chromosomal status of datasets GSE5784 (A) and GSE7230 (B), and the aberrations common to both of Neuroblastoma datasets, shown for the patients of GSE5784 (C) and GSE7230 (D). Each column corresponds to a sample. Samples are manually ordered according to known and new clinicogenetic subgroups, as the bar below shows. In A and B, each row corresponds to a chromosomal arm. Due to space limitation, only every second arm is labelled. Since some chromosomes are telocentric (with short p arm), there is a change from p to q. Values are color coded according to the mean log2 ratio of the markers on each chromosomal arm. In C and D, each row corresponds to a common aberration, and the label indicates the chromosome on which the aberration resides. Values are color coded according to the mean log2 ratio of the markers on each aberration. In all subfigures, for presentation only, values are truncated to the range [−1, 1], rising from blue to red.
Many types of tumors exhibit characteristic chromosomal losses or gains, as well as local amplifications and deletions. Within any given tumor type, sample specific amplifications and deletions are also observed. Typically, a region that is aberrant in more tumors, or whose copy number change is stronger, would be considered as a more promising candidate to be biologically relevant to cancer. We sought for an intuitive method to define such aberrations and prioritize them. We define V, the "volume" associated with an aberration, as the product of three factors: (a) fraction of patients with the aberration, (b) the aberration's length and (c) its amplitude. Our algorithm compares the values of V derived from the real data to a null distribution obtained by permutations, and yields the statistical significance (p-value) of the measured value of V. We detected genetic locations that were significantly aberrant, and combine them with chromosomal arm status (gain/loss) to create a succinct fingerprint of the tumor genome. This genomic fingerprint is used to visualize the tumors, highlighting events that are co-occurring or mutually exclusive. We apply the method on three different public array CGH datasets of Medulloblastoma and Neuroblastoma, and demonstrate its ability to detect chromosomal regions that were known to be altered in the tested cancer types, as well as to suggest new genomic locations to be tested. We identified a potential new subtype of Medulloblastoma, which is analogous to Neuroblastoma type 1.
This paper concerns a new method for identifying aberrant signal transduction pathways (STPs) in cancer using case/control gene expression-level datasets, and applying that method and an existing method to an ovarian carcinoma dataset. Both methods identify STPs that are plausibly linked to all cancers based on current knowledge. Thus, the paper is most appropriate for the cancer informatics community. Our hypothesis is that STPs that are altered in tumorous tissue can be identified by applying a new Bayesian network (BN)-based method (causal analysis of STP aberration (CASA)) and an existing method (signaling pathway impact analysis (SPIA)) to the cancer genome atlas (TCGA) gene expression-level datasets. To test this hypothesis, we analyzed 20 cancer-related STPs and 6 randomly chosen STPs using the 591 cases in the TCGA ovarian carcinoma dataset, and the 102 controls in all 5 TCGA cancer datasets. We identified all the genes related to each of the 26 pathways, and developed separate gene expression datasets for each pathway. The results of the two methods were highly correlated. Furthermore, many of the STPs that ranked highest according to both methods are plausibly linked to all cancers based on current knowledge. Finally, CASA ranked the cancer-related STPs over the randomly selected STPs at a significance level below 0.05 (P = 0.047), but SPIA did not (P = 0.083).
Clinico-pathological data for 70 primary CrC cases. 
numbers of genes in each gene set associated with key clinico-pathological factors. 
expression comparison of each genes in gene set III. expression differences are shown for each gene, compared by each group. gene set III-T A (T1 and T2 vs. T3 and T4): A) UgT2B28, B) LOC440995, c) CXCL6, D) sULT1B1; gene set III-T B (T1, T2 and T3 vs. T4 ): e) CXCL3, F) rALBP1, G) TYMs, H) rAB12; gene set III-n (n0 vs. n1 and n2): I) rnMT; gene set III-M (M0 vs. M1): J) ArhgDIB, K) s100A2, L) ABhD2, M) OIT1; gene set III-re (recurrence vs. non-recurrence): n) ABhD12. notes: *P , 0.05; **P , 0.01; ***P , 0.001. 
Colorectal cancer (CRC) is one of the most frequently occurring cancers in Japan, and thus a wide range of methods have been deployed to study the molecular mechanisms of CRC. In this study, we performed a comprehensive analysis of CRC, incorporating copy number aberration (CRC) and gene expression data. For the last four years, we have been collecting data from CRC cases and organizing the information as an "omics" study by integrating many kinds of analysis into a single comprehensive investigation. In our previous studies, we had experienced difficulty in finding genes related to CRC, as we observed higher noise levels in the expression data than in the data for other cancers. Because chromosomal aberrations are often observed in CRC, here, we have performed a combination of CNA analysis and expression analysis in order to identify some new genes responsible for CRC. This study was performed as part of the Clinical Omics Database Project at Tokyo Medical and Dental University. The purpose of this study was to investigate the mechanism of genetic instability in CRC by this combination of expression analysis and CNA, and to establish a new method for the diagnosis and treatment of CRC. Comprehensive gene expression analysis was performed on 79 CRC cases using an Affymetrix Gene Chip, and comprehensive CNA analysis was performed using an Affymetrix DNA Sty array. To avoid the contamination of cancer tissue with normal cells, laser micro-dissection was performed before DNA/RNA extraction. Data analysis was performed using original software written in the R language. We observed a high percentage of CNA in colorectal cancer, including copy number gains at 7, 8q, 13 and 20q, and copy number losses at 8p, 17p and 18. Gene expression analysis provided many candidates for CRC-related genes, but their association with CRC did not reach the level of statistical significance. The combination of CNA and gene expression analysis, together with the clinical information, suggested UGT2B28, LOC440995, CXCL6, SULT1B1, RALBP1, TYMS, RAB12, RNMT, ARHGDIB, S1000A2, ABHD2, OIT3 and ABHD12 as genes that are possibly associated with CRC. Some of these genes have already been reported as being related to CRC. TYMS has been reported as being associated with resistance to the anti-cancer drug 5-fluorouracil, and we observed a copy number increase for this gene. RALBP1, ARHGDIB and S100A2 have been reported as oncogenes, and we observed copy number increases in each. ARHGDIB has been reported as a metastasis-related gene, and our data also showed copy number increases of this gene in cases with metastasis. The combination of CNA analysis and gene expression analysis was a more effective method for finding genes associated with the clinicopathological classification of CRC than either analysis alone. Using this combination of methods, we were able to detect genes that have already been associated with CRC. We also identified additional candidate genes that may be new markers or targets for this form of cancer.
The copy number aberrations (CNAs) in association with metastasis. 
Numbers of statistically significant probes for the liver cancer study. The three types of dNA copy number measurements: smoothed log 2 -ratio, gain/loss call, and raw log 2 -ratio, are compared in terms of identification of statistically significant probes.  
Array-based comparative genomic hybridization (aCGH) allows measuring DNA copy number at the whole genome scale. In cancer studies, one may be interested in identifying DNA copy number aberrations (CNAs) associated with certain clinicopathological characteristics such as cancer metastasis. We proposed to define test regions based on copy number pattern profiles across multiple samples, using either smoothed log(2)-ratio or discrete data of copy number gain/loss calls. Association test performed on the refined test regions instead of the probes has improved power due to reduced number of tests. We also compared three types of measurement of copy number levels, normalized log(2)-ratio, smoothed log(2)-ratio, and copy number gain or loss calls in statistical hypothesis testing. The relative strengths and weaknesses of the proposed method were demonstrated using both simulation studies and real data analysis of a liver cancer study.
Copy number profiles of a lung cancer sequencing sample and matched patient normal signal. 12 Panel (A) shows all aberrations in the tumor sample. X-axis represents the bins ordered according to their chromosomal location. Y-axis represents the log 2 ratio (right side). the red line indicates the segmented values as obtained using circular binary segmentation in CGHcall. 11 Panel (b) shows chromosomes 3 (left) and 10 (right) both for patient normal and tumor sample. the gray arrow in the left panels indicates a focal CnV present in both tumor and matched patient normal sample. somatic focal Cnas on chromosome 10 are only present in the tumor and not in the matched patient normal sample. focal Cnas and CnVs were detected using focalCall(). 
frequency plots of the GBm dataset of all aberrations (top) and focal aberrations and CnVs (bottom) as generated by focalCall functions freqPlot() and FreqPlotFocal(). red indicates a gain and blue indicates a loss. In the frequency plot of focal aberrations (bottom), the somatic focal aberrations are indicated in red for gains and blue for losses. CnVs are indicated in gray, both for gains and losses. 
In order to identify somatic focal copy number aberrations (CNAs) in cancer specimens and to distinguish them from germ-line copy number variations (CNVs), we developed the software package FocalCall. FocalCall enables user-defined size cutoffs to recognize focal aberrations and builds on established array comparative genomic hybridization segmentation and calling algorithms. To distinguish CNAs from CNVs, the algorithm uses matched patient normal signals as references or, if this is not available, a list with known CNVs in a population. Furthermore, FocalCall differentiates between homozygous and heterozygous deletions as well as between gains and amplifications and is applicable to high-resolution array and sequencing data. AVAILABILITY AND IMPLEMENTATION FocalCall is available as an R-package from: . The R-package will be available in as of release 3.0.
Motivation: Existing methods for estimating copy number variations in array comparative genomic hybridization (aCGH) data are limited to estimations of the gain/loss of chromosome regions for single sample analysis. We propose the linear-median method for estimating shared copy numbers in DNA sequences across multiple samples, demonstrate its operating characteristics through simulations and applications to real cancer data, and compare it to two existing methods. Results: Our proposed linear-median method has the power to estimate common changes that appear at isolated single probe positions or very short regions. Such changes are hard to detect by current methods. This new method shows a higher rate of true positives and a lower rate of false positives. The linear-median method is non-parametric and hence is more robust in estimating copy number. Additionally the linear-median method is easily computable for practical aCGH data sets compared to other copy number estimation methods.
Trace plots of the sample values versus iteration for the parameters β10 (a), β20 (b), β30 (c), β40 (d), β50 (e), β60 (f), β70 (g), β11 (h), β21 (i), β31 (j), β41 (k), β51 (l), β61 (m), β71 (n), β12 (o), β22 (p), β32 (q), β42 (r), β52 (s), β62 (t) and β72 (u).
In this paper we develop a Bayesian analysis to estimate the disease prevalence, the sensitivity and specificity of three cervical cancer screening tests (cervical cytology, visual inspection with acetic acid and Hybrid Capture II) in the presence of a covariate and in the absence of a gold standard. We use Metropolis-Hastings algorithm to obtain the posterior summaries of interest. The estimated prevalence of cervical lesions was 6.4% (a 95% credible interval [95% CI] was 3.9, 9.3). The sensitivity of cervical cytology (with a result of ≥ ASC-US) was 53.6% (95% CI: 42.1, 65.0) compared with 52.9% (95% CI: 43.5, 62.5) for visual inspection with acetic acid and 90.3% (95% CI: 76.2, 98.7) for Hybrid Capture II (with result of >1 relative light units). The specificity of cervical cytology was 97.0% (95% CI: 95.5, 98.4) and the specificities for visual inspection with acetic acid and Hybrid Capture II were 93.0% (95% CI: 91.0, 94.7) and 88.7% (95% CI: 85.9, 91.4), respectively. The Bayesian model with covariates suggests that the sensitivity and the specificity of the visual inspection with acetic acid tend to increase as the age of the women increases. The Bayesian method proposed here is an useful alternative to estimate measures of performance of diagnostic tests in the presence of covariates and when a gold standard is not available. An advantage of the method is the fact that the number of parameters to be estimated is not limited by the number of observations, as it happens with several frequentist approaches. However, it is important to point out that the Bayesian analysis requires informative priors in order for the parameters to be identifiable. The method can be easily extended for the analysis of other medical data sets.
A schematic representation showing the construction of the MCAM database.
Number of entries from different database sources associated with GO terms.
A schematic representation showing the construction of the MCAM database.
In the post-genomic era, computational identification of cell adhesion molecules (CAMs) becomes important in defining new targets for diagnosis and treatment of various diseases including cancer. Lack of a comprehensive CAM-specific database restricts our ability to identify and characterize novel CAMs. Therefore, we developed a comprehensive mammalian cell adhesion molecule (MCAM) database. The current version is an interactive Web-based database, which provides the resources needed to search mouse, human and rat-specific CAMs and their sequence information and characteristics such as gene functions and virtual gene expression patterns in normal and tumor tissues as well as cell lines. Moreover, the MCAM database can be used for various bioinformatics and biological analyses including identifying CAMs involved in cell-cell interactions and homing of lymphocytes, hematopoietic stem cells and malignant cells to specific organs using data from high-throughput experiments. Furthermore, the database can also be used for training and testing existing transmembrane (TM) topology prediction methods specifically for CAM sequences. The database is freely available online at
execution time of Carat-GxG and CPU implementation in two-way testing (UnIt: seConD).
Results of CARAT-GxG performance assessment. ( A ) The dotted line indicates the theoretical acceleration folds by adding a graphics card. The solid line indicates measured acceleration folds in two against one (green) and three against one (blue) graphics cards. ( B ) Execution time between CARAT-GxG and CPU implementations in a single SNP test. ( C ) Execution time of CARAT-GxG according to the number of threads and blocks with the dataset including 1,000 samples with 500 SNPs. 
In genome-wide association studies (GWAS), regression analysis has been most commonly used to establish an association between a phenotype and genetic variants, such as single nucleotide polymorphism (SNP). However, most applications of regression analysis have been restricted to the investigation of single marker because of the large computational burden. Thus, there have been limited applications of regression analysis to multiple SNPs, including gene-gene interaction (GGI) in large-scale GWAS data. In order to overcome this limitation, we propose CARAT-GxG, a GPU computing system-oriented toolkit, for performing regression analysis with GGI using CUDA (compute unified device architecture). Compared to other methods, CARAT-GxG achieved almost 700-fold execution speed and delivered highly reliable results through our GPU-specific optimization techniques. In addition, it was possible to achieve almost-linear speed acceleration with the application of a GPU computing system, which is implemented by the TORQUE Resource Manager. We expect that CARAT-GxG will enable large-scale regression analysis with GGI for GWAS data.
Cancer surveillance network architecture.
Examples of GATE Monte Carlo simulations for radiotherapy treatment.
HOPE (Hospital Platform for E-health) web platform.
The main problem for health professionals and patients in accessing information is that this information is very often distributed over many medical records and locations. This problem is particularly acute in cancerology because patients may be treated for many years and undergo a variety of examinations. Recent advances in technology make it feasible to gain access to medical records anywhere and anytime, allowing the physician or the patient to gather information from an "ephemeral electronic patient record". However, this easy access to data is accompanied by the requirement for improved security (confidentiality, traceability, integrity, ...) and this issue needs to be addressed. In this paper we propose and discuss a decentralised approach based on recent advances in information sharing and protection: Grid technologies and watermarking methodologies. The potential impact of these technologies for oncology is illustrated by the examples of two experimental cases: a cancer surveillance network and a radiotherapy treatment plan. It is expected that the proposed approach will constitute the basis of a future secure "google-like" access to medical records.
errors and increments for discriminating Ao from other gliomas.
Power curves for different model parameters. Red: r = 0.03, green: r = 0.05, black: r = 0.07.  
Average good features for different model parameters. Red: r = 0.03, green: r = 0.05, black: r = 0.07.
error difference curves. green: true error, red: estimated error, black: error difference.
Power Curves for σ µ 2 = 0.5, 1.0, 2.0, respectively. Red: σ µ 2 = 0.5, green: σ µ 2 = 1.0, black: σ µ 2 = 2.0.
When confronted with a small sample, feature-selection algorithms often fail to find good feature sets, a problem exacerbated for high-dimensional data and large feature sets. The problem is compounded by the fact that, if one obtains a feature set with a low error estimate, the estimate is unreliable because training-data-based error estimators typically perform poorly on small samples, exhibiting optimistic bias or high variance. One way around the problem is limit the number of features being considered, restrict features sets to sizes such that all feature sets can be examined by exhaustive search, and report a list of the best performing feature sets. If the list is short, then it greatly restricts the possible feature sets to be considered as candidates; however, one can expect the lowest error estimates obtained to be optimistically biased so that there may not be a close-to-optimal feature set on the list. This paper provides a power analysis of this methodology; in particular, it examines the kind of results one should expect to obtain relative to the length of the list and the number of discriminating features among those considered. Two measures are employed. The first is the probability that there is at least one feature set on the list whose true classification error is within some given tolerance of the best feature set and the second is the expected number of feature sets on the list whose true errors are within the given tolerance of the best feature set. These values are plotted as functions of the list length to generate power curves. The results show that, if the number of discriminating features is not too small—that is, the prior biological knowledge is not too poor—then one should expect, with high probability, to find good feature sets. Availability: companion website at
Output from ADaCGH. Partial output from ADaCGH showing: (Top) the bottom of the main output screen with the thumbnails for the segmented plots; (Bottom) Genome View for one of the arrays, obtained by clicking on the uppermost thumbnail in (Top); (Center) Chromosome View for the fi rst chromosome (obtained by clicking on the region for the fi rst chromosome in (Bottom)), with some data-points showing their ID; (Right) the results from IDClight obtained by clicking on the ID for one of the highlighted points in (Center). 
The analysis of expression and CGH arrays plays a central role in the study of complex diseases, especially cancer, including finding markers for early diagnosis and prognosis, choosing an optimal therapy, or increasing our understanding of cancer development and metastasis. Asterias ( is an integrated collection of freely-accessible web tools for the analysis of gene expression and aCGH data. Most of the tools use parallel computing (via MPI) and run on a server with 60 CPUs for computation; compared to a desktop or server-based but not parallelized application, parallelization provides speed ups of factors up to 50. Most of our applications allow the user to obtain additional information for user-selected genes (chromosomal location, PubMed ids, Gene Ontology terms, etc.) by using clickable links in tables and/or figures. Our tools include: normalization of expression and aCGH data (DNMAD); converting between different types of gene/clone and protein identifiers (IDconverter/IDClight); filtering and imputation (preP); finding differentially expressed genes related to patient class and survival data (Pomelo II); searching for models of class prediction (Tnasas); using random forests to search for minimal models for class prediction or for large subsets of genes with predictive capacity (GeneSrF); searching for molecular signatures and predictive genes with survival data (SignS); detecting regions of genomic DNA gain or loss (ADaCGH). The capability to send results between different applications, access to additional functional information, and parallelized computation make our suite unique and exploit features only available to web-based applications.
An example of aCGHViewer input data format for categorized data 
The aCGHViewer genomic view. This fi gure shows the graphical representation of human aCGH data (Rossi et al 2005). Each panel contains the data for one chromosome and each point represents data from one target (BAC). The horizontal axis represents the base pair position along the chromosome while the vertical axis represents the measured log2 signal ratio (-2-+5) value for each BAC. The position of the centromere is indicated by a vertical black line. Two tabs are visible at the top of the main window in the fi gure indicating that two data sets have been loaded for visualization.
The detailed chromosome view. (a): a detailed view launched by selecting chromosome 7 shown in Fig. 1. In this window, the user may zoom in on a portion of the graph by drawing a zooming rectangle surrounding the region of interest. (b): a zoom window for a selected region of panel (a) covered by the rectangular box. A tooltip, containing target ID, cytoband, and value, appears when the mouse hovers over a data point of interest. When the data point is selected and clicked, a query is launched against the UCSC or NCBI genome browser. (c): the resulting UCSC web page is shown. Note that the target name from the tooltip matches the highlighted target on the webpage. (d): a hypothetical breakpoint region within chromosome 7 being selected for exploration. Shift-selecting the suspect region highlights it in yellow and launches a query to the UCSC or NCBI genome browser using the horizontal base pair coordinate range as query parameter. The resulting web page is similar to Fig. 2(c) but showing a larger genomic region. 
Genome plot of categorized data. Data treated by circular binary segmentation (Olshen et al 2004) are displayed in aCGHViewer. Targets representing amplifi ed regions are colored red, normal regions are in black, and deleted regions are colored green. The X and Y chromosomes in this data set were excluded from analysis because they were utilized as sex-mismatch hybridization controls. 
Array-Comparative Genomic Hybridization (aCGH) is a powerful high throughput technology for detecting chromosomal copy number aberrations (CNAs) in cancer, aiming at identifying related critical genes from the affected genomic regions. However, advancing from a dataset with thousands of tabular lines to a few candidate genes can be an onerous and time-consuming process. To expedite the aCGH data analysis process, we have developed a user-friendly aCGH data viewer (aCGHViewer) as a conduit between the aCGH data tables and a genome browser. The data from a given aCGH analysis are displayed in a genomic view comprised of individual chromosome panels which can be rapidly scanned for interesting features. A chromosome panel containing a feature of interest can be selected to launch a detail window for that single chromosome. Selecting a data point of interest in the detail window launches a query to the UCSC or NCBI genome browser to allow the user to explore the gene content in the chromosomal region. Additionally, aCGHViewer can display aCGH and expression array data concurrently to visually correlate the two. aCGHViewer is a stand alone Java visualization application that should be used in conjunction with separate statistical programs. It operates on all major computer platforms and is freely available at
Pre-processing results of the oral squamous cell carcinoma sample X3482 from the data set of Snijders et al. (2005). 
Array comparative genomic hybridization (aCGH) is a high-throughput lab technique to measure genome-wide chromosomal copy numbers. Data from aCGH experiments require extensive pre-processing, which consists of three steps: normalization, segmentation and calling. Each of these pre-processing steps yields a different data set: normalized data, segmented data, and called data. Publications using aCGH base their findings on data from all stages of the pre-processing. Hence, there is no consensus on which should be used for further down-stream analysis. This consensus is however important for correct reporting of findings, and comparison of results from different studies. We discuss several issues that should be taken into account when deciding on which data are to be used. We express the believe that called data are best used, but would welcome opposing views.
The graphical user interface for the Overlay Tool ©. This interface allows the choice of overlay type and the array platforms to be selected, as well as user-defi ned threshold values. The various sections as outlined (dotted lines) are designed for user input of the following information: (A) Overlay Type, (B) Data Source for array data, (C) Array Platform type, (D) User-defi ned threshold and Data Center, (E) Primary Display Type, (F) Option for users to add Custom Array Types to the Overlay Tool © , (G) Calculation options for aCGH data, (H) Calculation options for SNP array data, (I) Weighting scheme based on probe set suffi xes for the Affymetrix gene expression arrays and (J) Calculation options for Affymetrix gene expression arrays. 
Summary of different array identifi ers for the EGFR gene. 
Values obtained from two high throughput platforms for the EGFR gene. 
The aCGHViewer © genomic view of the RPCI 6K BAC Array (A) showing the graphical representation of aCGH data from a malignant glioblastoma. Notice amplifi cation of the EGFR locus (circle) on chromosome 7, a homozygous deletion around PTEN locus on chromosome 10 (double circle) and a single copy loss of chromosome 10, which are common cytogenetic events associated with glioblastoma. In (B), the aCGHViewer © genomic view of the SLR values from Affymetrix U133Plus2 expression array of the same glioblastoma sample is shown. Due to the intra data noise levels, it is diffi cult to establish the relationship between the two datasets (A and B) based on visual inspection alone. In C, the aCGHViewer © genomic view of the LOH p-values of the Affymetrix Mapping 100K SNP Array of the same glioblastoma sample is shown. Examples of regions showing high confi dence of LOH based on pooled normal allelic frequencies are highlighted by arrows (not all areas are highlighted to avoid clutter). Note the Y-axis scale differs for each chromosome depending on the range of-log p-values which is important for interpretation of LOH (see text). 
The Overlay Tool has been developed to combine high throughput data derived from various microarray platforms. This tool analyzes high-resolution correlations between gene expression changes and either copy number abnormalities (CNAs) or loss of heterozygosity events detected using array comparative genomic hybridization (aCGH). Using an overlay analysis which is designed to be performed using data from multiple microarray platforms on a single biological sample, the Overlay Tool identifies potentially important genes whose expression profiles are changed as a result of losses, gains and amplifications in the cancer genome. In addition, the Overlay Tool will incorporate loss of heterozygosity (LOH) probability data into this overlay procedure. To facilitate this analysis, we developed an application which computationally combines two or more high throughput datasets (e.g. aCGH/expression) into a single categorized dataset for visualization and interrogation using a gene-centric approach. As such, data from virtually any microarray platform can be incorporated without the need to remap entire datasets individually. The resultant categorized (overlay) data set can be conveniently viewed using our in-house visualization tool, aCGHViewer (Shankar et al. 2006), which serves as a conduit to public databases such as UCSC and NCBI, to rapidly investigate genes of interest.
Peptide profiles generated using SELDI/MALDI time of flight mass spectrometry provide a promising source of patient-specific information with high potential impact on the early detection and classification of cancer and other diseases. The new profiling technology comes, however, with numerous challenges and concerns. Particularly important are concerns of reproducibility of classification results and their significance. In this work we describe a computational validation framework, called PACE (Permutation-Achieved Classification Error), that lets us assess, for a given classification model, the significance of the Achieved Classification Error (ACE) on the profile data. The framework compares the performance statistic of the classifier on true data samples and checks if these are consistent with the behavior of the classifier on the same data with randomly reassigned class labels. A statistically significant ACE increases our belief that a discriminative signal was found in the data. The advantage of PACE analysis is that it can be easily combined with any classification model and is relatively easy to interpret. PACE analysis does not protect researchers against confounding in the experimental design, or other sources of systematic or random error. We use PACE analysis to assess significance of classification results we have achieved on a number of published data sets. The results show that many of these datasets indeed possess a signal that leads to a statistically significant ACE.
Selecting subregions of size 64  ×  64 pixels. 
Variation of accuracy. 
roC curve.
OBJECTIVE To explore the advantages of using artificial neural networks (ANNs) to recognize patterns in colposcopy to classify images in colposcopy. PURPOSE Transversal, descriptive, and analytical study of a quantitative approach with an emphasis on diagnosis. The training test e validation set was composed of images collected from patients who underwent colposcopy. These images were provided by a gynecology clinic located in the city of Criciúma (Brazil). The image database (n = 170) was divided; 48 images were used for the training process, 58 images were used for the tests, and 64 images were used for the validation. A hybrid neural network based on Kohonen self-organizing maps and multilayer perceptron (MLP) networks was used. RESULTS After 126 cycles, the validation was performed. The best results reached an accuracy of 72.15%, a sensibility of 69.78%, and a specificity of 68%. CONCLUSION Although the preliminary results still exhibit an average efficiency, the present approach is an innovative and promising technique that should be deeply explored in the context of the present study.
The antitumor drug paclitaxel stabilizes microtubules and reduces their dynamicity, promoting mitotic arrest and eventually apoptosis. Upon assembly of the alpha/beta-tubulin heterodimer, GTP becomes bound to both the alpha and beta-tubulin monomers. During microtubule assembly, the GTP bound to beta-tubulin is hydrolyzed to GDP, eventually reaching steady-state equilibrium between free tubulin dimers and those polymerized into microtubules. Tubulin-binding drugs such as paclitaxel interact with beta-tubulin, resulting in the disruption of this equilibrium. In spite of several crystal structures of tubulin, there is little biochemical insight into the mechanism by which anti-tubulin drugs target microtubules and alter their normal behavior. The mechanism of drug action is further complicated, as the description of altered beta-tubulin isotype expression and/or mutations in tubulin genes may lead to drug resistance as has been described in the literature. Because of the relationship between beta-tubulin isotype expression and mutations within beta-tubulin, both leading to resistance, we examined the properties of altered residues within the taxane, colchicine and Vinca binding sites. The amount of data now available, allows us to investigate common patterns that lead to microtubule disruption and may provide a guide to the rational design of novel compounds that can inhibit microtubule dynamics for specific tubulin isotypes or, indeed resistant cell lines. Because of the vast amount of data published to date, we will only provide a broad overview of the mutational results and how these correlate with differences between tubulin isotypes. We also note that clinical studies describe a number of predictive factors for the response to anti-tubulin drugs and attempt to develop an understanding of the features within tubulin that may help explain how they may affect both microtubule assembly and stability.
reproducibility of spectrum intensity. These graphs contain plots of the mean intensity across 432 aligned spectra (top), the point-wise standard deviation (SD; center), and the point-wise coefficient of variation (CV; bottom) as functions of the time of flight.  
Variance of the log-transformed peak heights before and after normalization.  
results of a peak-by-peak decomposition of variance components before normalization. each panel shows the percentage of variance of the log-transformed peak heights, as a function of the time-of-flight, for one of the factors (top left: residual; top right: array; lower left: laboratory; lower right: week).  
This is an open access article. Unrestricted non-commercial use is permitted provided the original work is properly cited. The reproducibility of mass spectrometry (MS) data collected using surface enhanced laser desorption/ionization-time of flight (SELDI-TOF) has been questioned. This investigation was designed to test the reproducibility of SELDI data collected over time by multiple users and instruments. Five laboratories prepared arrays once every week for six weeks. Spectra were collected on separate instruments in the individual laboratories. Additionally, all of the arrays produced each week were rescanned on a single instrument in one laboratory. Lab-to-lab and array-to-array variability in alignment parameters were larger than the variability attributable to running samples during different weeks. The coefficient of variance (CV) in spectrum intensity ranged from 25% at baseline, to 80% in the matrix noise region, to about 50% during the exponential drop from the maximum matrix noise. Before normalization, the median CV of the peak heights was 72% and reduced to about 20% after normalization. Additionally, for the spectra from a common instrument, the CV ranged from 5% at baseline, to 50% in the matrix noise region, to 20% during the drop from the maximum matrix noise. Normalization reduced the variability in peak heights to about 18%. With proper processing methods, SELDI instruments produce spectra containing large numbers of reproducibly located peaks, with consistent heights.
The aberrantly expressed signal transducer and activator of transcription 3 (STAT3) predicts poor prognosis, primarily in estrogen receptor positive (ER(+)) breast cancers. Activated STAT3 is overexpressed in luminal A subtype cells. The mechanisms contributing to the prognosis and/or subtype relevant features of STAT3 in ER(+) breast cancers are through multiple interacting regulatory pathways, including STAT3-MYC, STAT3-ERα, and STAT3-MYC-ERα interactions, as well as the direct action of activated STAT3. These data predict malignant events, treatment responses and a novel enhancer of tamoxifen resistance. The inferred crosstalk between ERα and STAT3 in regulating their shared target gene-METAP2 is partially validated in the luminal B breast cancer cell line-MCF7. Taken together, we identify a poor prognosis relevant gene set within the STAT3 network and a robust one in a subset of patients. VEGFA, ABL1, LYN, IGF2R and STAT3 are suggested therapeutic targets for further study based upon the degree of differential expression in our model.
Node centrality ranking of actor and semiotic nodes in G e '.
Exploring the actor-semiotic network of HCC. (1) The network G was reduced to a smaller network G′ by excluding extraneous Protein nodes and Gene Ontology nodes that did not map to the Gene nodes in the co-expression network. (2) G′ was transformed to Gd to expose any nested discrete clusters. (3) The largest connected component Ge′ was extracted from Ge which is the largest cluster in Gd. Node centrality signature vectors of Ge′ were constructed before biological inference.
Network topology of Gd. Gene co-expression in the normal hepatocyte is represented by green-coloured edges whereas co-expression in the hepatocellular carcinoma is represented by dark red-coloured edges. Gene nodes are coloured dark red. GO nodes are coloured yellow. MIRNA nodes are coloured blue.
Network topology of Ge′. The colour coding used for nodes and edges is the same as in Figure 2. The rank scores for the seven centrality types in each bar chart are arranged (from left to right) in this order: Degree, Closeness, Current Flow-Betweenness, Current Flow-Closeness, Eccentricity, HITS-Authority, and Shortest Path-Betweenness. A lower rank score means a higher node ranking for a particular centrality type.
Scatterplot generated by projecting graph signatures of Ge′ to the 2D space using Kruskal’s multi-dimensional scaling.
Primary hepatocellular carcinoma (HCC) is currently the fifth most common malignancy and the third most common cause of cancer mortality worldwide. Because of its high prevalence in developing nations, there have been numerous efforts made in the molecular characterization of primary HCC. However, a better understanding into the pathology of HCC required software-assisted network modeling and analysis. In this paper, the author presented his first attempt in exploring the biological implication of gene co-expression in HCC using actor-semiotic network modeling and analysis. The network was first constructed by integrating inter-actor relationships, e.g. gene co-expression, microRNA-to-gene, and protein interactions, with semiotic relationships, e.g. gene-to-Gene Ontology Process. Topological features that are highly discriminative of the HCC phenotype were identified by visual inspection. Finally, the author devised a graph signature-based analysis method to supplement the network exploration.
Summary of protein microarray technologies. Information adapted from refs (68-73).
A) cDNA synthesis, labeling and hybridization to oligonucleotide array slides. B) Correlation coefficient analysis of gene expression data, showing, in red, probes with fluorescent intensities above the threshold of detection, and in yellow, absent fluorescence. C) Scatter plot analysis of gene expression data, showing the correlation between two of the samples that clustered together, where most probes have similar expression levels, with some probes differentially expressed between these samples. D) Hierarchical clustering of microarray data; in this analysis, samples with similar gene expression profiles are grouped together, cluster of genes is shown on the Y-axis and dendogram or cluster of samples is seen in the X-axis.
A ChIP-on-Chip workflow. Each step of the chromatin immunoprecipitation stage is optimized for every cell and tissue type. Enrichment analysis to determine successful immunoprecipitation is performed using quantitative real time PCR using primers against target DNA sequences known to be bound by the protein of interest. Large scale genome binding analyses are dependent on the array platform used in the study—these can include promoter arrays, whole genome tiling arrays, or custom made targeted tiling arrays.
Microarray technology is a powerful tool, which has been applied to further the understanding of gene expression changes in disease. Array technology has been applied to the diagnosis and prognosis of Acute Myelogenous Leukemia (AML). Arrays have also been used extensively in elucidating the mechanism of and predicting therapeutic response in AML, as well as to further define the mechanism of AML pathogenesis. In this review, we discuss the major paradigms of gene expression array analysis, and provide insights into the use of software tools to annotate the array dataset and elucidate deregulated pathways and gene interaction networks. We present the application of gene expression array technology to questions in acute myelogenous leukemia; specifically, disease diagnosis, treatment and prognosis, and disease pathogenesis. Finally, we discuss several new and emerging array technologies, and how they can be further utilized to improve our understanding of AML.
The 18,352 pancreatic ductal adenocarcinoma (PDAC) cases from the Surveillance Epidemiology and End Results (SEER) database were analyzed using the Kaplan-Meier method for the following variables: race, gender, marital status, year of diagnosis, age at diagnosis, pancreatic subsite, T-stage, N-stage, M-stage, tumor size, tumor grade, performed surgery, and radiation therapy. Because the T-stage variable did not satisfy the proportional hazards assumption, the cases were divided into cases with T1- and T2-stages (localized tumor) and cases with T3- and T4-stages (extended tumor). For estimating survival and conditional survival probabilities in each group, a multivariate Cox regression model adjusted for the remaining covariates was developed. Testing the reproducibility of model parameters and generalizability of these models showed that the models are well calibrated and have concordance indexes equal to 0.702 and 0.712, respectively. Based on these models, a prognostic estimator of survival for patients diagnosed with PDAC was developed and implemented as a computerized web-based tool.
In this study, we introduce and use Efficiency Analysis to compare differences in the apparent internal and external consistency of competing normalization methods and tests for identifying differentially expressed genes. Using publicly available data, two lung adenocarcinoma datasets were analyzed using caGEDA ( to measure the degree of differential expression of genes existing between two populations. The datasets were randomly split into at least two subsets, each analyzed for differentially expressed genes between the two sample groups, and the gene lists compared for overlapping genes. Efficiency Analysis is an intuitive method that compares the differences in the percentage of overlap of genes from two or more data subsets, found by the same test over a range of testing methods. Tests that yield consistent gene lists across independently analyzed splits are preferred to those that yield less consistent inferences. For example, a method that exhibits 50% overlap in the 100 top genes from two studies should be preferred to a method that exhibits 5% overlap in the top 100 genes. The same procedure was performed using all available normalization and transformation methods that are available through caGEDA. The 'best' test was then further evaluated using internal cross-validation to estimate generalizable sample classification errors using a Naïve Bayes classification algorithm. A novel test, termed D1 (a derivative of the J5 test) was found to be the most consistent, and to exhibit the lowest overall classification error, and highest sensitivity and specificity. The D1 test relaxes the assumption that few genes are differentially expressed. Efficiency Analysis can be misleading if the tests exhibit a bias in any particular dimension (e.g. expression intensity); we therefore explored intensity-scaled and segmented J5 tests using data in which all genes are scaled to share the same intensity distribution range. Efficiency Analysis correctly predicted the 'best' test and normalization method using the Beer dataset and also performed well with the Bhattacharjee dataset based on both efficiency and classification accuracy criteria.
Disease outcome by A) our multivariable model with low risk patients showing an exceptional benefit towards survival compared to intermediate (log rank p = 0.005) and high risk patients (log rank p  0.001); intermediate vs. high risk patients also have better outcome (log rank p  0.001) and B) the AJCC TnM staging system (log rank p (IA vs. IB) = 0.125, p (IA vs. IIA) = 0.056, p (IA vs. IIB) = 0.010, p (IB vs. IIA) = 0.718, p (IB vs. IIB) = 0.315 and p (IIA vs. IIB) = 0.311). 
Multivariable analysis results of Cox proportional hazards regression model. 
The accurate prognosis for patients with resectable pancreatic adenocarcinomas requires the incorporation of more factors than those included in AJCC TNM system. We identified 218 patients diagnosed with stage I and II pancreatic adenocarcinoma at NewYork-Presbyterian Hospital/Columbia University Medical Center (1999 to 2009). Tumor and clinical characteristics were retrieved and associations with survival were assessed by univariate Cox analysis. A multivariable model was constructed and a prognostic score was calculated; the prognostic strength of our model was assessed with the concordance index. Our cohort had a median age of 67 years and consisted of 49% men; the median follow-up time was 14.3 months and the 5-year survival 3.6%. Age, tumor differentiation and size, alkaline phosphatase, albumin and CA 19-9 were the independent factors of the final multivariable model; patients were thus classified into low (n = 14, median survival = 53.7 months), intermediate (n = 124, median survival = 19.7 months) and high risk groups (n = 80, median survival = 12.3 months). The prognostic classification of our model remained significant after adjusting for adjuvant chemotherapy and the concordance index was 0.73 compared to 0.59 of the TNM system. Our prognostic model was accurate in stratifying patients by risk and could be incorporated into clinical decisions.
Historically, breast cancer classification has relied on prognostic subtypes. Thus, unlike hematopoietic cancers, breast tumor classification lacks phylogenetic rationale. The feasibility of phylogenetic classification of breast tumors has recently been demonstrated based on estrogen receptor (ER), androgen receptor (AR), vitamin D receptor (VDR) and Keratin 5 expression. Four hormonal states (HR0-3) comprising 11 cellular subtypes of breast cells have been proposed. This classification scheme has been shown to have relevance to clinical prognosis. We examine the implications of such phylogenetic classification on DNA methylation of both breast tumors and normal breast tissues by applying recently developed deconvolution algorithms to three DNA methylation data sets archived on Gene Expression Omnibus. We propose that breast tumors arising from a particular cell-of-origin essentially magnify the epigenetic state of their original cell type. We demonstrate that DNA methylation of tumors manifests patterns consistent with cell-specific epigenetic states, that these states correspond roughly to previously posited normal breast cell types, and that estimates of proportions of the underlying cell types are predictive of tumor phenotypes. Taken together, these findings suggest that the epigenetics of breast tumors is ultimately based on the underlying phylogeny of normal breast tissue.
Reverse phase protein arrays (RPPA) measure the relative expression levels of a protein in many samples simultaneously. Observed signal from these arrays is a combination of true signal, additive background, and multiplicative spatial effects. Background subtraction alone is not sufficient to remove all nonbiological trends from the data. We developed a surface adjustment that uses information from positive control spots to correct for spatial trends on the array beyond additive background. This method uses a generalized additive model to estimate a smoothed surface from positive controls. When positive controls are printed in a dilution series, a nested surface adjustment performs an intensity-based correction. When applicable, surface adjustment is able to remove spatial trends and increase within slide replicate agreement better than background subtraction alone as demonstrated on two sets of arrays. This work demonstrates the importance of including positive control spots on the array.
Top 40 most significant probes from a cluster analysis of 489, distinguishing lymphoid and myeloid BCR/ABL1 positive genomes. Notes: SAM analysis listed a total of 489 significant probes that discriminate between lymphoid and myeloid BCR/ABL1 positive genomes. The top 20% of these probes were associated with the TCR @chr7:38,287,976-38,315,044 and the IgH region @chr14:105,405,310-105,518,122. Arrow points to homozygous deletions of the IgH probes (bright green) seen exclusively in Ph( + ) samples with an early B cell lymphoid phenotype. 
Identification of probes discriminating between ph positive acute lymphoblastic leukemia and CML lymphoid blast transfomation. Notes: Heat map of the SAM data showing gains (red) and losses (green) for the 40 probes most influential in discriminating between Ph(+)ALL and BC/L CML samples. Altogether 16 of these probes (arrowed) cover the region of the PTPrD gene (protein tyrosine phosphatase, receptor type, D) on 9p24.1-p23. A sub group of 5 Ph(+)ALL samples show gains of chromosome 9p loci (red arrow), whilst the same region in 5 BC/L CML samples is deleted (green arrow) in agreement with their chromosome status.
The 40 most significant probes differentiating between lymphoid and myeloid lineages.
The 40 most significant probes differentiating between Ph(+)ALL and lymphoid blast crisis excluding chromosome 9 probes.
Philadelphia positive malignant disorders are a clinically divergent group of leukemias. These include chronic myeloid leukemia (CML) and de novo acute Philadelphia positive (Ph(+)) leukemia of both myeloid, and lymphoid origin. Recent whole genome screening of Ph(+)ALL in both children and adults identified an almost obligatory cryptic loss of Ikaros, required for the normal B cell maturation. Although similar losses were found in lymphoid blast crisis the genetic background of the transformation in CML is still poorly defined. We used Significance Analysis of Microarrays (SAM) to analyze comparative genomic hybridization (aCGH) data from 30 CML (10 each of chronic phase, myeloid and lymphoid blast stage), 10 Ph(+)ALL adult patients and 10 disease free controls and were able to: (a) discriminate between the genomes of lymphoid and myeloid blast cells and (b) identify differences in the genome profile of de novo Ph(+)ALL and lymphoid blast transformation of CML (BC/L). Furthermore we were able to distinguish a sub group of Ph(+) ALL characterized by gains in chromosome 9 and recurrent losses at several other genome sites offering genetic evidence for the clinical heterogeneity. The significance of these results is that they not only offer clues regarding the pathogenesis of Ph(+) disorders and highlight the potential clinical implications of a set of probes but also demonstrates what SAM can offer for the analysis of genome data.
Integrative cancer biology research relies on a variety of data-driven computational modeling and simulation methods and techniques geared towards gaining new insights into the complexity of biological processes that are of critical importance for cancer research. These include the dynamics of gene-protein interaction networks, the percolation of sub-cellular perturbations across scales and the impact they may have on tumorigenesis in both experiments and clinics. Such innovative 'systems' research will greatly benefit from enabling Information Technology that is currently under development, including an online collaborative environment, a Semantic Web based computing platform that hosts data and model repositories as well as high-performance computing access. Here, we present one of the National Cancer Institute's recently established Integrative Cancer Biology Programs, i.e. the Center for the Development of a Virtual Tumor, CViT, which is charged with building a cancer modeling community, developing the aforementioned enabling technologies and fostering multi-scale cancer modeling and simulation.
Label-free analysis of relative protein levels in seven-protein mixtures.
MS1 scan number threshold. Scatter-plots used to calculate MS1 scan number thresholds comparisons to (A) the logarithm of peptide ratio error for each peptide; (B) their standard deviations (SD) from (A); (C) number of peptides (D) number of proteins.  
Comparison between label-free and ICAT methods. The three different concentrations of the seven-protein mixtures were pair-wise compared and the differential expression gauged by four different quantifi cation methods (spectral counting, PICA, XPRESS, and ASAPRatio). Bars with asterisks represent MSE for sample 1 vs. 3 after removing 2 proteins at 4 fold-differences.  
Comparison of label-free analysis of 3-protein standards spiked into a Francisella novicida lysate. Bovine serum albumin (a), Bovine Catalase (b) and chicken ovalbumin (c) were spiked into a Francisella novicida lysate. Each data point represents either the actual (blue) spiked in value for the standard proteins or a value calculated four times from single technical LCMS analysis using either PICA (red) or spectral counting (yellow). Note: *The expected " actual " ratio from the 3 samples, a, b, and c.  
Signal to background noise (S/N) threshold. Scatter-plots used to calculate signal to background noise (S/N) threshold comparisons to (A) the logarithm of peptide ratio error for each peptide; (B) their corresponding standard deviations (SD) from (A); (C) number of peptides; and (D) number of proteins.  
Recently, several research groups have published methods for the determination of proteomic expression profiling by mass spectrometry without the use of exogenously added stable isotopes or stable isotope dilution theory. These so-called label-free, methods have the advantage of allowing data on each sample to be acquired independently from all other samples to which they can later be compared in silico for the purpose of measuring changes in protein expression between various biological states. We developed label free software based on direct measurement of peptide ion current area (PICA) and compared it to two other methods, a simpler label free method known as spectral counting and the isotope coded affinity tag (ICAT) method. Data analysis by these methods of a standard mixture containing proteins of known, but varying, concentrations showed that they performed similarly with a mean squared error of 0.09. Additionally, complex bacterial protein mixtures spiked with known concentrations of standard proteins were analyzed using the PICA label-free method. These results indicated that the PICA method detected all levels of standard spiked proteins at the 90% confidence level in this complex biological sample. This finding confirms that label-free methods, based on direct measurement of the area under a single ion current trace, performed as well as the standard ICAT method. Given the fact that the label-free methods provide ease in experimental design well beyond pair-wise comparison, label-free methods such as our PICA method are well suited for proteomic expression profiling of large numbers of samples as is needed in clinical analysis.
AlamarBlueTM Cytotoxicity Assay: Human fi broblasts treated with cisplatin and transplatin. Graphs indicate the % Cell Survival at (a) 8 and (b) 24 hours for 0.1-100µM Cisplatin and Transplatin. The data points shown are for drug concentrations of 0.1, 0.3, 1.0, 10, 25 and 100 µM, and have been averaged across three replicate values. Error bars represent the standard error of the mean. (Note: Scale of x-axis is in logarithmic format). 
MA-plot of normalised data for DMF (control) versus 1 µM cisplatin treatment comparison. M (Log 2 (expression ratio)) and A (relative intensity) values were averaged across 4 replicate microarrays. M values greater than zero indicate transcripts which responded to cisplatin with decreased expression levels, while M values less than zero indicate transcripts with increased expression levels after cisplatin treatment. 
Cisplatin is a DNA-damaging anti-cancer agent that is widely used to treat a range of tumour types. Despite its clinical success, cisplatin treatment is still associated with a number of dose-limiting toxic side effects. The purpose of this study was to clarify the molecular events that are important in the anti-tumour activity of cisplatin, using gene expression profiling techniques. Currently, our incomplete understanding of this drug's mechanism of action hinders the development of more efficient and less harmful cisplatin-based chemotherapeutics. In this study the effect of cisplatin on gene expression in human foreskin fibroblasts has been investigated using human 19K oligonucleotide microarrays. In addition its clinically inactive isomer, transplatin, was also tested. Dualfluor microarray experiments comparing treated and untreated cells were performed in quadruplicate. Cisplatin treatment was shown to significantly up- or down-regulate a consistent subset of genes. Many of these genes responded similarly to treatment with transplatin, the therapeutically inactive isomer of cisplatin. However, a smaller proportion of these transcripts underwent differential expression changes in response to the two isomers. Some of these genes may constitute part of the DNA damage response induced by cisplatin that is critical for its anti-tumour activity. Ultimately, the identification of gene expression responses unique to clinically active compounds, like cisplatin, could thus greatly benefit the design and development of improved chemotherapeutics.
A computational approach for estimating the overall, population, and individual cancer hazard rates was developed. The population rates characterize a risk of getting cancer of a specific site/type, occurring within an age-specific group of individuals from a specified population during a distinct time period. The individual rates characterize an analogous risk but only for the individuals susceptible to cancer. The approach uses a novel regularization and anchoring technique to solve an identifiability problem that occurs while determining the age, period, and cohort (APC) effects. These effects are used to estimate the overall rate, and to estimate the population and individual cancer hazard rates. To estimate the APC effects, as well as the population and individual rates, a new web-based computing tool, called the CancerHazard@Age, was developed. The tool uses data on the past and current history of cancer incidences collected during a long time period from the surveillance databases. The utility of the tool was demonstrated using data on the female lung cancers diagnosed during 1975-2009 in nine geographic areas within the USA. The developed tool can be applied equally well to process data on other cancer sites. The data obtained by this tool can be used to develop novel carcinogenic models and strategies for cancer prevention and treatment, as well as to project future cancer burden.
The Weibull-like curve fitted to the estimates (data) of the agespecific LC hazard function for white men. The estimates of the LC hazards are anchored to the 2000-2004 time period and to the 1925-1929 birth cohort.
The Weibull-like curve fitted to the estimates (data) of the agespecific LC hazard function for white women. The estimates of the LC hazards are anchored to the 2000-2004 time period and to the 19251929 birth cohort.
The Weibull-like curve fitted to the estimates (data) of the age-specific LC hazard function for white men. The estimates of the LC hazards are anchored to the 2000–2004 time period and to the 1925–1929 birth cohort.
The Weibull-like curve fitted to the estimates (data) of the age-specific LC hazard function for white women. The estimates of the LC hazards are anchored to the 2000–2004 time period and to the 1925–1929 birth cohort.
Mathematical modeling of cancer development is aimed at assessing the risk factors leading to cancer. Aging is a common risk factor for all adult cancers. The risk of getting cancer in aging is presented by a hazard function that can be estimated from the observed incidence rates collected in cancer registries. Recent analyses of the SEER database show that the cancer hazard function initially increases with the age, and then it turns over and falls at the end of the lifetime. Such behavior of the hazard function is poorly modeled by the exponential or compound exponential-linear functions mainly utilized for the modeling. In this work, for mathematical modeling of cancer hazards, we proposed to use the Weibull-like function, derived from the Armitage-Doll multistage concept of carcinogenesis and an assumption that number of clones at age t developed from mutated cells follows the Poisson distribution. This function is characterized by three parameters, two of which (r and λ) are the conventional parameters of the Weibull probability distribution function, and an additional parameter (C(0)) that adjusts the model to the observational data. Biological meanings of these parameters are: r-the number of stages in carcinogenesis, λ-an average number of clones developed from the mutated cells during the first year of carcinogenesis, and C(0)-a data adjustment parameter that characterizes a fraction of the age-specific population that will get this cancer in their lifetime. To test the validity of the proposed model, the nonlinear regression analysis was performed for the lung cancer (LC) data, collected in the SEER 9 database for white men and women during 1975-2004. Obtained results suggest that: (i) modeling can be improved by the use of another parameter A- the age at the beginning of carcinogenesis; and (ii) in white men and women, the processes of LC carcinogenesis vary by A and C(0), while the corresponding values of r and λ are nearly the same. Overall, the proposed Weibull-like model provides an excellent fit of the estimates of the LC hazard function in aging. It is expected that the Weibull-like model can be applicable to fit estimates of hazard functions of other adult cancers as well.
A screen shot of the section of the tool allowing users to choose from pre-made listed of authors, grouped by program.  
Searching PubMed for citations related to a specific cancer center or group of authors can be labor-intensive. We have created a tool, PubMed QUEST, to aid in the rapid searching of PubMed for publications of interest. It was designed by taking into account the needs of entire cancer centers as well as individual investigators. The experience of using the tool by our institution's cancer center administration and investigators has been favorable and we believe it could easily be adapted to other institutions. Use of the tool has identified limitations of automated searches for publications based on an author's name, especially for common names. These limitations could likely be solved if the PubMed database assigned a unique identifier to each author.
Overview of data model for image, CAD and quantitative data in a clinical trials setting.
(a) Original CT image of the right lung. (b) Result of attenuation thresholding in the lung field, with ROIs corresponding to blood vessels and pulmonary nodule in gray. (c) Automatically detected nodule (gray with arrow) following classification step.
3D rendering of a pulmonary nodule and blood vessels adjacent to the pleural surface.
Report from CAD measurement system showing diameter and volume measurements and percentage changes from baseline. From these changes disease stability or progression is determined.
Computer tomography (CT) imaging plays an important role in cancer detection and quantitative assessment in clinical trials. High-resolution imaging studies on large cohorts of patients generate vast data sets, which are infeasible to analyze through manual interpretation. In this article we describe a comprehensive architecture for computer-aided detection (CAD) and surveillance on lung nodules in CT images. Central to this architecture are the analytic components: an automated nodule detection system, nodule tracking capabilities and volume measurement, which are integrated within a data management system that includes mechanisms for receiving and archiving images, a database for storing quantitative nodule measurements and visualization, and reporting tools. We describe two studies to evaluate CAD technology within this architecture, and the potential application in large clinical trials. The first study involves performance assessment of an automated nodule detection system and its ability to increase radiologist sensitivity when used to provide a second opinion. The second study investigates nodule volume measurements on CT made using a semi-automated technique and shows that volumetric analysis yields significantly different tumor response classifications than a 2D diameter approach. These studies demonstrate the potential of automated CAD tools to assist in quantitative image analysis for clinical trials.
Example of a malignant mass ROI ( A ) and its corresponding segmentation mask ( B ). 
summary of computed image features for our mass detection scheme.
example of an fP roI (A) and its corresponding segmentation mask (b).
A flow diagram of our mass detection scheme. 
average aUC values and the corresponding standard deviations for the five compared classifiers computed using the 10-fold cross-validation experiments.
In the field of computer-aided mammographic mass detection, many different features and classifiers have been tested. Frequently, the relevant features and optimal topology for the artificial neural network (ANN)-based approaches at the classification stage are unknown, and thus determined by trial-and-error experiments. In this study, we analyzed a classifier that evolves ANNs using genetic algorithms (GAs), which combines feature selection with the learning task. The classifier named "Phased Searching with NEAT in a Time-Scaled Framework" was analyzed using a dataset with 800 malignant and 800 normal tissue regions in a 10-fold cross-validation framework. The classification performance measured by the area under a receiver operating characteristic (ROC) curve was 0.856 ± 0.029. The result was also compared with four other well-established classifiers that include fixed-topology ANNs, support vector machines (SVMs), linear discriminant analysis (LDA), and bagged decision trees. The results show that Phased Searching outperformed the LDA and bagged decision tree classifiers, and was only significantly outperformed by SVM. Furthermore, the Phased Searching method required fewer features and discarded superfluous structure or topology, thus incurring a lower feature computational and training and validation time requirement. Analyses performed on the network complexities evolved by Phased Searching indicate that it can evolve optimal network topologies based on its complexification and simplification parameter selection process. From the results, the study also concluded that the three classifiers - SVM, fixed-topology ANN, and Phased Searching with NeuroEvolution of Augmenting Topologies (NEAT) in a Time-Scaled Framework - are performing comparably well in our mammographic mass detection scheme.
Ideogram depicting the "hot spots" discovered on chromosomes 2, 9 and 19 that demonstrated statistical signifi cance (loci marked "0" denote genes that have not been investigated yet for identity). 
The computational aspects of the problem in this paper involve, firstly, selective mapping of methylated DNA clones according to methylation level and, secondly, extracting motif information from all the mapped elements in the absence of prior probability distribution. Our novel implementation of algorithms to map and maximize expectation in this setting has generated data that appear to be distinct for each lymphoma subtype examined. A “clone” represents a polymerase chain reaction (PCR) product (on average ~500 bp) which belongs to a microarray of 8544 such sequences preserving CpG-rich islands (CGIs) [ 1 ]. Accumulating evidence indicates that cancers including lymphomas demonstrate hypermethylation of CGIs “silencing” an increasing number of tumor suppressor (TS) genes which can lead to tumorigenesis. Availability Algorithms are available on request from the authors Contact Supplementary Information available on page 453.
Top-cited authors
David Scott Wishart
  • University of Alberta
Richard Simon
  • National Cancer Institute
Natasa Przulj
  • University College London
Kevin Robert Coombes
  • The Ohio State University
Yu Shyr
  • Vanderbilt University