Article

Expression of Conjoined Genes: Another Mechanism for Gene Regulation in Eukaryotes

University of Texas Arlington, United States of America
PLoS ONE (Impact Factor: 3.53). 10/2010; 5(10):e13284. DOI: 10.1371/journal.pone.0013284
Source: PubMed

ABSTRACT From the ENCODE project, it is realized that almost every base of the entire human genome is transcribed. One class of transcripts resulting from this arises from the conjoined gene, which is formed by combining the exons of two or more distinct (parent) genes lying on the same strand of a chromosome. Only a very limited number of such genes are known, and the definition and terminologies used for them are highly variable in the public databases. In this work, we have computationally identified and manually curated 751 conjoined genes (CGs) in the human genome that are supported by at least one mRNA or EST sequence available in the NCBI database. 353 representative CGs, of which 291 (82%) could be confirmed, were subjected to experimental validation using RT-PCR and sequencing methods. We speculate that these genes are arising out of novel functional requirements and are not merely artifacts of transcription, since more than 70% of them are conserved in other vertebrate genomes. The unique splicing patterns exhibited by CGs reveal their possible roles in protein evolution or gene regulation. Novel CGs, for which no transcript is available, could be identified in 80% of randomly selected potential CG forming regions, indicating that their formation is a routine process. Formation of CGs is not only limited to human, as we have also identified 270 CGs in mouse and 227 in drosophila using our approach. Additionally, we propose a novel mechanism for the formation of CGs. Finally, we developed a database, ConjoinG, which contains detailed information about all the CGs (800 in total) identified in the human genome. In summary, our findings reveal new insights about the functionality of CGs in terms of another possible mechanism for gene regulation and genomic evolution and the mechanism leading to their formation.

Download full-text

Full-text

Available from: Vineet K Sharma, Apr 18, 2014
0 Followers
 · 
121 Views
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Metastatic cancer of unknown primary (CUP) accounts for up to 5% of all new cancer cases, with a 5-year survival rate of only 10%. Accurate identification of tissue of origin would allow for directed, personalized therapies to improve clinical outcomes. Our objective was to use transcriptome sequencing (RNA-Seq) to identify lineage-specific biomarker signatures for the cancer types that most commonly metastasize as CUP (colorectum, kidney, liver, lung, ovary, pancreas, prostate, and stomach). RNA-Seq data of 17,471 transcripts from a total of 3,244 cancer samples across 26 different tissue types were compiled from in-house sequencing data and publically available International Cancer Genome Consortium and The Cancer Genome Atlas datasets. Robust cancer biomarker signatures were extracted using a 10-fold cross-validation method of log transformation, quantile normalization, transcript ranking by area under the receiver operating characteristic curve, and stepwise logistic regression. The entire algorithm was then repeated with a new set of randomly generated training and test sets, yielding highly concordant biomarker signatures. External validation of the cancer-specific signatures yielded high sensitivity (92.0% ± 3.15%; mean ± standard deviation) and specificity (97.7% ± 2.99%) for each cancer biomarker signature. The overall performance of this RNA-Seq biomarker-generating algorithm yielded an accuracy of 90.5%. In conclusion, we demonstrate a computational model for producing highly sensitive and specific cancer biomarker signatures from RNA-Seq data, generating signatures for the top eight cancer types responsible for CUP to accurately identify tumor origin.
    Neoplasia (New York, N.Y.) 11/2014; 16(11). DOI:10.1016/j.neo.2014.09.007 · 5.40 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Lung cancer causes more deaths, worldwide, than any other cancer. Several histologic subtypes exist. Currently, there is a dearth of targeted therapies for treating one of the main subtypes: squamous cell carcinoma (SCC). As for many cancers, lung SCC karyotypes are often highly anomalous owing to large somatic structural variants, some of which are seen repeatedly in lung SCC, indicating a potential causal association for genes therein. We chose to characterize a lung SCC genome to unprecedented detail and integrate our findings with the concurrently characterized transcriptome. We aimed to ascertain how somatic structural changes affected gene expression within the cell in ways that could confer a pathogenic phenotype. We sequenced the genomes of a lung SCC cell line (LUDLU-1) and its matched lymphocyte cell line (AGLCL) to more than 50x coverage. We also sequenced the transcriptomes of LUDLU-1 and a normal bronchial epithelium cell line (LIMM-NBE1), resulting in more than 600 million aligned reads per sample, including both coding and non-coding RNA (ncRNA), in a strand-directional manner. We also captured small RNA (<30 bp). We discovered significant, but weak, correlations between copy number and expression for protein-coding genes, antisense transcripts, long intergenic ncRNA, and microRNA (miRNA). We found that miRNA undergo the largest change in overall expression pattern between the normal bronchial epithelium and the tumor cell line. We found evidence of transcription across the novel genomic sequence created from six somatic structural variants. For each part of our integrated analysis, we highlight candidate genes that have undergone the largest expression changes.
    Neoplasia (New York, N.Y.) 11/2012; 14(11):1075-86. DOI:10.1593/neo.121380 · 5.40 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Chimeric read-through RNAs are transcripts originating from two directly adjacent genes (<10 kb) on the same DNA strand. Although they are found in next-generation whole transcriptome sequencing (RNA-Seq) data on a regular basis, investigating them further has usually been refrained from. Therefore, their expression patterns or functions in general, and in oncogenesis in particular, are poorly understood. We used paired-end RNA-Seq and a specifically designed computational data analysis pipeline (FusionSeq) to nominate read-through events in a small discovery set of renal cell carcinomas (RCC) and confirmed them in a larger validation cohort. 324 read-through events were called overall; 22/27 (81%) selected nominees passed validation with conventional PCR and were sequenced at the junction region. We frequently identified various isoforms of a given read-through event. 2/22 read-throughs were up-regulated: BC039389-GATM was higher expressed in RCC compared to benign adjacent kidney; KLK4-KRSP1 was expressed in 46/169 (27%) RCCs, but rarely in normal tissue. KLK4-KRSP1 expression was associated with worse clinical outcome in the patient cohort. In cell lines, both read-throughs influenced molecular mechanisms (i.e. target gene expression or migration/invasion) in a way that counteracted the effect of the respective parent transcript GATM or KLK4. Our data suggests that the up-regulation of read-through RNA chimeras in tumors is not random but causes regulatory effects on cellular mechanisms and may impact patient survival.
    BMC Genomics 03/2015; 16(1):247. DOI:10.1186/s12864-015-1446-z · 4.04 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Metastatic cancer of unknown primary (CUP) accounts for up to 5% of all new cancer cases, with a 5-year survival rate of only 10%. Accurate identification of tissue of origin would allow for directed, personalized therapies to improve clinical outcomes. Our objective was to use transcriptome sequencing (RNA-Seq) to identify lineage-specific biomarker signatures for the cancer types that most commonly metastasize as CUP (colorectum, kidney, liver, lung, ovary, pancreas, prostate, and stomach). RNA-Seq data of 17,471 transcripts from a total of 3,244 cancer samples across 26 different tissue types were compiled from in-house sequencing data and publically available International Cancer Genome Consortium and The Cancer Genome Atlas datasets. Robust cancer biomarker signatures were extracted using a 10-fold cross-validation method of log transformation, quantile normalization, transcript ranking by area under the receiver operating characteristic curve, and stepwise logistic regression. The entire algorithm was then repeated with a new set of randomly generated training and test sets, yielding highly concordant biomarker signatures. External validation of the cancer-specific signatures yielded high sensitivity (92.0% ± 3.15%; mean ± standard deviation) and specificity (97.7% ± 2.99%) for each cancer biomarker signature. The overall performance of this RNA-Seq biomarker-generating algorithm yielded an accuracy of 90.5%. In conclusion, we demonstrate a computational model for producing highly sensitive and specific cancer biomarker signatures from RNA-Seq data, generating signatures for the top eight cancer types responsible for CUP to accurately identify tumor origin.
    Neoplasia (New York, N.Y.) 11/2014; 16(11). DOI:10.1016/j.neo.2014.09.007 · 5.40 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Lung cancer causes more deaths, worldwide, than any other cancer. Several histologic subtypes exist. Currently, there is a dearth of targeted therapies for treating one of the main subtypes: squamous cell carcinoma (SCC). As for many cancers, lung SCC karyotypes are often highly anomalous owing to large somatic structural variants, some of which are seen repeatedly in lung SCC, indicating a potential causal association for genes therein. We chose to characterize a lung SCC genome to unprecedented detail and integrate our findings with the concurrently characterized transcriptome. We aimed to ascertain how somatic structural changes affected gene expression within the cell in ways that could confer a pathogenic phenotype. We sequenced the genomes of a lung SCC cell line (LUDLU-1) and its matched lymphocyte cell line (AGLCL) to more than 50x coverage. We also sequenced the transcriptomes of LUDLU-1 and a normal bronchial epithelium cell line (LIMM-NBE1), resulting in more than 600 million aligned reads per sample, including both coding and non-coding RNA (ncRNA), in a strand-directional manner. We also captured small RNA (<30 bp). We discovered significant, but weak, correlations between copy number and expression for protein-coding genes, antisense transcripts, long intergenic ncRNA, and microRNA (miRNA). We found that miRNA undergo the largest change in overall expression pattern between the normal bronchial epithelium and the tumor cell line. We found evidence of transcription across the novel genomic sequence created from six somatic structural variants. For each part of our integrated analysis, we highlight candidate genes that have undergone the largest expression changes.
    Neoplasia (New York, N.Y.) 11/2012; 14(11):1075-86. DOI:10.1593/neo.121380 · 5.40 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Chimeric read-through RNAs are transcripts originating from two directly adjacent genes (<10 kb) on the same DNA strand. Although they are found in next-generation whole transcriptome sequencing (RNA-Seq) data on a regular basis, investigating them further has usually been refrained from. Therefore, their expression patterns or functions in general, and in oncogenesis in particular, are poorly understood. We used paired-end RNA-Seq and a specifically designed computational data analysis pipeline (FusionSeq) to nominate read-through events in a small discovery set of renal cell carcinomas (RCC) and confirmed them in a larger validation cohort. 324 read-through events were called overall; 22/27 (81%) selected nominees passed validation with conventional PCR and were sequenced at the junction region. We frequently identified various isoforms of a given read-through event. 2/22 read-throughs were up-regulated: BC039389-GATM was higher expressed in RCC compared to benign adjacent kidney; KLK4-KRSP1 was expressed in 46/169 (27%) RCCs, but rarely in normal tissue. KLK4-KRSP1 expression was associated with worse clinical outcome in the patient cohort. In cell lines, both read-throughs influenced molecular mechanisms (i.e. target gene expression or migration/invasion) in a way that counteracted the effect of the respective parent transcript GATM or KLK4. Our data suggests that the up-regulation of read-through RNA chimeras in tumors is not random but causes regulatory effects on cellular mechanisms and may impact patient survival.
    BMC Genomics 03/2015; 16(1):247. DOI:10.1186/s12864-015-1446-z · 4.04 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Metastatic cancer of unknown primary (CUP) accounts for up to 5% of all new cancer cases, with a 5-year survival rate of only 10%. Accurate identification of tissue of origin would allow for directed, personalized therapies to improve clinical outcomes. Our objective was to use transcriptome sequencing (RNA-Seq) to identify lineage-specific biomarker signatures for the cancer types that most commonly metastasize as CUP (colorectum, kidney, liver, lung, ovary, pancreas, prostate, and stomach). RNA-Seq data of 17,471 transcripts from a total of 3,244 cancer samples across 26 different tissue types were compiled from in-house sequencing data and publically available International Cancer Genome Consortium and The Cancer Genome Atlas datasets. Robust cancer biomarker signatures were extracted using a 10-fold cross-validation method of log transformation, quantile normalization, transcript ranking by area under the receiver operating characteristic curve, and stepwise logistic regression. The entire algorithm was then repeated with a new set of randomly generated training and test sets, yielding highly concordant biomarker signatures. External validation of the cancer-specific signatures yielded high sensitivity (92.0% ± 3.15%; mean ± standard deviation) and specificity (97.7% ± 2.99%) for each cancer biomarker signature. The overall performance of this RNA-Seq biomarker-generating algorithm yielded an accuracy of 90.5%. In conclusion, we demonstrate a computational model for producing highly sensitive and specific cancer biomarker signatures from RNA-Seq data, generating signatures for the top eight cancer types responsible for CUP to accurately identify tumor origin.
    Neoplasia (New York, N.Y.) 11/2014; 16(11). DOI:10.1016/j.neo.2014.09.007 · 5.40 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Lung cancer causes more deaths, worldwide, than any other cancer. Several histologic subtypes exist. Currently, there is a dearth of targeted therapies for treating one of the main subtypes: squamous cell carcinoma (SCC). As for many cancers, lung SCC karyotypes are often highly anomalous owing to large somatic structural variants, some of which are seen repeatedly in lung SCC, indicating a potential causal association for genes therein. We chose to characterize a lung SCC genome to unprecedented detail and integrate our findings with the concurrently characterized transcriptome. We aimed to ascertain how somatic structural changes affected gene expression within the cell in ways that could confer a pathogenic phenotype. We sequenced the genomes of a lung SCC cell line (LUDLU-1) and its matched lymphocyte cell line (AGLCL) to more than 50x coverage. We also sequenced the transcriptomes of LUDLU-1 and a normal bronchial epithelium cell line (LIMM-NBE1), resulting in more than 600 million aligned reads per sample, including both coding and non-coding RNA (ncRNA), in a strand-directional manner. We also captured small RNA (<30 bp). We discovered significant, but weak, correlations between copy number and expression for protein-coding genes, antisense transcripts, long intergenic ncRNA, and microRNA (miRNA). We found that miRNA undergo the largest change in overall expression pattern between the normal bronchial epithelium and the tumor cell line. We found evidence of transcription across the novel genomic sequence created from six somatic structural variants. For each part of our integrated analysis, we highlight candidate genes that have undergone the largest expression changes.
    Neoplasia (New York, N.Y.) 11/2012; 14(11):1075-86. DOI:10.1593/neo.121380 · 5.40 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Chimeric read-through RNAs are transcripts originating from two directly adjacent genes (<10 kb) on the same DNA strand. Although they are found in next-generation whole transcriptome sequencing (RNA-Seq) data on a regular basis, investigating them further has usually been refrained from. Therefore, their expression patterns or functions in general, and in oncogenesis in particular, are poorly understood. We used paired-end RNA-Seq and a specifically designed computational data analysis pipeline (FusionSeq) to nominate read-through events in a small discovery set of renal cell carcinomas (RCC) and confirmed them in a larger validation cohort. 324 read-through events were called overall; 22/27 (81%) selected nominees passed validation with conventional PCR and were sequenced at the junction region. We frequently identified various isoforms of a given read-through event. 2/22 read-throughs were up-regulated: BC039389-GATM was higher expressed in RCC compared to benign adjacent kidney; KLK4-KRSP1 was expressed in 46/169 (27%) RCCs, but rarely in normal tissue. KLK4-KRSP1 expression was associated with worse clinical outcome in the patient cohort. In cell lines, both read-throughs influenced molecular mechanisms (i.e. target gene expression or migration/invasion) in a way that counteracted the effect of the respective parent transcript GATM or KLK4. Our data suggests that the up-regulation of read-through RNA chimeras in tumors is not random but causes regulatory effects on cellular mechanisms and may impact patient survival.
    BMC Genomics 03/2015; 16(1):247. DOI:10.1186/s12864-015-1446-z · 4.04 Impact Factor