Previous studies revealed that Igf2 and Mpr/Igf2r are imprinted in eutherian mammals and marsupials but not in monotremes or birds. Igf2 lies in a large imprinted cluster in eutherians, and its imprinting is regulated by long-range mechanisms. As a step to understand how the imprinted cluster evolved, we have determined a 490-kb chicken sequence containing the orthologs of mammalian Ascl2/Mash2, Ins2 and Igf2. We found that most of the genes in this region are conserved between chickens and mammals, maintaining the same transcriptional polarities and exon-intron structures. However, H19, an imprinted noncoding transcript, was absent from the chicken sequence. Chicken ASCL2/CASH4 and INS, the orthologs of the imprinted mammalian genes, showed biallelic expression, further supporting the notion that imprinting evolved after the divergence of mammals and birds. The H19 imprinting center and many of the local regulatory elements identified in mammals were not found in chickens. Also, a large segment of tandem repeats and retroelements identified between the two imprinted subdomains in mice was not found in chickens. Our findings show that the imprinted genes were clustered before the emergence of imprinting and that the elements associated with imprinting probably evolved after the divergence of mammals and birds.
Bacterial artificial chromosome (BAC) clones are effective mapping and sequencing reagents. The 1.1-Mb α/δ T-cell receptor locus of humans was mapped and partially sequenced with BAC clones. Seventeen BAC clones covered the 1.1-Mb α/δ locus, with the exception of one small gap that was expected from the coverage that a 3.7-fold BAC library is likely to provide. The end sequences of the BAC inserts could be obtained directly from the BAC DNA by sequencing with the chain terminator chemistry. Five complete BAC inserts were sequenced directly by the shotgun approach. The ends of the 17 BAC inserts were distributed evenly across the locus. By several independent criteria, the BAC clones faithfully represented the genomic DNA, with the exception of a single clone with a 68-kb deletion. These BAC features led to the proposal of a new approach to sequence the human genome.
[The sequenced BAC clones, BAC956, BAC810, BAC480, BAC378, and BAC129, have been submitted to GenBank under accession nos. U85199 , U85198 , U85197 , U85196 , and U85195 , respectively.]
The advent of systems biology necessitates the cloning of nearly entire sets of protein-encoding open reading frames (ORFs), or ORFeomes, to allow functional studies of the corresponding proteomes. Here, we describe the generation of a first version of the human ORFeome using a newly improved Gateway recombinational cloning approach. Using the Mammalian Gene Collection (MGC) resource as a starting point, we report the successful cloning of 8076 human ORFs, representing at least 7263 human genes, as mini-pools of PCR-amplified products. These were assembled into the human ORFeome version 1.1 (hORFeome v1.1) collection. After assessing the overall quality of this version, we describe the use of hORFeome v1.1 for heterologous protein expression in two different expression systems at proteome scale. The hORFeome v1.1 represents a central resource for the cloning of large sets of human ORFs in various settings for functional proteomics of many types, and will serve as the foundation for subsequent improved versions of the human ORFeome.
The bacteria of the Brucella genus are responsible for a worldwide zoonosis called brucellosis. They belong to the alpha-proteobacteria group, as many other bacteria that live in close association with a eukaryotic host. Importantly, the Brucellae are mainly intracellular pathogens, and the molecular mechanisms of their virulence are still poorly understood. Using the complete genome sequence of Brucella melitensis, we generated a database of protein-coding open reading frames (ORFs) and constructed an ORFeome library of 3091 Gateway Entry clones, each containing a defined ORF. This first version of the Brucella ORFeome (v1.1) provides the coding sequences in a user-friendly format amenable to high-throughput functional genomic and proteomic experiments, as the ORFs are conveniently transferable from the Entry clones to various Expression vectors by recombinational cloning. The cloning of the Brucella ORFeome v1.1 should help to provide a better understanding of the molecular mechanisms of virulence, including the identification of bacterial protein-protein interactions, but also interactions between bacterial effectors and their host's targets.
Most of the yeast artificial chromosomes (YACs) isolated from the Xp11.23-22 region have shown instability and chimerism and are not a reliable resource for determining physical distances. We therefore constructed a long-range pulsed-field gel electrophoresis map that encompasses approximately 3.5 Mb of genomic DNA between the loci TIMP and DXS146 including a CpG-rich region around the WASP and TFE-3 gene loci. A combined YAC-cosmid contig was constructed along the genomic map and was used for fine-mapping of 15 polymorphic microsatellites and 30 expressed sequence tags (ESTs) or sequence transcribed sites (STSs), revealing the following order: tel-(SYN-TIMP)-(DXS426-ELK1)-ZNF(CA) n-L1-DXS1367-ZNF81-ZNF21-DXS6616- (HB3-OATL1pseudogenes-DXS6950)-DXS6949-DXS694 1-DXS7464E(MG61)-GW1E(EBP)- DXS7927E(MG81)-RBM- DXS722-DXS7467E(MG21)-DXS1011E-WASP-DXS6940++ +-DXS7466E(MG44)-GF1- DXS226-DXS1126-DXS1240-HB1- DXS7469E-(DXS6665-DXS1470)-TFE3-DXS7468E-+ ++SYP-DXS1208-HB2E-DXS573-DXS1331- DXS6666-DXS1039-DXS 1426-DXS1416-DXS7647-DXS8222-DXS6850-DXS255++ +-CIC-5-DXS146-cen. A sequence-ready map was constructed for an 1100-kb gene-rich interval flanked by the markers HB3 and DXS1039, from which six novel ESTs/STSs were isolated, thus increasing the number of markers used in this interval to thirty. This precise ordering is a prerequisite for the construction of a transcription map of this region that contains numerous disease loci, including those for several forms of retinal degeneration and mental retardation. In addition, the map provides the base to delineate the corresponding syntenic region in the mouse, where the mutants scurfy and tattered are localized.
In the process of positionally cloning a candidate gene responsible for hereditary hemochromatosis (HH), we constructed a 1.1-Mb transcript map of the region of human chromosome 6p that lies 4.5 Mb telomeric to HLA-A . A combination of three gene-finding techniques, direct cDNA selection, exon trapping, and sample sequencing, were used initially for a saturation screening of the 1.1-Mb region for expressed sequence fragments. As genetic analysis further narrowed the HH candidate locus, we sequenced completely 0.25 Mb of genomic DNA as a final measure to identify all genes. Besides the novel MHC class 1-like HH candidate gene HLA-H , we identified a family of five butyrophilin-related sequences, two genes with structural similarity to a type 1 sodium phosphate transporter, 12 novel histone genes, and a gene we named RoRet based on its strong similarity to the 52-kD Ro/SSA lupus and Sjogren’s syndrome auto-antigen and the RET finger protein. Several members of the butyrophilin family and the RoRet gene share an exon of common evolutionary origin called B30-2. The B30-2 exon was originally isolated from the HLA class 1 region, yet has apparently “shuffled” into several genes along the chromosome telomeric to the MHC. The conservation of the B30-2 exon in several novel genes and the previously described amino acid homology of HLA-H to MHC class 1 molecules provide further support that this gene-rich region of 6p21.3 is related to the MHC. Finally, we performed an analysis of the four approaches for gene finding and conclude that direct selection provides the most effective probes for cDNA screening, and that as much as 30% of ESTs in this 1.1-Mb region may be derived from noncoding genomic DNA.
[The sequence data described in this paper have been submitted to GenBank under accession nos. U90543 – U90548 , U90550 – U90552 , and U91328 .]
Lurcher ( Lc ) is a semidominant mouse mutant that displays a characteristic ataxia in the heterozygous state beginning in the third postnatal week. This symptom results from a neurodegenerative event in the cerebellum: There is a catastrophic loss of Purkinje cells in the heterozygote animal between postnatal days 10 and 15. In an effort to identify the genetic lesion borne by Lc mice, we initiated a cloning project based on the position of the Lc mutation on mouse chromosome 6. We have extended our previous analysis of the genomic segment containing the Lc locus by isolating a set of stable and manipulable genomic clones called bacterial artificial chromosomes (BACs) that cover this region of mouse chromosome 6. These clones provided a good substrate for the isolation of markers that were used to refine the physical map of the locus. Furthermore, 20 of these markers were mapped onto our (B6CBACa- A w − J /A − Lc × CAST/Ei)F 1 × B6CBACa- A w − J /A backcross, refining the genetic map and identifying two nonrecombinant markers ( D6Rck354 and D6Rck355 ). These two markers, in conjunction with the closest flanking markers, were used to identify a 110-kb genomic segment that contains all four markers and hence contains the Lc locus. This small genomic segment, covered by multiple BACs, sets the stage for the final effort of this project—the identification of transcripts and of the mutation within the Lc locus.
[The Lt1 sequence has been submitted to GenBank as two ESTs; the accession numbers are U89356 and U89357 .]
Duplication and deletion of the 1.4-Mb region in 17p12 that is delimited by two 24-kb low copy number repeats (CMT1A-REPs) represent frequent genomic rearrangements resulting in two common inherited peripheral neuropathies, Charcot-Marie-Tooth disease type 1A (CMT1A) and hereditary neuropathy with liability to pressure palsy (HNPP). CMT1A and HNPP exemplify a paradigm for genomic disorders wherein unique genome architectural features result in susceptibility to DNA rearrangements that cause disease. A gene within the 1.4-Mb region, PMP22, is responsible for these disorders through a gene-dosage effect in the heterozygous duplication or deletion. However, the genomic structure of the 1.4-Mb region, including other genes contained within the rearranged genomic segment, remains essentially uncharacterized. To delineate genomic structural features, investigate higher-order genomic architecture, and identify genes in this region, we constructed PAC and BAC contigs and determined the complete nucleotide sequence. This CMT1A/HNPP genomic segment contains 1,421,129 bp of DNA. A low copy number repeat (LCR) was identified, with one copy inside and two copies outside of the 1.4-Mb region. Comparison between physical and genetic maps revealed a striking difference in recombination rates between the sexes with a lower recombination frequency in males (0.67 cM/Mb) versus females (5.5 cM/Mb). Hypothetically, this low recombination frequency in males may enable a chromosomal misalignment at proximal and distal CMT1A-REPs and promote unequal crossing over, which occurs 10 times more frequently in male meiosis. In addition to three previously described genes, five new genes (TEKT3, HS3ST3B1, NPD008/CGI-148, CDRT1, and CDRT15) and 13 predicted genes were identified. Most of these predicted genes are expressed only in embryonic stages. Analyses of the genomic region adjacent to proximal CMT1A-REP indicated an evolutionary mechanism for the formation of proximal CMT1A-REP and the creation of novel genes by DNA rearrangement during primate speciation.
The nucleotide sequence of 1.5 Mb of genomic DNA from Mycobacterium leprae was determined using computer-assisted multiplex sequencing technology. This brings the 2.8-Mb M. leprae genome sequence to ∼66% completion. The sequences, derived from 43 recombinant cosmids, contain 1046 putative protein-coding genes, 44 repetitive regions, 3 rRNAs, and 15 tRNAs. The gene density of one per 1.4 kb is slightly lower than that of Mycoplasma (1.2 kb). Of the protein coding genes, 44% have significant matches to genes with well-defined functions. Comparison of 1157 M. leprae and 1564 Mycobacterium tuberculosis proteins shows a complex mosaic of homologous genomic blocks with up to 22 adjacent proteins in conserved map order. Matches to known enzymatic, antigenic, membrane, cell wall, cell division, multidrug resistance, and virulence proteins suggest therapeutic and vaccine targets. Unusual features of the M. leprae genome include large polyketide synthase (pks) operons, inteins, and highly fragmented pseudogenes.
[The sequence data described in this paper have been submitted to GenBank under accession nos. L78811 – L78829 , U00010 – U00023 , U15180 – U15184 , U15186 , U15187 , L01095 , L01536 , L04666 , and L01263 . On-line supplementary information for Table 1 is available at http://www.cshl.org/gr .]
A contig of 21 nonchimeric yeast artificial chromosomes (YACs) has been assembled across 1.5 Mb of the multidrug resistance (MDR) gene region located at 7q21, and formatted with four previously reported probes, six newly isolated probes, and three sequence-tagged sites (STSs) from internal and end fragments of YACs. A physical map of rare cutter restriction enzyme sites across the region was also constructed by pulsed-field gel electrophoretic (PFGE) analysis of four overlapping YAC clones. The amplification unit of this region in different cell lines was then determined by Southern blot analysis on the basis of the physical map and probes. Amplified DNA was located in extrachromosomal elements in human MDR cell lines studied here, and the size of the amplification unit was determined to be discrete in one MDR amplification but variable in others.
The Down syndrome (DS) region has been defined by analyses of partial trisomy 21. The 2.5-Mb region between D21S17 and ERG is reportedly responsible for the main features of DS. Within this 2.5-Mb region, we focused previously on a distal 1.6-Mb region from an analysis of Japanese DS patients with partial trisomy 21. Previously we also performed exon-trapping and direct cDNA library screening of a fetal brain cDNA library and identified a novel gene TPRD. Further screening of a fetal heart cDNA library was performed and a total of 44 possible exons and 97 cDNA clones were obtained and mapped on a BamH1 map. By rescreening other cDNA libraries and a RACE reaction, we isolated nearly full-length cDNAs of three additional genes [holocarboxylase synthetase (HCS), G protein-coupled inward rectifier potassium channel 2 (GIRK2), and a human homolog of Drosophila minibrain gene (MNB)] and a coding sequence of a novel inward rectifier potassium channel-like gene (IRKK). The gene distribution and direction of transcription were determined by mapping both ends of the cDNA sequences. We found that these genes, except IRKK, are expressed ubiquitously and are relatively large, extending from 100 kb to 300 kb on the genome. These nearly full-length cDNA sequences should facilitate understanding of the detailed genome structure of the DS region and help to elucidate their role in the etiology of DS.
Large-scale genetic studies are highly dependent on efficient and scalable multiplex SNP assays. In this study, we report the development of Molecular Inversion Probe technology with four-color, single array detection, applied to large-scale genotyping of up to 12,000 SNPs per reaction. While generating 38,429 SNP assays using this technology in a population of 30 trios from the Centre d'Etude Polymorphisme Humain family panel as part of the International HapMap project, we established SNP conversion rates of approximately 90% with concordance rates >99.6% and completeness levels >98% for assays multiplexed up to 12,000plex levels. Furthermore, these individual metrics can be "traded off" and, by sacrificing a small fraction of the conversion rate, the accuracy can be increased to very high levels. No loss of performance is seen when scaling from 6,000plex to 12,000plex assays, strongly validating the ability of the technology to suppress cross-reactivity at high multiplex levels. The results of this study demonstrate the suitability of this technology for comprehensive association studies that use targeted SNPs in indirect linkage disequilibrium studies or that directly screen for causative mutations.
The analysis of single nucleotide polymorphisms (SNPs) is increasingly utilized to investigate the genetic causes of complex human diseases. Here we present a high-throughput genotyping platform that uses a one-primer assay to genotype over 10,000 SNPs per individual on a single oligonucleotide array. This approach uses restriction digestion to fractionate the genome, followed by amplification of a specific fractionated subset of the genome. The resulting reduction in genome complexity enables allele-specific hybridization to the array. The selection of SNPs was primarily determined by computer-predicted lengths of restriction fragments containing the SNPs, and was further driven by strict empirical measurements of accuracy, reproducibility, and average call rate, which we estimate to be >99.5%, >99.9%, and>95%, respectively [corrected]. With average heterozygosity of 0.38 and genome scan resolution of 0.31 cM, the SNP array is a viable alternative to panels of microsatellites (STRs). As a demonstration of the utility of the genotyping platform in whole-genome scans, we have replicated and refined a linkage region on chromosome 2p for chronic mucocutaneous candidiasis and thyroid disease, previously identified using a panel of microsatellite (STR) markers.
Sequencing of full-insert clones from full-length cDNA libraries from both Xenopus laevis and Xenopus tropicalis has been ongoing as part of the Xenopus Gene Collection Initiative. Here we present 10,967 full ORF verified cDNA clones (8049 from X. laevis and 2918 from X. tropicalis) as a community resource. Because the genome of X. laevis, but not X. tropicalis, has undergone allotetraploidization, comparison of coding sequences from these two clawed (pipid) frogs provides a unique angle for exploring the molecular evolution of duplicate genes. Within our clone set, we have identified 445 gene trios, each comprised of an allotetraploidization-derived X. laevis gene pair and their shared X. tropicalis ortholog. Pairwise dN/dS, comparisons within trios show strong evidence for purifying selection acting on all three members. However, dN/dS ratios between X. laevis gene pairs are elevated relative to their X. tropicalis ortholog. This difference is highly significant and indicates an overall relaxation of selective pressures on duplicated gene pairs. We have found that the paralogs that have been lost since the tetraploidization event are enriched for several molecular functions, but have found no such enrichment in the extant paralogs. Approximately 14% of the paralogous pairs analyzed here also show differential expression indicative of subfunctionalization.
A medium-density linkage map of the ovine genome has been developed. Marker data for 550 new loci were generated and merged with the previous sheep linkage map. The new map comprises 1093 markers representing 1062 unique loci (941 anonymous loci, 121 genes) and spans 3500 cM (sex-averaged) for the autosomes and 132 cM (female) on the X chromosome. There is an average spacing of 3.4 cM between autosomal loci and 8.3 cM between highly polymorphic [polymorphic information content (PIC) > or = 0.7] autosomal loci. The largest gap between markers is 32.5 cM, and the number of gaps of > 20 cM between loci, or regions where loci are missing from chromosome ends, has been reduced from 40 in the previous map to 6. Five hundred and seventy-three of the loci can be ordered on a framework map with odds of > 1000 : 1. The sheep linkage map contains strong links to both the cattle and goat maps. Five hundred and seventy-two of the loci positioned on the sheep linkage map have also been mapped by linkage analysis in cattle, and 209 of the loci mapped on the sheep linkage map have also been placed on the goat linkage map. Inspection of ruminant linkage maps indicates that the genomic coverage by the current sheep linkage map is comparable to that of the available cattle maps. The sheep map provides a valuable resource to the international sheep, cattle, and goat gene mapping community.
A high-throughput genotyping system for scoring single nucleotide polymorphisms (SNPs) has been developed. With this system, >1000 SNPs can be analyzed in a single assay, with a sensitivity that allows the use of single haploid cells as starting material. In the multiplex polymorphic sequence amplification step, instead of attaching universal sequences to the amplicons, primers that are unlikely to have nonspecific and productive interactions are used. Genotypes of SNPs are then determined by using the widely accessible microarray technology and the simple single-base extension assay. Three SNP panels, each consisting of >1000 SNPs, were incorporated into this system. The system was used to analyze 24 human genomic DNA samples. With 5 ng of human genomic DNA, the average detection rate was 98.22% when single probes were used, and 96.71% could be detected by dual probes in different directions. When single sperm cells were used, 91.88% of the SNPs were detectable, which is comparable to the level that was reached when very few genetic markers were used. By using a dual-probe assay, the average genotyping accuracy was 99.96% for 5 ng of human genomic DNA and 99.95% for single sperm. This system may be used to significantly facilitate large-scale genetic analysis even if the amount of DNA template is very limited or even highly degraded as that obtained from paraffin-embedded cancer specimens, and to make many unpractical research projects highly realistic and affordable.
RNA-guided engineered nucleases (RGENs) derived from the prokaryotic adaptive immune system known as CRISPR (clustered, regularly interspaced, short palindromic repeat)/Cas (CRISPR-associated) enable genome editing in human cell lines, animals, and plants but are limited by off-target effects and unwanted integration of DNA segments derived from plasmids encoding Cas9 and guide RNA at both on-target and off-target sites in the genome. Here, we deliver purified recombinant Cas9 protein and guide RNA into cultured human cells including hard-to-transfect fibroblasts and pluripotent stem cells. RGEN ribonucleoproteins (RNPs) induce site-specific mutations at frequencies of up to 79%, while reducing off-target mutations associated with plasmid transfection at off-target sites that differ by one or two nucleotides from on-target sites. RGEN RNPs cleave chromosomal DNA almost immediately after delivery and are degraded rapidly in cells, reducing off-target effects. Furthermore, RNP delivery is less stressful to human embryonic stem cells, producing at least two-fold more colonies than does plasmid transfection.
The filamentous fungus Aspergillus niger exhibits great diversity in its phenotype. It is found globally, both as marine and terrestrial strains, produces both organic acids and hydrolytic enzymes in high amounts, and some isolates exhibit pathogenicity. Although the genome of an industrial enzyme-producing A. niger strain (CBS 513.88) has already been sequenced, the versatility and diversity of this species compel additional exploration. We therefore undertook whole-genome sequencing of the acidogenic A. niger wild-type strain (ATCC 1015) and produced a genome sequence of very high quality. Only 15 gaps are present in the sequence, and half the telomeric regions have been elucidated. Moreover, sequence information from ATCC 1015 was used to improve the genome sequence of CBS 513.88. Chromosome-level comparisons uncovered several genome rearrangements, deletions, a clear case of strain-specific horizontal gene transfer, and identification of 0.8 Mb of novel sequence. Single nucleotide polymorphisms per kilobase (SNPs/kb) between the two strains were found to be exceptionally high (average: 7.8, maximum: 160 SNPs/kb). High variation within the species was confirmed with exo-metabolite profiling and phylogenetics. Detailed lists of alleles were generated, and genotypic differences were observed to accumulate in metabolic pathways essential to acid production and protein synthesis. A transcriptome analysis supported up-regulation of genes associated with biosynthesis of amino acids that are abundant in glucoamylase A, tRNA-synthases, and protein transporters in the protein producing CBS 513.88 strain. Our results and data sets from this integrative systems biology analysis resulted in a snapshot of fungal evolution and will support further optimization of cell factories based on filamentous fungi.
Large-scale analyses of expression data of eukaryotic organisms are now becoming increasingly routine. The data sets are revealing interesting and novel patterns of genomic organization, which provide insight both into molecular evolution and how structure and function of a genome interrelate. Our study investigates, for the first time, how genome organization affects expression of a gene in the Arabidopsis genome. The analyses show that neighboring genes are coexpressed. This pattern has been found for all eukaryotic genomes studied so far, but as yet, it remains unclear whether it is due to selective or nonselective influences. We have investigated reasons for coexpression of neighboring genes in Arabidopsis, and our evidence suggests that orientation of gene pairs plays a significant role, with potential sharing of regulatory elements in divergently transcribed genes. Using the data available in the KEGG database, we find evidence that genes in the same pathway are coexpressed, although this is not a major cause for the coexpression of neighboring genes.
Genome size varies greatly across angiosperms. It is well documented that, in addition to polyploidization, retrotransposon amplification has been a major cause of genome expansion. The lack of evidence for counterbalancing mechanisms that curtail unlimited genome growth has made many of us wonder whether angiosperms have a "one-way ticket to genomic obesity." We have therefore investigated an angiosperm with a well-characterized and notably small genome, Arabidopsis thaliana, for evidence of genomic DNA loss. Our results indicate that illegitimate recombination is the driving force behind genome size decrease in Arabidopsis, removing at least fivefold more DNA than unequal homologous recombination. The presence of highly degraded retroelements also suggests that retrotransposon amplification has not been confined to the last 4 million years, as is indicated by the dating of intact retroelements.
Chromatin structure is central for the regulation of gene expression, but its genome-wide organization is only beginning to be understood. Here, we examine the connection between patterns of nucleosome occupancy and the capacity to modulate gene expression upon changing conditions, i.e., transcriptional plasticity. By analyzing genome-wide data of nucleosome positioning in yeast, we find that the presence of nucleosomes close to the transcription start site is associated with high transcriptional plasticity, while nucleosomes at more distant upstream positions are negatively correlated with transcriptional plasticity. Based on this, we identify two typical promoter structures associated with low or high plasticity, respectively. The first class is characterized by a relatively large nucleosome-free region close to the start site coupled with well-positioned nucleosomes further upstream, whereas the second class displays a more evenly distributed and dynamic nucleosome positioning, with high occupancy close to the start site. The two classes are further distinguished by multiple promoter features, including histone turnover, binding site locations, H2A.Z occupancy, expression noise, and expression diversity. Analysis of nucleosome positioning in human promoters reproduces the main observations. Our results suggest two distinct strategies for gene regulation by chromatin, which are selectively employed by different genes.
Contiguous finished sequence from highly duplicated pericentromeric regions of human chromosomes is needed if we are to understand the role of pericentromeric instability in disease, and in gene and karyotype evolution. Here, we have constructed a BAC contig spanning the transition from pericentromeric satellites to genes on the short arm of human chromosome 10, and used this to generate 1.4 Mb of finished genomic sequence. Combining RT-PCR, in silico gene prediction, and paralogy analysis, we can identify two domains within the sequence. The proximal 600 kb consists of satellite-rich pericentromerically duplicated DNA which is transcript poor, containing only three unspliced transcripts. In contrast, the distal 850 kb contains four known genes (ZNF248, ZNF25, ZNF33A, and ZNF37A) and up to 32 additional transcripts of unknown function. This distal region also contains seven out of the eight intrachromosomal duplications within the sequence, including the p arm copy of the approximately 250-kb duplication which gave rise to ZNF33A and ZNF33B. By sequencing orthologs of the duplicated ZNF33 genes we have established that ZNF33A has diverged significantly at residues critical for DNA binding but ZNF33B has not, indicating that ZNF33B has remained constrained by selection for ancestral gene function. These results provide further evidence of gene formation within intrachromosomal duplications, but indicate that recent interchromosomal duplications at this centromere have involved transcriptionally inert, satellite rich DNA, which is likely to be heterochromatic. This suggests that any novel gene structures formed by these interchromosomal events would require relocation to a more open chromatin environment to be expressed.
We have previously localized the core centromere protein-binding domain of a 10q25.2-derived neocentromere to an 80-kb genomic region. Detailed analysis has indicated that the 80-kb neocentromere (NC) DNA has a similar overall organization to the corresponding region on a normal chromosome 10 (HC) DNA, derived from a genetically unrelated CEPH individual. Here we report sequencing of the HC DNA and its comparison to the NC sequence. Single-base differences were observed at a maximum rate of 4.6 per kb; however, no deletions, insertions, or other structural rearrangements were detected. To investigate whether the observed changes, or subsets of these, might be de novo mutations involved in neocentromerization (i.e., in committing a region of a chromosome to neocentromere formation), the progenitor DNA (PnC) from which the NC DNA descended, was cloned and sequenced. Direct comparison of the PnC and NC sequences revealed 100% identity, suggesting that the differences between NC and HC DNA are single nucleotide polymorphisms (SNPs) and that formation of the 10q25.2 NC did not involve a change in DNA sequence in the core centromere protein-binding NC region. This is the first study in which a cloned NC DNA has been compared directly with its inactive progenitor DNA at the primary sequence level. The results form the basis for future sequence comparison outside the core protein-binding domain, and provide direct support for the involvement of an epigenetic mechanism in neocentromerization.
[The sequences in this paper have been submitted to GenBank under accession nos. AF222855 (not yet available) for HC; AF042484 for NCI; AF222854 (not yet available) for NCII; and AF222856 (not yet available) for PnC.]
The genome of the halophilic archaeon Halobacterium sp. NRC-1 and predicted proteome have been analyzed by computational methods and reveal characteristics relevant to life in an extreme environment distinguished by hypersalinity and high solar radiation: (1) The proteome is highly acidic, with a median pI of 4.9 and mostly lacking basic proteins. This characteristic correlates with high surface negative charge, determined through homology modeling, as the major adaptive mechanism of halophilic proteins to function in nearly saturating salinity. (2) Codon usage displays the expected GC bias in the wobble position and is consistent with a highly acidic proteome. (3) Distinct genomic domains of NRC-1 with bacterial character are apparent by whole proteome BLAST analysis, including two gene clusters coding for a bacterial-type aerobic respiratory chain. This result indicates that the capacity of halophiles for aerobic respiration may have been acquired through lateral gene transfer. (4) Two regions of the large chromosome were found with relatively lower GC composition and overrepresentation of IS elements, similar to the minichromosomes. These IS-element-rich regions of the genome may serve to exchange DNA between the three replicons and promote genome evolution. (5) GC-skew analysis showed evidence for the existence of two replication origins in the large chromosome. This finding and the occurrence of multiple chromosomes indicate a dynamic genome organization with eukaryotic character.
Recent genetic analyses in worms, flies, and mammals illustrate the importance of bioactive peptides in controlling numerous complex behaviors, such as feeding and circadian locomotion. To pursue a comprehensive genetic analysis of bioactive peptide signaling, we have scanned the recently completed Drosophila genome sequence for G protein-coupled receptors sensitive to bioactive peptides (peptide GPCRs). Here we describe 44 genes that represent the vast majority, and perhaps all, of the peptide GPCRs encoded in the fly genome. We also scanned for genes encoding potential ligands and describe 22 bioactive peptide precursors. At least 32 Drosophila peptide receptors appear to have evolved from common ancestors of 15 monophyletic vertebrate GPCR subgroups (e.g., the ancestral gastrin/cholecystokinin receptor). Six pairs of receptors are paralogs, representing recent gene duplications. Together, these findings shed light on the evolutionary history of peptide GPCRs, and they provide a template for physiological and genetic analyses of peptide signaling in Drosophila.
We have tested 80 expressed sequence-tagged site (eSTS) markers assigned to human chromosome 11 by the Genexpress program on a panel of somatic cell hybrids containing parts of this chromosome, characterized by cytogenetic data, reference markers, and with respect to the Généthon microsatellite genetic map. Sixty-eight new gene transcripts have been assigned to 25 subregions, one of which was newly defined by five of the eSTS markers. The markers are distributed on the short and long arms in agreement with their physical length. The genic map thus obtained has been integrated with the cytogenetic, genetic, and disease maps. Two eSTS markers have been further mapped with respect to a yeast artificial chromosome (YAC) contig close to the brain-derived neurotrophic factor (BDNF) gene and thus provide potential candidate genes for the mental retardation phenotype of WAGR (Wilms' tumor, aniridia, genitourinary abnormalities and mental retardation) syndrome. Altogether, the 68 new gene transcripts localized here represent more than a threefold increase in the number of unknown regionalized genes that could reveal potential candidate genes for the numerous orphan pathologies associated with chromosome 11.
A total of 57.8 Mb of publicly available rice ( Oryza sativa L.) DNA sequence was searched to determine the frequency and distribution of different simple sequence repeats (SSRs) in the genome. SSR loci were categorized into two groups based on the length of the repeat motif. Class I, or hypervariable markers, consisted of SSRs ≥20 bp, and Class II, or potentially variable markers, consisted of SSRs ≥12 bp <20 bp. The occurrence of Class I SSRs in end-sequences of Eco RI- and Hin dIII-digested BAC clones was one SSR per 40 Kb, whereas in continuous genomic sequence (represented by 27 fully sequenced BAC and PAC clones), the frequency was one SSR every 16 kb. Class II SSRs were estimated to occur every 3.7 kb in BAC ends and every 1.9 kb in fully sequenced BAC and PAC clones. GC-rich trinucleotide repeats (TNRs) were most abundant in protein-coding portions of ESTs and in fully sequenced BACs and PACs, whereas AT-rich TNRs showed no such preference, and di- and tetranucleotide repeats were most frequently found in noncoding, intergenic regions of the rice genome. Microsatellites with poly(AT)n repeats represented the most abundant and polymorphic class of SSRs but were frequently associated with the Micropon family of miniature inverted-repeat transposable elements (MITEs) and were difficult to amplify. A set of 200 Class I SSR markers was developed and integrated into the existing microsatellite map of rice, providing immediate links between the genetic, physical, and sequence-based maps. This contribution brings the number of microsatellite markers that have been rigorously evaluated for amplification, map position, and allelic diversity in Oryza spp. to a total of 500.
[Clone sequences for 199 markers (RM1–RM88, RM200–RM345) developed in this lab are available as GenBank accessions AF343840 – AF343869 and AF344003 – AF344169 .]
The genetic factors involved in type II diabetes are still unknown. To address this problem, we are creating a 10 to 15 cM genetic map on 444 individuals from 32 Mexican American families ascertained on a type II diabetic proband. Using highly polymorphic microsatellite markers and a multipoint variance components method, we found evidence for linkage of plasma glucose concentration 2 hr after oral glucose administration to two regions on chromosome 11: beta-hemoglobin (HBB) and markers D11S899/D11S1324 near the sulfonylurea receptor (SUR) gene. Iod scores at these two loci were 2.77 and 3.37, respectively. The SUR gene region accounted for 44.7% of the phenotypic variance. Evidence for linkage to fasting glucose concentration was also observed for two loci on chromosome 6, one of which is identical to a proposed susceptibility locus for type I diabetes (D6S290). When diabetics were excluded from the analyses, all Iod scores became zero, suggesting that the observed linkages were with the trait diabetes rather than with normal variation in glucose levels. Results were similar whether all diabetics were included in the analyses or only those who were not under treatment with oral antidiabetic agents or insulin.
The ATP-binding cassette (ABC) transporter superfamily contains membrane proteins that translocate a variety of substrates across extra- and intra-cellular membranes. Genetic variation in these genes is the cause of or contributor to a wide variety of human disorders with Mendelian and complex inheritance, including cystic fibrosis, neurological disease, retinal degeneration, cholesterol and bile transport defects, anemia, and drug response. Conservation of the ATP-binding domains of these genes has allowed the identification of new members of the superfamily based on nucleotide and protein sequence homology. Phylogenetic analysis is used to divide all 48 known ABC transporters into seven distinct subfamilies of proteins. For each gene, the precise map location on human chromosomes, expression data, and localization within the superfamily has been determined. These data allow predictions to be made as to potential functions or disease phenotypes associated with each protein. In this paper, we review the current state of knowledge on all human ABC genes in inherited disease and drug resistance. In addition, the availability of the complete Drosophila genome sequence allows the comparison of the known human ABC genes with those in the fly genome. The combined data enable an evolutionary analysis of the superfamily. Complete characterization of all ABC from the human genome and from model organisms will lead to important insights into the physiology and the molecular basis of many human disorders.
Estimates of genetic population structure (F(ST)) were constructed from all autosomes in two large SNP data sets. The Perlegen data set contains genotypes on approximately 1 million SNPs segregating in all three samples of Americans of African, Asian, and European descent; and the Phase I HapMap data set contains genotypes on approximately 0.6 million SNPs segregating in all four samples from specific Caucasian, Chinese, Japanese, and Yoruba populations. Substantial heterogeneity of F(ST) values was found between segments within chromosomes, although there was similarity between the two data sets. There was also substantial heterogeneity among population-specific F(ST) values, with the relative sizes of these values often changing along each chromosome. Population-structure estimates are often used as indicators of natural selection, but the analyses presented here show that individual-marker estimates are too variable to be useful. There is inherent variation in these statistics because of variation in genealogy even among neutral loci, and values at pairs of loci are correlated to an extent that reflects the linkage disequilibrium between them. Furthermore, it may be that the best indications of selection will come from population-specific F(ST) values rather than the usually reported population-average values.
Elevated galactose levels can be caused by several enzyme defects, one of which is galactokinase. Galactokinase deficiency cause congenital cataracts during infancy and presenile cataracts in the adult population. We have isolated the mouse cDNA for galactokinase, which shares extensive amino acid sequence homology, 88% identity, with a recently cloned human galactokinase. It is expressed in all tissues examined. In an interspecific backcross analysis galactokinase maps to the distal region of mouse chromosome 11, a region that is homologous to human chromosome 17q22-25. The availability of the mouse gene provides an opportunity to make a knockout model for galactokinase deficiency.
With the human genome sequence approaching completion, a major challenge is to identify the locations and encoded protein sequences of all human genes. To address this problem we have developed a new gene identification algorithm, GenomeScan, which combines exon-intron and splice signal models with similarity to known protein sequences in an integrated model. Extensive testing shows that GenomeScan can accurately identify the exon-intron structures of genes in finished or draft human genome sequence with a low rate of false-positives. Application of GenomeScan to 2.7 billion bases of human genomic DNA identified at least 20,000-25,000 human genes out of an estimated 30,000-40,000 present in the genome. The results show an accurate and efficient automated approach for identifying genes in higher eukaryotic genomes and provide a first-level annotation of the draft human genome.
No single experimental method can discover all connections in the interactome. A computational approach can help by integrating data from multiple, often unrelated, proteomics and genomics pipelines. Reconstructing global networks of functional coupling (FC) faces the challenges of scale and heterogeneity--how to efficiently integrate huge amounts of diverse data from multiple organisms, yet ensuring high accuracy. We developed FunCoup, an optimized Bayesian framework, to resolve these issues. Because interactomes comprise functional coupling of many types, FunCoup annotates network edges with confidence scores in support of different kinds of interactions: physical interaction, protein complex member, metabolic, or signaling link. This capability boosted overall accuracy. On the whole, the constructed framework was comprehensively tested to optimize the overall confidence and ensure seamless, automated incorporation of new data sets of heterogeneous types. Using over 50 data sets in seven organisms and extensively transferring information between orthologs, FunCoup predicted global networks in eight eukaryotes. For the Ciona intestinalis network, only orthologous information was used, and it recovered a significant number of experimental facts. FunCoup predictions were validated on independent cancer mutation data. We show how FunCoup can be used for discovering candidate members of the Parkinson and Alzheimer pathways. Cross-species pathway conservation analysis provided further support to these observations.
The gastrointestinal microbiome undergoes shifts in species and strain abundances, yet dynamics involving closely related microorganisms remain largely unknown because most methods cannot resolve them. We developed new metagenomic methods and utilized them to track species and strain level variations in microbial communities in 11 fecal samples collected from a premature infant during the first month of life. 96 % of the sequencing reads were assembled into scaffolds of >500 bp length that could be assigned to organisms at the strain level. Six essentially complete (~99 %) and two near-complete genomes were assembled for bacteria that comprised as little as 1 % of the community, as well as nine partial genomes of bacteria representing as little as 0.05 %. In addition, three viral genomes were assembled and assigned to their hosts. The relative abundance of three Staphylococcus epidermidis strains, as well as three phage that infect them, changed dramatically over time. Genes possibly related to these shifts include those for resistance to antibiotics, heavy metals and phage. At the species level we observed the decline of an early-colonizing Propionibacterium acnes strain similar to SK137 and the proliferation of novel Propionibacterium and Peptoniphilus species late in colonization. The Propionibacterium species differed in their ability to metabolize carbon compounds such as inositol and sialic acid, indicating that shifts in species composition likely impact the metabolic potential of the community. These results highlight the benefit of reconstructing complete genomes from metagenomic data and demonstrate methods for achieving this goal.
Balanced chromosome rearrangements (BCRs) can cause genetic diseases by disrupting or inactivating specific genes, and the characterization of breakpoints in disease-associated BCRs has been instrumental in the molecular elucidation of a wide variety of genetic disorders. However, mapping chromosome breakpoints using traditional methods, such as in situ hybridization with fluorescent dye-labeled bacterial artificial chromosome clones (BAC-FISH), is rather laborious and time-consuming. In addition, the resolution of BAC-FISH is often insufficient to unequivocally identify the disrupted gene. To overcome these limitations, we have performed shotgun sequencing of flow-sorted derivative chromosomes using "next-generation" (Illumina/Solexa) multiplex sequencing-by-synthesis technology. As shown here for three different disease-associated BCRs, the coverage attained by this platform is sufficient to bridge the breakpoints by PCR amplification, and this procedure allows the determination of their exact nucleotide positions within a few weeks. Its implementation will greatly facilitate large-scale breakpoint mapping and gene finding in patients with disease-associated balanced translocations.
The somatic mutation burden in healthy white blood cells (WBCs) is not well known. Based on deep whole-genome sequencing, we estimate that approximately 450 somatic mutations accumulated in the nonrepetitive genome within the healthy blood compartment of a 115-yr-old woman. The detected mutations appear to have been harmless passenger mutations: They were enriched in noncoding, AT-rich regions that are not evolutionarily conserved, and they were depleted for genomic elements where mutations might have favorable or adverse effects on cellular fitness, such as regions with actively transcribed genes. The distribution of variant allele frequencies of these mutations suggests that the majority of the peripheral white blood cells were offspring of two related hematopoietic stem cell (HSC) clones. Moreover, telomere lengths of the WBCs were significantly shorter than telomere lengths from other tissues. Together, this suggests that the finite lifespan of HSCs, rather than somatic mutation effects, may lead to hematopoietic clonal evolution at extreme ages.
Many cancers are characterized by chromosomal aberrations that may be predictive of disease outcome. Human neuroblastomas are characterized by somatically acquired copy number changes, including loss of heterozygosity (LOH) at multiple chromosomal loci, and these aberrations are strongly associated with clinical phenotype including patient outcome. We developed a method to assess region-specific LOH by genotyping multiple SNPs simultaneously in DNA from tumor tissues. We identified informative SNPs at an average 293-kb density across nine regions of recurrent LOH in human neuroblastomas. We also identified SNPs in two copy number neutral regions, as well as two regions of copy number gain. SNPs were PCR-amplified in 12-plex reactions and used in solution-phase single-nucleotide extension incorporating tagged dideoxynucleotides. Each extension primer had 5' complementarity to one of 2000 oligonucleotides on a commercially available tag-array platform allowing for solid-phase sorting and identification of individual SNPs. This approach allowed for simultaneous detection of multiple regions of LOH in six human neuroblastoma-derived cell lines, and, more importantly, 14 human neuroblastoma primary tumors. Concordance with conventional genotyping was nearly absolute. Detection of LOH in this assay may not require comparison to matched normal DNAs because of the redundancy of informative SNPs in each region. The customized tag-array system for LOH detection described here is rapid, results in parallel assessment of multiple genomic alterations, and may speed identification of and/or assaying prognostically relevant DNA copy number alterations in many human cancers.
Over 100 distinct disease-associated mutations have been identified in the breast-ovarian cancer susceptibility gene BRCA1. Loss of the wild-type allele in > 90% of tumors from patients with inherited BRCA1 mutations indicates tumor suppressive function. The low incidence of somatic mutations suggests that BRCA1 inactivation in sporadic tumors occurs by alternative mechanisms, such as interstitial chromosomal deletion or reduced transcription. To identify possible features of the BRCA1 genomic region that may contribute to chromosomal instability as well as potential transcriptional regulatory elements, a 117,143-bp DNA sequence encompassing BRCA1 was obtained by random sequencing of four cosmids identified from a human chromosome 17 specific library. The 24 exons of BRCA1 span an 81-kb region that has an unusually high density of Alu repetitive DNA (41.5%), but relatively low density (4.8%) of other repetitive sequences. BRCA1 intron lengths range in size from 403 bp to 9.2 kb and contain the intragenic microsatellite markers D17S1323, D17S1322, and D17S855, which localize to introns 12, 19, and 20, respectively. In addition to BRCA1, the contig contains two complete genes: Rho7, a member of the rho family of GTP binding proteins, and VAT1, an abundant membrane protein of cholinergic synaptic vesicles. Partial sequences of the 1A1-3B B-box protein pseudogene and IFP 35, an interferon induced leucine zipper protein, reside within the contig. An L21 ribosomal protein pseudogene is embedded in BRCA1 intron 13. The order of genes on the chromosome is: centromere-1FP 35-VAT1-Rho7-BRCA1-1A1-3B-telomere.
Chromatin immunoprecipitation coupled with high-throughput sequencing (ChIP-seq) has become the dominant technique for mapping transcription factor (TF) binding regions genome-wide. We performed an integrative analysis centered around 457 ChIP-seq data sets on 119 human TFs generated by the ENCODE Consortium. We identified highly enriched sequence motifs in most data sets, revealing new motifs and validating known ones. The motif sites (TF binding sites) are highly conserved evolutionarily and show distinct footprints upon DNase I digestion. We frequently detected secondary motifs in addition to the canonical motifs of the TFs, indicating tethered binding and cobinding between multiple TFs. We observed significant position and orientation preferences between many cobinding TFs. Genes specifically expressed in a cell line are often associated with a greater occurrence of nearby TF binding in that cell line. We observed cell-line-specific secondary motifs that mediate the binding of the histone deacetylase HDAC2 and the enhancer-binding protein EP300. TF binding sites are located in GC-rich, nucleosome-depleted, and DNase I sensitive regions, flanked by well-positioned nucleosomes, and many of these features show cell type specificity. The GC-richness may be beneficial for regulating TF binding because, when unoccupied by a TF, these regions are occupied by nucleosomes in vivo. We present the results of our analysis in a TF-centric web repository Factorbook (http://factorbook.org) and will continually update this repository as more ENCODE data are generated.
A large set of mRNA and encoded protein sequences, from orthologous murine and human genes, was compiled to analyze statistical, biological, and evolutionary properties of coding and noncoding transcribed sequences. Protein sequence conservation varied between 36% and 100% identity, with an average value of 85%. The average degree of nucleotide sequence identity for the corresponding coding sequences was also approximately 85%, whereas 5' and 3' untranslated regions (UTRs) were less conserved, with aligned identities of 67% and 69%, respectively. For some mouse and human genes, nucleotide sequences are more highly conserved than the encoded protein sequences. A subset of 32 sequences, consisting of only mouse/human protein pairs for which the human sequence represents a positionally cloned disease gene, had properties very similar to the larger data set, suggesting that our data are representative of the genome as a whole. With respect to sequence conservation, two interesting outliers are the breast cancer (BRCAI) gene product and the testis-determining factor (SRY), both of which display among the lowest degrees of sequence identity. The occurrence of both introns and repetitive elements (e.g., Alu, Bl) in 5' and 3' UTRs was also studied. These results provide one benchmark for the "comparative genomics" of mice and humans, with practical implications for the cross-referencing of transcript maps. Also, they should prove useful in estimating the additional sampling diversity provided by mouse EST sequencing projects designed to complement the existing human cDNA collection.
The region p13 of the short arm of human chromosome 11 has been studied intensely during the search for genes involved in the etiology of the Wilms' tumor, aniridia, genitourinary abnormalities, mental retardation (WAGR) syndrome, and related conditions. The gene map for this region is far from being complete, however, strengthening the need for additional gene identification efforts. We describe the extension of an existing contig map with P1-derived artificial chromosomes (PACs) to cover 7.5 Mb of 11p13-14.1. The extended sequence-ready contig was established by end probe walking and fingerprinting and consists of 201 PAC clones. Utilizing bins defined by overlapping PACs, we generated a detailed gene map containing 20 genes as well as 22 anonymous ESTs which have been identified by searching the RH databases. RH maps and our established gene map show global correlation, but the limits of resolution of the current RH panels are evident at this scale. Initial expression studies on the novel genes have been performed by Northern blot analyses. To extend these expression profiles, corresponding mouse cDNA clones were identified by database search and employed for Northern blot analyses and RNA in situ hybridizations to mouse embryo sections. Genomic sequencing of clones along a minimal tiling path through the contig is currently under way and will facilitate these expression studies by in silico gene identification approaches.
The Usher syndrome type 1C (USH1C) and familial hyperinsulinism (HI) loci have been assigned to chromosome 11p14-15.1, within the interval D11S419-D11S1310. We have constructed a yeast artificial chromosome (YAC) contig, extending from D11S926 to D11S899, which encompasses the critical regions for both USH1C and HI and spans an estimated genetic distance of approximately 4 cM. A minimal set of six YAC clones constitute the contig, with another 22 YACs confirming the order of sequence-tagged sites (STSs) and position of YACs on the contig. A total of 40 STSs, including 10 new STSs generated from YAC insert-end sequences and inter-Alu PCR products, were used to order the clones within the contig. This physical map provides a resource for identification of gene transcripts associated with USH1C, HI, and other genetic disorders that map to the D11S926-D11S899 interval.
A major barrier to conceptual advances in understanding the mechanisms and regulation of imprinting of a genomic region is our relatively poor understanding of the overall organization of genes and of the potentially important cis-acting regulatory sequences that lie in the nonexonic segments that make up 97% of the genome. Interspecies sequence comparison offers an effective approach to identify sequence from conserved functional elements. In this article we describe the successful use of this approach in comparing a approximately 1-Mb imprinted genomic domain on mouse chromosome 7 to its orthologous region on human 11p15.5. Within the region, we identified 112 exons of known genes as well as a novel gene identified uniquely in the mouse region, termed Msuit, that was found to be imprinted. In addition to these coding elements, we identified 33 CpG islands and 49 orthologous nonexonic, nonisland sequences that met our criteria as being conserved, and making up 4.1% of the total sequence. These conserved noncoding sequence elements were generally clustered near imprinted genes and the majority were between Igf2 and H19 or within Kvlqt1. Finally, the location of CpG islands provided evidence that suggested a two-island rule for imprinted genes. This study provides the first global view of the architecture of an entire imprinted domain and provides candidate sequence elements for subsequent functional analyses.
Jacobsen syndrome is a haploinsufficiency disorder caused, most frequently by terminal deletion of part of the long arm of chromosome 11, with breakpoints in 11q23.3-11q24.2. Inheritance of an expanded p(CCG)n trinucleotide repeat at the folate-sensitive fragile site FRA11B has been implicated in the generation of the chromosome breakpoint in several Jacobsen syndrome patients. The majority of such breakpoints, however, map distal to this fragile site and are not linked with its expression. To characterize these distal breakpoints and ultimately to further investigate the mechanisms of chromosome breakage, a 40-Mb YAC contig covering the distal long arm of chromosome 11 was assembled. The utility of the YAC contig was demonstrated in three ways: (1) by rapidly mapping the breakpoints from two new Jacobsen syndrome patients using FISH; (2) by demonstrating conversion to high resolution PAC contigs after direct screening of PAC library filters with a YAC clone containing a Jacobsen syndrome breakpoint; and (3) by placing 23 Jacobsen syndrome breakpoints on the physical map. This analysis has suggested the existence of at least two new Jacobsen syndrome breakpoint cluster regions in distal chromosome 11.
Best’s vitelliform macular dystrophy is an autosomal dominant disorder of unknown causes. To identify the underlying gene defect the disease locus has been mapped to an ∼1.4-Mb region on chromosome 11q12–q13.1. As a prerequisite for its positional cloning we have assembled a high coverage PAC contig of the candidate region. Here, we report the construction of a primary transcript map that places a total of 19 genes within the Best’s disease region. This includes 14 transcripts of as yet unknown function obtained by EST mapping and/or cDNA selection and five genes mapped previously to the interval (CD5, PGA, DDB1, FEN1, and FTH1). Northern blot analyses were performed to determine the expression profiles in various human tissues. At least three genes appear to be good candidates for Best’s disease based on their abundant expression in retina or retinal pigment epithelium. Additional information on the functional properties of these genes, as well as mutation analyses in Best’s disease patients, have to await their further characterization.
[The GenBank/EMBL accession numbers and details of the isolation, localization, and characterization of ESTs and selected cDNAs are available as online supplements in Online Tables 1–3 at http://www.genome.org .]
We have combined genetic, radiation-reduced somatic cell hybrid (RRH), fluorescent in situ hybridization (FISH), and physical mapping methods to generate a contig of overlapping YAC, PAC, and cosmid clones corresponding to >3 continuous Mb in 11q13. A total of 15 STSs [7 genes ( GSTP1, ACTN, PC, MLK3, FRA1, SEA, HNP36 ), 4 polymorphic loci ( D11S807, D11S987, GSTP1, D11S913 ), 3 ESTs ( D11S1956E, D11S951E, and WI-12191 ), and 1 anonymous STS ( D11S703 )], mapping to three independent RRH segregation groups, identified 26 YAC, 7 PAC, and 16 cosmid clones from the CGM, Roswell Park, CEPH Mark I, and CEPH MegaYAC YAC libraries, a 5 genome equivalent PAC library, and a chromosome 11-specific cosmid library. Thirty-six Alu –PCR products derived from 10 anonymous bacteriophage λ clones, a cosmid containing the polymorphic marker D11S460, or STS-positive YAC or cosmid clones were identified and used to screen selected libraries by hybridization, resulting in the identification of 19 additional clones. The integrity and relative position of a subset of clones was confirmed by FISH and were found to be consistent with the physical and RRH mapping results. The combination of STS and Alu –PCR-based approaches has proven to be successful in attaining contiguous cloned coverage in this very GC-rich region, thereby establishing for the first time the absolute order and distance between the markers: CEN– MLK3 –( D11S1956E/D11S951E/WI-12191 )– FRA1–D11S460–SEA–HNP36/D11S913–ACTN–PCD11S703–GSTP1–D11S987 –TEL.
[On-line supplementary material concerning screening materials and clones referred to in the text as Table 1 is available at http://genome.wustl.edu/gerhard/gerhard.html or http://www.cshl.org/gr . The sequence data described in this paper have been submitted to the GenBank data library under accession no. AF009361 .]
We have localized the human homolog of the rabbit vasopressin-activated calcium-mobilizing receptor VACM-1 to a region close to the gene for ataxia telangiectasia ATM on chromosome 11q22-23. We have determined the complete amino acid sequence of the human Hs-VACM-1 protein, which is 780 amino acids long. The human and rabbit sequences are highly conserved, differing at only seven amino acids. Northern analysis of the human gene showed expression in a wide range of human tissues. The Hs-VACM-1 gene has homology with the Caenorhabditis elegans gene Ce-cul-5, a member of a family of cullin genes that are involved in cell cycle regulation and that might, when mutated, contribute to tumor progression.
Del(13)Svea36H (Del36H) is a deletion of approximately 20% of mouse chromosome 13 showing conserved synteny with human chromosome 6p22.1-6p22.3/6p25. The human region is lost in some deletion syndromes and is the site of several disease loci. Heterozygous Del36H mice show numerous phenotypes and may model aspects of human genetic disease. We describe 12.7 Mb of finished, annotated sequence from Del36H. Del36H has a higher gene density than the draft mouse genome, reflecting high local densities of three gene families (vomeronasal receptors, serpins, and prolactins) which are greatly expanded relative to human. Transposable elements are concentrated near these gene families. We therefore suggest that their neighborhoods are gene factories, regions of frequent recombination in which gene duplication is more frequent. The gene families show different proportions of pseudogenes, likely reflecting different strengths of purifying selection and/or gene conversion. They are also associated with relatively low simple sequence concentrations, which vary across the region with a periodicity of approximately 5 Mb. Del36H contains numerous evolutionarily conserved regions (ECRs). Many lie in noncoding regions, are detectable in species as distant as Ciona intestinalis, and therefore are candidate regulatory sequences. This analysis will facilitate functional genomic analysis of Del36H and provides insights into mouse genome evolution.
An essential step in Serial Analysis of Gene Expression (SAGE) is tag mapping, which refers to the unambiguous determination of the gene represented by a SAGE tag. Current resources for tag mapping are incomplete, and thus do not allow assessment of the efficacy of SAGE in transcript identification. A method of tag mapping is described here and applied to the Drosophila melanogaster and Caenorhabditis elegans genomes, which permits detailed SAGE assessment and provides tag-mapping resources that were unavailable previously for these organisms. In our method, a conceptual transcriptome is constructed using genomic sequence and annotation by extending predicted coding regions to include UTRs on the basis of EST and cDNA alignments, UTR length distributions, and polyadenylation signals. Analysis of extracted tags suggests that, using the standard SAGE procedure, expression of 8% of D. melanogaster and 15% of C. elegans genes cannot be detected unambiguously by SAGE due to shared sequence or lack of NlaIII-anchoring enzyme sites. Both increasing tag length by 2-3 bp and using Sau3A instead of NlaIII as the anchoring enzyme increases potential for transcript detection. This work identifies and quantifies genes not amenable to SAGE analysis, in addition to providing tag-to-gene mappings for two model organisms.
A regional analysis of nucleotide substitution rates along human genes and their flanking regions allows us to quantify the effect of mutational mechanisms associated with transcription in germ line cells. Our analysis reveals three distinct patterns of substitution rates. First, a sharp decline in the deamination rate of methylated CpG dinucleotides, which is observed in the vicinity of the 5' end of genes. Second, a strand asymmetry in complementary substitution rates, which extends from the 5' end to 1 kbp downstream from the 3' end, associated with transcription-coupled repair. Finally, a localized strand asymmetry, an excess of C-->T over G-->A substitution in the nontemplate strand confined to the first 1-2 kbp downstream of the 5' end of genes. We hypothesize that higher exposure of the nontemplate strand near the 5' end of genes leads to a higher cytosine deamination rate. Up to now, only the somatic hypermutation (SHM) pathway has been known to mediate localized and strand-specific mutagenic processes associated with transcription in mammalia. The mutational patterns in SHM are induced by cytosine deaminase, which just targets single-stranded DNA. This DNA conformation is induced by R-loops, which preferentially occur at the 5' ends of genes. We predict that R-loops are extensively formed in the beginning of transcribed regions in germ line cells.