Genome Research

Published by Cold Spring Harbor Laboratory Press
Online ISSN: 1088-9051
Previous studies revealed that Igf2 and Mpr/Igf2r are imprinted in eutherian mammals and marsupials but not in monotremes or birds. Igf2 lies in a large imprinted cluster in eutherians, and its imprinting is regulated by long-range mechanisms. As a step to understand how the imprinted cluster evolved, we have determined a 490-kb chicken sequence containing the orthologs of mammalian Ascl2/Mash2, Ins2 and Igf2. We found that most of the genes in this region are conserved between chickens and mammals, maintaining the same transcriptional polarities and exon-intron structures. However, H19, an imprinted noncoding transcript, was absent from the chicken sequence. Chicken ASCL2/CASH4 and INS, the orthologs of the imprinted mammalian genes, showed biallelic expression, further supporting the notion that imprinting evolved after the divergence of mammals and birds. The H19 imprinting center and many of the local regulatory elements identified in mammals were not found in chickens. Also, a large segment of tandem repeats and retroelements identified between the two imprinted subdomains in mice was not found in chickens. Our findings show that the imprinted genes were clustered before the emergence of imprinting and that the elements associated with imprinting probably evolved after the divergence of mammals and birds.
Bacterial artificial chromosome (BAC) clones are effective mapping and sequencing reagents. The 1.1-Mb α/δ T-cell receptor locus of humans was mapped and partially sequenced with BAC clones. Seventeen BAC clones covered the 1.1-Mb α/δ locus, with the exception of one small gap that was expected from the coverage that a 3.7-fold BAC library is likely to provide. The end sequences of the BAC inserts could be obtained directly from the BAC DNA by sequencing with the chain terminator chemistry. Five complete BAC inserts were sequenced directly by the shotgun approach. The ends of the 17 BAC inserts were distributed evenly across the locus. By several independent criteria, the BAC clones faithfully represented the genomic DNA, with the exception of a single clone with a 68-kb deletion. These BAC features led to the proposal of a new approach to sequence the human genome. [The sequenced BAC clones, BAC956, BAC810, BAC480, BAC378, and BAC129, have been submitted to GenBank under accession nos. U85199 , U85198 , U85197 , U85196 , and U85195 , respectively.]
(A) Outline of the Gateway recombination reaction used for generation of hORFeome v1.1. PCR amplification of human ORFs (blue boxes) was performed on isolated MGC cDNA clones. In the depiction of the gene-specific primers, yellow nucleotides represent the altered attB recombination sites (attB1.1 and attB2.1), and blue nucleotides represent the coding sequence in the ORF. PCR-amplified ORFs are cloned by a unidirectional recombinational cloning reaction via their flanking attB1.1 and attB2.1 recombination sites into the pDONR223 Gateway Donor vector. On the Donor vector, the universal Fwd and Rev sequencing primers, the origin of replication (ORI), and the spectinomycin (Spc) selectable marker are indicated. (B) Defining a first version of the human ORFeome starting from the MGC (Mammalian Gene Collection) Resource. The MGC contained 12,710 available full-length human cDNA clones at the time this project was begun. We designed pairs of primers for the PCR amplification of 10,154 distinct full-length ORFs. Asterisks indicate sequence polymorphisms. (C) Scheme for the generation of the hORFeome v1.1 resource. A total of 10,154 pairs of ORF-specific primers were designed to PCR amplify all nonredundant ORFs present in the MGC collection as of March 2004. Amplified ORFs were cloned into the pDONR223 Donor vector by BP recombinational cloning. PCR amplification of cloned inserts for sequence verification was then done using the Fwd and Rev primers that flank the cloning site. Sequencing was performed on the 5 end with the Fwd primer to generate ORF Sequence Tags (OSTs), and ORFs were then identified by BLAST analysis against the complete MGC set of 10,154 nonredundant clones. Successfully cloned ORFs were consolidated onto new plates, generating the hORFeome v1.1 resource. 
(Continued on next page) 
Comparison of the Gateway Cloning System for the Current Human ORFeome Project to the Previous C. elegans ORFeome Project (Reboul et al. 2003) C. elegans ORFeome Human ORFeome 
Summary of the Proteome-Scale Protein Expression Data 
(A) Recombinant protein expression in E. coli. The 282 Entry constructs were transferred into a His6 N-terminal fusion vector (pDEST17). Representative samples from the 282 small, medium, and long ORFs are shown. (Lanes 1-12) Plate 11006, Row F01-12, small ORFs (1) BC007838; (2) BC007355; (3) BC027953; (4) BC027899; (5) BC021719; (6) BC038108; (7) BC002826; (8) BC030277; (9) BC005991; (10) BC036723; (11) BC025403; (12) BC025760; (lanes 13-24) Plate 11023, Row A1-12, medium ORFS (13) BC001061; (14) BC000453; (15) BC000017; (16) BC001167; (17) BC001150; (18) BC000480; (19) BC001221; (20) BC001142; (21) BC002472; (22) BC000723; (23) BC000770; (24) BC001665; (lanes 25-36) Plate 11025, Row C1-12, large ORFS (25) BC037313; (26) BC035818; (27) BC007670; (28) BC007897; (29) BC012177; (30) BC037491; (31) BC005033; (32) BC032597; (33) BC036216; (34) BC001571; (35) BC012064; (36) BC034237. The positions of molecular weight markers (31-98 kDa) are indicated. All visible proteins migrate at the expected size. (B) Recombinant protein expression in mammalian cells. The 282 Entry constructs were transferred into a GFP C-terminal fusion vector (pcDNA-DEST47). Expression of GFP fusion proteins were assessed in transiently transfected 293T cells. Shown are 10 representative images of GFP fusion protein expression, with GFP images (left) showing the distribution of the fusion proteins, DAPI images (middle) indicating nuclei, and the GFP/DAPI merged images (right). MGC clone numbers are indicated to the left of each panel. 
The advent of systems biology necessitates the cloning of nearly entire sets of protein-encoding open reading frames (ORFs), or ORFeomes, to allow functional studies of the corresponding proteomes. Here, we describe the generation of a first version of the human ORFeome using a newly improved Gateway recombinational cloning approach. Using the Mammalian Gene Collection (MGC) resource as a starting point, we report the successful cloning of 8076 human ORFs, representing at least 7263 human genes, as mini-pools of PCR-amplified products. These were assembled into the human ORFeome version 1.1 (hORFeome v1.1) collection. After assessing the overall quality of this version, we describe the use of hORFeome v1.1 for heterologous protein expression in two different expression systems at proteome scale. The hORFeome v1.1 represents a central resource for the cloning of large sets of human ORFs in various settings for functional proteomics of many types, and will serve as the foundation for subsequent improved versions of the human ORFeome.
The bacteria of the Brucella genus are responsible for a worldwide zoonosis called brucellosis. They belong to the alpha-proteobacteria group, as many other bacteria that live in close association with a eukaryotic host. Importantly, the Brucellae are mainly intracellular pathogens, and the molecular mechanisms of their virulence are still poorly understood. Using the complete genome sequence of Brucella melitensis, we generated a database of protein-coding open reading frames (ORFs) and constructed an ORFeome library of 3091 Gateway Entry clones, each containing a defined ORF. This first version of the Brucella ORFeome (v1.1) provides the coding sequences in a user-friendly format amenable to high-throughput functional genomic and proteomic experiments, as the ORFs are conveniently transferable from the Entry clones to various Expression vectors by recombinational cloning. The cloning of the Brucella ORFeome v1.1 should help to provide a better understanding of the molecular mechanisms of virulence, including the identification of bacterial protein-protein interactions, but also interactions between bacterial effectors and their host's targets.
Long-range map of the Xp11.23-22 region. STSs or ESTs that were mapped by hybridization on genomic DNA and to YACs are shown above (probes with * only against YACs), markers mapped by a PCR-based approach on YACs or cosmids below. (R) Nrul; (B) BssHII; (M) Mlul; (N) Notl. YACs shown in the center line were obtained from either the ICRF library (Lehrach et al. 1990) or from an X chromosome-specific library (Lee et al. 1992). YACs yWXDF14D4/A39E7 were isolated from the St. Louis library (Nagaraja et al. 1994). Additional restriction sites on YACs not seen in genomic DNA are given. The 1100-kb sequence-ready cosmid/PAC contig is shown in detail in Fig. 4. YACs ICRFy900A0120/A1220 are chimeric (ch); YAC ICRFy900F051 is chimeric and unstable (u). 
Fragment Sizes and Probes Designation Notl BssHII Nrul Mlul Marker type 
PFGE analysis connecting MG44 with T54. Both probes identify an identical partially digested Notl fragment, but different Notl end fragments and Mlul fragments. Hybridization was performed under stringent conditions. Sizes are indicated on the left. 
PFGE analysis in Xp11.22. (a) High-resolution PFGE connecting HB-2 with DXS255. The probe HB2 and the VNTR DXS255 hybridize with identical Mlul and Notl fragments (>3 Mb), but with two different Nrul fragments. (b) Identification of two putative CpG islands by hybridization with HB2. Partially digested BssHII fragments were not observed with other probes except for HB4. 
Entire cDNA sequence of T54 (GenBank accession no. U66359) encoding for a predicted novel protein of 378 amino acids. (a) The initiation codon is underlined and the stop codon is indicated by an asterisk. The poly(A) site is in bold letters, a putative nuclear location signal is doubly underlined. (b) Hydrophilicity profile of T54 using a window of 19 residues (Kyte and Doolittle 1982). 
Most of the yeast artificial chromosomes (YACs) isolated from the Xp11.23-22 region have shown instability and chimerism and are not a reliable resource for determining physical distances. We therefore constructed a long-range pulsed-field gel electrophoresis map that encompasses approximately 3.5 Mb of genomic DNA between the loci TIMP and DXS146 including a CpG-rich region around the WASP and TFE-3 gene loci. A combined YAC-cosmid contig was constructed along the genomic map and was used for fine-mapping of 15 polymorphic microsatellites and 30 expressed sequence tags (ESTs) or sequence transcribed sites (STSs), revealing the following order: tel-(SYN-TIMP)-(DXS426-ELK1)-ZNF(CA) n-L1-DXS1367-ZNF81-ZNF21-DXS6616- (HB3-OATL1pseudogenes-DXS6950)-DXS6949-DXS694 1-DXS7464E(MG61)-GW1E(EBP)- DXS7927E(MG81)-RBM- DXS722-DXS7467E(MG21)-DXS1011E-WASP-DXS6940++ +-DXS7466E(MG44)-GF1- DXS226-DXS1126-DXS1240-HB1- DXS7469E-(DXS6665-DXS1470)-TFE3-DXS7468E-+ ++SYP-DXS1208-HB2E-DXS573-DXS1331- DXS6666-DXS1039-DXS 1426-DXS1416-DXS7647-DXS8222-DXS6850-DXS255++ +-CIC-5-DXS146-cen. A sequence-ready map was constructed for an 1100-kb gene-rich interval flanked by the markers HB3 and DXS1039, from which six novel ESTs/STSs were isolated, thus increasing the number of markers used in this interval to thirty. This precise ordering is a prerequisite for the construction of a transcription map of this region that contains numerous disease loci, including those for several forms of retinal degeneration and mental retardation. In addition, the map provides the base to delineate the corresponding syntenic region in the mouse, where the mutants scurfy and tattered are localized.
In the process of positionally cloning a candidate gene responsible for hereditary hemochromatosis (HH), we constructed a 1.1-Mb transcript map of the region of human chromosome 6p that lies 4.5 Mb telomeric to HLA-A . A combination of three gene-finding techniques, direct cDNA selection, exon trapping, and sample sequencing, were used initially for a saturation screening of the 1.1-Mb region for expressed sequence fragments. As genetic analysis further narrowed the HH candidate locus, we sequenced completely 0.25 Mb of genomic DNA as a final measure to identify all genes. Besides the novel MHC class 1-like HH candidate gene HLA-H , we identified a family of five butyrophilin-related sequences, two genes with structural similarity to a type 1 sodium phosphate transporter, 12 novel histone genes, and a gene we named RoRet based on its strong similarity to the 52-kD Ro/SSA lupus and Sjogren’s syndrome auto-antigen and the RET finger protein. Several members of the butyrophilin family and the RoRet gene share an exon of common evolutionary origin called B30-2. The B30-2 exon was originally isolated from the HLA class 1 region, yet has apparently “shuffled” into several genes along the chromosome telomeric to the MHC. The conservation of the B30-2 exon in several novel genes and the previously described amino acid homology of HLA-H to MHC class 1 molecules provide further support that this gene-rich region of 6p21.3 is related to the MHC. Finally, we performed an analysis of the four approaches for gene finding and conclude that direct selection provides the most effective probes for cDNA screening, and that as much as 30% of ESTs in this 1.1-Mb region may be derived from noncoding genomic DNA. [The sequence data described in this paper have been submitted to GenBank under accession nos. U90543 – U90548 , U90550 – U90552 , and U91328 .]
A composite physical map of the Lc locus on mouse chromosome 6. The markers presented here are a selection of markers from Table 1 that illustrate the redundancy of the YAC and BAC contigs. ( Top ) A diagram of the chromosomal segment under study showing the relative position of the markers on the chromosome. The YAC contig is shown in the center; individual YACs are labeled with their respective names. The BAC contig is found at the bottom; individual BACs are labeled with their abbreviated names. The position of each marker is indicated by a bold vertical line on the chromosome, YACs and BACs; markers that are end clones are indicated with a ᭹ on the YAC or BAC from which they were isolated. The BAC contig stretches from D6Rck342 to D6Rck329, covering ∼ 1.2 Mb. The Lc mutation is contained within the genomic segment defined by markers D6Rck353 and D6Rck357. Two genes were mapped to positions within the BAC contig; a short line describes the general position of these genes. More specifically, Lt1 maps to marker D6Rck329 and Atoh1 to D6Rck368. 
Lurcher ( Lc ) is a semidominant mouse mutant that displays a characteristic ataxia in the heterozygous state beginning in the third postnatal week. This symptom results from a neurodegenerative event in the cerebellum: There is a catastrophic loss of Purkinje cells in the heterozygote animal between postnatal days 10 and 15. In an effort to identify the genetic lesion borne by Lc mice, we initiated a cloning project based on the position of the Lc mutation on mouse chromosome 6. We have extended our previous analysis of the genomic segment containing the Lc locus by isolating a set of stable and manipulable genomic clones called bacterial artificial chromosomes (BACs) that cover this region of mouse chromosome 6. These clones provided a good substrate for the isolation of markers that were used to refine the physical map of the locus. Furthermore, 20 of these markers were mapped onto our (B6CBACa- A w − J /A − Lc × CAST/Ei)F 1 × B6CBACa- A w − J /A backcross, refining the genetic map and identifying two nonrecombinant markers ( D6Rck354 and D6Rck355 ). These two markers, in conjunction with the closest flanking markers, were used to identify a 110-kb genomic segment that contains all four markers and hence contains the Lc locus. This small genomic segment, covered by multiple BACs, sets the stage for the final effort of this project—the identification of transcripts and of the mutation within the Lc locus. [The Lt1 sequence has been submitted to GenBank as two ESTs; the accession numbers are U89356 and U89357 .]
Duplication and deletion of the 1.4-Mb region in 17p12 that is delimited by two 24-kb low copy number repeats (CMT1A-REPs) represent frequent genomic rearrangements resulting in two common inherited peripheral neuropathies, Charcot-Marie-Tooth disease type 1A (CMT1A) and hereditary neuropathy with liability to pressure palsy (HNPP). CMT1A and HNPP exemplify a paradigm for genomic disorders wherein unique genome architectural features result in susceptibility to DNA rearrangements that cause disease. A gene within the 1.4-Mb region, PMP22, is responsible for these disorders through a gene-dosage effect in the heterozygous duplication or deletion. However, the genomic structure of the 1.4-Mb region, including other genes contained within the rearranged genomic segment, remains essentially uncharacterized. To delineate genomic structural features, investigate higher-order genomic architecture, and identify genes in this region, we constructed PAC and BAC contigs and determined the complete nucleotide sequence. This CMT1A/HNPP genomic segment contains 1,421,129 bp of DNA. A low copy number repeat (LCR) was identified, with one copy inside and two copies outside of the 1.4-Mb region. Comparison between physical and genetic maps revealed a striking difference in recombination rates between the sexes with a lower recombination frequency in males (0.67 cM/Mb) versus females (5.5 cM/Mb). Hypothetically, this low recombination frequency in males may enable a chromosomal misalignment at proximal and distal CMT1A-REPs and promote unequal crossing over, which occurs 10 times more frequently in male meiosis. In addition to three previously described genes, five new genes (TEKT3, HS3ST3B1, NPD008/CGI-148, CDRT1, and CDRT15) and 13 predicted genes were identified. Most of these predicted genes are expressed only in embryonic stages. Analyses of the genomic region adjacent to proximal CMT1A-REP indicated an evolutionary mechanism for the formation of proximal CMT1A-REP and the creation of novel genes by DNA rearrangement during primate speciation.
The nucleotide sequence of 1.5 Mb of genomic DNA from Mycobacterium leprae was determined using computer-assisted multiplex sequencing technology. This brings the 2.8-Mb M. leprae genome sequence to ∼66% completion. The sequences, derived from 43 recombinant cosmids, contain 1046 putative protein-coding genes, 44 repetitive regions, 3 rRNAs, and 15 tRNAs. The gene density of one per 1.4 kb is slightly lower than that of Mycoplasma (1.2 kb). Of the protein coding genes, 44% have significant matches to genes with well-defined functions. Comparison of 1157 M. leprae and 1564 Mycobacterium tuberculosis proteins shows a complex mosaic of homologous genomic blocks with up to 22 adjacent proteins in conserved map order. Matches to known enzymatic, antigenic, membrane, cell wall, cell division, multidrug resistance, and virulence proteins suggest therapeutic and vaccine targets. Unusual features of the M. leprae genome include large polyketide synthase (pks) operons, inteins, and highly fragmented pseudogenes. [The sequence data described in this paper have been submitted to GenBank under accession nos. L78811 – L78829 , U00010 – U00023 , U15180 – U15184 , U15186 , U15187 , L01095 , L01536 , L04666 , and L01263 . On-line supplementary information for Table 1 is available at .]
List of YAC clones used in this study 
A contig of 21 nonchimeric yeast artificial chromosomes (YACs) has been assembled across 1.5 Mb of the multidrug resistance (MDR) gene region located at 7q21, and formatted with four previously reported probes, six newly isolated probes, and three sequence-tagged sites (STSs) from internal and end fragments of YACs. A physical map of rare cutter restriction enzyme sites across the region was also constructed by pulsed-field gel electrophoretic (PFGE) analysis of four overlapping YAC clones. The amplification unit of this region in different cell lines was then determined by Southern blot analysis on the basis of the physical map and probes. Amplified DNA was located in extrachromosomal elements in human MDR cell lines studied here, and the size of the amplification unit was determined to be discrete in one MDR amplification but variable in others.
The Down syndrome (DS) region has been defined by analyses of partial trisomy 21. The 2.5-Mb region between D21S17 and ERG is reportedly responsible for the main features of DS. Within this 2.5-Mb region, we focused previously on a distal 1.6-Mb region from an analysis of Japanese DS patients with partial trisomy 21. Previously we also performed exon-trapping and direct cDNA library screening of a fetal brain cDNA library and identified a novel gene TPRD. Further screening of a fetal heart cDNA library was performed and a total of 44 possible exons and 97 cDNA clones were obtained and mapped on a BamH1 map. By rescreening other cDNA libraries and a RACE reaction, we isolated nearly full-length cDNAs of three additional genes [holocarboxylase synthetase (HCS), G protein-coupled inward rectifier potassium channel 2 (GIRK2), and a human homolog of Drosophila minibrain gene (MNB)] and a coding sequence of a novel inward rectifier potassium channel-like gene (IRKK). The gene distribution and direction of transcription were determined by mapping both ends of the cDNA sequences. We found that these genes, except IRKK, are expressed ubiquitously and are relatively large, extending from 100 kb to 300 kb on the genome. These nearly full-length cDNA sequences should facilitate understanding of the detailed genome structure of the DS region and help to elucidate their role in the etiology of DS.
Calling genotypes based on cluster analysis of raw data. Each SNP in a multiplex assay results in four fluorescent signal values: two for the two expected allele channels and two in background channels. Plotting the signal channels against each other ( left ) results in the formation of three clusters. The plot on the left shows 50,000 data points across several thousand markers. In order to decouple the overall signal of the particular data point from the contrast between the different allele signals, it is helpful to transform the data into a different space in which the sum of the signals in both channels (S) is plotted on the y -axis and the projection of the individual data point onto the line of constant S (the contrast value C) is plotted on the x -axis. The values of C range from מ 1 to 1 such that a value of מ 1 or 1 means signal in only one of the two channels while a value of 0 means equal signal in each channel. A one-dimensional E-M algorithm can then be used to find the clusters of homozygous and heterozygous calls. The colors have been automatically added by the cluster calling algorithm, which has identified the three clusters. 
A summary of the genotyping for two representative batches of the HapMap study
Schematic of the MIP assay process. MIP reactions are set up adding an enzyme mix and genomic DNA to the probe pool. This mix is then split into four tubes, each receiving a distinct nucleotide species. After gap-filling and probe inversion, inverted probes are amplified using common PCR primers. These amplicons are labeled using one of two labeling processes. In the two-color labeling scheme ( top ), the A and C reactions are labeled with one fluorophore while the G and T reactions are labeled with a spectrally distinct fluorophore. The A and G reactions are then pooled and hybridized to one tag array while the C and T reactions pooled and hybridized to a second array. Both arrays are then scanned using a GeneChip array scanner in two spectral channels to generate four fluorescent signals for each tag. In the four-color labeling scheme ( bottom ), each of the four reactions is labeled with a spectrally distinct fluorophore. All four reactions are then pooled and hybridized to a single tag array which is scanned using a GeneChip AT CCD imager in four spectral bands. In both cases four images are generated containing the four allele signals for each SNP marker. 
The effect of clustering parameters on performance metrics. In this plot, the markers for Batch 6 are ordered along the x -axis such that the marker with the highest call rate is at the origin, while the worst performing of the ∼ 12,000 markers is at the right . The y -axis shows the call rate for each of these markers across 95 individuals. The markers that exhibit poor call rates are called nonconverted and are shown in the gray area. The red curve shows a choice of cluster calling parameters that emphasizes high completeness by accepting calls on the periphery of clusters. More markers show very high call rates and the amount of missing data shown by the red shaded region is minimal (99.2% com- pleteness). The overall accuracy as measured by trio concordance shows that a small number of erroneous calls are being made (99.64% concor- dance). If one wishes to eliminate these incorrect calls, the base caller can be tuned to be more stringent. This choice allows very high accuracy ( ∼ 99.9% trio concordance) while causing more missing data (blue shaded region). The choice of cluster calling parameters should thus be chosen according to the intended use of the data. 
The effect of inaccurate genotypes ( A ) and incomplete geno- typing ( B ) on the number of patients required to have 80% power to find a genetic association. A genetic model has been assumed in which the relative risk of the causative allele (GRR) is two. The effect is assumed to be multiplicative. The causative allele frequency is plotted on the x -axis. The largest loss of power comes with making inaccurate calls for markers with low frequency. By contrast, incomplete data result in smaller loss of power, which is felt across the allele frequency spectrum. The data from the MIP assay are accurate enough to be used for the investigation of rare alleles without significant loss of power. 
Large-scale genetic studies are highly dependent on efficient and scalable multiplex SNP assays. In this study, we report the development of Molecular Inversion Probe technology with four-color, single array detection, applied to large-scale genotyping of up to 12,000 SNPs per reaction. While generating 38,429 SNP assays using this technology in a population of 30 trios from the Centre d'Etude Polymorphisme Humain family panel as part of the International HapMap project, we established SNP conversion rates of approximately 90% with concordance rates >99.6% and completeness levels >98% for assays multiplexed up to 12,000plex levels. Furthermore, these individual metrics can be "traded off" and, by sacrificing a small fraction of the conversion rate, the accuracy can be increased to very high levels. No loss of performance is seen when scaling from 6,000plex to 12,000plex assays, strongly validating the ability of the technology to suppress cross-reactivity at high multiplex levels. The results of this study demonstrate the suitability of this technology for comprehensive association studies that use targeted SNPs in indirect linkage disequilibrium studies or that directly screen for causative mutations.
Genotyping Accuracy
Replication of Linkage Region in a Family With a Combination of Chronic Mucocutaneous-Candidiasis and Thyroid Disease, Using SNPs
The analysis of single nucleotide polymorphisms (SNPs) is increasingly utilized to investigate the genetic causes of complex human diseases. Here we present a high-throughput genotyping platform that uses a one-primer assay to genotype over 10,000 SNPs per individual on a single oligonucleotide array. This approach uses restriction digestion to fractionate the genome, followed by amplification of a specific fractionated subset of the genome. The resulting reduction in genome complexity enables allele-specific hybridization to the array. The selection of SNPs was primarily determined by computer-predicted lengths of restriction fragments containing the SNPs, and was further driven by strict empirical measurements of accuracy, reproducibility, and average call rate, which we estimate to be >99.5%, >99.9%, and>95%, respectively [corrected]. With average heterozygosity of 0.38 and genome scan resolution of 0.31 cM, the SNP array is a viable alternative to panels of microsatellites (STRs). As a demonstration of the utility of the genotyping platform in whole-genome scans, we have replicated and refined a linkage region on chromosome 2p for chronic mucocutaneous candidiasis and thyroid disease, previously identified using a panel of microsatellite (STR) markers.
Frequencies of d N / d S ratio ( ␻ ) for pairwise comparisons between X. laevis and X. tropicalis genes. The distribution of d N / d S from pairwise comparisons of genes within gene trios is shown. ␻ s from X. laevis paralog pairs (shown in red) indicate a weaker selective constraint than the ␻ obtained from the comparisons of X. laevis paralogs with their X. tropicalis ortholog (shown in gray and blue). The ␻ s from both paralog – ortholog pairs follow a similar distribution with a lower median than the ␻ obtained from the paralogs in each trio ( P = 2.184 ן 10 מ 7 ). 
Distribution of d S for pairwise comparisons between paralogs and orthologs. Distribution of d S from pairwise comparison between X. laevis paralogs (red) and from pairwise comparisons between each X. laevis paralog from a trio with its X. tropicalis ortholog (blue and gray). The small number 
Clustered Image Map of genes with no paralog versus GO categories for categories with significant enrichment. Thumbnail clustered image map (CIM) of genes ( top ) versus categories ( right ) for categories with a false discovery rate (FDR) Յ 0.10. Very large generic categories have been removed to improve visualization. Clustering was performed with the Genesis Client (Sturn et al. 2002;  GenesisCenter.html). Three major clusters can be seen. Processes involved in general metabolism (far left ) include “ cofactor catabolism, ” “ acetyl-CoA catabolism, ” “ aerobic respiration, ” “ cellular respiration, ” and “ tricarboxylic acid cycle. ” Processes involved in nucleic acid processing ( bottom right ) include “ RNA metabolism, ” “ transcription, ” “ nucleobase metabolism, ” “ DNA replication, ” and “ DNA metabolism. ” The third cluster contains GO categories involved in nucleoside metabolism such as “ nucleobase metabolism, ” “ pyrimidine base biosynthesis, ” and “ nucleobase biosynthesis. ” The full-size CIM in which all genes are displayed is available as Supplemental Figure S6. 
Sequencing of full-insert clones from full-length cDNA libraries from both Xenopus laevis and Xenopus tropicalis has been ongoing as part of the Xenopus Gene Collection Initiative. Here we present 10,967 full ORF verified cDNA clones (8049 from X. laevis and 2918 from X. tropicalis) as a community resource. Because the genome of X. laevis, but not X. tropicalis, has undergone allotetraploidization, comparison of coding sequences from these two clawed (pipid) frogs provides a unique angle for exploring the molecular evolution of duplicate genes. Within our clone set, we have identified 445 gene trios, each comprised of an allotetraploidization-derived X. laevis gene pair and their shared X. tropicalis ortholog. Pairwise dN/dS, comparisons within trios show strong evidence for purifying selection acting on all three members. However, dN/dS ratios between X. laevis gene pairs are elevated relative to their X. tropicalis ortholog. This difference is highly significant and indicates an overall relaxation of selective pressures on duplicated gene pairs. We have found that the paralogs that have been lost since the tetraploidization event are enriched for several molecular functions, but have found no such enrichment in the extant paralogs. Approximately 14% of the paralogous pairs analyzed here also show differential expression indicative of subfunctionalization.
Characteristics of the Ovine Linkage Map 
(continues on pp 1278-1280) 
Comparison of the Number of Loci Shared by the Various Sheep and Cattle Maps 
(Continues on pp. 1283-1284) 
Loci that Map to Nonhomologous Chromosomes on Sheep, Cattle, and Goat Linkage Maps 
A medium-density linkage map of the ovine genome has been developed. Marker data for 550 new loci were generated and merged with the previous sheep linkage map. The new map comprises 1093 markers representing 1062 unique loci (941 anonymous loci, 121 genes) and spans 3500 cM (sex-averaged) for the autosomes and 132 cM (female) on the X chromosome. There is an average spacing of 3.4 cM between autosomal loci and 8.3 cM between highly polymorphic [polymorphic information content (PIC) > or = 0.7] autosomal loci. The largest gap between markers is 32.5 cM, and the number of gaps of > 20 cM between loci, or regions where loci are missing from chromosome ends, has been reduced from 40 in the previous map to 6. Five hundred and seventy-three of the loci can be ordered on a framework map with odds of > 1000 : 1. The sheep linkage map contains strong links to both the cattle and goat maps. Five hundred and seventy-two of the loci positioned on the sheep linkage map have also been mapped by linkage analysis in cattle, and 209 of the loci mapped on the sheep linkage map have also been placed on the goat linkage map. Inspection of ruminant linkage maps indicates that the genomic coverage by the current sheep linkage map is comparable to that of the available cattle maps. The sheep map provides a valuable resource to the international sheep, cattle, and goat gene mapping community.
Schematic illustration of the multiplex genotyping proce- dure. Only one SNP is shown. Primers and probes are shown as arrowed lines. Microarray spots are indicated as ellipsoids. ( A ) Amplification of the polymorphic sequence. Two allelic sequences use the same set of prim- ers, P1 and P2. ( B ) Generation of ssDNA by using the primer-probes in both directions in separate tubes. Only the two allelic template strands in each reaction are shown. ( C ) ssDNA generated from B . ( D ) Addition of the ssDNA to the respective microarrays containing probes in different directions. ( E ) ssDNA templates hybridized to their probes on the micro- arrays. ( F ) Labeling probes by incorporating fluorescently labeled dd- NTPs. ( G ) Labeled probes after washing off all other reagents. 
( A ) A microarray image from genotyping one individual with Group II SNPs. Each probe was printed twice and shown as neighboring spots. Spots in red and green, homozygous; yellow, heterozygous; white, pink, and light green, spots with strong signal that have exceeded the linear range; and dark, low signal but not necessarily mean no signal or too low for genotype calls. ( B ) Scatter plot based on the color intensities from the microarray image shown in A . Two horizontal lines are the cutoffs (natural logarithms of the ratios [Cy3/Cy5] at 2 and מ 2) to divide the spots into three genotype groups. ( C ) A plot simply based on the two color intensities for the 24 samples (two spots for each sample) of an SNP. Values of the signal intensities indicated on the axes should be multiplied by 1000. Note that since different parameters are used, the color orientations are different in B and C . 
Results from genotyping 24 samples with the three groups of SNPs
A microarray image from the analysis of single sperm with Group II SNPs. Each probe was printed twice as neighboring spots on the microarray. Spots in red and green, homozygous; yellow, heterozygous; white, pink, and light green, spots with strong signal that have exceeded the linear range; and dark, low signal but not necessarily mean no signal or too low for genotype calls. Yellow spots are either from SNPs that were not real because of the presences of a small portion of SNPs consisting of paralogous sequence variants in the databases (Cheung et al. 2003; Fred- man et al. 2004), or from a low level ( ∼ 5%) of contamination as dem- onstrated in the previous studies (Cui et al. 1989; Goradia et al. 1991), which has been shown to be from oligonucleotides synthesized by the current hemi-open-oligonucleotide synthesis system. Note that hetero- zygous SNPs are treated as uninformative in genetic analyses with single sperm. 
A high-throughput genotyping system for scoring single nucleotide polymorphisms (SNPs) has been developed. With this system, >1000 SNPs can be analyzed in a single assay, with a sensitivity that allows the use of single haploid cells as starting material. In the multiplex polymorphic sequence amplification step, instead of attaching universal sequences to the amplicons, primers that are unlikely to have nonspecific and productive interactions are used. Genotypes of SNPs are then determined by using the widely accessible microarray technology and the simple single-base extension assay. Three SNP panels, each consisting of >1000 SNPs, were incorporated into this system. The system was used to analyze 24 human genomic DNA samples. With 5 ng of human genomic DNA, the average detection rate was 98.22% when single probes were used, and 96.71% could be detected by dual probes in different directions. When single sperm cells were used, 91.88% of the SNPs were detectable, which is comparable to the level that was reached when very few genetic markers were used. By using a dual-probe assay, the average genotyping accuracy was 99.96% for 5 ng of human genomic DNA and 99.95% for single sperm. This system may be used to significantly facilitate large-scale genetic analysis even if the amount of DNA template is very limited or even highly degraded as that obtained from paraffin-embedded cancer specimens, and to make many unpractical research projects highly realistic and affordable.
Targeted mutagenesis in human K562 cells via direct delivery of RGEN RNPs. (A) CCR5-specific RGEN RNP-mediated mutations measured by the T7E1 assay. (B) Mutant DNA sequences at the CCR5 locus. The 20-bp target sequence is underlined and shown in bold. The PAM sequence is shown in red. (C ) RGEN RNP-mediated mutagenesis at several endogenous loci. A mixture of Cas9 protein (15 mg) and sgRNA (20 mg) was transfected into 2 3 10 5 K562 cells. PCR amplicons around RGEN target sites were subjected to the T7E1 assay. Representative data from at least three independent experiments are shown. 
Genome editing in BJ fibroblasts and H9 hES cell lines via direct delivery of RGEN RNPs. (A) CCR5-specific RGEN-driven mutations detected by the T7E1 assay in H9 and BJ cells. (B) RGEN-driven mutations in H9 ES cells detected by the T7E1 assay. A mixture of Cas9 protein (75 mg) and sgRNA (100 mg) was transfected into 1 3 10 6 H9 cells. (C ) Cytotoxicity of RGEN RNPs vs. RGEN plasmid in H9 ES cells. (**) P < 0.01, (*) P < 0.05. (D) No apparent changes in the physiology of ES cells after RGEN RNP treatment. Untransfected, RNP-, and plasmid-transfected ES cell colonies were subjected to AP staining. 
Off-target mutations caused by RGEN RNPs vs. RGEN plasmids. RGEN RNPs or plasmids that encode Cas9 and sgRNA were electroporated into K562 cells. Mutations were detected using the T7E1 assay (left) and deep sequencing (right). The PAM sequence is shown in blue. Mismatched bases are shown in red. 
Time-course analyses of RGEN-mediated genome editing via RNP delivery or plasmid transfection. (A, top) Mutation frequencies were determined by the T7E1 assay. (Bottom) Western blot analysis of K562 cells transfected with the CCR5-specific RGEN via RNP or plasmid DNA delivery. (B,C) Line graphs showing the results of the T7E1 (B) and Western blot analysis (C ). Note that only the relative abundance of Cas9 in each experiment is shown. 
RNA-guided engineered nucleases (RGENs) derived from the prokaryotic adaptive immune system known as CRISPR (clustered, regularly interspaced, short palindromic repeat)/Cas (CRISPR-associated) enable genome editing in human cell lines, animals, and plants but are limited by off-target effects and unwanted integration of DNA segments derived from plasmids encoding Cas9 and guide RNA at both on-target and off-target sites in the genome. Here, we deliver purified recombinant Cas9 protein and guide RNA into cultured human cells including hard-to-transfect fibroblasts and pluripotent stem cells. RGEN ribonucleoproteins (RNPs) induce site-specific mutations at frequencies of up to 79%, while reducing off-target mutations associated with plasmid transfection at off-target sites that differ by one or two nucleotides from on-target sites. RGEN RNPs cleave chromosomal DNA almost immediately after delivery and are degraded rapidly in cells, reducing off-target effects. Furthermore, RNP delivery is less stressful to human embryonic stem cells, producing at least two-fold more colonies than does plasmid transfection.
Horizontal gene transfer of alpha-amylase genes from A. oryzae to A. niger CBS 513.88. (A) Unmatched region identified for the left arm of chromosome III spanning 65 kb for ATCC 1015 encoding 30 predicted genes and 85 kb for CBS 513.88 encoding 24 predicted genes. The unmatched region is flanked on one side by a small local inversion of three predicted genes (in green). (B) The 12.4-kb HGT region is part of an identical 12.7-kb region present in A. oryzae RIB40 supercontigs SC113 and SC023. In A. niger CBS 513.88, the transferred region is enclosed by a 203-bp inverted repeat (red arrow). An12g07000 (blue) is a putative transposase and identical to the A. oryzae transposon Aot1 tnpA gene. Similarly, the A. niger CBS 513.88 alphaamylase encoding gene An12g06930 (orange) is identical to A. oryzae annotated genes AO090120000196 (SC113) and AO090023000944 (SC023). (C ) Proposed duplication-recombination event between supercontig 12 and supercontig 05 of the 12-kb HGT region. Breakpoints are indicated with dotted lines. The fragment encoding alpha-amylase An05g02200 is identical to An12g06930 (orange). The breakpoints are flanked with additional copies of the 203-bp repeat region (red arrow). The region encoding genes An12g06940 to An12g06970 is identical to the region encoding An05g02210 to An05g02130. The downstream breakpoint occurred in the predicted gene coding region of An12g06970, thereby deleting the original start codon and upstream region.
(A) Phylogenetic relationship of several black Aspergilli based on partial sequencing of beta-tubulin. The tree was rooted to Aspergillus aculeatus CBS 172.66. (B) Phylogenetic relationship of seven strains of A. niger based on sequencing of 1-kb variable regions from four chromosomes. The tree was rooted to the sequence obtained from Aspergillus carbonarius IMI 388653. Clades based on the exo-metabolomic groupings of Table 2 are shown. For both trees, bootstrap values above 80% of the 1000 performed reiterations are shown.
The filamentous fungus Aspergillus niger exhibits great diversity in its phenotype. It is found globally, both as marine and terrestrial strains, produces both organic acids and hydrolytic enzymes in high amounts, and some isolates exhibit pathogenicity. Although the genome of an industrial enzyme-producing A. niger strain (CBS 513.88) has already been sequenced, the versatility and diversity of this species compel additional exploration. We therefore undertook whole-genome sequencing of the acidogenic A. niger wild-type strain (ATCC 1015) and produced a genome sequence of very high quality. Only 15 gaps are present in the sequence, and half the telomeric regions have been elucidated. Moreover, sequence information from ATCC 1015 was used to improve the genome sequence of CBS 513.88. Chromosome-level comparisons uncovered several genome rearrangements, deletions, a clear case of strain-specific horizontal gene transfer, and identification of 0.8 Mb of novel sequence. Single nucleotide polymorphisms per kilobase (SNPs/kb) between the two strains were found to be exceptionally high (average: 7.8, maximum: 160 SNPs/kb). High variation within the species was confirmed with exo-metabolite profiling and phylogenetics. Detailed lists of alleles were generated, and genotypic differences were observed to accumulate in metabolic pathways essential to acid production and protein synthesis. A transcriptome analysis supported up-regulation of genes associated with biosynthesis of amino acids that are abundant in glucoamylase A, tRNA-synthases, and protein transporters in the protein producing CBS 513.88 strain. Our results and data sets from this integrative systems biology analysis resulted in a snapshot of fungal evolution and will support further optimization of cell factories based on filamentous fungi.
Degree of Coexpression of Genes Within the Same Pathway as Defined by the KEGG Database
Descriptive Statistics for Pairwise Comparison of Neighboring Genes According to Orientation of Transcription
Large-scale analyses of expression data of eukaryotic organisms are now becoming increasingly routine. The data sets are revealing interesting and novel patterns of genomic organization, which provide insight both into molecular evolution and how structure and function of a genome interrelate. Our study investigates, for the first time, how genome organization affects expression of a gene in the Arabidopsis genome. The analyses show that neighboring genes are coexpressed. This pattern has been found for all eukaryotic genomes studied so far, but as yet, it remains unclear whether it is due to selective or nonselective influences. We have investigated reasons for coexpression of neighboring genes in Arabidopsis, and our evidence suggests that orientation of gene pairs plays a significant role, with potential sharing of regulatory elements in divergently transcribed genes. Using the data available in the KEGG database, we find evidence that genes in the same pathway are coexpressed, although this is not a major cause for the coexpression of neighboring genes.
Genome size varies greatly across angiosperms. It is well documented that, in addition to polyploidization, retrotransposon amplification has been a major cause of genome expansion. The lack of evidence for counterbalancing mechanisms that curtail unlimited genome growth has made many of us wonder whether angiosperms have a "one-way ticket to genomic obesity." We have therefore investigated an angiosperm with a well-characterized and notably small genome, Arabidopsis thaliana, for evidence of genomic DNA loss. Our results indicate that illegitimate recombination is the driving force behind genome size decrease in Arabidopsis, removing at least fivefold more DNA than unequal homologous recombination. The presence of highly degraded retroelements also suggests that retrotransposon amplification has not been confined to the last 4 million years, as is indicated by the dating of intact retroelements.
Chromatin structure is central for the regulation of gene expression, but its genome-wide organization is only beginning to be understood. Here, we examine the connection between patterns of nucleosome occupancy and the capacity to modulate gene expression upon changing conditions, i.e., transcriptional plasticity. By analyzing genome-wide data of nucleosome positioning in yeast, we find that the presence of nucleosomes close to the transcription start site is associated with high transcriptional plasticity, while nucleosomes at more distant upstream positions are negatively correlated with transcriptional plasticity. Based on this, we identify two typical promoter structures associated with low or high plasticity, respectively. The first class is characterized by a relatively large nucleosome-free region close to the start site coupled with well-positioned nucleosomes further upstream, whereas the second class displays a more evenly distributed and dynamic nucleosome positioning, with high occupancy close to the start site. The two classes are further distinguished by multiple promoter features, including histone turnover, binding site locations, H2A.Z occupancy, expression noise, and expression diversity. Analysis of nucleosome positioning in human promoters reproduces the main observations. Our results suggest two distinct strategies for gene regulation by chromatin, which are selectively employed by different genes.
Contiguous finished sequence from highly duplicated pericentromeric regions of human chromosomes is needed if we are to understand the role of pericentromeric instability in disease, and in gene and karyotype evolution. Here, we have constructed a BAC contig spanning the transition from pericentromeric satellites to genes on the short arm of human chromosome 10, and used this to generate 1.4 Mb of finished genomic sequence. Combining RT-PCR, in silico gene prediction, and paralogy analysis, we can identify two domains within the sequence. The proximal 600 kb consists of satellite-rich pericentromerically duplicated DNA which is transcript poor, containing only three unspliced transcripts. In contrast, the distal 850 kb contains four known genes (ZNF248, ZNF25, ZNF33A, and ZNF37A) and up to 32 additional transcripts of unknown function. This distal region also contains seven out of the eight intrachromosomal duplications within the sequence, including the p arm copy of the approximately 250-kb duplication which gave rise to ZNF33A and ZNF33B. By sequencing orthologs of the duplicated ZNF33 genes we have established that ZNF33A has diverged significantly at residues critical for DNA binding but ZNF33B has not, indicating that ZNF33B has remained constrained by selection for ancestral gene function. These results provide further evidence of gene formation within intrachromosomal duplications, but indicate that recent interchromosomal duplications at this centromere have involved transcriptionally inert, satellite rich DNA, which is likely to be heterochromatic. This suggests that any novel gene structures formed by these interchromosomal events would require relocation to a more open chromatin environment to be expressed.
TAR cloning and sequencing of PnC DNA. The shaded area represents the region corresponding to the ∼ 80-kb 10q25.5 NC DNA (du Sart et al. 1997). ( A ) Sequenced regions of the HC DNA (derived from a CEPH library YAC clone) and NC DNA [derived from the mardel(10) neocentromere]. Total number of nucleotides sequenced is shown in brackets. ( B ) Structure of the HC/NC region and flanking DNA. Solid boxes represent STSs used in the identification and cloning of the DNA. AFM259xg5 is a (CA)n microsatellite located ∼ 150 kb (represented by the broken line) from the core region (Cancilla et al. 1998). AT28 (Barry et al. 1999) is a polymorphic VNTR used to identify the progenitor allele. C3-F2 is a 1.4-kb Eco RI fragment that served as the specific TAR “hook”(Cancilla et al. 1998). Small arrows indicate oligonucleotides used in PCR of the STSs. p Ј and q Ј refer to the short and long arms of mardel(10), respectively. ( C ) Radial TAR strategy using the vector pVC39- Alu /C3-F2(+) for the direct cloning of the progenitor DNA from the total genomic DNA of CE. The hatched box indicates the position of the Alu consensus sequence hook. Crosses denote the sites of recombination between the TAR vector pVC39- Alu /C3-F2( +) and CE genomic DNA at the C3-F2 and Alu hooks during cloning. The resulting circular YAC, CE-4–27, was shown by the AT28 polymorphism (see Fig. 3) to contain the PnC DNA from the progenitor chromosome 10. ( D ) The ∼ 69-kb sequenced portion of PnC DNA, represented by the bar. 
We have previously localized the core centromere protein-binding domain of a 10q25.2-derived neocentromere to an 80-kb genomic region. Detailed analysis has indicated that the 80-kb neocentromere (NC) DNA has a similar overall organization to the corresponding region on a normal chromosome 10 (HC) DNA, derived from a genetically unrelated CEPH individual. Here we report sequencing of the HC DNA and its comparison to the NC sequence. Single-base differences were observed at a maximum rate of 4.6 per kb; however, no deletions, insertions, or other structural rearrangements were detected. To investigate whether the observed changes, or subsets of these, might be de novo mutations involved in neocentromerization (i.e., in committing a region of a chromosome to neocentromere formation), the progenitor DNA (PnC) from which the NC DNA descended, was cloned and sequenced. Direct comparison of the PnC and NC sequences revealed 100% identity, suggesting that the differences between NC and HC DNA are single nucleotide polymorphisms (SNPs) and that formation of the 10q25.2 NC did not involve a change in DNA sequence in the core centromere protein-binding NC region. This is the first study in which a cloned NC DNA has been compared directly with its inactive progenitor DNA at the primary sequence level. The results form the basis for future sequence comparison outside the core protein-binding domain, and provide direct support for the involvement of an epigenetic mechanism in neocentromerization. [The sequences in this paper have been submitted to GenBank under accession nos. AF222855 (not yet available) for HC; AF042484 for NCI; AF222854 (not yet available) for NCII; and AF222856 (not yet available) for PnC.]
The genome of the halophilic archaeon Halobacterium sp. NRC-1 and predicted proteome have been analyzed by computational methods and reveal characteristics relevant to life in an extreme environment distinguished by hypersalinity and high solar radiation: (1) The proteome is highly acidic, with a median pI of 4.9 and mostly lacking basic proteins. This characteristic correlates with high surface negative charge, determined through homology modeling, as the major adaptive mechanism of halophilic proteins to function in nearly saturating salinity. (2) Codon usage displays the expected GC bias in the wobble position and is consistent with a highly acidic proteome. (3) Distinct genomic domains of NRC-1 with bacterial character are apparent by whole proteome BLAST analysis, including two gene clusters coding for a bacterial-type aerobic respiratory chain. This result indicates that the capacity of halophiles for aerobic respiration may have been acquired through lateral gene transfer. (4) Two regions of the large chromosome were found with relatively lower GC composition and overrepresentation of IS elements, similar to the minichromosomes. These IS-element-rich regions of the genome may serve to exchange DNA between the three replicons and promote genome evolution. (5) GC-skew analysis showed evidence for the existence of two replication origins in the large chromosome. This finding and the occurrence of multiple chromosomes indicate a dynamic genome organization with eukaryotic character.
Cloned and Candidate Drosophila Neuropeptide Receptors
Neighbor-joining phylogenetic trees for the Family A, Group V receptors. (A) Rooted tree for the opioid, somatostatin, galanin, and allatostatin receptors. (B) Unrooted tree for the gonadotropin releasing hormone (GnRH), vasopressin, and oxytocin receptors. The likely midpoint of the tree is indicated with an "X." (C) Rooted tree for the glycoprotein hormone receptors and related leucine-rich repeat containing receptors (LGRs). Bootstrap scores, omitted branches, and Drosophila GPCRs are indicated as in Fig. 1. (ALGR) Anthopleura elegantissima (sea anemone) LGR; (FSHR) follicle-stimulating hormone receptor; (GALR) galanin receptor type 1; (GALS) galanin receptor type 2; (GALT) galanin receptor type 3; (GPR24 and GPR54) mammalian orphan GPCRs; (GRHR) GnRH receptor; (ITR) isotocin receptor; (LGR4-7) LGR types 4-7; (LSCPR and LSCPR2) Lymnaea stagnalis conopressin receptor types 1 and 2; (LSHR) lutropinchoriogonadotropic hormone receptor; (MTR) mesotocin receptor; (NLGR) C. elegans LGR; (ORPH4) Lymnaea stagnalis orphan GPCR; (OPRD) delta-type opioid receptor; (OPRK) kappatype opioid receptor; (OPRM) mu-type opioid receptor; (OPRX) nociceptin/orphanin FQ receptor; (OXYR) oxytocin receptor; (SLGR) L. stagnalis GRL101; (SSR1-SSR5) somatostatin receptor types 1-5; (TSHR) thyrotropin receptor; (V1AR and V1BR) vasopressin V1A and V1B receptors; (V2R) vasopressin V2 receptor; (VTR) vasostocin receptor. The remaining non-Drosophila sequences are orphan GPCRs from C. elegans.
Classification of Cloned and Candidate Drosophila Neuropeptide Receptors by BLAST and Phylogenetic Analysis
Locations and Phasing of Introns among Genes Encoding Drosophila Family A Peptide GPCRs
Unrooted neighbor-joining tree for the Family B receptors . The location of the tree midpoint is ambiguous and is therefore not indicated. Bootstrap scores, omitted branches, and Drosophila GPCRs are indicated as in Fig. 1. The four groups of Family B receptors are indicated with vertical bars. (BAI) brainspecific angiogenesis inhibitors 1–3; (CALR) calcitonin receptor; (CAR1) cyclic AMP receptor 1; (CD97) leucocyte antigen CD97; (CGRR) calcitonin gene-related peptide type 1 receptor; (CRF2) corticotropin releasing factor (CRF) receptor 2; (CRFR) CRF receptor 1; (DIHR) diuretic hormone receptor; (EMR1) cell surface glycoprotein EMR1; (GIPR) gastric inhibitory polypeptide receptor ; (GLP2R) glucagon-like peptide 2 receptor; (GLPR) glucagonlike peptide 1 receptor; (GLR) glucagon receptor; (GRFR) growth hormone releasing hormone receptor; (HE6) G protein-coupled receptor HE6; (LRP1–3) calcium-independent alpha-latrotoxin receptors (latrophilins) 1–3; (MEGF2) seven-pass transmembrane proteins CELSR1–2 and MEGF2; (PACR) pituitary adenylate cyclase activating polypeptide (PACAP) type I receptor; (PTR2) parathyroid hormone receptor; (PTRR) parathyroid hormone/ parathyroid hormone-related peptide receptor; (SCRC) secretin receptor; (TM7XM1) human EGF-TM7 like protein; (VIPR) vasoactive intestinal polypeptide (VIP) receptor 1; (VIPS) VIP receptor 2. The remaining non-Drosophila sequences are orphan GPCRs from Caenorhaloditis elegans.  
Recent genetic analyses in worms, flies, and mammals illustrate the importance of bioactive peptides in controlling numerous complex behaviors, such as feeding and circadian locomotion. To pursue a comprehensive genetic analysis of bioactive peptide signaling, we have scanned the recently completed Drosophila genome sequence for G protein-coupled receptors sensitive to bioactive peptides (peptide GPCRs). Here we describe 44 genes that represent the vast majority, and perhaps all, of the peptide GPCRs encoded in the fly genome. We also scanned for genes encoding potential ligands and describe 22 bioactive peptide precursors. At least 32 Drosophila peptide receptors appear to have evolved from common ancestors of 15 monophyletic vertebrate GPCR subgroups (e.g., the ancestral gastrin/cholecystokinin receptor). Six pairs of receptors are paralogs, representing recent gene duplications. Together, these findings shed light on the evolutionary history of peptide GPCRs, and they provide a template for physiological and genetic analyses of peptide signaling in Drosophila.
List of eSTS markers used in this study 
We have tested 80 expressed sequence-tagged site (eSTS) markers assigned to human chromosome 11 by the Genexpress program on a panel of somatic cell hybrids containing parts of this chromosome, characterized by cytogenetic data, reference markers, and with respect to the Généthon microsatellite genetic map. Sixty-eight new gene transcripts have been assigned to 25 subregions, one of which was newly defined by five of the eSTS markers. The markers are distributed on the short and long arms in agreement with their physical length. The genic map thus obtained has been integrated with the cytogenetic, genetic, and disease maps. Two eSTS markers have been further mapped with respect to a yeast artificial chromosome (YAC) contig close to the brain-derived neurotrophic factor (BDNF) gene and thus provide potential candidate genes for the mental retardation phenotype of WAGR (Wilms' tumor, aniridia, genitourinary abnormalities and mental retardation) syndrome. Altogether, the 68 new gene transcripts localized here represent more than a threefold increase in the number of unknown regionalized genes that could reveal potential candidate genes for the numerous orphan pathologies associated with chromosome 11.
Relative frequency of Class I microsatellites with different simple sequence repeat motifs in three sets of DNA sequence data.  
Characteristics of Rice Microsatellites and Efficiency of SSR Marker Development 
Molecular linkage map of rice. The framework is based on the IR64/Azucena doubled haploid (DH) population. Short arms of chromosomes are at the top. Approximate positions  
A total of 57.8 Mb of publicly available rice ( Oryza sativa L.) DNA sequence was searched to determine the frequency and distribution of different simple sequence repeats (SSRs) in the genome. SSR loci were categorized into two groups based on the length of the repeat motif. Class I, or hypervariable markers, consisted of SSRs ≥20 bp, and Class II, or potentially variable markers, consisted of SSRs ≥12 bp <20 bp. The occurrence of Class I SSRs in end-sequences of Eco RI- and Hin dIII-digested BAC clones was one SSR per 40 Kb, whereas in continuous genomic sequence (represented by 27 fully sequenced BAC and PAC clones), the frequency was one SSR every 16 kb. Class II SSRs were estimated to occur every 3.7 kb in BAC ends and every 1.9 kb in fully sequenced BAC and PAC clones. GC-rich trinucleotide repeats (TNRs) were most abundant in protein-coding portions of ESTs and in fully sequenced BACs and PACs, whereas AT-rich TNRs showed no such preference, and di- and tetranucleotide repeats were most frequently found in noncoding, intergenic regions of the rice genome. Microsatellites with poly(AT)n repeats represented the most abundant and polymorphic class of SSRs but were frequently associated with the Micropon family of miniature inverted-repeat transposable elements (MITEs) and were difficult to amplify. A set of 200 Class I SSR markers was developed and integrated into the existing microsatellite map of rice, providing immediate links between the genetic, physical, and sequence-based maps. This contribution brings the number of microsatellite markers that have been rigorously evaluated for amplification, map position, and allelic diversity in Oryza spp. to a total of 500. [Clone sequences for 199 markers (RM1–RM88, RM200–RM345) developed in this lab are available as GenBank accessions AF343840 – AF343869 and AF344003 – AF344169 .]
The genetic factors involved in type II diabetes are still unknown. To address this problem, we are creating a 10 to 15 cM genetic map on 444 individuals from 32 Mexican American families ascertained on a type II diabetic proband. Using highly polymorphic microsatellite markers and a multipoint variance components method, we found evidence for linkage of plasma glucose concentration 2 hr after oral glucose administration to two regions on chromosome 11: beta-hemoglobin (HBB) and markers D11S899/D11S1324 near the sulfonylurea receptor (SUR) gene. Iod scores at these two loci were 2.77 and 3.37, respectively. The SUR gene region accounted for 44.7% of the phenotypic variance. Evidence for linkage to fasting glucose concentration was also observed for two loci on chromosome 6, one of which is identical to a proposed susceptibility locus for type I diabetes (D6S290). When diabetics were excluded from the analyses, all Iod scores became zero, suggesting that the observed linkages were with the trait diabetes rather than with normal variation in glucose levels. Results were similar whether all diabetics were included in the analyses or only those who were not under treatment with oral antidiabetic agents or insulin.
ABC Transporters Involved in Drug Resistance
Drosophila ABC Genes
ABC Gene Subfamilies in Characterized Eukaryotes
The ATP-binding cassette (ABC) transporter superfamily contains membrane proteins that translocate a variety of substrates across extra- and intra-cellular membranes. Genetic variation in these genes is the cause of or contributor to a wide variety of human disorders with Mendelian and complex inheritance, including cystic fibrosis, neurological disease, retinal degeneration, cholesterol and bile transport defects, anemia, and drug response. Conservation of the ATP-binding domains of these genes has allowed the identification of new members of the superfamily based on nucleotide and protein sequence homology. Phylogenetic analysis is used to divide all 48 known ABC transporters into seven distinct subfamilies of proteins. For each gene, the precise map location on human chromosomes, expression data, and localization within the superfamily has been determined. These data allow predictions to be made as to potential functions or disease phenotypes associated with each protein. In this paper, we review the current state of knowledge on all human ABC genes in inherited disease and drug resistance. In addition, the availability of the complete Drosophila genome sequence allows the comparison of the known human ABC genes with those in the fly genome. The combined data enable an evolutionary analysis of the superfamily. Complete characterization of all ABC from the human genome and from model organisms will lead to important insights into the physiology and the molecular basis of many human disorders.
Estimates of genetic population structure (F(ST)) were constructed from all autosomes in two large SNP data sets. The Perlegen data set contains genotypes on approximately 1 million SNPs segregating in all three samples of Americans of African, Asian, and European descent; and the Phase I HapMap data set contains genotypes on approximately 0.6 million SNPs segregating in all four samples from specific Caucasian, Chinese, Japanese, and Yoruba populations. Substantial heterogeneity of F(ST) values was found between segments within chromosomes, although there was similarity between the two data sets. There was also substantial heterogeneity among population-specific F(ST) values, with the relative sizes of these values often changing along each chromosome. Population-structure estimates are often used as indicators of natural selection, but the analyses presented here show that individual-marker estimates are too variable to be useful. There is inherent variation in these statistics because of variation in genealogy even among neutral loci, and values at pairs of loci are correlated to an extent that reflects the linkage disequilibrium between them. Furthermore, it may be that the best indications of selection will come from population-specific F(ST) values rather than the usually reported population-average values.
Elevated galactose levels can be caused by several enzyme defects, one of which is galactokinase. Galactokinase deficiency cause congenital cataracts during infancy and presenile cataracts in the adult population. We have isolated the mouse cDNA for galactokinase, which shares extensive amino acid sequence homology, 88% identity, with a recently cloned human galactokinase. It is expressed in all tissues examined. In an interspecific backcross analysis galactokinase maps to the distal region of mouse chromosome 11, a region that is homologous to human chromosome 17q22-25. The availability of the mouse gene provides an opportunity to make a knockout model for galactokinase deficiency.
With the human genome sequence approaching completion, a major challenge is to identify the locations and encoded protein sequences of all human genes. To address this problem we have developed a new gene identification algorithm, GenomeScan, which combines exon-intron and splice signal models with similarity to known protein sequences in an integrated model. Extensive testing shows that GenomeScan can accurately identify the exon-intron structures of genes in finished or draft human genome sequence with a low rate of false-positives. Application of GenomeScan to 2.7 billion bases of human genomic DNA identified at least 20,000-25,000 human genes out of an estimated 30,000-40,000 present in the genome. The results show an accurate and efficient automated approach for identifying genes in higher eukaryotic genomes and provide a first-level annotation of the draft human genome.
Outline of the FunCoup network reconstruction process. Amounts of input data and sizes of training sets are shown for each species in FunCoup version 1.0. Input data are as follows: MEX, mRNA coexpression; PHP, phylogenetic profile similarity; PPI, protein–protein interactions; SCL, subcellular colocalization; TFB, shared transcription factor binding; PEX, protein coexpression: MIR, miRNA targeting of transcripts; and DOM, domain associations. Training sets are as follows: ML, links between proteins from the same metabolic pathways; SL, links between proteins from the same signaling pathways; PI, experimentally observed protein–protein interactions; and CM, pairs of protein-members of the same complex. The Bayesian framework processes the input data using the training sets. The input datapoints are converted into raw interaction scores, which are grouped into discrete regions. Each such bin is assigned an FC score using the training sets. The ‘‘cards’’ illustrate the results of this process, showing the raw interaction score along the horizontal axis. For each training set, or functional class, the resulting bins are shown as colored rectangles: (green) positive evidence of FC; (white) either close to neutral or insignificant; (red) negative evidence of FC. Finally, the FC scores are calculated for all possible gene pairs in each species. For brevity, the predicted links of different functional classes have been combined into one network per species. 
Scenarios for interaction inheritance. ( A ) Interacting genes A and B in an ancestral species are duplicated into two and three genes in species 1, while B is duplicated into two genes in species 2. These duplicated genes are clusters of inparalogs in relation to the other species, meaning that they are co-orthologous to the corresponding A and B genes in that species. When transferring functional coupling (arrow marked FC) between orthologs, one can consider the interaction either to be valid for all inparalogs in a cluster, or to be specific for a particular inparalog pair (e.g., the seed orthologs, i.e., the two most similar ones, dotted arrows). The transfer of interaction information thus can either be done from an arbitrary inparalog in a cluster, which maximizes the coverage, or only be transferred between, e.g., seed orthologs. Benchmarking showed that transferring the best interaction in a set of inparalogs to all inparalogs in the other species yields the best results (Supplemental Material). ( B ) Real example of this situation: An interaction in fly ptc – babo (Shyamala and Bhat 2002) can either be considered to be valid for all six pairs of mouse inparalogs { Ptch1 , Ptch2 } vs. { Tgfbr1 , Acvr1b , Acvr1c } or to be specific for a particular inparalog pair. Seed orthologs in this case are ptc/Ptch1 and babo/Tgfbr1 . The two fly genes are connected with solid arrows; the others are mouse genes. The arrow denotes the known protein–protein interaction between ptc and babo . The leaves of the gene trees present protein domain architectures. 
No single experimental method can discover all connections in the interactome. A computational approach can help by integrating data from multiple, often unrelated, proteomics and genomics pipelines. Reconstructing global networks of functional coupling (FC) faces the challenges of scale and heterogeneity--how to efficiently integrate huge amounts of diverse data from multiple organisms, yet ensuring high accuracy. We developed FunCoup, an optimized Bayesian framework, to resolve these issues. Because interactomes comprise functional coupling of many types, FunCoup annotates network edges with confidence scores in support of different kinds of interactions: physical interaction, protein complex member, metabolic, or signaling link. This capability boosted overall accuracy. On the whole, the constructed framework was comprehensively tested to optimize the overall confidence and ensure seamless, automated incorporation of new data sets of heterogeneous types. Using over 50 data sets in seven organisms and extensively transferring information between orthologs, FunCoup predicted global networks in eight eukaryotes. For the Ciona intestinalis network, only orthologous information was used, and it recovered a significant number of experimental facts. FunCoup predictions were validated on independent cancer mutation data. We show how FunCoup can be used for discovering candidate members of the Parkinson and Alzheimer pathways. Cross-species pathway conservation analysis provided further support to these observations.
The gastrointestinal microbiome undergoes shifts in species and strain abundances, yet dynamics involving closely related microorganisms remain largely unknown because most methods cannot resolve them. We developed new metagenomic methods and utilized them to track species and strain level variations in microbial communities in 11 fecal samples collected from a premature infant during the first month of life. 96 % of the sequencing reads were assembled into scaffolds of >500 bp length that could be assigned to organisms at the strain level. Six essentially complete (~99 %) and two near-complete genomes were assembled for bacteria that comprised as little as 1 % of the community, as well as nine partial genomes of bacteria representing as little as 0.05 %. In addition, three viral genomes were assembled and assigned to their hosts. The relative abundance of three Staphylococcus epidermidis strains, as well as three phage that infect them, changed dramatically over time. Genes possibly related to these shifts include those for resistance to antibiotics, heavy metals and phage. At the species level we observed the decline of an early-colonizing Propionibacterium acnes strain similar to SK137 and the proliferation of novel Propionibacterium and Peptoniphilus species late in colonization. The Propionibacterium species differed in their ability to metabolize carbon compounds such as inositol and sialic acid, indicating that shifts in species composition likely impact the metabolic potential of the community. These results highlight the benefit of reconstructing complete genomes from metagenomic data and demonstrate methods for achieving this goal.
Solexa sequencing profile of derivative chromosome 9 from patient 1. 1-Mb intervals around the breakpoints on chromosome 7 (A) and 9 (B) are shown. 199,421 and 1,047,649 reads derived from the der (9) were mapped to unique positions on normal chromosomes 7 and 9, respectively. The number of reads was then binned into nonoverlapping 1-kb segments and plotted against the chromosome coordinates. (Arrows) Breakpoints.
Balanced chromosome rearrangements (BCRs) can cause genetic diseases by disrupting or inactivating specific genes, and the characterization of breakpoints in disease-associated BCRs has been instrumental in the molecular elucidation of a wide variety of genetic disorders. However, mapping chromosome breakpoints using traditional methods, such as in situ hybridization with fluorescent dye-labeled bacterial artificial chromosome clones (BAC-FISH), is rather laborious and time-consuming. In addition, the resolution of BAC-FISH is often insufficient to unequivocally identify the disrupted gene. To overcome these limitations, we have performed shotgun sequencing of flow-sorted derivative chromosomes using "next-generation" (Illumina/Solexa) multiplex sequencing-by-synthesis technology. As shown here for three different disease-associated BCRs, the coverage attained by this platform is sufficient to bridge the breakpoints by PCR amplification, and this procedure allows the determination of their exact nucleotide positions within a few weeks. Its implementation will greatly facilitate large-scale breakpoint mapping and gene finding in patients with disease-associated balanced translocations.
Validation of variants detected in blood but not in brain (A) Somatic SNV detection in blood 
Features of somatic mutations compared to SNPs and disease-associated mutations 
Presence of single nucleotide mutations detected in blood in other W115 tissues. Box plot of the VAF values for the 214 confirmed somatic mutations detected in blood for a variety of other tissues. On each box, the central mark is the median VAF; the edges of the box are the 25th and 75th percentiles, the whiskers extend to the most extreme data points not considered outliers, and outliers are plotted individually as red crosses. 
The somatic mutation burden in healthy white blood cells (WBCs) is not well known. Based on deep whole-genome sequencing, we estimate that approximately 450 somatic mutations accumulated in the nonrepetitive genome within the healthy blood compartment of a 115-yr-old woman. The detected mutations appear to have been harmless passenger mutations: They were enriched in noncoding, AT-rich regions that are not evolutionarily conserved, and they were depleted for genomic elements where mutations might have favorable or adverse effects on cellular fitness, such as regions with actively transcribed genes. The distribution of variant allele frequencies of these mutations suggests that the majority of the peripheral white blood cells were offspring of two related hematopoietic stem cell (HSC) clones. Moreover, telomere lengths of the WBCs were significantly shorter than telomere lengths from other tissues. Together, this suggests that the finite lifespan of HSCs, rather than somatic mutation effects, may lead to hematopoietic clonal evolution at extreme ages.
Quantitative allele frequency estimation based on analyses of mixed templates. (Upper panel) The number of heterozygous SNPs detected in mixtures containing varying amounts of the constitutional and tumor DNAs is shown with the number of heterozygous SNPs indicated in each of the four regions. (Bottom panel, bar-graph) The total number of heterozygous SNPs increases above that seen in tumor DNA (only two) when tumor DNA is contaminated with 15% normal (blood) DNA. The number increases linearly as the percentage of contaminating normal (blood) is increased to 45%.
Many cancers are characterized by chromosomal aberrations that may be predictive of disease outcome. Human neuroblastomas are characterized by somatically acquired copy number changes, including loss of heterozygosity (LOH) at multiple chromosomal loci, and these aberrations are strongly associated with clinical phenotype including patient outcome. We developed a method to assess region-specific LOH by genotyping multiple SNPs simultaneously in DNA from tumor tissues. We identified informative SNPs at an average 293-kb density across nine regions of recurrent LOH in human neuroblastomas. We also identified SNPs in two copy number neutral regions, as well as two regions of copy number gain. SNPs were PCR-amplified in 12-plex reactions and used in solution-phase single-nucleotide extension incorporating tagged dideoxynucleotides. Each extension primer had 5' complementarity to one of 2000 oligonucleotides on a commercially available tag-array platform allowing for solid-phase sorting and identification of individual SNPs. This approach allowed for simultaneous detection of multiple regions of LOH in six human neuroblastoma-derived cell lines, and, more importantly, 14 human neuroblastoma primary tumors. Concordance with conventional genotyping was nearly absolute. Detection of LOH in this assay may not require comparison to matched normal DNAs because of the redundancy of informative SNPs in each region. The customized tag-array system for LOH detection described here is rapid, results in parallel assessment of multiple genomic alterations, and may speed identification of and/or assaying prognostically relevant DNA copy number alterations in many human cancers.
Over 100 distinct disease-associated mutations have been identified in the breast-ovarian cancer susceptibility gene BRCA1. Loss of the wild-type allele in > 90% of tumors from patients with inherited BRCA1 mutations indicates tumor suppressive function. The low incidence of somatic mutations suggests that BRCA1 inactivation in sporadic tumors occurs by alternative mechanisms, such as interstitial chromosomal deletion or reduced transcription. To identify possible features of the BRCA1 genomic region that may contribute to chromosomal instability as well as potential transcriptional regulatory elements, a 117,143-bp DNA sequence encompassing BRCA1 was obtained by random sequencing of four cosmids identified from a human chromosome 17 specific library. The 24 exons of BRCA1 span an 81-kb region that has an unusually high density of Alu repetitive DNA (41.5%), but relatively low density (4.8%) of other repetitive sequences. BRCA1 intron lengths range in size from 403 bp to 9.2 kb and contain the intragenic microsatellite markers D17S1323, D17S1322, and D17S855, which localize to introns 12, 19, and 20, respectively. In addition to BRCA1, the contig contains two complete genes: Rho7, a member of the rho family of GTP binding proteins, and VAT1, an abundant membrane protein of cholinergic synaptic vesicles. Partial sequences of the 1A1-3B B-box protein pseudogene and IFP 35, an interferon induced leucine zipper protein, reside within the contig. An L21 ribosomal protein pseudogene is embedded in BRCA1 intron 13. The order of genes on the chromosome is: centromere-1FP 35-VAT1-Rho7-BRCA1-1A1-3B-telomere.
Binding sites of certain TFs or TF pairs are enriched in repeats. (A) Enrichment of TF binding sites in repetitive elements. The redness of each grid point is proportional to the negative logarithm of enrichment P-value. Repetitive elements are color-coded by family. (B) Enrichment of motif pairs that strongly prefer a narrow distance range in various repetitive elements (Fig. 2C).
Chromatin structure and GC content around TF binding regions. (A,B) Nucleosome occupancy profiles anchored on the summits of TSSproximal (A) and TSS-distal (B) peaks of YY1 grouped by ChIP-seq signal strength: top (green), middle (red), and bottom (blue) third peaks in terms of ChIP-seq signal. Nucleosome depletion for the top third peaks is shown as D in each panel. (C ) Distribution of nucleosome depletion ''D'' across all tested TFs, with peaks stratified according to TSS proximity (proximal or distal) and ChIP-seq signal strength (top, middle, or bottom third). P-values for pairwise comparisons based on paired Wilcoxon rank-sum tests are: P1 = 8.2 3 10 À17 , P2 = 7.6 3 10 À21 , P3 = 3.8 3 10 À23 , P4 = 8.8 3 10 À10 , P5 = 1.1 3 10 À9 , P6 = 1.1 3 10 À11 , and P7 = 6.6 3 10 À22. (D) TF binding is correlated with significantly more nucleosome depletion than TSS. Wilcoxon rank-sum test P-values are shown separately for GM12878 and K562 cells. For the box plots in C and D, only those subcategories with 200 or more peaks are included, and whiskers represent the 1.5 inter-quartile range. (E ) Nucleosome occupancy genome-wide is correlated with GC%. The smoothed density scatter plot contains 40,000 data points; each data point is a randomly chosen 250-bp region of the human genome. (Black dots) Those regions that overlap with ChIP-seq peaks. (Black line) Least square fit. Pearson correlation coefficient = 0.62; P-value < 2.2 3 10 À16. (F ) Comparison of in vivo (green) and in vitro (black) nucleosome occupancy profiles around peak summits of YY1. GC% profile around the same summits is plotted in orange. Note elevated GC% at summit coincides with high in vitro nucleosome occupancy and low in vivo nucleosome occupancy.
Chromatin immunoprecipitation coupled with high-throughput sequencing (ChIP-seq) has become the dominant technique for mapping transcription factor (TF) binding regions genome-wide. We performed an integrative analysis centered around 457 ChIP-seq data sets on 119 human TFs generated by the ENCODE Consortium. We identified highly enriched sequence motifs in most data sets, revealing new motifs and validating known ones. The motif sites (TF binding sites) are highly conserved evolutionarily and show distinct footprints upon DNase I digestion. We frequently detected secondary motifs in addition to the canonical motifs of the TFs, indicating tethered binding and cobinding between multiple TFs. We observed significant position and orientation preferences between many cobinding TFs. Genes specifically expressed in a cell line are often associated with a greater occurrence of nearby TF binding in that cell line. We observed cell-line-specific secondary motifs that mediate the binding of the histone deacetylase HDAC2 and the enhancer-binding protein EP300. TF binding sites are located in GC-rich, nucleosome-depleted, and DNase I sensitive regions, flanked by well-positioned nucleosomes, and many of these features show cell type specificity. The GC-richness may be beneficial for regulating TF binding because, when unoccupied by a TF, these regions are occupied by nucleosomes in vivo. We present the results of our analysis in a TF-centric web repository Factorbook ( and will continually update this repository as more ENCODE data are generated.
Human and Mouse Proteins with Identical Amino Acid Sequences 
The 25 Most Divergent Human and Mouse Proteins 
Box plots of aligned sequence identity distributions. For each category, the central box depicts the middle half of the data between the 25th and 75th percentiles; the black clots indicate the medians of each distribution. Extreme values in each distribution are indicated by the circles which fall outside the main body of data.
Distributions of sequence conservation in aligned mRNA coding sequences and the proteins they encode. 
A large set of mRNA and encoded protein sequences, from orthologous murine and human genes, was compiled to analyze statistical, biological, and evolutionary properties of coding and noncoding transcribed sequences. Protein sequence conservation varied between 36% and 100% identity, with an average value of 85%. The average degree of nucleotide sequence identity for the corresponding coding sequences was also approximately 85%, whereas 5' and 3' untranslated regions (UTRs) were less conserved, with aligned identities of 67% and 69%, respectively. For some mouse and human genes, nucleotide sequences are more highly conserved than the encoded protein sequences. A subset of 32 sequences, consisting of only mouse/human protein pairs for which the human sequence represents a positionally cloned disease gene, had properties very similar to the larger data set, suggesting that our data are representative of the genome as a whole. With respect to sequence conservation, two interesting outliers are the breast cancer (BRCAI) gene product and the testis-determining factor (SRY), both of which display among the lowest degrees of sequence identity. The occurrence of both introns and repetitive elements (e.g., Alu, Bl) in 5' and 3' UTRs was also studied. These results provide one benchmark for the "comparative genomics" of mice and humans, with practical implications for the cross-referencing of transcript maps. Also, they should prove useful in estimating the additional sampling diversity provided by mouse EST sequencing projects designed to complement the existing human cDNA collection.
The region p13 of the short arm of human chromosome 11 has been studied intensely during the search for genes involved in the etiology of the Wilms' tumor, aniridia, genitourinary abnormalities, mental retardation (WAGR) syndrome, and related conditions. The gene map for this region is far from being complete, however, strengthening the need for additional gene identification efforts. We describe the extension of an existing contig map with P1-derived artificial chromosomes (PACs) to cover 7.5 Mb of 11p13-14.1. The extended sequence-ready contig was established by end probe walking and fingerprinting and consists of 201 PAC clones. Utilizing bins defined by overlapping PACs, we generated a detailed gene map containing 20 genes as well as 22 anonymous ESTs which have been identified by searching the RH databases. RH maps and our established gene map show global correlation, but the limits of resolution of the current RH panels are evident at this scale. Initial expression studies on the novel genes have been performed by Northern blot analyses. To extend these expression profiles, corresponding mouse cDNA clones were identified by database search and employed for Northern blot analyses and RNA in situ hybridizations to mouse embryo sections. Genomic sequencing of clones along a minimal tiling path through the contig is currently under way and will facilitate these expression studies by in silico gene identification approaches.
The Usher syndrome type 1C (USH1C) and familial hyperinsulinism (HI) loci have been assigned to chromosome 11p14-15.1, within the interval D11S419-D11S1310. We have constructed a yeast artificial chromosome (YAC) contig, extending from D11S926 to D11S899, which encompasses the critical regions for both USH1C and HI and spans an estimated genetic distance of approximately 4 cM. A minimal set of six YAC clones constitute the contig, with another 22 YACs confirming the order of sequence-tagged sites (STSs) and position of YACs on the contig. A total of 40 STSs, including 10 new STSs generated from YAC insert-end sequences and inter-Alu PCR products, were used to order the clones within the contig. This physical map provides a resource for identification of gene transcripts associated with USH1C, HI, and other genetic disorders that map to the D11S926-D11S899 interval.
A major barrier to conceptual advances in understanding the mechanisms and regulation of imprinting of a genomic region is our relatively poor understanding of the overall organization of genes and of the potentially important cis-acting regulatory sequences that lie in the nonexonic segments that make up 97% of the genome. Interspecies sequence comparison offers an effective approach to identify sequence from conserved functional elements. In this article we describe the successful use of this approach in comparing a approximately 1-Mb imprinted genomic domain on mouse chromosome 7 to its orthologous region on human 11p15.5. Within the region, we identified 112 exons of known genes as well as a novel gene identified uniquely in the mouse region, termed Msuit, that was found to be imprinted. In addition to these coding elements, we identified 33 CpG islands and 49 orthologous nonexonic, nonisland sequences that met our criteria as being conserved, and making up 4.1% of the total sequence. These conserved noncoding sequence elements were generally clustered near imprinted genes and the majority were between Igf2 and H19 or within Kvlqt1. Finally, the location of CpG islands provided evidence that suggested a two-island rule for imprinted genes. This study provides the first global view of the architecture of an entire imprinted domain and provides candidate sequence elements for subsequent functional analyses.
Jacobsen syndrome is a haploinsufficiency disorder caused, most frequently by terminal deletion of part of the long arm of chromosome 11, with breakpoints in 11q23.3-11q24.2. Inheritance of an expanded p(CCG)n trinucleotide repeat at the folate-sensitive fragile site FRA11B has been implicated in the generation of the chromosome breakpoint in several Jacobsen syndrome patients. The majority of such breakpoints, however, map distal to this fragile site and are not linked with its expression. To characterize these distal breakpoints and ultimately to further investigate the mechanisms of chromosome breakage, a 40-Mb YAC contig covering the distal long arm of chromosome 11 was assembled. The utility of the YAC contig was demonstrated in three ways: (1) by rapidly mapping the breakpoints from two new Jacobsen syndrome patients using FISH; (2) by demonstrating conversion to high resolution PAC contigs after direct screening of PAC library filters with a YAC clone containing a Jacobsen syndrome breakpoint; and (3) by placing 23 Jacobsen syndrome breakpoints on the physical map. This analysis has suggested the existence of at least two new Jacobsen syndrome breakpoint cluster regions in distal chromosome 11.
Northern blot analyses of cDNA fragments corresponding to the (TUs) as indicated (see Fig. 1). Examples are given to demonstrate the various expression profiles in lung (lane 1 ), cerebellum (lane 2 ), retina (lane 3 ), retinal pigment epithelium cell line ARPE-19 (lane 4 ), and heart (lane 5 ), brain (lane 6 ), placenta (lane 7 ), liver (lane 8 ), skeletal muscle (lane 9 ), kidney (lane 10 ), and pancreas (lane 11 ). The transcript sizes are indicated at left in kilobases. 
Best’s vitelliform macular dystrophy is an autosomal dominant disorder of unknown causes. To identify the underlying gene defect the disease locus has been mapped to an ∼1.4-Mb region on chromosome 11q12–q13.1. As a prerequisite for its positional cloning we have assembled a high coverage PAC contig of the candidate region. Here, we report the construction of a primary transcript map that places a total of 19 genes within the Best’s disease region. This includes 14 transcripts of as yet unknown function obtained by EST mapping and/or cDNA selection and five genes mapped previously to the interval (CD5, PGA, DDB1, FEN1, and FTH1). Northern blot analyses were performed to determine the expression profiles in various human tissues. At least three genes appear to be good candidates for Best’s disease based on their abundant expression in retina or retinal pigment epithelium. Additional information on the functional properties of these genes, as well as mutation analyses in Best’s disease patients, have to await their further characterization. [The GenBank/EMBL accession numbers and details of the isolation, localization, and characterization of ESTs and selected cDNAs are available as online supplements in Online Tables 1–3 at .]
A >3-Mb contig of overlapping YAC, PAC, and cosmid clones. STS and Alu-PCR product screening reagents are listed above the black line, and positions are identified by vertical lines. Genetic markers are indicated in blue, genes in green, STSs in red, and bacteriophage Alu products in purple. (Phage cluster) Corresponding phage clone-derived Alu-PCR products listed in Table 1. (*ESTs) D11S951E, D11S1956E, and WI-12191. (a1) Alu-PCR products obtained using 263 primers; (a2) PCR products obtained using S/J primers; (a3) PCR products obtained using 278 primers (see Methods). Clones are represented by horizontal lines: red lines represent PAC clones, blue lines represent cosmid clones, and black lines represent YAC clones. Corresponding STS/Alu-PCR content is shown by vertical lines. PAC size is estimated at 125 kb and cosmid size at 40 kb. An open circle identifies a probable internal deletion. Parentheses indicate unstable clones (i.e., strains that have different insert sizes in two separate isolations). To simplify the diagram, not all of the screening reagents/clones used for the establishment of the contig are depicted if the information was redundant. Table 1 lists the screening materials and clones used in the development of the contig depicted here and can be accessed at or http://www.cshl.or/gr. 
We have combined genetic, radiation-reduced somatic cell hybrid (RRH), fluorescent in situ hybridization (FISH), and physical mapping methods to generate a contig of overlapping YAC, PAC, and cosmid clones corresponding to >3 continuous Mb in 11q13. A total of 15 STSs [7 genes ( GSTP1, ACTN, PC, MLK3, FRA1, SEA, HNP36 ), 4 polymorphic loci ( D11S807, D11S987, GSTP1, D11S913 ), 3 ESTs ( D11S1956E, D11S951E, and WI-12191 ), and 1 anonymous STS ( D11S703 )], mapping to three independent RRH segregation groups, identified 26 YAC, 7 PAC, and 16 cosmid clones from the CGM, Roswell Park, CEPH Mark I, and CEPH MegaYAC YAC libraries, a 5 genome equivalent PAC library, and a chromosome 11-specific cosmid library. Thirty-six Alu –PCR products derived from 10 anonymous bacteriophage λ clones, a cosmid containing the polymorphic marker D11S460, or STS-positive YAC or cosmid clones were identified and used to screen selected libraries by hybridization, resulting in the identification of 19 additional clones. The integrity and relative position of a subset of clones was confirmed by FISH and were found to be consistent with the physical and RRH mapping results. The combination of STS and Alu –PCR-based approaches has proven to be successful in attaining contiguous cloned coverage in this very GC-rich region, thereby establishing for the first time the absolute order and distance between the markers: CEN– MLK3 –( D11S1956E/D11S951E/WI-12191 )– FRA1–D11S460–SEA–HNP36/D11S913–ACTN–PCD11S703–GSTP1–D11S987 –TEL. [On-line supplementary material concerning screening materials and clones referred to in the text as Table 1 is available at or . The sequence data described in this paper have been submitted to the GenBank data library under accession no. AF009361 .]
Analysis of Hs-VACM-1 mRNA transcription in human tissues. (A,B) Nylon membranes containing electrophoretically separated poly(A) § RNA (2 pg) from the indicated tissues that were hybridized with [~-32p]dCTP-labeled HsVACM-1 cDNA. C and D are the same membranes that were used in A and B, respectively; but hybridized with radiolabeled 13-actin DNA; the orders of the tissues are the same as shown in A and B. Size markers shown at the left of A and B are in kilobases. 
We have localized the human homolog of the rabbit vasopressin-activated calcium-mobilizing receptor VACM-1 to a region close to the gene for ataxia telangiectasia ATM on chromosome 11q22-23. We have determined the complete amino acid sequence of the human Hs-VACM-1 protein, which is 780 amino acids long. The human and rabbit sequences are highly conserved, differing at only seven amino acids. Northern analysis of the human gene showed expression in a wide range of human tissues. The Hs-VACM-1 gene has homology with the Caenorhabditis elegans gene Ce-cul-5, a member of a family of cullin genes that are involved in cell cycle regulation and that might, when mutated, contribute to tumor progression.
Mutation location. A line diagram of the Del(13)Svea36H region showing human synteny, the position of gene clusters and markers used in the recombination mapping (black) relative to relevant genes in the region (red). The critical interval for each mapped line is shown by the blue line beneath the map. Line 91 carries a mutation in the Sox4 gene. 
The Sox4 missense mutation. ( a and b ) Sequencing of Sox4 in line 91 heterozygotes reveals a T to C transition (arrow) at nucleotide 869. ( c – h ) Magnetic resonance imaging of the heart in wild-type and Line91 hemizygote mutant embryos at 14.5 dpc. ( c ) Transverse section through a wild-type embryo at the level of the mitral valve (MV). The left (LV) and right (RV) ventricles are separated by the interventricular septum (IVS) and the left (LA) and right (RA) atria by the primary atrial septum (PAS). The systemic venous sinus (SVS) draining into the RA is indicated. ( d ) A corresponding section 
Del(13)Svea36H (Del36H) is a deletion of approximately 20% of mouse chromosome 13 showing conserved synteny with human chromosome 6p22.1-6p22.3/6p25. The human region is lost in some deletion syndromes and is the site of several disease loci. Heterozygous Del36H mice show numerous phenotypes and may model aspects of human genetic disease. We describe 12.7 Mb of finished, annotated sequence from Del36H. Del36H has a higher gene density than the draft mouse genome, reflecting high local densities of three gene families (vomeronasal receptors, serpins, and prolactins) which are greatly expanded relative to human. Transposable elements are concentrated near these gene families. We therefore suggest that their neighborhoods are gene factories, regions of frequent recombination in which gene duplication is more frequent. The gene families show different proportions of pseudogenes, likely reflecting different strengths of purifying selection and/or gene conversion. They are also associated with relatively low simple sequence concentrations, which vary across the region with a periodicity of approximately 5 Mb. Del36H contains numerous evolutionarily conserved regions (ECRs). Many lie in noncoding regions, are detectable in species as distant as Ciona intestinalis, and therefore are candidate regulatory sequences. This analysis will facilitate functional genomic analysis of Del36H and provides insights into mouse genome evolution.
An essential step in Serial Analysis of Gene Expression (SAGE) is tag mapping, which refers to the unambiguous determination of the gene represented by a SAGE tag. Current resources for tag mapping are incomplete, and thus do not allow assessment of the efficacy of SAGE in transcript identification. A method of tag mapping is described here and applied to the Drosophila melanogaster and Caenorhabditis elegans genomes, which permits detailed SAGE assessment and provides tag-mapping resources that were unavailable previously for these organisms. In our method, a conceptual transcriptome is constructed using genomic sequence and annotation by extending predicted coding regions to include UTRs on the basis of EST and cDNA alignments, UTR length distributions, and polyadenylation signals. Analysis of extracted tags suggests that, using the standard SAGE procedure, expression of 8% of D. melanogaster and 15% of C. elegans genes cannot be detected unambiguously by SAGE due to shared sequence or lack of NlaIII-anchoring enzyme sites. Both increasing tag length by 2-3 bp and using Sau3A instead of NlaIII as the anchoring enzyme increases potential for transcript detection. This work identifies and quantifies genes not amenable to SAGE analysis, in addition to providing tag-to-gene mappings for two model organisms.
A regional analysis of nucleotide substitution rates along human genes and their flanking regions allows us to quantify the effect of mutational mechanisms associated with transcription in germ line cells. Our analysis reveals three distinct patterns of substitution rates. First, a sharp decline in the deamination rate of methylated CpG dinucleotides, which is observed in the vicinity of the 5' end of genes. Second, a strand asymmetry in complementary substitution rates, which extends from the 5' end to 1 kbp downstream from the 3' end, associated with transcription-coupled repair. Finally, a localized strand asymmetry, an excess of C-->T over G-->A substitution in the nontemplate strand confined to the first 1-2 kbp downstream of the 5' end of genes. We hypothesize that higher exposure of the nontemplate strand near the 5' end of genes leads to a higher cytosine deamination rate. Up to now, only the somatic hypermutation (SHM) pathway has been known to mediate localized and strand-specific mutagenic processes associated with transcription in mammalia. The mutational patterns in SHM are induced by cytosine deaminase, which just targets single-stranded DNA. This DNA conformation is induced by R-loops, which preferentially occur at the 5' ends of genes. We predict that R-loops are extensively formed in the beginning of transcribed regions in germ line cells.
Top-cited authors
Nitin Baliga
  • Institute for Systems Biology
Paul Shannon
  • Institute for Systems Biology
Nada Amin
  • Helwan University
Benno Schwikowski
  • Institut Pasteur
David Haussler
  • University of California, Santa Cruz