About
99
Publications
15,003
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
61,919
Citations
Citations since 2017
Introduction
Skills and Expertise
Publications
Publications (99)
The genome of budding yeast (Saccharomyces cerevisiae) contains approximately 5,800 protein-encoding genes, the majority of which are associated with some known biological function. Yet the extent of amino acid sequence conservation of these genes over all phyla has only been partially examined. Here we provide a more comprehensive overview and vis...
The macronuclear genome of the ciliate Oxytricha trifallax displays an extreme and unique eukaryotic genome architecture with extensive genomic variation. During sexual genome development, the expressed, somatic macronuclear genome is whittled down to the genic portion of a small fraction (∼5%) of its precursor "silent" germline micronuclear genome...
Genome meta-assembly method. The meta-assembly started by reassembling the contigs produced from three different assemblers (see Materials and Methods for parameters used) with the CAP3 assembler. Two cycles of contig extension and re-assembly were performed before splitting of potentially chimeric contigs and trimming back at the sites of potentia...
Nanochromosomal variant frequencies in relation to sequence coverage. Variant frequency distribution over all positions with detected variants for low (≥20× to <40×; blue) and high sequence coverage (≥40×; green); variant frequencies ≥40 bp from either nanochromosome end were counted to avoid possible incorrect variant calling resulting from telome...
Association between nanochromosome copy number and number of telomeric reads. (A) Hexagonal binning plot of relative nanochromosome copy number measured in reads/bp versus copy number measured in number of telomeric reads for nonalternatively fragmented nanochromosomes (nanochromosomes without strongly supported alternative fragmentation sites). (B...
Genes per contig or nanochromosome. Nanochromosomes are defined as contigs with TASs no more than 100 bp away from both ends of the contig (14,388 in total). Alternatively fragmented nanochromosomes are those that are strongly supported by Illumina telomeric reads (≥10 reads per site), and nonalternatively fragmented nanochromosomes are all the rem...
Heat map of extended contigs verified with 454-telomeric end reads. Axes indicate percent difference of the 100 bp end matches. The first number in each cell indicates the fraction of complete nanochromosomes with paired matches to within 50 bases of each end of the nanochromosome; the second number indicates the total number of extended nanochromo...
Genome assembly redundancy analysis. Distribution of matching contigs for non-self BLAT matches (≥100 bp long) within the Oxytricha macronuclear genome assemblies. The number of matching contigs, not the number of contig matches, is counted. The graphs (A–D) represent the ≥90%, ≥99%, ≥90% to <99%, and ≥95% match identity thresholds, respectively.
(...
Missing pentose phosphate pathway (PPP) enzymes in ciliates. Enzymes that are confirmed to be absent/present are highlighted in color, with enzymes that are present in Paramecium, Tetrahymena and Ichthyophthirius, but not Oxytricha highlighted in pink, and a single enzyme missing in Paramecium but present in Oxytricha, Tetrahymena, and Ichthyophthi...
Linear regressions of relative estimates of nanochromosome copy number. Squares (red) are total telomeric reads for each contig; triangles (green) are 5′ telomeric reads for each contig; and diamonds (blue) are total reads/nanochromosome length (bp). The x-axis units are values obtained from qPCR (see Table S3). Linear regressions were determined w...
Validation of nanochromosomes from the final assembly. Nanochromosomes were validated both by 454 telomeric end reads and/or Sanger reads/mate pairs. (A) Length distribution of nanochromosomes validated by either 454 telomeric end reads (green) or Sanger read/mate pairs (cyan) or both (purple), or not validated by either method (pink) (see Material...
Length distribution of nanochromosomes validated by Sanger mate pairs. Nanochromosomes were validated according to the method illustrated in Figure S11.
(TIFF)
Intra-CDS alternative nanochromosome fragmentation. Nanochromosomes are indicated by black bars in descending order of length, with gene annotations below them. Predicted genes are indicated by green arrows and predicted CDSs by yellow arrows. Red arrows indicate alternative fragmentation sites and point in the direction that the alternative nanoch...
Nonalternatively fragmented tRNA nanochromosomes. Nanochromosomes are indicated by black bars in descending order of length, with gene annotations below them. Where multiple allelic versions of nanochromosomes are present, we have selected just a single representative nanochromosome. Predicted genes are indicated by green arrows, predicted CDSs by...
TAS sequence logos. Sequence logos showing nucleotide frequencies (generated with WebLogo [130]) for method 1 are for contig-derived sequences; while the logos for method 2 are for read-derived sequences. Sequence logos show base frequencies.
(TIFF)
Examples of overamplification of ribosomal protein-encoding nanochromosome isoforms (red) relative to the isoforms only containing nonribosomal genes (blue). High peaks and deep troughs indicate subtelomeric sequence biases (see Materials and Methods).
(TIFF)
Base compositional biases surrounding TASs. Contig consensus sequences surrounding strongly supported site TASs (≥10 supporting Illumina telomeric reads) were extracted for (A–C) (see Text S1: Determination of sequences surrounding telomere addition sites). The telomere position is 0. We only illustrate base composition biases for one end of the na...
Nanochromosome subtelomeric base composition of Stylonychia compared to that of Oxytricha.
Oxytricha base compositions are indicated by dots behind the Stylonychia base composition lines.
(TIFF)
Intron length distribution. The green histogram is for all introns predicted by AUGUSTUS, including those with experimental support from RNA-seq data; the blue histogram is for all introns determined from RNA-seq data that were used as hints for AUGUSTUS during the gene prediction. The inset shows the size distribution over a longer length scale (w...
Sequence logos of experimentally determined and predicted intron donor sites. Sequence logos generated by WebLogo show base frequencies. Experimentally determined introns were obtained from RNA-seq data (see Text S1: Determination of sequences surrounding telomere addition sites). Predicted introns are all the introns predicted by AUGUSTUS, includi...
Subtelomeric DNA capture method for 454 subtelomeric sequencing. Adaptor ends with a 5′-phosphate are shown in bold; otherwise 5′-phosphate is absent. The biotinylated thyamine residue in the internal adaptor is indicated in green.
(TIFF)
Southern blot analysis of Contig14329.0. Total macronuclear DNA was run on an electrophoretic gel. Two probes were created to investigate alternative fragmentation of this contig (“gene 1 probe” and “gene 2 probe”). For the gene 1 probe, the forward and reverse primers, 257_F and 1264_R, are CAGGCCCACAACATCTTCCTTCTTTG and CCATCTAGCACTACTCCATTAAGCAC...
pN/pS values for matchless nanochromosomes. pN/pS values were calculated by PAML (see Text S1: Determination of pN/pS values). A cut-off of pN/pS = 0.6 is shown by the dashed red line.
(TIFF)
Positional variation of TASs. TASs are contig-derived (see Text S1: Determination of sequences surrounding telomere addition sites). TASs within a 200 bp window surrounding and centered on strongly supported, alternatively fragmented, and nonalternatively sites were counted. The frequency distributions of the TASs for alternatively fragmented sites...
Length distributions of untranscribed (UTS) and untranslated (UTR) regions. Length distributions are for single-gene, nonalternatively fragmented nanochromosomes. (A) 5′ UTS length from the transcription start site to telomere (determined from 5′-RLM RACE Sanger reads). (B) 3′ UTS length from polyadenylation site to telomere (determined from RNA-se...
Assessment of potential paralogy in model ciliate genomes. UCLUST from the USEARCH suite (version 5.1.221) [131] was used for clustering at increasing global sequence alignment identity clustering thresholds, with the query and target alignment fractions both set to 80% coverage (i.e., number of letters in the query that are aligned to letters in t...
End-to-end validation by Sanger mate pair reads. Paired and SE reads are shown with gray arrows. First, “outer spans” between the ends of paired-end reads or consisting of the entire SE read are found. Next, we attempt to greedily find a path through the spans, so that there are ≥100 bp overlaps between the spans comprising the path. If we find suc...
Nucleic-acid-associated protein domains found in both Paramecium and Tetrahymena but not Oxytricha. Domains marked with * are present in translated ORFs, but were not originally detected as AUGUSTUS failed to predict them. Protein IDs are given for Tetrahymena.
(RTF)
Genomic and RNA libraries Sanger sequenced on ABI3730 sequencers.
(RTF)
Meta-contig statistics after first CAP3 assembly before extension. “Single” refers to an SE being complete (≥1 5′ or 3′ telomeres). “Both” refers to one or more telomeres on both ends of the contig (≥1 5′ and ≥1 3′ ends). “Multiple” refers to greater than two ends on either end of the contig (≥2 5′ or ≥2 3′ ends). All lengths are given in bp.
(RTF)
Meta-contig statistics after CAP3 reassembly of extended contigs. “Single” refers to an SE being complete (≥1 5′ or 3′ telomeres). “Both” refers to one or more telomeres on both ends of the contig (≥1 5′ and ≥1 3′ ends). “Multiple” refers to greater than two ends on either end of the contig (≥2 5′ or ≥2 3′ ends). All lengths are given in bp.
(RTF)
Meta-contig statistics after second extension. “Single” refers to an SE being complete (≥1 5′ or 3′ telomeres). “Both” refers to one or more telomeres on both ends of the contig (≥1 5′ and ≥1 3′ ends). “Multiple” refers to greater than two ends on either end of the contig (≥2 5′ or ≥2 3′ ends). All lengths are given in bp.
(RTF)
Alternative nanochromosome fragmentation in a predicted intron-containing region. Gene predictions for Contig17419.0 are shown. Predicted genes are indicated by green arrows and predicted CDSs by yellow arrows; predicted introns are indicate by white arrows, and those introns that are supported by RNA-seq evidence have two white arrows; neon green...
Sequence logo of Euplotes crassus subtelomeric regions. Sequence logos show base frequencies. Note that some of the motifs may be slightly misaligned (usually by 1 base), and hence the motif centered on position −20 would be even more prominent if they were correctly aligned.
(TIFF)
Intron length distribution for Tetrahymena thermophila gene predictions. Intron lengths determined from 2008 Tetrahymena gene predictions (downloaded from http://www.ciliate.org/system/downloads/oct2008_release.gff).
(TIFF)
Sequence logos of experimentally determined and predicted intron acceptor sites. Sequence logos show base frequencies. Experimentally determined introns were obtained from RNA-seq data (see Text S1: Gene prediction). Predicted introns are all the introns predicted by AUGUSTUS, including those that have supporting RNA-seq evidence. Sequence logos we...
Location of alternative fragmentation sites relative to inter- and intracoding sequence regions for two-gene nanochromosomes. Alternative fragmentation sites with decreasing numbers of supporting telomeric reads are shown in three successive columns. To exclude conventional TASs, only alternative fragmentation sites at least 100 bp away from either...
Oxytricha nucleic-acid-associated protein domains not found in Paramecium and Tetrahymena.
aJudging from multiple sequence alignments, domain appears to exist in Paramecium (GSPATP00020413001) and Tetrahymena (TTHERM_00721450) but was not detected by hmmscan (HMMER3) bindependent E-value greater than the threshold (0.001), but domain exists (e.g.,...
Small ribosomal proteins. Gene identifiers are given as contig identifiers with a gene suffix beginning with “g” followed by a number (which is arbitrary in this context). Only proteins ≤100 aa with domains found in Pfam 26.0 with an E-value<0.01 and with some homologs in UniProt that are ≤120 aa (to ensure that they are genuine small proteins) are...
Small nonribosomal proteins. Gene identifiers are given as contig identifiers with a gene suffix beginning with “g” followed by a number (which is arbitrary in this context). All nanochromosomes in this table longer than 1 kb are predicted to be multigene nanochromosomes. Proteins 50–100 aa long with domains found in Pfam 26.0 (independent E-value<...
RNA-seq counts for transcription initiation factor II domain protein genes. RNA expression values are given in normalized read counts for vegetative (“Fed”) cells and cells developing during conjugation (see Text S1: RNA-seq mapping and read counting).
(RTF)
RNA-seq counts for poly-adenylate binding protein domain protein genes. RNA expression values are given in normalized read counts for vegetative (“Fed”) cells and cells developing during conjugation (see Text S1: RNA-seq mapping and read counting).
(RTF)
RNA-seq counts for replication protein A domain protein genes. RNA expression values are given in normalized read counts for vegetative (“Fed”) cells and cells developing during conjugation (see Text S1: RNA-seq mapping and read counting).
(RTF)
Distribution of supporting telomeric reads per alternative fragmentation site versus reads per contig length. Alternative fragmentation sites >400 bp from contig ends are hexagonally binned. The x-axis units are an estimate of relative nanochromosome copy number based on the total number of mapped Illumina reads, both telomeric and nontelomeric, pe...
Location of alternative fragmentation sites relative to coding and noncoding sequence regions for single-gene nanochromosomes. Alternative fragmentation sites with decreasing numbers of supporting telomeric reads are shown in three successive columns. To exclude conventional TASs, only alternative fragmentation sites at least 100 bp away from eithe...
Large predicted proteins. The 20 longest nanochromosomes, excluding cases that appear to be redundant (i.e., quasi-alleles), are shown. None of these nanochromosomes is alternatively fragmented. Nanochromosome lengths include telomeres. Protein domain names are abbreviations from Pfam-A (version 26). Semicolons separate predicted protein lengths an...
Data sources for genome assemblies. Data for the genome assemblies incorporated in the final meta-assembly may be downloaded from http://dx.doi.org/10.5061/dryad.d1013
[132].
(RTF)
Meta-contig statistics for the final CAP3 assembly. “Single” refers to an SE being complete (≥1 5′ or 3′ telomeres). “Both” refers to one or more telomeres on both ends of the contig (≥1 5′ and ≥1 3′ ends). “Multiple” refers to greater than two ends on either end of the contig (≥2 5′ or ≥2 3′ ends). All lengths are given in bp.
(RTF)
RNA-seq counts for homeodomain protein genes. RNA expression values are given in normalized read counts for vegetative (“Fed”) cells and cells developing during conjugation (see Text S1: RNA-seq mapping and read counting).
(RTF)
Total RNA sources for poly(A)-selected mRNA.
aRiboMinus Eukaryote Kit (Invitrogen, Carlsbad, CA).
(RTF)
Estimates of nanochromosome copy number. Only the rRNA nanochromosome in this table is alternatively fragmented (with a site at 634 bp supported by 11 reads and two sites at 1,253 bp and 6,077 bp supported by a single read). 5′- and 3′-telomeric reads refer to reads that are mapped either to the 5′ or 3′ end as it is oriented in the genome assembly...
Oxytricha putative nucleic-acid-associated protein domains (not annotated in pfam2go) not found in Paramecium and Tetrahymena. Domains in this table are considered to have putative nucleic-acid-related functions based on Pfam descriptions and literature cited for these domains. aProteins encoded on contigs with no telomeric repeats. bProtein is tru...
454 genomic DNA libraries. Short read archive data can be downloaded from http://www.ncbi.nlm.nih.gov/sra.
(RTF)
Meta-contig statistics after first extension. “Single” refers to an SE being complete (≥1 5′ or 3′ telomeres). “Both” refers to one or more telomeres on both ends of the contig (≥1 5′ and ≥1 3′ ends). “Multiple” refers to greater than two ends on either end of the contig (≥2 5′ or ≥2 3′ ends). All lengths are given in bp.
(RTF)
Meta-contig statistics after CAP3 reassembly of second round of extended contigs. “Single” refers to an SE being complete (≥1 5′ or 3′ telomeres). “Both” refers to one or more telomeres on both ends of the contig (≥1 5′ and ≥1 3′ ends). “Multiple” refers to greater than two ends on either end of the contig (≥2 5′ or ≥2 3′ ends). All lengths are giv...
Missing Moco biosynthesis enzymes in ciliates.
(RTF)
Top 40 elevated domain counts in Oxytricha relative to Tetrahymena. The “enrichment” column measures the number of UCLUST clustered proteins in Oxytricha relative to proteins in Tetrahymena. The columns after “enrichment” count the number of proteins in which the Pfam domains are found for Oxytricha (Oxy), Tetrahymena (Tet), Paramecium (Par), and P...
Zinc finger protein domain counts in Oxytricha. The “enrichment” column measures the number of UCLUST clustered proteins in Oxytricha relative to proteins in Tetrahymena. The columns after “enrichment” count the number of proteins in which the Pfam domains are found for Oxytricha (Oxy), Tetrahymena (Tet), Paramecium (Par), Ichthyophthirius (Ich), a...
Meta-contig statistics after chimera splitting and end trimming. “Single” refers to an SE being complete (≥1 5′ or 3′ telomeres). “Both” refers to one or more telomeres on both ends of the contig (≥1 5′ and ≥1 3′ ends). “Multiple” refers to greater than two ends on either end of the contig (≥2 5′ or ≥2 3′ ends). All lengths are given in bp.
(RTF)
Properties of intergenic regions. Prediction features were obtained for complete nanochromosomes (14,388 in total) only. Intergenic regions are between start and stop codons, including UTRs. Alternative fragmentation sites are those that are strongly supported by Illumina telomeric reads. %GC estimates exclude telomeric bases. Intergenic regions ar...
Supporting Results, Materials and Methods.
Contents of file:
Supporting Results, p. 3.
Macronuclear genome validation, p. 3.
Analysis of low frequency variants, p. 4.
Pfam domains detected for CEGs missing in Oxytricha, p. 5.
Investigation of alternative fragmentation sites in relation to mapped RNA-seq data, p. 6.
Frequent colocation of ncRNA- an...
We studied the relationship between growth rate and genome-wide gene expression, cell cycle progression, and glucose metabolism in 36 steady-state continuous cultures limited by one of six different nutrients (glucose, ammonium, sulfate, phosphate, uracil, or leucine). The expression of more than one quarter of all yeast genes is linearly correlate...
The Stanford Microarray Database (SMD; http://smd.stanford.edu/) is a research tool and archive that allows hundreds of researchers worldwide to store, annotate, analyze and share data
generated by microarray technology. SMD supports most major microarray platforms, and is MIAME-supportive and can export or
import MAGE-ML. The primary mission of SM...
Variable phenotypes have been identified for Entamoeba species. Entamoeba histolytica is invasive and causes colitis and liver abscesses but only in approximately 10% of infected individuals; 90% remain asymptomatically colonized. Entamoeba dispar, a closely related species, is avirulent. To determine the extent of genetic diversity among Entamoeba...
The Stanford Microarray Database (SMD) (http://smd.stanford.edu) is a research tool for hundreds of Stanford researchers and their collaborators. In addition, SMD functions as a resource for the entire biological research community by providing unrestricted access to microarray data published by SMD users and by disseminating its source code. In ad...
When publishing large-scale microarray datasets, it is of great value to create supplemental websites where either the full data, or selected subsets corresponding to figures within the paper, can be browsed. We set out to create a CGI application containing many of the features of some of the existing standalone software for the visualization of c...
The Microarray Gene Expression Data Society believe that the time is right for journals to require that microarray data be deposited in public repositories, as a condition for publication.
The Stanford Microarray Database (SMD; http://genome-www.stanford.edu/microarray/) serves as a microarray research database for Stanford investigators and their collaborators. In addition, SMD functions
as a resource for the entire scientific community, by making freely available all of its source code and providing full public
access to data publi...
The explosion in the number of functional genomic datasets generated with tools such as DNA microarrays has created a critical need for resources that facilitate the interpretation of large-scale biological data. SOURCE is a web-based database that brings together information from a broad range of resources, and provides it in manner particularly u...
Androgens are required for both normal prostate development and prostate carcinogenesis. We used DNA microarrays, representing approximately 18,000 genes, to examine the temporal program of gene expression following treatment of the human prostate cancer cell line LNCaP with a synthetic androgen.
We observed statistically significant changes in lev...
The genome-wide program of gene expression during the cell division cycle in a human cancer cell line (HeLa) was characterized using cDNA microarrays. Transcripts of >850 genes showed periodic variation during the cell cycle. Hierarchical clustering of the expression patterns revealed coexpressed groups of previously well-characterized genes involv...
Full gene names and GenBank accession numbers for Figure 2
Full gene names and GenBank accession numbers for Figure 4
Full gene names and GenBank accession numbers for Figure 3
Full gene names and GenBank accession numbers for Figure 1
Full gene names and GenBank accession numbers for Figure 5
Microarray analysis has become a widely used tool for the generation of gene expression data on a genomic scale. Although many significant results have been derived from microarray studies, one limitation has been the lack of standards for presenting and exchanging such data. Here we present a proposal, the Minimum Information About a Microarray Ex...