OMICS A Journal of Integrative Biology
Volume 9, Number 4, 2005
© Mary Ann Liebert, Inc.
Proteogenomic Approaches for the Molecular
Characterization of Natural Microbial Communities
JILLIAN F. BANFIELD,1NATHAN C. VERBERKMOES,2ROBERT L. HETTICH,2
and MICHAEL P. THELEN3
At the present time we know little about how microbial communities function in their nat-
ural habitats. For example, how do microorganisms interact with each other and their phys-
ical and chemical surroundings and respond to environmental perturbations? We might be-
gin to answer these questions if we could monitor the ways in which metabolic roles are
partitioned amongst members as microbial communities assemble, determine how resources
such as carbon, nitrogen, and energy are allocated into metabolic pathways, and understand
the mechanisms by which organisms and communities respond to changes in their sur-
roundings. Because many organisms cannot be cultivated, and given that the metabolisms
of those growing in monoculture are likely to differ from those of organisms growing as part
of consortia, it is vital to develop methods to study microbial communities in situ. Chemoau-
totrophic biofilms growing in mine tunnels hundreds of meters underground drive pyrite
(FeS2) dissolution and acid and metal release, creating habitats that select for a small num-
ber of organism types. The geochemical and microbial simplicity of these systems, the sig-
nificant biomass, and clearly defined biological-inorganic feedbacks make these ecosystem
microcosms ideal for development of methods for the study of uncultivated microbial con-
sortia. Our approach begins with the acquisition of genomic data from biofilms that are
sampled over time and in different growth conditions. We have demonstrated that it is pos-
sible to assemble shotgun sequence data to reveal the gene complement of the dominant com-
munity members and to use these data to confidently identify a significant fraction of pro-
teins from the dominant organisms by mass spectrometry (MS)–based proteomics. However,
there are technical obstacles currently restricting this type of “proteogenomic” analysis.
Composite genomic sequences assembled from environmental data from natural microbial
communities do not capture the full range of genetic potential of the associated populations.
Thus, it is necessary to develop bioinformatics approaches to generate relatively compre-
1Departments of Earth and Planetary Science, and Environmental Science, Policy, and Management, University of
California, Berkeley, California.
2Organic and Biological Mass Spectrometry, Chemical Sciences Division, Oak Ridge National Laboratory, Oak Ridge,
3Biosciences Directorate, Lawrence Livermore National Laboratory, Livermore, California.
hensive gene inventories for each organism type. These inventories are critical for expres-
sion and functional analyses. In proteomic studies, for example, peptides that differ from
those predicted from gene sequences can be measured, but they generally cannot be identi-
fied by database matching, even if the difference is only a single amino acid residue. Fur-
thermore, many of the identified proteins have no known function. We propose that these
challenges can be addressed by development of proteogenomic, biochemical, and geochem-
ical methods that will be initially deployed in a simple, natural model ecosystem. The re-
sulting approach should be broadly applicable and will enhance the utility and significance
of genomic data from isolates and consortia for study of organisms in many habitats. Solu-
tions draining pyrite-rich deposits are referred to as acid mine drainage (AMD). AMD is a
very prevalent, international environmental problem associated with energy and metal re-
sources. The biological-mineralogical interactions that define these systems can be harnessed
for energy-efficient metal recovery and removal of sulfur from coal. The detailed under-
standing of microbial ecology and ecosystem dynamics resulting from the proposed work
will provide a scientific foundation for dealing with the environmental challenges and tech-
nological opportunities, and yield new methods for analysis of more complex natural com-
The challenge of studying microbial communities in situ
lar machinery of individual microbial cells is harnessed in the context of a fully functional microbial
community. Specifically, our goals are to understand how natural microbial communities are structured and
the ways in which metabolic activity levels and partitioning of function are affected by community makeup
and the physical and chemical surroundings. Major obstacles to development of detailed biogeochemical
models to describe such processes have been the inability to (i) access the genetic potential of uncultivated
organisms and their natural populations, (ii) assay function directly in the environmental context, and (iii)
construct models for metabolic networks because of the large number of genes for which function cannot
be assigned. The objectives of our research are to tackle all three of these challenges in an integrative man-
ner in order to provide the information necessary to develop understanding of the structure and function of
a self-sustaining model microbial ecosystem.
Our approach consists of the following five steps:
HE OVERALL OBJECTIVE of our work is to develop a comprehensive understanding of how the molecu-
1. Choose a system that is self-contained and simple enough for relatively complete characterization and
2. Obtain comprehensive genomic information to evaluate genetic potential.
3. Identify abundant proteins expressed by the dominant community members, thus evaluate the partition-
ing of functions amongst community members.
4. Screen the extensive proteomic datasets to target key proteins of unknown function for rapid functional
prediction and testing.
5. Develop methods to tackle identified challenges related to strain heterogeneity and protein abundance
quantification, allowing more comprehensive analysis and extension of the approach to more complex
Why study acid mine drainage ecosystems?
Extremely acidic environments form when rocks enriched in metal sulfides, predominantly pyrite (FeS2),
are exposed to reaction with air and water. When this exposure occurs as the result of natural weathering
and erosion, the phenomenon is referred to as acid rock drainage (Nordstrom and Southam, 1997; Nord-
strom and Alpers, 1999). Often, however, rocks enriched in pyrite also contain minerals such as sphalerite
BANFIELD ET AL.
(ZnS), chalcopyrite (CuFeS2), galena (PbS), uraninite (UO2), arsenopyrite (FeAsS), coal, gold, silver, and
other valuable resources. Mining of these ore deposits greatly increases the exposed surface area of sulfide
minerals and accelerates chemical weathering. The resulting solutions are referred to as acid mine drainage
(AMD). AMD generation can occur in open pit mines and waste rock piles at the surface, and underground,
as in Figure 1.
The pH values of AMD systems are typically 1–4, but values may be even lower. pH values of 0.5–1.0
are encountered in regions of active AMD formation in pyrite-rich environments, and values below zero
occur locally due to evaporative concentration (Nordstrom et al., 2000). In addition to acidity, pyrite oxi-
dation is so highly exothermic that pyrite dissolution easily accounts for underground temperatures that can
exceed 50°C (Druschel et al., 2004). Solutions typically contain sub-molar concentrations of FeSO4and
mM levels of Zn, Cu, As, and other metals (Druschel et al., 2004) (Table 1). The resulting environment is
challenging to most life forms.
Despite the apparently hostile habitat, AMD can host quite productive ecosystems. Bacteria, archaea, and
eukaryotes from diverse lineages have developed methods to deal challenges such as metal toxicity and
high proton gradients (Baker and Banfield, 2003). In underground systems microorganisms assemble com-
munities based solely on energy derived from iron and sulfur oxidation. Because they are largely separated
from external sources of fixed carbon and nitrogen, these essential elements must be fixed in situ. Some
community members provide biofilm architecture and remove waste products. The result is an ecosystem
microcosm. The self-contained nature of these communities distinguishes AMD biofilms from some other
biological systems (such as soil or ocean water) available for study.
Microorganisms do not simply exist in AMD systems. Rather, their metabolism is directly linked to pyrite
dissolution. Overall, the reaction for pyrite oxidation is:
FeS2? 3.5 O2? H2O ? Fe2?? 2 SO42?? 2 H?
The predominant source of oxygen is air. Geochemical studies have established that oxygen is a less
PROTEOGENOMIC APPROACHES TO NATURAL MICROBIAL COMMUNITIES
mine were taken in a tunnel (the C drift; see map in Fig. 2) within the ore deposit. Visible in this photograph are the
tunnel walls, which consist mostly of host rock (rhyolite) and pyrite, a channel of green solution enriched in iron, other
metals, and sulfuric acid, and pink biofilms on the solution surface and attached to the channel walls.
The Richmond Mine, Iron Mountain in Northern California. Photograph of the underground regions of the
effective oxidant of sulfide ions on pyrite surfaces than ferric iron (Moses et al., 1987). Thus, the dominant
pathway for pyrite dissolution involves oxidation of ferrous iron by oxygen:
14 Fe2?? 3.5 O2? 14 H?? 14 Fe3?? 7 H2O (2)
followed by reduction of ferric iron by sulfide:
FeS2? 14 Fe3?? 8 H2O ? 15 Fe2?? 2 SO42?? 16 H?
Note that the sum of reactions (2) and (3), required to describe the sustainable process, yields reaction (1).
At low pH, the rate of reaction (2) is slow. However, acidophilic microorganisms can use iron oxidation to
generate metabolic energy. Microbial activity greatly accelerates the rate of iron oxidation (by a factor of
106in culture-based experiments) (Singer and Stum, 1970). Although there are both inorganic and biolog-
ical components (Silverman and Ehrlich, 1964; Boogerd et al., 1991), it is often stated that microorganisms
are largely responsible for environmental pollution associated with metal and coal resources. The same bi-
ological-geochemical processes also enable technologies that harness microbial activity for metal recovery
and to remove sulfur from coal.
Relevance of AMD systems to energy resource management
As coal forms from organic detritus it always contains some, and in some cases considerable, sulfur.
Upon burial, the sulfur is converted to metal sulfide minerals, typically FeS2. For example, a significant
fraction of coal mined from the Appalachian Basin has a sulfur content of ?3 wt% (http://pubs.
usgs.gov/of/1998/of98-763/). Most Brazilian coal contains 5% sulfur, typically as pyrite (Castilhos, 2003).
Upon surface exposure, these sulfide minerals oxidize, leading to generation of AMD. This problem is
prominent in deposits that contain ?1% sulfur. Thus, AMD is very often associated with coal mines. The
severity of the problem varies with sulfide abundance and position of the mine relative to the water table
(Demchak et al., 2004; Hawkins, 2004). The processes and microorganisms involved in AMD formation in
coal mines are closely related to those involved in pyrite oxidation associated with metal sulfide mineral
deposits (Brofft et al., 2002; Johnson et al., 2002; Baker and Banfield, 2003).
Mining of coal and other resources occurs worldwide on a massive scale. Estimates for the amount of
solid wastes produced each year from mining range from 15,000 Mt (million tons, metric; lower limit of
Lottermoser, 2003) to 24,000 Mt (Young, 1992). The area of the Earth’s surface covered by mine wastes
is in the order of 100 million hectares (Lottermoser, 2003). The amount of rock, soil, and sediment moved
by mining is estimated to be comparable to (Lottermoser, 2003), or twice, the quantity moved by natural
geological processes (Young, 1992).
BANFIELD ET AL.
TABLE 1. A SELECTION OF SOLUTION CHEMICAL DATA FOR DIFFERENT SITES WITHIN THE RICHMOND MINE
For site information and other details (and more analyses), see Nordstrom et al. (2000), Alpers et al. (2003), and
Druschel et al. (2004). Concentrations given in mM.
AMD has significant environmental impact. For example, in a recent study of the Clinch-Powell River
system of Virginia and Tennessee (one of the most biologically diverse ecosystems in the world), researchers
attributed ecosystem declines primarily to coal mining–related AMD (e.g., acute toxicity due to metals, alu-
minum toxicity, and problems related to ferrihydrite precipitates) (Soucek et al., 2003). Records of fish
killed by mining activities are generally poorly estimated, but the numbers have ranged in the millions per
year from mining in the United States (Nordstrom and Alpers, 1999). Kleinmann (1989) estimated that
about 19,300 km of rivers and more than 180,000 acres of lakes and reservoirs in the continental United
States have been seriously damaged by AMD. Losses of livestock, benthic organisms, and crops have also
been reported (Nordstrom and Alpers, 1999; Kelly, 1988). Anthropogenic inputs of metals to atmospheric,
terrestrial, and aquatic ecosystems as a result of mining have been estimated to be several million kilograms
per year (Nriagu and Pacyna, 1988; Smith and Huyck, 1999).
The largest Superfund site, or complex of sites, is the Anaconda/Butte/Clark Fork covering about 50,000
acres and 140 miles of the Clark Fork River in Montana. Contaminants derive from mine drainage, waste
rock piles, tailings piles, slag piles, and flue dust piles rich in arsenic. Metals have affected the air, soils,
surface waters, and ground waters (Housman and Hoffman, 1992; Moore and Luoma, 1990). The Iron
Mountain Superfund site, the site selected for research described in this proposal, is also a problem of sig-
nificant magnitude, in part due to its predicted longtime scale (2000–3000 years) and extremely low pH ef-
Uranium is also an important energy resource. Uranium is mined from deposits that form through ig-
neous (plutonic and volcanic), metamorphic, and sedimentary processes (Plant et al., 1999). Many of these
deposits are enriched in sulfide minerals. For example, in the massive Olympic Dam deposit in Australia
(7 km ? 3 km), uraninite (UO2) is associated with Cu and Fe sulfides throughout, and there is a significant
co-location of high uranium and Cu sulfide minerals. The high-grade Arizona Strip deposits (with up to
0.85% U) are capped with 3–15-m-thick FeS2deposits (Plant et al., 1999). Thus, AMD is a problem typi-
cal of many uranium mines (Suzuki et al., 2003). In addition to impacting the environment through oxida-
tion of pyrite associated with these deposits (forming AMD), iron-oxidizing acidophiles (such as those stud-
ied here) directly influence the rate of uranium release. This is because the ferric iron metabolic byproduct
reacts with UO2, oxidizing the U4?and causing uraninite to dissolve (Francis et al., 1991; Abdelouas et
al., 1999). Thus, information about acidophilic microbial communities underpinned by iron oxidation may
be relevant to many systems characterized by uranium contamination.
Study of AMD ecosystems may yield information that will assist in design of new remediation methods.
This is a pressing problem. There may be as many as 500,000 inactive or abandoned mine sites in the USA
(Lyon et al., 1993), of which a few thousand are undergoing or in need of remediation. Understanding of
the biological component to AMD formation may assist in at least three potential strategies. First, under-
standing of microbial community structure and function could reveal metabolic characteristics that provide
opportunities for biological suppression. Second, a potential solution to the long-term nature of AMD prob-
lems is to promote AMD formation in situ, reducing the duration of the problem and also increasing the
feasibility of cost-effective metal recovery from effluent. Clearly, understanding how microbial communi-
ties function is central to this effort. Finally, there has been considerable progress in Japan (e.g., the Mat-
suo Neutralization Plant on the Akagawa River, Japan) and in Europe in the development of strategies to
promote biological iron oxidation and iron oxyhydroxide precipitation in treatment facilities, enabling biore-
mediation of metal-contaminated waters (Johnson et al., 2002). Community structure-function information
could also prove to be useful for this application. For a review of considerations related to AMD treatment
and management, see Nordstrom et al. (1990, 1999) and Nordstrom and Alpers (1995).
In addition to the relevance of biological community studies to AMD management and treatment, there
are a number of important technologies that harness biological mineral dissolution to solubilize metals from
ore deposits (Rawlings et al., 1999; Sand et al., 1992). In the case of uranium, mining occurs at some sites
via in situ leaching (Bain et al., 2001). Bioleaching is potentially an energy-efficient and relatively envi-
ronmentally friendly technology for metal recovery that is also deployed in controlled environments. Bi-
oleaching is currently receiving the most attention outside of the United States, where mining of metals re-
mains a major economic activity (e.g., South Africa, Indonesia, Australia, South America).
Burning of sulfur-rich coal leads to release of contaminants such as mercury and to the formation of acid
PROTEOGENOMIC APPROACHES TO NATURAL MICROBIAL COMMUNITIES
rain, both of which cause considerable environmental damage. For this reason, a variety of approaches are
used to minimize release of sulfur and mercury to the atmosphere (e.g., burning of low S coal, sulfur re-
moval from emissions, and sulfur removal prior to combustion). Of potential technological significance are
biological methods for low-temperature, energy-efficient coal desulfurization. Such strategies are based on
the same methods used to solubilize minerals for metal recovery.
AMD is an enormous environmental problem, especially in countries where mining is more prevalent
than in the United States. However, worldwide, much mining is carried out by United States–based com-
panies or international companies in which there is significant U.S. investment. Ultimately, the technolog-
ical opportunities, as well as the liability for clean-up of mining-related environmental damage will have a
financial impact on the United States.
The Richmond Mine Site
Our research is carried out primarily at the Richmond Mine near Redding in northern California. The
Richmond Mine is the dominant source of very acidic (pH ? 1.0) solutions that drain from the Iron Moun-
tain Mines complex. The processes occurring within the Richmond system are broadly typical of AMD-
generating processes that occur at innumerable other sites in the United States and around the world.
It has been estimated that the Iron Mountain ore deposits were exposed by erosion approximately 700,000
to 1 million years ago (based on magnetic dating of the hematite deposits) (Alpers et al., 1999). Mining be-
gan in the 1860s and ceased in 1962. Current discharge from the Richmond tunnels averages ?165,000
liters per day, representing a loss of ?5 metric tons of Fe per day in solution. Despite this seemingly large
flux, AMD production will necessitate remediation at the site for the next 2,000–3,000 years. For other site
details, see Druschel et al. (2004); and for a review of remediation options, see Alpers et al. (2003).
Prior studies at the Richmond Mine site have documented the nature of the AMD problem (Alpers et al.,
2003), described the extremes of pH at the site (Nordstrom et al., 2000), investigated the sulfate salt min-
eralogy (Nordstrom and Alpers, 1999) and seasonal fluxes in solution chemistry (Alpers et al., 1994; Nord-
strom and Alpers, 1995), and examined the impact of AMD on surrounding water bodies (Nordstrom et al.,
1999). In addition, there has been over a decade of microbial and geochemical research conducted within
the Richmond Mine by the Banfield group. A subset of work focused on the surface chemistry of pyrite
dissolution (McGuire et al., 2001), AMD chemistry (Druschel et al., 2004), and rates of colonization of
pyrite surfaces and of microbially-mediated dissolution (Edwards et al., 1998, 1999, 2001).
In the mid-1990s, the distribution of microorganisms at the site was determined via construction and se-
quencing of 16S rRNA gene libraries and by labeling of ribosomes by fluorescence in situ hybridization of
oligonucleotide probes (FISH) with species and genera-level specificity (Schrenk et al., 1998; Edwards et
al., 1999; Bond et al., 2000a,b, 2001). This work documented the variation in species makeup over time
and as a function of geochemical conditions (Edwards et al., 1999), demonstrated the species-level sim-
plicity of the AMD communities (typically four to six dominant prokaryote species) relative to all other
habitats considered (Baker and Banfield, 2003), and attempted to quantify the metabolism-environmental
chemistry feedbacks (Edwards et al., 1999; Druschel et al., 2004).
A list of organisms found across all the mine habitats and a summary of associated information is pro-
vided in Table 2. The species profile of these communities is strikingly similar to those of most other AMD
systems (coal and metal deposits) and bioleaching systems worldwide (Johnson, 1998; Goebel and Stacke-
brandt, 1994; Baker and Banfield, 2002; Coram and Rawlings, 2002; Johnson et al., 2002). At the Rich-
mond Mine, subaerial biofilms and some subaqueous biofilms tend to be dominated by archaea (genera
known only from culture-independent analyses and referred to as the alphabet plasmas, including strains
related to Ferroplasma acidarmanus fer1) (Edwards et al., 2000) whereas biofilms at the air-water inter-
face are usually dominated by bacteria of the Leptospirillum genus (Hippe, 2000), mostly related to Lep-
tospirillum group II and Leptospirillum group III (Bond et al., 2000).
The Richmond Mine (Fig. 2) is accessed by a 400-m-long horizontal tunnel, which is maintained as part
of the onsite remediation activities. Essentially all AMD formed within the mine system collects at the junc-
tion of five tunnels and is piped out along the access tunnel to a treatment site. A range of sampling sites
within the exposed ore deposit (several hundred meters below the surface) can be accessed via horizontal
tunnels (primarily the A, AB, B, and C drifts, Fig. 2). AMD solutions reach the sampling locations after
BANFIELD ET AL.
flowing through the mine workings along a reaction path of unknown and probably highly variable length.
Transit times for rain from the surface to the sampling sites have been estimated in the range of months to
decades (Druschel et al., 2004). However, solution-pyrite interaction times are sufficient that the pH val-
ues are essentially always ?1.2. Biofilms are visible at all sampling locations at most times of the year
(they are periodically removed by high flow) and microbial cells are ubiquitous in the fine-grained pyrite
that accumulates on the tunnel floors. For these reasons, and given the significant distance from the sur-
face, we infer that essentially no surface-derived fixed carbon and nitrogen is supplied to the sampling sites.
The Richmond Mine sampling locations are at sites that are separated by hundreds of meters from each
other, and likely receive solutions that drain from widely separated regions of the ore deposit. Thus, some
degree of geographically-controlled strain diversification is possible (Whitaker et al., 2003). The sites pro-
vide a selection of sample types. Because very high rainfall early in the year drives massive groundwater
movement, flushing of biofilms occurs at least annually. Consequently, it is possible to sample biofilms in
their earliest stages of formation, and in subsequent growth stages, thus to investigate colonization and de-
velopment. Photographs of biofilms in different growth stages are shown in Figure 3.
In late October 2004, the site maintenance team removed tens of cubic meters of pyrite sediment that
had accumulated at the confluence of the AB and B drifts. After work ceased, a shallow pool (?10 cm
deep) of AMD accumulated over the remaining pyrite bed. Thirteen days later a very thin, evenly textured
biofilm had formed over the surface of the entire pool. Based on measurement of biomass collected from
defined areas, we estimate that the biofilm was ?60 ?m thick. Using an initial inoculum of 1–106cells/cm2
(the latter is based on planktonic cell counts); we infer a doubling time of 8–13 h in situ. This value is close
to that estimated for pure cultures of iron-oxidizing microorganisms grown under optimal laboratory con-
ditions (5–20 h) (Nordstrom and Southam, 1997). These field-based estimates illustrate two important points:
(i) it is practical to study biofilm formation in situ over weekly to monthly time periods, and (ii) abundant
biomass is available for characterization (several 50-mL tubes of biomass can be removed from most sites,
PROTEOGENOMIC APPROACHES TO NATURAL MICROBIAL COMMUNITIES
TABLE 2.MICROORGANISMS DETECTED AT THE RICHMOND MINE
Phylogeny Genomic dataIsolateReference
Bacteria from Iron Mountain
L. ferrooxidans I
L. ferrooxidans II
L. ferrooxidans III
Low G?C gram-
Bond et al. (2000)
Tyson et al. (in review)
Edwards et al. (2000)
Bond et al. (2000) Fragment
?-Proteobacteria sp. SI?80%
Acidobacterium sp. SI?95%
Bond et al. (2000)
Bond et al. (2000)
Edwards et al. (2000)
Baker et al. (unpublished
Archaea from Iron Mountain
Thermoplasmales sp. SI?93%
(A to Gplasma)
Edwards et al. (2000)
Bond et al. (2000)
Baker and Banfield (2002)
Baker et al. (unpublished
? Deep branchingFragmentary
Eukaryotes from Iron Mountain
Baker et al. (2003)
Baker et al. (2003)
Baker et al. (2003)
Baker et al. (2003)
Availability of genomic data, an isolate from the mine representative of the group, and reference information are noted.
if necessary, and many small, spatially resolved samples could be collected, with sufficient biomass for
both genomic and proteomic analyses).
The Richmond Mine as a site for studies of microbial communities
Comprehensive, high-resolution analysis of a real natural community may provide the means to answer
fundamental questions about community structure, function, and dynamics. Our approach is to choose a
tractable system for the first such analysis. No matter which system is chosen, questions will be raised about
the relevance of the discoveries to other systems. Our view is that the general importance of AMD as a ma-
jor and long-term environmental challenge (as well as associated technological opportunities) makes the
Richmond Mine biofilms an important target for study. Although the extreme habitat may prompt concerns
about the broad applicability of our findings, it is our view that “extreme” is a relative term. The organ-
isms colonizing this habitat are highly adapted to it. We suspect that there are no reasons why fundamen-
tally different principles of ecology, adaptation, and evolution should apply, although the details will be en-
vironment specific. Microbial mats, layered microbial communities, other biofilms, and some environments
with high cell densities are likely to share similar functional and structural organization.
The Richmond Mine biofilms are ideal model systems for development of community proteogenomic
methods because they are dominated by a few species, contain high biomass, and represent essentially self-
contained chemoautotrophically based ecosystems that can be sampled over space and time. The geochemical
conditions vary primarily in response to hydrologic flow, which depends directly on rainfall. As a result,
most sites undergo an annual cycle in temperature, ionic strength, and flow rate. This relatively systematic
fluctuation enables study of colonization, biofilm development, and community response to perturbation.
The strong and clearly defined geochemical-microbiological feedbacks make it possible to analyze the in-
terplay between organism activity and environmental conditions. Based on similarities with other AMD
sites, we deduce that a relatively defined group of organisms has evolved to cohabitate over a very long
evolutionary time period. Thus, we expect well-established organism-organism interactions. The low species
abundance is especially important as this ensures that we can extensively sample the genomes of the dom-
inant organism types and assign metabolic functions to species (Tyson et al., 2004).
Species and populations
In 16S rRNA gene surveys and related studies, “species” are often roughly defined as organisms that
share ?97% identity at the 16S rRNA gene level. In our work at the Richmond Mine site, this definition
has been refined as the result of genomic analyses. We use a working definition of a species that stems
from genome assembly. The conceptual basis for such a definition derives from the idea that organism di-
vergence to form new species occurs by multiple parallel mechanisms, including formation of single nu-
cleotide polymorphisms and genome rearrangements, and genome assembly is affected by both of these. In
the case of archaea, we have found that reasonably stringent assembly criteria unite organisms that are fre-
quently exchanging DNA and separate groups that are only rarely undergoing homologous recombination
(Tyson et al., 2004). For example, in the case of Ferroplasma type I and type II, ?1% divergence at the
16S rRNA gene level corresponds to an average of ?22% at the nucleotide level (genome wide). These
genomes assembled separately and we consider types I and II to be different species. In contrast, there is
an average of ?2% divergence within each of the Ferroplasma populations, and data from individuals in
each was largely co-assembled into a composite genome sequence.
SCIENTIFIC OPPORTUNITIES AND HYPOTHESES
Key remaining questions relate to the roles of uncharacterized organisms in natural ecosystems, how
function in consortia differs from function in laboratory cultures or co-cultures, and the form of the inter-
play between genomes and the environment that leads to adaptation and speciation. We anticipate that strain
populations growing as part of multi-organism natural communities will exhibit metabolic activities that are
fundamentally different to those observed when organisms grow in pure cultures. For example, we expect
BANFIELD ET AL.
to see partitioning of functions among community members that will not be evident in pure culture and pos-
sibly laboratory co-cultures. We predict that environmental perturbations preserve strain-level protein het-
erogeneity, implying that the dominant protein variant will vary with geochemical conditions and commu-
We predict that the important investments of resources made by members of communities will include
proteins and protein complexes that play critical roles in environmental adaptation. Furthermore, we expect
that many of the highly expressed lineage-specific genes currently lacking functional annotation evolved in
PROTEOGENOMIC APPROACHES TO NATURAL MICROBIAL COMMUNITIES
AMD flow direction
Tyson et al. 2004
Map of the Richmond Mine showing the tunnel system within the ore deposit.
teomic studies. (Bottom left) 6 months later, biofilm from the same general area is much thicker. (Top right) In a 13-
day-old biofilm, a 10" ? 10" area has been cut for sampling. (Bottom right) A similar 10" ? 10" area after sampling.
The pink biofilm color is due to the high concentration of heme-bearing cytochromes; green solution color is due to
the high metals concentration.
Sampling of biofilms. Photographs are from the Richmond mine site. (Top left) Biofilm used in initial pro-
response to specific challenges of the geochemical environment. However, we anticipate that a significant
fraction of lineage-specific genes were acquired via lateral gene transfer (especially via phage insertion),
and predict that only a small subset of their products will be detected in the proteomic datasets.
Figure 4 diagrams an important objective of our work. It illustrates the genetic potential of four organ-
isms, comprising a simple community, and shows a snapshot of gene expression. Each organism’s genome
is shown as a ring of open reading frames (ORFs); color indicates level of expression (red is high). The
ring structure represents population structure: the single ring indicates a clonal population; the double ring
indicates a species with two dominant clonal variants. Rings with one to three variants per ORF indicate
mosaic genome structure (combinatorial variants), as appears typical of archaeal populations (Tyson et al.,
2004; Papke et al., 2004; Whitaker et al., 2005; Allen et al., 2005). At this time, we do not know whether
strains in the same sample express their gene variants at the same time or different times (or in different
places), because analyses to date have used only the strain consensus genome sequence. We do not know
how functions such as carbon fixation, polymer production, etc. are distributed over time, space, or growth
stage because we have only studied a single sample using a single genomic dataset. Similarly, we do not
know how the community would respond if an aspect of the system (e.g., metal content, pH, temperature)
was perturbed, and how the community re-establishes its stability and controls its physical and chemical
environment. Furthermore, we do not know what ?40% of the genome-encoded genes do. Our goal is to
understand how to render this cartoon in the form of a real model that addresses all major parameters that
define the ecosystem and its dynamics.
OVERVIEW OF THE APPROACH: ENVIRONMENTAL PROTEOGENOMIC
Samples will be collected from a single site with three goals: (i) to document the biofilm population struc-
ture and gene expression as a function of growth stage; (ii) to evaluate changes in gene expression due to
environmental perturbation (in some cases, deliberately induced in the field); and (iii) to document spatial
heterogeneity (with smaller-scales of strain detection achieved via PCR-based methods). Samples will be
collected from different sites (Fig. 2) in order to explore strain structure and gene expression in communi-
ties with different membership and under a wider range of geochemical conditions (e.g., the temperature
range anticipated is 30–50°C, pH range 0.5–1.2, differences in aqueous Fe2?/Fe3?from ferrous to ferric
iron–dominated, and ionic strength variations of a factor of 2–4, with associated variations in metal con-
centrations). This will enable identification of proteins that are abundant over all growth conditions vs. those
that dominate only in specific growth stages or environmental conditions.
We will characterize the chemical composition of acid mine waters within the Richmond Mine as com-
prehensively as possible for documentation of the environment of biofilm samples. Considerable challenges
are expected in some of the analytical determinations because (i) these waters contain extraordinary solute
concentrations, some of the highest ever documented (Nordstrom and Alpers, 1999; Nordstrom et al., 2000)
and (ii) nutrient concentrations are needed for this study and they are normally not determined in such so-
lutions because routine methods would have substantial interferences. Consequently, in addition to standard
characterization, research is needed to develop techniques that can reliably measure low nutrient concen-
trations (nitrate, nitrite, ammonium, nitrogen gas, dissolved organic carbon, and phosphate) in the presence
of extremely high concentrations of sulfate and metals.
The species makeup of each biofilm sampled will be determined by epifluorescence (including confocal)
microscopy using in situhybridization (FISH) with strain, species, group, and domain-level probes. This method
has proven extremely effective for profiling AMD biofilms because there is very little interference from aut-
ofluorescence and the persistence of dead cells is minimal (Bond et al., 2000; Baker and Banfield, 2003). Sam-
ples will be compared to establish the degree of heterogeneity and to select the most representative biofilms
for subsequent analysis. Microscopic characterization will be used to estimate total cell concentrations and the
relative abundances of the dominant organism types. The extracellular polymeric compounds that comprise
the biofilm will be characterized, primarily via NMR and mass spectrometry. It is likely that the formation of
the exopolysaccharide matrix is advantageous to at least some organisms. Correlating matrix information with
BANFIELD ET AL.
protein production, community membership, and geochemical conditions will likely indicate who is making
the biofilm and what the signal(s) for matrix formation are. The nature of the matrix is also likely to be a use-
ful way to distinguish different types of biofilms, and deduce their growth stages.
DNA will be extracted from and sheared into ?3-kb fragments. Ends are repaired and the DNA is lig-
ated into a vector, screened, and transformed into Escherichia coli. The quality of the libraries will be as-
sessed by PCR of the inserts from ?100 clones and sequencing of these products. The preliminary sequence
data will also provide information about the strain makeup (?700-bp reads will be compared against ex-
isting genomic data). We anticipate constructing and screening at least twice as many libraries as will be
sequenced to significant depth. Fractions of the samples used for library construction, as well as many other
samples, will be processed for proteomic analysis.
The proteome is defined here as all of the proteins present in an organism at a particular time. Analysis
at the proteome level is a step towards understanding the many biochemical functions associated with a
community. Mass spectrometry (MS)–based (proteomic) methods for identification of proteins generally
PROTEOGENOMIC APPROACHES TO NATURAL MICROBIAL COMMUNITIES
the AMD biofilm community at the molecular level. Color on genes (circles) indicates detection of a protein product
(red is high and white is no expression). Some pathways may be active in some organisms or strains, but not in oth-
ers, indicating functional partitioning. Note the connection between iron oxidation and pyrite dissolution. (Adapted from
Tyson et al. .)
Population structure and activity in the acid mine drainage (AMD) system. Illustration depicts a snapshot of
rely upon availability of genome sequence data. Our approach, in its most simple rendering, is to obtain
the genomic information needed to characterize the protein complement of natural biofilms. This involves
extraction of proteins from natural communities, fractionation based on their cellular association (e.g., mem-
brane vs. extracellular vs. cytoplasmic), mass, and isoelectric points, digestion of the proteins into peptides,
separation of the peptides chromatographically, measurement of the mass of the peptides and their frag-
mented products (via a two-step MS measurement), and assignment of the identified peptides to a predicted
protein through database searching algorithms.
Initial proteomic studies have relied upon a protein database predicted from existing community genomic
sequence (Tyson et al., 2004). However, a single amino acid substitution is likely to prevent matching of
a peptide to a protein. This significantly limits the utility of sequence data for proteomic analyses of sam-
ples that differ in strain makeup from those characterized by genomics. Thus, an important challenge for
this project, and for any project that seeks to use genome information from an isolate (or community) to
characterize a new sample, is to develop methods to deal with protein-level strain variability. Although con-
siderable information can be obtained if the strain mixture is similar (but not identical) to that for which
genome sequence is on hand, the completeness of such studies is compromised by the inability to detect
some abundant proteins and inaccuracies in measures of protein abundance due to “missing peptides” (MS
spectra that were collected but could not be assigned). Thus, ways to solve the “missing peptide” (and pro-
tein) problem are an important target for our future work.
In one approach to this problem, we will use population genomic sequence data to create an inventory
of predicted protein variants for the dominant strain populations. This will involve sequence from multiple
new biofilm samples collected from sites characterized by different geochemical conditions, and biofilms
in different growth stages. The genomic and proteomic data will be integrated with results from cultures to
allow us to correlate strain distribution with steps in the colonization process, community structure, and
Methods will also be developed to estimate relative abundance levels of proteins to allow comparison of
function and activity among samples. This, in combination with strain variant analysis approaches, will pro-
vide general tools needed to employ information from isolate genomes in new environments. Ultimately,
our proteogenomic approach will be deployed at other AMD sites to test how well the genomic dataset and
proteogenomic approach is extended to more distantly related communities.
An important outcome of proteomic analyses is the validation of novel hypothetical or conserved hypothet-
ical proteins, which typically comprise ?40% of the predicted proteins. Validation is an important first step.
However, biological insights will be greatly restricted unless we can determine the function of these presum-
ably important molecules. Although the magnitude of the problem is immense, a method to tackle it must be
found if we are to approach the desired level of understanding of microorganisms in their environments. We
will develop methods for functional assignment that include new ways to purify proteins and protein complexes
from natural biofilm samples and methods to target specific proteins of interest. One specific goal is to eluci-
date the electron transfer mechanisms central to energy generation and AMD formation. Enriched or purified
proteins and protein complexes recovered from environmental samples will be subjected to detailed MS char-
acterization. Purified proteins will be evaluated with biochemical assays to test functions predicted from operon
structure, localization (in protein fractions), and protein structural modeling. These proteins and protein com-
plexes will be analyzed by both peptide and intact protein MS methods to determine post-translational modifi-
cations, N-terminal and C-terminal truncations and extensions, as well as protein variants. Proteins purified from
strain variants will be compared to evaluate the basis for biochemical adaptation. As many of the novel pro-
teins have orthologs in multiple community members and, in some cases in organisms from other lineages, the
impact of each new functional assignment will be significant.
GENOMICS, PROTEOMICS, AND BIOCHEMISTRY
An introduction to community genomics
Understanding of function in microbial communities requires methods to describe the types of organisms
present, their strain (population) diversity, and their biochemical pathways and activity levels. Organism
BANFIELD ET AL.
PROTEOGENOMIC APPROACHES TO NATURAL MICROBIAL COMMUNITIES
detection via amplification of ribosomal and conserved protein-encoding genes (Pace et al., 1985; Rappe
and Giovannoni, 2003; Baker et al., 2003) has enabled cultivation-independent surveys of diversity in a
wide variety of environments. However, this approach provides limited physiological and evolutionary in-
sights. Recently, however, large-scale recovery of genes and genomes directly from environmental samples
has become practical (Tyson et al., 2004; Venter et al., 2004). This approach provides an enormous amount
of information about genetic potential. Sequencing of DNA fragments derived from populations rather than
isolates also provides the information necessary to define community structure, genome heterogeneity, and
to investigate selection. These databases open the way for functional analyses that rely upon gene sequence
information, such as gene expression arrays (Dennis et al., 2003; Sabat et al., 2003; Wu et al., 2001) and
proteomics (Ram et al., 2005; Florens et al., 2002).
Cultivation-independent genomic sequencing has been applied to a variety of environments, with differ-
ent types of results. In studies of complex and transient assemblies of organisms, important findings in-
clude catalogs of gene types, estimates of diversity (Venter et al., 2004), and discovery of new pathways
(Beja et al., 2000). In our own work (Tyson et al., 2004), we chose to focus on a chemoautotrophic biofilm
from the Richmond Mine that comprises an essentially self-sustaining ecosystem. An overview of the re-
sults of this work is given below.
AMD community genomics: initial results
The low species diversity of Richmond Mine biofilms suggested that relatively complete genome re-
construction would be possible if the genomes in each “species” population had relatively consistent gene
order and sequence divergence was not high. In 2002–3, we obtained 100 MB of sequence data from two
small insert libraries from a single biofilm sample. Using the initial 76 MB, it was possible to reconstruct
the genomes of the dominant bacterium (Leptospirillum group II) and archaeon (Ferroplasma type II), and
partially (?90%) reconstruct the genomes of Leptospirillum group III (bacterial), Ferroplasma type I (ar-
chaeal, very closely related to F. acidarmanus fer1 (Edwards et al. 2000) for which complete genome se-
quence data exist), and Gplasma (archaeal). Genome reconstruction, metabolic analyses, and some initial
population structure analyses have been published (Tyson et al., 2004).
Genome fragments (scaffolds) in the community genomic dataset were initially assigned to organisms
using GC content and read depth. Scaffolds binning was refined using tetranucleotide frequency analysis
and via manual reassembly (unpublished work). Currently, the composite genome of Leptospirillum group
II is in ~50 fragments (with some gaps). The majority of the genome of the dominant Leptospirillum group
II population was near clonal. In contrast, the archaeal populations show significant strain heterogeneity,
both at the sequence level and in terms of gene content. The composite Ferroplasma type II genome is now
in ?14 fragments, with a tentative ordering of these scaffolds based on the largely syntenous F. acidar-
manus fer1 genome.
Both through comparison of genomic data for the fer1 isolate and its associated strain population,
fer1(env), and by analysis of the Ferroplasma type II scaffolds, it has been possible to quantify different
modes of strain and species divergence (Allen et al., 2005). For example, approximately 50 genes are pre-
sent in fer1 that are not present in fer1(env), and vice versa (excluding those absent due to ?15% lack of
coverage for fer1[env]). The fer1 and fer1(env) genomes are almost completely syntenous, whereas there
are at least 44 genome rearrangements (defined at ?2 genes) that distinguish gene order in fer1 from that
in Ferroplasma type II (?70% of fer1 genes have orthologs in Ferroplasma type II). Notably, many genes
that occur in a subset of strains or only in one of the two species are encoded in blocks that are associated
with integrases and other genes that suggest a likely plasmid or phage origin. At the amino acid level, there
is an average of about 2% divergence (for orthologs) within the fer1 and Ferroplasma type II populations,
and an average of ?20% divergence between these populations (and ?1% divergence at the 16S rRNA
level). This information is important in that it relates to the full scope of genetic potential within the com-
munity and variability available for selection and species divergence.
Based on functional annotations, we deduced that the only abundant organism capable of nitrogen fixa-
tion in the AMD communities is Leptospirillum group III. This insight enabled design of a cultivation strat-
egy that led to the isolation of the first member of this group (L. ferrodiazotrophum) (Tyson et al., 2005),
and established its role in iron oxidation. Therefore, this organism is now available for proteomic studies
of isolates and mixed cultures.
In future work, we consider it important to develop methods for efficient and effective assembly of ge-
nomic information from real populations. We also highly prioritize the need to develop ways to communi-
cate information about the form of strain heterogeneity and means by which to create databases of strain-
specific protein variants. This reflects a movement away from composite (species) genomes and toward full
genomic representations for populations. This information is also critical for analysis of selection and pop-
The outputs of different genome assembly programs will be compared so as to optimize the resulting
strain composite genomes. If necessary, the assembly programs will be customized. Because current as-
sembly programs are designed for mammalian genome projects (Batzoglou, 2005) they do not readily ad-
dress problems that arise due to differences in gene content and sequence divergence. Thus, manual analy-
sis of the resulting scaffolds is necessary. A typical example of scaffold termination due to strain differences
is shown in Figure 5. In this case, two transposases (TN) and one other novel gene are present in one vari-
ant, but not another. Based on progress to date, we anticipate that near complete assembly of genomes from
community genomic data from multiple biofilms will be possible, but this will require manual analysis. The
manual approach, once optimized, may be automated.
Assigning scaffolds (and genes) to organisms
The first challenge of any community (environmental) genomics study that aims to extract ecological in-
sights is to correctly assign genome fragments to organism types. In our experience, the most effective ap-
proach relies heavily upon assembly. Once large scaffolds are generated, they are assigned to organisms
(“binned”) based upon the phylogeny of the genes they encode (e.g., 16S rRNA, other), GC content (in our
system, this effectively separates most archaeal from bacterial scaffolds), depth of sequence coverage (num-
ber of reads per unit length of DNA sequence), and di-, tri-, and tetranucleotide frequencies (Teeling et al.,
2004). Most of these methods are unreliable if the scaffolds are small. Thus, construction of composite
genomes for specific organisms will rely heavily upon the assembly of sequence data from the samples.
BANFIELD ET AL.
the strain population can be resolved to achieve more complete assembly. Identical genes on the two scaffolds are the
same shade of gray. Scaffold 5 contains three genes not present on scaffold 50 (and scaffold 50 has one gene not pre-
sent on scaffold 5).
Assembly of community genomics data. Illustration is a typical case where differences in gene content within
In addition to problems at scaffold ends, assembly of data from strain populations generates many small
scaffolds. This is illustrated in Figure 6, where scaffold 960 maps onto scaffold 1, but was separated due
to differences in gene content and sequence dissimilarity. We are developing methods to rapidly identify
regions of small scaffolds that share high sequence identity, gene content, and gene order with larger scaf-
PROTEOGENOMIC APPROACHES TO NATURAL MICROBIAL COMMUNITIES
folds sample the same genomic region, but were separated during assembly.
Differences in gene content between fer1 and Ferroplasma type II. Note that two Ferroplasma type II scaf-
folds (considering the number of shared vs. novel genes and allowing some degree of sequence divergence).
We will not develop methods to simply bypass assembly problems due to differences in gene content be-
cause the form of strain heterogeneity is a primary product of our analyses. In cases where assembly does
not bin small scaffolds, phylogeny of the encoded genes will be used for tentative binning. Other approaches
may utilize methods for accurately building phylogenetic trees that take into account the special character-
istics of data from community sequencing projects (e.g., partial sequences).
Beyond composite genomes to fully analyze strain heterogeneity
Community genomic datasets provide direct insights into the structure of strain populations. Comparison
amongst reads covering specific genomic regions will reveal single nucleotide polymorphisms (SNPs), which
we anticipate will represent a major form of strain heterogeneity. Based on prior analyses that used an average
sequence depth of four to 20 reads, we anticipate only a small number of sequence types at each locus and an
average amino acid–level divergence for orthologs of up to around 8% (Tyson et al., 2004; Allen et al., 2005).
Based on prior work, we anticipate that for genomes with good coverage, we will be able to reconstruct
gene variants for a significant number of loci for which variation occurs. We have been able to track vari-
ants distinguished by nucleotide polymorphism patterns across at least 10 kb of sequence, using both se-
quence similarity and mate pair information to span regions with no variation. Identification of nucleotide
polymorphism pattern types is required in order to identify reads from individuals that have undergone ho-
mologous recombination between variant types (Tyson et al., 2004; Allen et al., 2005). Manual tracking is
far too time consuming to be done comprehensively. Consequently, one of our objectives is to develop an
efficient method to analyze nucleotide polymorphism pattern data to enable creation of a database of genes
and predicted proteins for each population.
We suspect that even subtle protein variants are subjected to significant selective pressures. At this time
we do not know why specific protein variants are maintained in populations, but we anticipate that at least
a subset reflect optimization to different environmental conditions (or due to predation pressure from phage
or eukaryotes) (Weinbauer et al., 2004; Thingstad et al., 2000; Huws et al., 2005; Boenigk et al., 2004). It
is vital that we develop methods to generate strain variant databases to enable both population genetic stud-
ies of selection and also to monitor expression of protein variants as a function of the organisms present,
growth stage, and geochemical conditions.
In our experience, gene order within species is almost completely preserved (Allen et al., 2005; J.F. Ban-
field, unpublished data) so that it is feasible to develop a representation of differences in gene content and
sequence with a diagram such as shown in Figure 7. We refer to this as a population genome map. This
map will indicate locations of insertions, deletions, and genome rearrangements, where necessary. The pro-
teomic data will also be linked to genes through this interface.
Introduction to proteogenomics of natural communities
Extensive analyses of the protein complements of natural microbial consortia are possible if peptides can
be assigned to proteins based on predictions made from genomic sequence information. Because only a
very small number of communities have been sampled by cultivation-independent environmental genomic
methods, the potential for such analyses has been quite limited. As a result, there has been little develop-
ment of the necessary methods for characterization of the proteome of multi-species assemblages. Further-
more, most proteomic studies have focused on clonal cultures, and so the challenges that arise due to strain
variation have not been fully investigated. Similarly, the challenges of protein abundance quantification for
multi-organism mixtures sampled directly from the environment have not been adequately addressed. Meth-
ods for proteomic characterization of complex natural mixtures of organisms may find application in many
fields of biology where genome sequence for one organism is to be used for characterization of closely re-
lated, but not identical, organisms (including medicine).
Initial proteogenomic results
The community genomic database (composite genome sequences) for the AMD biofilms currently in-
cludes 12,148 genes associated with five organisms. Using this genomic database, we were able to identify
BANFIELD ET AL.
2,036 proteins (two peptides required per protein for a match) that were extracted from a biofilm collected
in 2004 from a site in the AB drift (Fig. 2). Forty-eight percent of the detected proteins were assigned to
Leptospirillum group II, as expected given its high abundance in the sample. When these proteins were as-
signed to COG categories, we noted that predicted novel (hypothetical) proteins were the most abundant
category detected, but the rate of detection was lower than expected, based on the frequency of novel genes
in the genomic dataset (Ram et al., 2005). The abundance of novel proteins suggests that they play key
roles in adaptation. Proteins involved in energy generation, defense against oxygen and hydroxyl radical
species; protein refolding at low pH, and protection against toxic metals were highly detected, providing
insights into the challenges of life in the AMD system (Ram et al., 2005).
The proteomic analyses generated many mass spectra that could not be assigned to any protein. There
are several reasons for this, two of which relate to problems with the genomic database. As noted above,
peptides from proteins that differ significantly from those encoded in the community genomic dataset are
unlikely to be matched. If there are substitutions throughout the protein, matching of all peptides (or all but
one peptide) may be precluded. In this case, the protein is unidentifiable. Alternatively, a subset of pep-
tides may be found, resulting in (i) assignment of the peptides to a protein that is not identical to the dom-
inant form in the community and (ii) incorrect evaluation of the relative abundance of that protein. For these
reasons, proteomic analyses must move away from composite genome sequence-based analyses to utilize
the full database of gene types detected in each community. A second reason for unassigned spectra is the
presence of organisms in the community that were not sampled in the genomic dataset (due to different or-
ganism membership or incomplete genome coverage).
The problem of the “missing peptides” due to strain variation is apparent through analysis of the way in
which peptides recovered using genome sequence from one strain map onto the protein from a closely re-
lated strain. Regions of the protein that are not reconstructed may be signal peptides or peptides that con-
tain one or more amino acid substitutions. For example, in the case of a novel protein now referred to as
cytochrome 579, we amplified the gene sequence from DNA in the biofilm sample used for proteomic
analyses and detected 12 nucleotide polymorphisms relative to the community genomic database sequence.
When we included the true strain variant sequence into the database, peptides were recovered for the en-
tire mature protein. Certainly many of the high-quality mass spectra collected in the initial study could be
matched if (i) a database of strain variant sequences was available or if (ii) there was a method for further
characterization of unassigned peptides.
It is probable that incomplete proteomic analysis of typically lower abundance organisms (?10% of the
PROTEOGENOMIC APPROACHES TO NATURAL MICROBIAL COMMUNITIES
single sequence type
gene variants at locus
novel gene insertion
rearrangement at locus
228 229 230 231232233 234235 236237238 239
hensive information about gene types (shown as horizontal bars). Multiple gene types can occur at each locus. Inclined
lines indicate a gene inserted in one strain; arrow on a gene indicates a rearrangement in some individuals. Recon-
structed sequences for gene variants, their annotation, dN/dS values, and expression information (e.g., as a function of
geochemical conditions) will be displayed in zoom in mode (black box).
Map of population genome data. Cartoon shows an example map in a browser designed to display compre-
biofilm) will be a problem. In order to improve resolution of these proteomes, we will utilize physical en-
richment methods, such as selective lysis. For example, archaea are prone to lyse when the pH is raised,
and this can be used to selectively reduce their input to protein separations. Alternatively, size separations
can be used to select for smaller cells (often archaea).
Proteomic analysis of communities: going wider, digging deeper
The characterization of the proteome dataset of a complex natural microbial community will require a
systematic experimental plan to push the capabilities of current mass spectrometry techniques, as well as
development of the next generation of MS technology. In particular, we will apply state-of-the-art MS in-
strumentation for deeper and more comprehensive characterization and quantification of the proteins and
protein complexes important for community structure. The experimental and computational technologies
developed in this project will help pave the way for the characterization of more complex microbial com-
Based on the data-depth enhancement (approximately three times) achieved by moving from the con-
ventional quadrupole ion trap (LCQ) instrumentation to the newer generation linear trapping quadrupole
(LTQ) technology (Schwartz, 2002), we anticipate that higher performance MS approaches will provide
critical measurement capabilities. Despite our success in monitoring the complex proteome of the natural
microbial community, we are still primarily detecting proteins from the most abundant organisms. To dig
deeper and more comprehensively into the proteome of the community, we need to combine high through-
put MS/MS with high mass accuracy, high dynamic range measurements. The recently commercial avail-
able hybrid MS instruments, such as the LTQ-FTICR-MS (Syka, 2004) and LTQ-FT-Orbitrap instruments
(Hardman, 2003; Hu, 2005), appear to be well suited for this task. The complexity of the microbial com-
munity is undoubtedly the perfect challenge for such an instrument. We expect to be able to utilize such
instrumental platforms to enhance the total number of proteins detected by another factor of two to five
times, while simultaneously greatly increasing the confidence of identification (with false positive rates of
less than 1%). More importantly, however, is the ability to use such an instrument to conduct high-resolu-
tion MS/MS and MS3-type experiments (Olsen, 2004) to help unravel strain variation. Although strain-re-
solved genomic datasets can be achieved for the AMD community, in the longer term we expect that it will
be possible to accurately identify proteins from consortia without the need for exact genomic sequence for
the organisms present.
In addition to requirements for high mass accuracy, high dynamic range, and the need for methods to
deal with strain variations, it will be essential to develop methods to quantify protein abundances. Quanti-
tative methods will allow exploration of how relative and absolute protein abundances change from the ini-
tial colonization event through biofilm development. These data should reveal how functions and activity
levels redistribute between pathways and organism types, as well as the details of the ways in which or-
ganisms and communities respond to changes in their physical and chemical surroundings.
Methods for comprehensive proteomic characterization of natural microbial communities
It is important to expand the analytical capabilities of the MS-based proteomics methodology for identi-
fying and obtaining detailed biological information for every protein in an AMD sample. This requires iden-
tification of a very large number of proteins and their post translations modifications, truncation products,
and associated strain variability. The greatest challenges for analysis of a microbial community are the large
number and relatively low concentrations of many proteins within the system.
Fractionated proteomes or whole proteomes are more complex than enriched proteins and protein com-
plexes, thus they require more complicated separation methods. For all initial studies, protein separation
will be accomplished on an integrated two-dimensional nano columns equipped with nanospray MS (Wash-
burn, 2001; McDonald, 2002; Peng, 2003). For the initial phase of this project, all samples will be analyzed
on a linear ion trap (LTQ) mass spectrometer. For the latter work on this project, a new hybrid LTQ-FT-
Orbitrap (Thermo Electron) will be integrated into the pipeline and be used for all sample characterization.
High mass accuracy, high sensitivity and high dynamic range capabilities are features of this new instru-
ment, which can be coupled with a linear ion trap mass spectrometer capable of rapid data-dependent MS/MS
BANFIELD ET AL.
and MS3. This instrument can easily be coupled with multidimensional chromatography. The high mass ac-
curacy will allow for much more confident identifications of MS/MS spectrum due to the accuracy at which
the parent mass of the peptide can be determined (?3 ppm). Furthermore, if desired, MS/MS spectra can
also be analyzed in the Orbitrap allowing for ?3 ppm mass accuracies on fragment ions in MS/MS or MS3
experiments. This is critical in community samples where databases are larger and sample variability will
be much greater. More comprehensive proteomic analysis will be achieved due to the high dynamic range
of the instrument, because data-dependent MS/MS events will enable measurement of peptides that cannot
be detected in full scan mode on the linear ion trap. The linear ion trap is capable of MS3experiments,
which, along with high mass accuracy of the intact peptide and MS/MS spectra, will greatly increase the
accuracy of de novo sequencing techniques outlined below.
To interpret the LC-MS/MS data from the samples, proteome bioinformatics, in particular, Sequest (Eng,
1994) and DBDigger (Tabb, 2004) search engines, will be employed and web-based data repositories will
be created. One major goal is to develop new informatics tools for data analyses, data mining, and data dis-
play critical for the challenges associated with these studies. New computational techniques will immedi-
ately be implemented into the proteome informatics pipeline.
Unraveling microbial strain variant diversity
Global characterization of strain variants may be accomplished by incorporating MS3and high mass ac-
curacy peptide measurements, with a de novo sequencing algorithm to detect and verify peptides and thus
proteins from the AMD community. The high mass accuracy measurement of parent peptide ions and frag-
ment ions is necessary to validate predicted sequences from the de novo sequencing algorithms. The high
mass accuracy will be accomplished with the LTQ-FT-Orbitrap instrument. This instrument allows for ac-
curate measurement of peptides within 3 ppm from LC-MS analyses across the dynamic range of mea-
surements. This level of mass accuracy allows for good discrimination between potential peptide sequences
predicted from a de novo sequencing attempt. If the mass of the predicted peptide is not within ?5 ppm of
the measured mass, then the candidate sequence can be rejected and others can be considered. If all major
ions can be assigned, then the predicted sequence is most likely correct. While MS3experiments are theo-
retically possible on ion traps and FTMS instruments, they have found no real application in “shotgun” pro-
teomics because older versions of these instruments did not trap enough ions in the MS/MS experiment to
make MS3a viable option in real-time LC-MS experiments. The linear ion trap has much greater ion stor-
age capacity making MS3experiments in real-time LC-MS a possibility. Our goal will be to optimize this
methodology for the analyses of large numbers of peptides from protein standard mixtures and a represen-
tative AMD proteome sample. The basic experimental plan for an MS3experiment follows three major
steps: (1) parent peptide mass measurement (will be high mass accuracy on LTQ-FTMS), (2) parent mass
isolation and fragmentation, and (3) isolation and fragmentation of the top three fragment ions from the
MS/MS spectra resulting in three MS3experiments (Fig. 8). The high mass accuracy of the parent ion as
well as the three MS3experiments can each be used to independently verify the peptide sequence. The ex-
perimental focus will be to develop the methods for targeted and comprehensive MS3experiments on tryp-
tic peptides from standard protein mixtures (to verify the methodology is working) and representative biofilm
samples. These datasets will be used to help develop and test the de novo sequencing algorithms. Proteins
that are characterized as having a number of strain variations both spatially and/or temporally can be tar-
geted for further purification and biochemical testing. By functional testing of protein variants, we can be-
gin to understand how adaptation occurs in a microbial community.
Quantification of protein abundances
Qualitative measures of protein abundance such as peptide sequence coverage, mass spectra counts, and/or
relative intensity have provided base level information about which proteins are present in biofilm samples.
However, quantitative MS measurements of protein abundances are needed in order provide information
required to obtain biological insights into community function. Recent developments in stable isotope la-
beling methodologies, combined with high performance mass spectrometry, has allowed for accurate, rel-
ative quantitation of proteins (Gygi, 1999). In these experiments, a given protein can be compared with its
PROTEOGENOMIC APPROACHES TO NATURAL MICROBIAL COMMUNITIES
Full MS spectra
10 15 202530 354045 5055
60 65 7075808590 95
10 15 20 2530 354045 505560 6570 75 808590 95
10 152025 3035404550 5560
657075 80 859095
10 15 20 253035 40 4550 5560
65 70 758085 9095
10 15 20 2530 35 4045 5055 606570 7580 859095
MS3experiments for verification of de novo sequencing with high-resolution tandem mass spectrometry (MS/MS). MS3experiments will be used to verify de novo se-
quencing of high-resolution tandem mass spectrometry (MS/MS) candidates. The basic concept is intact peptide mass spectra (top left) will be followed by a data-dependent
MS/MS event (top right). This will be followed by three MS3experiments (bottom) on the top three most abundant fragment ions from the MS/MS experiment. The MS3data
will be used to verify the predicted sequence from de novo sequencing of the MS/MS data.
counterpart from a different growth condition to obtain a relative expression level of up- or down-regula-
tion. This provides a measurement of relative protein abundance differences between two growth states of
a sample. One of the most accurate approaches for quantification in microbial systems is metabolic label-
ing, in which the stable isotope is introduced during the growth process. However, as this is impractical
with the AMD community sample, we will have to rely on chemical isotope labeling after the samples are
Stable isotope labeling approaches will be employed to compare AMD biofilm samples acquired at dif-
ferent locations and/or different times, to investigate protein abundance changes as a function of growth
conditions or age. Because protein abundances are expected to be key markers of community function, this
information will provide a much deeper view of the metabolic activities and changes of this system. While
even the qualitative proteomics experiments will reveal information about those proteins changing dramat-
ically between different samples, the chemical isotope labeling method should be able to measure much
At present, absolute quantification of protein abundance cannot be conducted on entire proteome sam-
ples without the use of peptide standards (Gerber, 2003). Thus, this approach is usually employed in a tar-
geted fashion, in which a few selected proteins are quantified. For this process, synthetic peptides contain-
ing a stable isotope label will be prepared for several previously identified proteolytic peptides obtained
from an AMD sample. By knowing the exact quantity of the peptide standard, and then measuring the rel-
ative ratio of the proteolytic peptide to the synthetic peptide, it should be possible to obtain absolute quan-
tification. Because this approach is labor-intensive and slow, it may be best employed for a few key pro-
teins that are functionally important with regard to metabolic changes in different samples.
Introduction to functional biochemistry
Current knowledge of protein structure and function is limited to only a tiny fraction of proteins in the
biosphere. Nearly half of the proteins inferred from microbial genome sequences are hypothetical and have
not been analyzed. Furthermore, only a very small subset of all organisms has been sampled to date. The
rapid accumulation in sequence databases of genes encoding hypothetical proteins has outpaced any at-
tempts to characterize them. As of January 2005, the NCBI Entrez collection of databases (www.ncbi.
nlm.nih.gov/gquery/gquery.fcgi) lists roughly 350,000 gene entries that are described by “hypothetical pro-
tein.” Swiss-Prot (http://au.expasy.org/sprot), a database that focuses on functional annotation of proteins
(including domain structure and variants) with a minimal level of redundancy, currently contains over 22,000
hypothetical proteins per se. Some of the predicted proteins are unique in the protein databases, or found
only infrequently, so there are no counterparts for which function has been ascribed. In other cases, part or
all of a protein sequence can be assigned to a family or superfamily of similar proteins, but it is difficult
to distinguish any specific function. Moreover, a large fraction of functional assignments are neither cer-
tain nor experimentally tested, so these too remain unresolved.
Unfortunately, there are no high-throughput methods (as exist for sequencing genomes and analyzing
proteomes) to deal with legion of proteins encoded by gene sequences but not yet confirmed and assigned
functions. Approaches for functional assignment for novel genes revealed through genomic sequence analy-
sis are desperately needed. However, predicted proteins can be easily cross-correlated by source organisms,
gene sequence and genome context, and protein sequence. In this category, many database predicted pro-
tein entries are conserved between species and genera (orthologous groups) (Tatusov et al., 1997). Where
putative functions are assigned, functions could be tested biochemically if the protein were isolated. Evo-
lutionary selection of structural features within proteins can often be visualized by protein sequence align-
ments that clearly indicate a conservation of amino acid sequences between similar proteins in disparate or-
ganisms. Highly conserved domain structure and arrangement of multiple domains are criteria that can be
used to predict that some hypothetical proteins are not simply sequences discarded during evolution, but
have important functions (Chothia et al., 2003).
The need to characterize conserved hypothetical proteins has been emphasized recently, using criteria
that include wide phyletic spread and reasonable confidence in functional predictions (Galperin and Koonin,
2004). Even with those proteins for which there is no prediction of biochemical activity, some general bi-
ological clues may exist, such as the cellular location of the encoded protein, or the gene expression pro-
PROTEOGENOMIC APPROACHES TO NATURAL MICROBIAL COMMUNITIES
file. Transcript information is useful, particularly when these genes can be correlated with predicted operon
structure, an indicator of coordinated expression of several genes needed for a common function. In many
cases, one or more of the genes associated with an operon have been characterized or annotated. Apart from
evolutionary conservation and genome context, the detection of transcripts encoding hypothetical proteins
can indicate a biological function if linked with natural events such as cell growth, cell death or the coop-
erative action of multiple microorganisms (e.g., swarming behavior). NCBI’s Unigene database, which or-
ganizes sequences by unique (nonredundant) gene-oriented clusters of transcripts, currently cites 23,500
transcripts for hypothetical proteins from animals and plants. This transcript database uses data from vari-
ous genome analyses and from the massive amount of sequence information from the experimentally de-
termined expressed sequence tags (dbEST); however, experimental information pertaining to bacterial or
archaeal transcripts is limited.
In the face of this onrush of unknown protein sequences, the majority of studies related to hypotheticals
has continued to focus on individual proteins that can be overexpressed in bacteria or other surrogate sys-
tems. Evidence for this is in the cited literature, where only ~300 journal articles can be correlated with
“hypothetical protein.” In the year 2004 alone, there are ?100 citations using this search query, describing
either structural determination (?40 articles) or biochemical characterization (?50 separate articles). Al-
though these studies have increased about twofold over the previous year, the detailed understanding of the
immense number of hypothetical proteins rendered by whole genome studies will take many years to ac-
complish on a gene by gene basis. We maintain that to begin interpreting the functional significance of the
majority of hypothetical genes from a single or composite genome, an integrated approach is needed that
can aggressively target and characterize the novel proteins produced by organisms with well described
genome and physiology. This is one of the major goals for our studies. Central to our approach is removal
from initial consideration those predicted proteins that are not detected or identified under any growth con-
Initial biochemical results guided by proteogenomics
Overall, the single largest category of proteins in the biofilm is the hypothetical proteins, defined here
as those lacking a significant BLAST match (?e?10) to a protein with a functional assignment. Since many
functionally annotated proteins have not been biochemically characterized, this approach is likely to un-
derestimate the number of truly novel proteins. Predicted proteins with no significant similarity to any known
protein are referred to as “unique.” Those with similarity to predicted proteins but no close similarity to
characterized proteins are described as “conserved.” Unique and conserved novel proteins represented 15%
and 2% of the abundant proteins detected in our first AMD biofilm proteomic analysis, respectively. Over-
all, 407 unique and 216 conserved hypothetical proteins were validated by MS analyses (Ram et al., 2005).
The collection of proteins specifically enriched in the extracellular fraction was dominated by unique
novel proteins (64%), but contained only ?1% conserved novel proteins. The presumably metal and acid
tolerant unique proteins are future research targets, as they likely play key roles in adaptation. We detected
all predicted proteins for 11 putative operons composed only of hypothetical genes. For example, one operon
encodes five Leptospirillum-specific proteins and another encodes three Leptospirillum group II-specific
proteins, all of which were detected in the membrane and extracellular fractions.
Of the proteins enriched in the extracellular fraction, the one with the highest MS coverage was initially
identified as a hypothetical protein from Leptospirillum group II and is now known as cytochrome 579. It
is only weakly similar to previously studied c-type cytochromes and to Fe/Pb permeases. This, and the pres-
ence of a heme-binding consensus sequence (CXXCH), suggested a role in electron transport. We verified
that the predicted sequence of this protein matches that of an abundant, reddish-yellow heme-staining pro-
tein identified by SDS-PAGE analysis of the extracellular fraction. The cleavage of a transit sequence from
the N-terminus was evident from N-terminal sequencing, and indicates that the mature protein is exported
across the cytoplasmic membrane. The partitioning of this protein into the extracellular fraction of Lep-
tospirillum group II corroborates its localization in the acid-exposed periplasm. Based on its distribution,
its abundance, and its ability to oxidize iron, we conclude that cytochrome 579 is central to iron oxidation
in this and closely related bacteria (Ram et al., 2005).
Eight other membrane and periplasmic c-type cytochromes, and components of NADH dehydrogenase,
BANFIELD ET AL.
succinate dehydrogenase, and the cytochrome bc complex were also detected. In addition, three hypothet-
ical proteins with heme-binding motifs were detected (Ram et al., 2005). Using this information, we de-
veloped a working model for the iron oxidation pathway in Leptospirillum group II in which cyt579 is the
first step. Elucidating the roles of the other c-type cytochromes of in this electron transport chain are im-
portant objectives for further study, as iron oxidation is central to energy generation in the AMD ecosys-
Targeting novel proteins for functional analysis
We predict that the AMD community genomic dataset will contain a maximum of ?24,000 genes. Based
on our genomic data, we anticipate ?7,440 genes will encode hypothetical proteins and ?2,880 encode
conserved hypothetical proteins. Our primary objective is to define functions for orthologs, that is, genes
with sequences which are sufficiently closely related to imply that they share a similar function. From se-
quence comparisons we extrapolate that ?60% of these are unique (i.e., ?6,200 orthologous groups). Based
on preliminary data, ?15% of these are likely to be abundant in the proteome. Thus, we anticipate that
there will be ?930 protein targets for functional analysis.
Genes expressing unique and conserved hypothetical proteins and identified by proteomics are mapped
onto genome data. Clues to the function of some of these can be inferred based on operon structure and
surrounding genes. The potential value of gene context emphasizes one of the motivations for reconstruc-
tion of genomes that are as complete as possible for each population. For example, in the soluble fraction
we detected the products of two conserved hypothetical genes that are membrane-associated and that oc-
cur in an operon with four proteasome subunits. In another example, two unique hypothetical proteins oc-
cur in a four-gene operon with two putative nitrogen regulatory proteins, suggesting roles in nitrogen me-
tabolism. Along the same lines, four detected unique novel proteins, possibly involved in motility, are
encoded in an operon of fifteen genes that includes at least eight flagellar genes. Observations such as these
provide starting points for selection of biochemical assays to determine precise functions for important novel
As mentioned above, novel proteins are concentrated in the extracellular and membrane fractions, sug-
gesting functions in environmental adaptation. Characteristics of the extracellular proteins may be deduced
from genome sequences by predicting the isoelectric points, which for the total collection of proteins iden-
tified in the proteome exhibit a bimodal distribution (Ram et al., 2005). The Leptospirillum group II pro-
teome resembles that of the community, with clusters of isoelectric points around ?6–6.9 and ?9–9.9. In
contrast, the distribution of isoelectric points for the 35 proteins concentrated in the extracellular fraction
has a single maximum in the range of 9–10.9. As this includes many hypothetical proteins and several cy-
tochromes, isoelectric point-based separation methods have been useful in purification of some of the acid
Recently developed computational aids have provided additional tools for understanding protein struc-
ture and function, including signal/transit sequence (Bendtsen et al., 2004), transmembrane helix (Kall et
al., 2004), protein–protein interactions, phylogenetic relationships, and metabolic pathway maps (e.g.,
KEGG; www.genome.jp/kegg). New algorithms for rapid protein sequence comparisons, such as MUSCLE
(Edgar, 2004; www.drive5.com/muscle/), are available for detecting distant structural homology from pri-
mary sequence. Those programs that extract information from multiple alignments and use this to perform
iterative database searching have proven especially helpful. Approaches such as the hidden Markov model
(Neuwald and Liu, 2004; Soding, 2004), PSI-BLAST (Altschul et al., 1997; www.ncbi.nlm.nih.gov/BLAST),
fold recognition (Fischer and Eisenberg, 1995; Skolnick et al., 2004), and other programs have greatly en-
hanced the ability to detect similarities between evolutionarily distant proteins that are nevertheless closely
related in structure (Aravind and Koonin, 1999). Active sites of proteins can be detected in some cases us-
ing an appropriate computational approach (Fetrow and Skolnick, 1998). Orthologous groups of proteins
with unknown function can be linked to functionally associated partner proteins; in many cases, those part-
ners are uncharacterized themselves (giving clues to newly identified modules), but in others, the assign-
ment of pathways, cellular processes, or physical complexes for orthologous groups can encompass thou-
sands of previously uncharacterized proteins (Doerks et al., 2004).
Fully understanding a protein’s function requires knowledge of its three-dimensional structure. Even
PROTEOGENOMIC APPROACHES TO NATURAL MICROBIAL COMMUNITIES
though there has been a recent, sharp increase in the experimental determination of protein structures, meth-
ods involved in this process are inherently time consuming and costly. Current techniques and resources
limit the experimental determination of structures to several hundred unique proteins each year. Bridging
the several log gap between the rate of sequencing and the rate of structure determination remains a sig-
nificant problem. In many cases, however, comparative modeling techniques can provide useful structural
models. It has been estimated that of ?2,300,000 protein sequences, about half are related to the already
known protein structures (Chothia et al., 2003). Thus, considering there are approximately 26,000 known
protein structures in the Protein Data Bank (PDB), the number of proteins for which it is possible to ob-
tain relatively accurate models could be extended by more than an order of magnitude. Furthermore, it has
been estimated there are approximately 1,100 different protein families (Chothia et al., 2003), a third of
which have already been structurally characterized. Identification of the fold family can assist in the search
for biochemical function (Anantharaman et al., 2003). Assuming the current growth rate in the number of
experimentally determined structures, the structure of at least a representative member of each fold family
could be determined within a several year period (Holm and Sander, 1996), making the application of com-
parative modeling an even more viable option.
Inferring and integrating metabolic pathway information from the predicted proteome of an organism can
offer additional clues and constraints on the function of hypothetical proteins. Metabolic pathway analysis
has become particularly plausible with the emergence of several well-curated databases, such as KEGG,
BRENDA, and WIT, which connect experimental data on enzymes that carry out metabolic reactions with
their sequences. The information in these databases facilitates the construction of a hierarchy of reaction
pathways for a given organism from known genes encoded in its genome. Metabolic pathway reconstruc-
tion can thereby be useful in identifying “missing” steps known or inferred to be essential to an organism,
providing candidate functions for hypothetical proteins. Furthermore, genes encoding subsequent steps in
metabolic pathways are occasionally encoded in tandem bicistrons or in multigene operons, so information
about the context of a hypothetical gene can be useful in elucidating protein function if at least one of the
genes in the operon has been characterized. This approach has been useful elsewhere in elucidating hypo-
thetical gene function in a number of essential biosynthetic pathways, such as that for acetyl-CoA, riboflavin,
and isoprenoids (Osterman and Overbeek, 2003), as well in understanding pathway differences that arise
during organismal divergence (Baliga et al., 2004).
Recovery of natural proteins and protein complexes for biochemical analysis
Given that most biofilm organisms have not been cultivated and in view of our goal to study function in
situ, our approach to biochemical function relies on the development of strategies for sample processing
and fractionation. This will provide novel proteins and protein complexes for use in biochemical charac-
terization studies (Fig. 9). Our intention is to confirm and establish the functions for a significant fraction
of the abundant hypothetical proteins via analysis of protein complexes recovered directly from environ-
In many cases, the localization and arrangement of proteins within complexes containing hypothetical
proteins is a key to discovering function. It is clear from our studies that secreted proteins and those asso-
ciated with membranes and cell walls are tolerant to the unusually low pH and high metal concentrations
of the AMD environment. Using protein biochemistry suited for acidophilic proteins, we are developing
techniques to rapidly isolate high molecular mass, multiprotein complexes that are most exposed to the ex-
tracellular environment in the biofilm community, and determine the exact composition, interactions and
biochemical function of such complexes. For fine discrimination of subcellular location, we will use den-
sity gradient centrifugation techniques to separate archaeal membranes from bacterial membranes, and also
the inner and outer bacterial membranes, into fractions that can be interrogated for protein complexes and
super-assemblies of proteins. Cell wall and membrane embedded complexes will be isolated using meth-
ods to fractionate exopolysaccharides and their associated proteins. Surface exposed proteins in these var-
ious fractions will be labeled to identify the outer peptide loops (and associated complexes) using proteol-
ysis and mass spectrometry (MS) characterization. Levels of novel proteins identified here will be correlated
with microbial gene expression information to examine possible protein complex formation and pathway
BANFIELD ET AL.
Mapping proteins to subcellular locations and further separating them into subsets of either multiprotein
complexes or proteins with similar physical or biochemical characteristics facilitates interpretation of pro-
teomic data (Strader et al., 2004). Of special significance in microbiology are those proteins located on the
cell surface, because these are in direct contact with the environment, and are perhaps most likely to facil-
itate the unique specialization of an organism for its ecological niche.
Screening for biochemical function
Given the large number of novel proteins with unassigned function, there is a pressing need for well de-
signed, high throughput or multiplexed tests for function. These should employ conventional enzymatic,
ligand binding, protein interaction or other assays, as well as novel methods for determining biochemical
activity. The ability to use spectrophotometric and fluorometric instruments with microtiter plates contain-
ing 96,384 or higher numbers of samples will enable screening of several different biochemical functions
in hundreds of chromatographic fractions in a single, rapid assay.
Theoretical predictions of function can be assessed using enriched or purified proteins and the appropri-
ate biochemical assays. Once a protein function is inferred, testable hypotheses will be established. This is
simply illustrated in Figure 10, in which several hypothetical proteins detected from a biofilm extract are
nested within genes required for biotin metabolism. Tests for biotin affinity capture of proteins that are in
this pathway can be followed by identification of these proteins using either MS analyses or N-terminal se-
There are also anticipated functions based on biological observation. Since these biofilms are self-sus-
taining, we anticipate proteins involved carbon fixation and nitrogen fixation and other aspects of carbon
cycling, including biosynthesis and biodegradation of stable extracellular organic compounds. Extracellu-
lar polysaccharides such as cellulose are likely crucial to the establishment and function of microbial
biofilms. Therefore, assays for the incorporation of glucose from the activated substrate, UDP-glucose, into
insoluble polymer (cellulose or other glucans) are straightforward (Hill et al., 2001) and can be used with
the appropriate protein fractions. In pathways such as carbon fixation that are well characterized in other
biological systems, some expected genes are missing from the biofilm community genome data. These could
be identified in the course of this project. Additionally, we anticipate novel acid-tolerant proteins, includ-
ing extracellular hydrolytic enzymes such as cellulases and peptidases that may have functionality that can
PROTEOGENOMIC APPROACHES TO NATURAL MICROBIAL COMMUNITIES
biofilm suspension (pH 1)
crude lysate (B)
cells & “debris”
on dry ice at the field site; all steps shown can be carried out in 1 day. A typical sample is processed from ?8 mL of biofilm,
extracted with 0.2 M H2SO4, and extracellular proteins precipitated (with either trichloroacetic acid or ammonium sulfate).
Membranes were isolated in either acidic or neutral solutions to test for acid-stable membrane proteins.
Scheme for biofilm protein fractionation. Laboratory experiments begin with a biofilm sample that was frozen
be harnessed for technological and energy applications. Such proteins can be incubated with the appropri-
ate substrates and the relevant hydrolytic products identified.
FUNCTIONAL ANALYSES OF NATURAL MICROBIAL COMMUNITIES
Organisms in pure culture are different from those in consortia
An important question motivating our research relates to how an individual growing as part of a com-
munity differs from an individual grown in a pure culture. We anticipate significant differences. We plan
to test this through the conduct of proteomic studies on pure cultures of organisms isolated from the Rich-
BANFIELD ET AL.
pleiotropic regulatory protein
sensory transduction histidine kinase
two-component response regulator CheY
sensory box/EAL domain/GGDEF domain
thiamine phosphate synthase ThiE
*acetyl-CoA carboxylase/biotin carrier
translation elongation factor
biofilm fraction are encoded within operons in the Leptospirillum Group II genome (left panel). *Proteins that can be
mapped onto the biotin metabolic pathway (right panel). Both unique and conserved hypothetical genes are grouped
together with the biotin genes.
Proteins identified in a biofilm proteome relevant to biotin metabolism. A group of proteins detected in a
mond Mine and on the same organisms growing in multi-species consortia. We will compare protein pro-
duction in laboratory cultures with protein production in situ, as observed through multiple growth stages
and under different geochemical conditions. As the result of prior work, Leptospirillum group II, group III,
and Ferroplasma acidarmanus fer1 are available for culture-based work that will include one, two, or three
organism types. Research will make use of metabolic insights from several recent studies of Ferroplasma
and related species (Okibe and Johnson, 2004; Dobson et al., 2004; Baumler et al., 2005; Macalady et al.,
2004) and Leptospirillum (Tyson et al., 2005). For example, Leptospirillum group III growing in media de-
void of fixed nitrogen will be supplemented by addition of Leptospirillum group II and then fer1 in order
to mimic community assembly. We will monitor protein profiles during active biofilm growth and also
when a stable community is achieved (protein abundances in relatively steady state). Such analyses require
both field observations and laboratory-based measurements, including estimates of cell numbers and total
protein concentrations. It also will be important to identify molecules that are induced specifically in mixed
species biofilms but not in monocultures. A subset of these may be proteins involved in communication
and competition between organisms.
Proteomic-based analyses of cultures will help to identify those hypothetical proteins that are produced
in significant concentrations by members of the AMD community and to determine when they are pro-
duced. Many of the highly expressed lineage-specific genes may have arisen in response to particular chal-
lenges of the geochemical environment. This hypothesis is supported by our preliminary data, in which the
unique, novel proteins from one biofilm in its early growth stage are in higher proportion in the extracel-
lular and membrane fractions that the cytoplasmic fraction (Ram et al., 2005). Other novel proteins may be
involved in signaling, competition between organisms, or in phage defense.
How do communities assemble and function?
We anticipate that the first colonist in the AMD system will be Leptospirillum group III because this or-
ganism type is most capable of synthesizing all that its needs from air, water, and dissolving minerals. Es-
pecially important is its ability to fix nitrogen. We predict that Leptospirillum group II will be recruited as
the biofilms grow, and may supply as yet unknown complementary functions. Archaea may rapidly join
the assemblage, attracted by waste organic carbon (Norris and Kelly, 1980) that probably cannot be me-
tabolized by Leptospirillum species (and which is probably toxic to them). Leptospirillum group III may
have the profile of a keystone species, with time becoming less successful (abundant) compared to organ-
isms that do not need to carry out the energy-demanding task of nitrogen fixation. It is likely that functions
will become increasingly strongly partitioned amongst organism types as the community becomes estab-
lished. It is possible to evaluate hypotheses such as these in the AMD biofilm system.
The first proteomic dataset indicates high expression of the key thiamine biosynthetic proteins in fer1-
like organisms (only the very most abundant proteins were detected from this organisms due to its low
abundance in the biofilm studied). The ortholog for one of these key proteins was not detected in the Lep-
tospirillum group II protein complement (despite detection of 48% of the predicted proteins for this or-
ganism). Thus, we suspect that Ferroplasma could be an important source of thiamine to the community.
We also suspect that, early in biofilm formation, Leptospirillum group III is the predominant producer of
extracellular polymers. However, we expect that the polymer type (based on appearance) will change as
the biofilm matures, with biosynthesis falling to other community members. Proteogenomic data, in com-
bination with information about the makeup of communities, can be used to test these predictions, and to
search for evidence of other unanticipated interdependencies. Such analyses rely upon determination of the
levels of metabolic activity over time (i.e., quantitative protein abundance data), starting with the very ear-
liest, extremely thin biofilms that develop on the surface of AMD pools.
Survival and adaptation: is strain heterogeneity important?
Based on our initial results, we anticipate that organisms invest considerable resources in dealing with
the challenges inherent in their habitats. In the case of the AMD system, these are specifically the high pro-
ton gradient across membranes, the high abundance of reactive oxygen and hydroxyl (radical) species, the
relatively inefficient energy source, and the very high concentrations of toxic metals. This picture is cur-
PROTEOGENOMIC APPROACHES TO NATURAL MICROBIAL COMMUNITIES
rently incomplete because such a large fraction of the genes and detected proteins have no assigned func-
tion. We anticipate that new information obtained via functional analysis of organism- and lineage-specific
novel proteins will expand this picture. We plan to quantify the diversity and abundance of proteins carry-
ing out all functions using information for proteins with confidently assigned functions, augmenting this
with information about the functions of novel proteins as they become available.
At a higher level of resolution, we anticipate that community structure will change significantly over
space and time. We hypothesize that fluctuations in environmental conditions (solution chemistry, temper-
ature, other organisms and phage) lead to selective pressures that maintain strain-level diversity, ensuring
the ability of the community to adapt to perturbation. We will determine whether certain gene variants be-
come prominent as geochemical conditions change and as the biofilm grows. This analysis will also utilize
information about abundances of protein variants that will be made available by strain-resolved gene se-
quence datasets and de novo sequencing. Results of this research should reveal the level of gene sequence
variation that is discriminated by selection. We predict that strains (and not species) are the relevant eco-
logical level at which understanding of these communities is required.
Where there are coexisting protein variants (strain level), we anticipate that patterns of abundance of
these protein types will provide clues to pathways involved in adaptation to the prevailing conditions. This
analysis will rely upon strain-resolved protein abundances patterns. We anticipate that gene variants with
characteristics indicating that they under positive selection will be those involved in environmental adap-
tation, and that a subset of these will be highly expressed under predictable geochemical conditions (e.g.,
high temperature, high ionic strength). Selection can be evaluated via calculation of ratios of synonymous
to non-synonymous substitutions at the gene and sub-gene level. Such analyses can be conducted at close
to the whole genome scale.
A special case of genome heterogeneity involves blocks of genes that encode hypothetical proteins of
possible phage origin (often with unusual GC content, codon usage, and associated with integrases). Pre-
liminary observations suggest that the complement of novel genes (particularly blocks of novel genes) can
vary significantly between individuals within the strain populations. An important question relates to the
extent to which these genes are expressed, and their likely roles. We will assess the form and distribution
of blocks of hypothetical proteins within populations and determine the fraction of these that are expressed.
Special attention will be given to those for which there are indications that the genes were introduced to
genomes after speciation, probably via lateral transfer.
It is important to study communities of uncultivated organisms in the context of their natural surround-
ings because, in nature, organisms exist in assemblages of populations (not as clonal types) that are selected
for by environmental and biological factors. Genomic, proteomic, and biochemical approaches can be in-
tegrated to expand the understanding of natural genetic diversity and to place metabolic capabilities into
organismal and environmental context.
Amongst the many positive attributes that make the AMD system ideal for development of this approach
is the abundant biomass available at the field site. Up to 100 g of biofilm, containing cells at high density,
can be obtained multiple times each year. This enables investigation of function during different growth
stages at a single location, and comparisons amongst sites within the same system. Under different geo-
chemical conditions the biofilms adopt different architecture, including significant changes in proportions
of the individual microorganisms. The high density of cells in these samples means that protein separation
into extracellular, membrane and cytoplasmic fractions can be achieved with high enough yield for protein
purification and characterization. Moreover, with more comprehensive community genomics datasets, pro-
teins from strain variants can be identified after detection by MS analyses. Our initial results have con-
firmed that this is feasible, and that many important novel proteins can be confidently identified.
Knowledge about how communities that underpin AMD formation function is key to development of strate-
gies for remediation of acidic, metal-contaminated systems, understanding of biogeochemical cycles of sulfur
and metals, and new strategies for energy-efficient resource processing and cleaner energy production. Be-
BANFIELD ET AL.
cause our system of choice is a chemoautotrophically-based subsurface microbial ecosystem, our work is also
appropriate for development of ecological models for carbon sequestration and carbon cycling.
Our analyses to date suggest that strain populations are under active biological and geochemical selec-
tion. Thus, strains rather than species appear to be the relevant biological units in the ecosystem. In the
AMD system it may be possible to link genotype and phenotype of strains to learn how subtle changes at
the DNA level result in functional modifications to yield adaptive advantage. We will identify and assign
function to as much as possible of the protein complement of natural microbial communities to enable pre-
diction of phenotypic characteristics and comprehensive elucidation of the metabolic network, largely at
the strain level.
An important goal of our work is the development of methods for environmental genomic analysis. We
feel that methods development is best accomplished through work on very simple natural communities be-
cause the extent of characterization of such systems possible now can serve as a proxy for the characteri-
zation that could be achieved for communities with much higher species richness in 5 years due growth in
sequencing and protein characterization capacity. New bioinformatics methods for comprehensive commu-
nity genomic studies, methods for high throughput proteomics and protein strain variant detection and rapid
functional analysis of proteins and complexes should be generally applicable to more complex communi-
ties of all types. Through such studies we hope to learn about the ecological functions of uncultured mi-
croorganisms, how communities assemble and respond to environmental change, and the genetic basis of
functional stability and adaptation.
For more information about this project, please see the following:
• Introduction to AMD research: ?http://seismo.berkeley.edu/?jill/amd/AMDresearch.html?
• AMD proteome website analysis page: ?http://compbio.ornl.gov/biofilm_amd?
• Genomics: GTL project website: ?http://quicksilver.espm.berkeley.edu?
The development of the research ideas herein benefited from the input of members of our research team.
We especially acknowledge the contributions of Eric Allen, Brett Baker, Chris Belnap, Kevin Chen, John
Eppley, Christopher Jeans, Ian Lo, D. Kirk Nordstrom, Lior Pachter, Jason Raymond, Rachna Ram, Manesh
Shah, Gene Tyson, and Rachel Whitaker. Funding support for preliminary work on which this project is
based derived from the Department of Energy, National Science Foundation, NASA NAI program, and the
LDRD programs of Lawrence Livermore and Oak Ridge National Laboratories.
ABDELOUAS, A., LUTZ, W., and NUTTALL, H.E. (1999). Uranium contamination in the subsurface: characteriza-
tion and remediation. In Reviews in Mineralogy. Volume 38. Uranium: Mineralogy, Geochemistry and the Environ-
ment. P.C. Burns and R. Finch, eds. (Mineralogical Society of America, Chantilly, VA), pp. 433–474.
ALLEN, E.E., TYSON, G.W., DETTER, J.C., et al. (2005). Recent evolutionary modes deduced by isolate vs. envi-
ronmental strain population comparative genomics. Proc Natl Acad Sci USA (in review).
ALPERS, C.N., NORDSTROM, D.K., and THOMPSON, J.M. (1994). Seasonal variations in copper and zinc con-
centrations from Iron Mountain Mine. In Environmental Geochemistry of Sulfide Oxidation, C.N. Alpers and D.W.
Blowes, eds. (Am. Chem. Soc.) pp. 324–344. Washington, DC.
ALPERS, C.N., NORDSTROM, D.K., VEROSUB, K.L., et al. (1999). Paleomagnetic reversal in Iron Mountain gos-
san provides limits on long-term, pre-mining metal flux rates. Presented at the Geol. Soc. Am. Cordilleran Section
ALPERS, C.N., NORDSTROM, D.K., and SPITZLEY, J. (2003). Extreme acid mine drainage from a pyritic massive
sulfide deposit: the Iron Mountain end-member. In Environmental Aspects of Mine Wastes (Mineralogical Associa-
tion of Canada), J.L. Jambor, D.W. Blowes, and A.I.M. Ritchie, eds. pp. 407–430. Ottawa, Ontario.
ALTSCHUL, S.F., MADDEN, T.L., SCHAFFER, A.A., et al. (1997). Gapped BLAST and PSI-BLAST: a new gen-
eration of protein database search programs. Nucleic Acids Res 25, 3389–3402.
PROTEOGENOMIC APPROACHES TO NATURAL MICROBIAL COMMUNITIES
ANANTHARAMAN, V., ARAVIND, L., and KOONIN, E.V. (2003). Emergence of diverse biochemical activities in
evolutionarily conserved structural scaffolds of proteins. Curr Opin Chem Biol 7, 12–20.
ARAVIND, L., and KOONIN, E.V. (1999). Gleaning non-trivial structural, functional and evolutionary information
about proteins by iterative database searches. J Mol Biol 287, 1023–1040.
BAESEMAN, J.L. (2004). Denitrification in acid-impacted mountain stream sediments [Doctoral dissertation]. Boul-
der: University of Colorado.
BAIN, J.G., MAYER, K.U., BLOWES, D.W., et al. (2001). Modeling the closure-related geochemical evolution of
groundwater at a former uranium mine. J Contaminant Hydrol 52, 109–135.
BAKER, B.J., and BANFIELD, J.F. (2003). Microbial communities associated with acid mine drainage. FEMS Mi-
crobiol Rev 44, 139–152.
BAKER, B.J., MOSER, D.P., MACGREGOR, B.J., et al. (2003). Related assemblages of sulphate-reducing bacteria
associated with ultradeep gold mines of South Africa and deep basalt aquifers of Washington State. Environ Micro-
biol 5, 267–277.
BALIGA, N.S., BONNEAU, R., FACCIOTTI, M.T., et al. (2004). Genome sequence of Haloarcula marismortui: a
halophilic archaeon from the Dead Sea. Genome Res 14, 2221–2234.
BAUMLER, D.J., JEONG, K.-C., FOX, B.G., et al. (2005). Sulfate requirement for heterotrophic growth of 4 “Ferro-
plasma acidarmanus” strain fer1. Res Microbiol 156, 492–498.
BEJA, O., ARAVIND, L., KOONIN, E.V., et al. (2000). Bacterial rhodopsin: evidence for a new type of phototrophy
in the sea. Science 289, 1902–1906.
BENDTSEN, J.D., NIELSEN, H., VON HEIJNE, G., et al. (2004). Improved prediction of signal peptides: SignalP
3.0. J Mol Biol 340, 783–795.
BOENIGK, J., STADLER, P., WIEDLROITHER, A., et al. (2004). Strain-specific differences in the grazing sensitiv-
ities of closely related ultramicrobacteria affiliated with the polynucleobacter cluster. Appl Environ Microbiol 70,
BOND, P.L., and BANFIELD, J.F. (2001). Design and performance of rRNA targeted oligonucleotide probes for in
situ detection and phylogenetic identification of microorganisms inhabiting acid mine drainage environments. Mi-
crobial Ecol 41, 149–161.
BOND, P.L., DRUSCHEL, G.K., and BANFIELD, J.F. (2000a). Comparison of acid mine drainage microbial com-
munities in physically and geochemically distinct ecosystems. Appl Environ Microbiol 66, 4962–4971.
BOND, P.L., SMRIGA, S.P., and BANFIELD, J.F. (2000b). Phylogeny of microorganisms populating a thick, sub-
aerial, lithotrophic biofilm at an extreme acid mine drainage site. Appl Environ Microbiol 66, 3842–3849.
BOOGERD, F.C., VANDENBEEMD, C., STOELWINDER, T., et al. (1991). Relative contributions of biological and
chemical-reactions to the overall rate of pyrite oxidation at temperatures between 30°C and 70°C. Biotechnol Bio-
eng 38, 109–115.
BROFFT, J.E., MCARTHUR, J.V., and SHIMKETS, L.J. (2002). Recovery of novel bacterial diversity from a forested
wetland impacted by reject coal. Environ Microbiol 4, 764–769.
CASTILHOS ZC (2003) Evaluation of human health risks associated to coal mining in Brazil. J Phys. 107, 271–274.
CHOTHIA, C., GOUGH, J., VOGEL, C., et al. (2003). Evolution of the protein repertoire. Science 300, 1701–1703.
CORAM, N.J., and RAWLINGS, D.E. (2002). Molecular relationship between two groups of the genus Leptospirillum
and the finding that Leptospirillum ferriphilum sp. nov. dominates South African commercial biooxidation tanks that
operate at 40°C. Appl Environ Microbiol 68, 838–845.
DEMCHAK, J., SKOUSEN, J., and MCDONALD, L.M. (2004). Longevity of acid discharges from underground mines
located above the regional water table. J Environ Quality 33, 656–668.
DENNIS, P., EDWARDS, E.A., LISS, S.N., et al. (2003). Monitoring gene expression in mixed microbial communi-
ties by using DNA microarrays. Appl Environ Microbiol 69, 769–778.
DOERKS, T., VON MERING, C., and BORK, P. (2004). Functional clues for hypothetical proteins based on genomic
context analysis in prokaryotes. Nucleic Acids Res 32, 6321–6326.
DRUSCHEL, G.K., BAKER, B.J., GIHRING, T.H., et al. (2004). Acid mine drainage biogeochemistry at Iron Moun-
tain, California. Geochem Trans 5, 13–32.
EDGAR, R.C. (2004). MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids
Res 32, 1792–1797.
EDWARDS, K.J., SCHRENK, M.O., HAMERS, R.J., et al. (1998). Microbial oxidation of pyrite: experiments using
microorganisms from an extreme acidic environment. Am Miner 83, 1444–1453.
EDWARDS, K.J., GOEBEL, B., RODGERS, T. M., et al. (1999a). Environmental conditions, microbial populations,
and the role of attachment in oxidative dissolution of pyrite at Iron Mountain. Geomicrobiol J 16, 155–179.
EDWARDS, K.J., GIHRING, T.M., and BANFIELD, J.F. (1999b). Seasonal variations in microbial populations and
environmental conditions in an extreme acid mine environment. Appl Environ Microbiol 65, 3627–3632.
BANFIELD ET AL.
EDWARDS, K.J., BOND, P.L., GIHRING, T.M., et al. (2000a). An archaeal iron-oxidizing extreme acidophile im-
portant in acid mine drainage. Science 287, 1796–1799.
EDWARDS, K.J., BOND, P.L., and BANFIELD, J.F. (2000b). Characteristics of attachment and growth of Thiobacil-
lus caldus on sulphide minerals: a chemotactic response to sulphur minerals? Environ Microbiol 2, 324–332.
EDWARDS, K.J., BOND, P.L., DRUSCHEL, G.K., et al. (2000c). Geochemical and biological aspects of sulfide min-
eral dissolution: lessons from Iron Mountain, California. Chem Geol 169, 383–397.
EDWARDS, K.J., HU, B., HAMERS, R.J., et al. (2001). A new look at microbial leaching patterns on sulfide miner-
als. FEMS Microbiol. Ecol. 34, 197–206.
ENG, J.K., MCCORMACK, A.L., and YATES, J.R., 3rd(1994). An approach to correlate tandem mass spectral data
of peptides with amino acid sequences in a protein database. J Am Mass Spectrom 5, 976.
FETROW, J.S., and SKOLNICK, J. (1998). Method for prediction of protein function from sequence using the se-
quence-to-structure-to-function paradigm with application to glutaredoxins/thioredoxins and T1 ribonucleases. J Mol
Biol 281, 949–968.
FISCHER, D., and EISENBERG, D. (1995). Protein fold recognition using sequence-derived predictions. Protein Sci
FLORENS, L., WASHBURN, M.P., RAINE, J.D., et al. (2002). A proteomic view of the Plasmodium falciparum life
cycle. Nature 419, 520–526.
FRANCIS, A.J., DODGE, C.J., GILLOW, J.B., et al. (1991). Microbial transformations of uranium in wastes. Ra-
diochim Acta 52, 311–316.
GALPERIN, M.Y., and KOONIN, E.V. (2004). “Conserved hypothetical” proteins: prioritization of targets for exper-
imental study. Nucleic Acids Res 32, 5452–5463.
GERBER, S.A., RUSH, J., STEMMAN, O., et al. (2003). Absolute quantification of proteins and phosphoproteins from
cell lysates by tandem MS. Proc Natl Acad Sci USA 100, 6940–6945.
GOEBEL, B.M., and STACKEBRANDT, E. (1994). Cultural and phylogenetic analysis of mixed microbial popula-
tions found in natural and commercial bioleaching environments. Appl Environ Microbiol 60, 1614–1621.
GYGI, S.P., RIST, B., GERBER, S.A., et al. (1999). Quantitative analysis of complex protein mixtures using isotope-
coded affinity tags. Nat Biotechnol 17, 994–999.
HARDMAN, M., and MAKAROV, A.A. (2003). Interfacing the orbitrap mass analyzer to an electrospray ion source.
Anal Chem 75, 1699–1705.
HAWKINS J.W. (2004). Predictability of surface mine spoil hydrologic properties in the Appalachian plateau. Ground
Water 42, 119–125.
HIPPE, H. (2000). Leptospirillum gen. nov. (ex Markosyan 1972), nom. rev., including Leptospirillum ferrooxidans sp.
nov. (ex Markosyan 1972), nom. rev. and Leptospirillum. Int J Syst Evol Microbiol 2, 501–503.
HOUSMAN, V.E., and HOFFMAN, S.D. (1992). Mining sites on Superfund’s National Priority List—Past and cur-
rent mining practices. In Risk Assessment/Management Issues in the Environmental Planning of Mines. D. Van Zyl,
M. Koval, and T.M. Li, eds. (Society for Mining, Metallurgy, Exploration, Littleton, CO), pp. 55–62.
HU, Q., NOLL, R.J., LI, H., MAKAROV, A.A., HARDMAN, M., and COOKS, R.G. (2005). The orbitrap: a new mass
spectrometer. J Mass Spectrom 40, 430–443.
HUWS, S.A., MCBAIN, A.J., and GILBERT, P. (2005). Protozoan grazing and its impact upon population dynamics
in biofilm communities. J Appl Microbiol 98, 238–244.
JOHNSON, D.B. (1998). Biodiversity and ecology of acidophilic microorganisms. FEMS Microbiol Ecol 27, 307–317.
JOHNSON, D.B., DZIURLA, M.A., KOLMERT, A., et al. (2002). The microbiology of acid mine drainage: genesis
and biotreatment. S Afr J Sci 98, 249–255.
KALL, L., KROGH, A., and SONNHAMMER, E.L. (2004). A combined transmembrane topology and signal peptide
prediction method. J Mol Biol 338, 1027–1036.
KELLY, M. (1988). Mining and the Freshwater Environment (Elsevier, New York).
KLEINMANN, R.L.P. (1989). Acid mine drainage in the United States: controlling the impact on streams and rivers.
Presented at the 4thWorld Congress on the Conservation of Built and Natural Environments, University of Toronto.
LOTTERMOSER, B. (2003). Mine Wastes: Characterization, Treatment and Environmental Impacts (Springer, New
MACALADY, J.L., VESTLING, M.M., BAUMLER, D., et al. (2004). Tetraether-linked membrane monolayers in Fer-
roplasma spp: a key to survival in acid. Extremophiles 8, 411–419.
MCDONALD, W.H., OHI, R., MIYAMOTO, D., et al. (2002). Comparison of three directly coupled HPLC MS/MS
strategies for identification of proteins from complex mixtures: single-dimensional LC-MS/MS, 2-phase MudPIT,
and 3-phase MudPIT. Int J Mass Spectrom 219, 245.
MCGUIRE, M.M., EDWARDS, K.J., BANFIELD, J.F., et al. (2001). Kinetics, surface chemistry, and structural evo-
lution of microbially mediated sulfide mineral dissolution. Geochim Cosmo Acta 65, 1243–1258.
PROTEOGENOMIC APPROACHES TO NATURAL MICROBIAL COMMUNITIES
MOORE, J.N., and LUOMA, S.N. (1990). Hazardous wastes from large-scale metal extraction, Environ Sci Tech 24,
MOSES, C.O., NORDSTROM, D.K., HERMAN, J.S., et al. (1987). Aqueous pyrite oxidation by dissolved oxygen and
by ferric iron. Geochim Cosmochim Acta 51, 1561–1571.
NEUWALD, A.F., and LIU, J.S. (2004). Gapped alignment of protein sequence motifs through Monte Carlo opti-
mization of a hidden Markov model. BMC Bioinform 5, 157.
NORDSTROM, D.K., and ALPERS, C.N. (1995). Remedial investigations, decisions, and geochemical consequences
at Iron Mountain Mine, California. Presented at the Sudbury ‘95, Conference on Mining and the Environment, Sud-
NORDSTROM, D.K., and ALPERS, C.N. (1999a). Geochemistry of acid mine waters. In Reviews in Economic Geol-
ogy. Vol. 6A. The Environmental Geochemistry of Mineral Deposits. Part A. Processes, Methods and Health Issues.
G.S. Plumlee and M.J. Logsdon, eds. (Soc. Econ. Geol., Littleton, CO), pp. 133–160.
NORDSTROM, D.K., and ALPERS, C. N. (1999b). Negative pH efflorescent mineralogy, and consequences for envi-
ronmental restoration at the Iron Mountain Superfund site, California. Proc Natl Acad Sci USA 96, 3455–3462.
NORDSTROM, D.K., and SOUTHAM, G. (1997). Geomicrobiology of sulfide mineral oxidation. In Geomicrobiol-
ogy: Interactions between Microbes and Minerals. Vol. 35. Reviews in Mineralogy. J.F. Banfield and K.H. Nealson,
eds. (Mineralogical Society of America, Washington, DC), pp. 361–390.
NORDSTROM, D.K. BURCHARD, J.M., and ALPERS, C.N. (1990) The production and variability of acid mine
drainage at Iron Mountain, California: A Superfund Site undergoing rehabilitation, In Acid Mine Drainage—De-
signing for Closure. J.W. Gadsy, J.W. Malick, and S.J. Day, eds. (BiTech Publishers, Vancouver, B.C.), pp. 23–33.
NORDSTROM, D.K., ALPERS, C.N., COSTON, J.A., et al. (1999). Geochemistry, toxicity, and sorption properties
of contaminated sediments and pore waters in two reservoirs receiving acid mine drainage from Iron Mountain, Cal-
ifornia. In Proc. U.S. Geol. Survey Toxic Substances Hydrology Program. D.W. Morganwalp and H.T. Buxton, eds.
(U.S. Geol. Survey Water-Resources Invest. Report 99-4018A), pp. 289–296.
NORDSTROM, D.K., ALPERS, C.N., PTACEK, C.J., et al. (2000). Negative pH and extremely acidic mine waters
from Iron Mountain, California. Environ Sci Tech 34, 254–258.
NORRIS, P.R., and KELLY, D.P. (1980). Dissolution of pyrite (FeS2) by pure and mixed cultures of some acidophilic
bacteria. FEMS Microbiol Lett 4, 143–146.
NRIAGU, J.O., and PACYNA, J.M. (1988). Quantitative assessment of world-wide contamination of air, water and
soils by trace metals, Nature 333, 134–139.
OLSEN, J.V., and MANN, M. (2004). Improved peptide identification in proteomics by two consecutive stages of mass
spectrometric fragmentation. Proc Natl Acad Sci USA 101, 13417–13422.
OKIBE, N., and D. B. JOHNSON. (2004). Biooxidation of pyrite by defined mixed cultures of moderately thermophilic
acidophiles in pH-controlled bioreactors: significance of microbial interactions. Biotechnol Bioeng 87, 574–583.
OSTERMAN, A., and OVERBEEK, R. (2003). Missing genes in metabolic pathways: a comparative genomics ap-
proach. Curr Opin Chem Biol 7, 238–251.
PACE, N.R., STAH, D.A., LANE, D.J., et al. (1985). Analyzing natural microbial populations by ribosomal RNA se-
quences. ASM News 51, 4–12.
PAPKE, R.T., KOENIG, J.E., RODRIGUEZ-VALERA, F., et al. (2004). Frequent recombination in a saltern popula-
tion of Halorubrum. Science 306, 1928–1929.
PENG, J., ELIAS, J.E., THOREEN, C.C., et al. (2003). Evaluation of multidimensional chromatography coupled with
tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome. J Proteome Res 2,
PLANT, J., SIMPSON, P.R., SMITH, B., et al. (1999). Uranium ore deposits: products of the radioactive earth. In Re-
views in Mineralogy. Volume 38. Uranium: Mineralogy, Geochemistry and the Environment. P.C. Burns and R. Finch,
eds. (Mineralogical Society of America, Chantilly, CA), pp. 255–320.
RAM, R.J., VERBERKMOES, N.C., THELEN, M.P., et al. (2005). Community proteomics of a natural microbial
biofilm. Science 308, 1915–1920.
RAPPE, M.S., and S.J. GIOVANNONI (2003). The uncultured microbial majority. Annu Rev Microbiol 57, 369–394.
RAWLINGS, D.E., TRIBUTSCH, H., and HANSFORD, G.S. (1999). Reasons why ‘Leptospirillum’-like species rather
than Thiobacillus ferrooxidans are the dominant iron-oxidizing bacteria in many commercial processes for the bioox-
idation of pyrite and related ores. Microbiology 145, 5–13.
SAND, W., ROHDE, K., SOBOTKE, B., et al. (1992). Evaluation of Leptospirillum ferrooxidans for leaching. Appl
Environ Microbiol 58, 85–92.
SCHRENK, M.O., EDWARDS, K.J., GOODMAN, R.M., et al. (1998). Distribution of Thiobacillus ferrooxidans and
Leptospirillum ferrooxidans: implications for generation of acid mine drainage. Science 279, 1519–1522.
BANFIELD ET AL.
SCHWARTZ, J.C., and SENKO, M.W. (2002). A two-dimensional quadrupole ion trap mass spectrometer. J Am Soc Download full-text
Mass Spectrom 13, 659.
SILVERMAN, M.P., and EHRLICH, H.L. (1964). Microbial formation and degradation of minerals. In Advances in
Applied Microbiology Vol. 6. W.W. Umbreit, ed. (Academic Press, New York), pp. 153–206.
SINGER, P.C., and STUMM, W. (1970). Acidic mine drainage: the rate determining step. Science 167, 1121–1123.
SKOLNICK, J., KIHARA, D., and ZHANG, Y. (2004). Development and large scale benchmark testing of the
PROSPECTOR_3 threading algorithm. Proteins 56, 502–518.
SMITH, K.S., and HUYCK, H.L.O. (1999). An overview of the abundance, relative mobility, bioavailability, and hu-
man toxicity of metals. In Reviews in Economic Geology. Vol. 6A. The Environmental Geochemistry of Mineral De-
posits. Part A. Processes, Methods and Health Issues. G.S. Plumlee and M.J. Logsdon, eds. (Soc. Econ. Geol., Lit-
tleton, CO), pp. 29–70.
SODING, J. (2004). Protein homology detection by HMM-HMM comparison. Bioinformatics 21, 951–960.
STRADER, M.B., VERBERKMOES, N.C., TABB, D.L., et al. (2004). Characterization of the 70S Ribosome from
Rhodopseudomonas palustris using an integrated “top-down” and “bottom-up” mass spectrometric approach. J Pro-
teome Res 3, 965–978.
SUZUKI, Y., KELLY, S.D., KEMNER, K.M., et al. (2003). Microbial populations stimulated for U(VI) reduction in
uranium mine sediment. Appl Environ Microbiol 69, 1337–1346.
SYKA, J.E., MARTO, J.A., BAI, D.L., et al. (2004). Novel linear quadrupole ion trap/FT mass spectrometer: perfor-
mance characterization and use in the comparative analysis of histone H3 post-translational modifications. J Pro-
teome Res 3, 621–626.
TABB, D.L., NARASIMHAN, C., STRADER, M.B., et al. (2005). DBDigger: reorganized proteomic database identi-
fication improves flexibility and speed. Anal Chem 77, 2464–2474.
TATUSOV, R.L., KOONIN, E.V., and LIPMAN, D.J. (1997). A genomic perspective on protein families. Science 278,
TEELING, H., WALDMANN, J., LOMBARDOT, T., et al. (2004). TETRA: a web-service and stand-alone program
for the analysis and comparison of tetranucleotide usage patterns in DNA sequence. BMC Bioinform 5, 163.
THINGSTAD, T. (2000). Elements of a theory for the mechanisms controlling abundance, diversity, and biogeochem-
ical role of lytic bacterial viruses in aquatic systems. Limnol Oceanogr 45, 1320–1328.
TYSON, G.W., CHAPMAN, J., HUGENHOLTZ, P., et al. (2004). Community structure and metabolism through re-
construction of microbial genomes from the environment. Nature 428, 37–43.
TYSON, G.W., LO, I, BAKER, J.J., et al. (2005). Genome-directed isolation of the key nitrogen fixer Leptospirillum
ferrodiazotrophum sp. nov., from an acidophilic microbial community. Appl Environ Microbiol, (in review).
VENTER, J.C., REMINGTON, K., HEIDELBERG, J.F., et al. (2004). Environmental shotgun sequencing of the Sar-
gasso Sea. Science 304, 66–74.
WASHBURN, M.P., WOLTERS, D., and YATES, J.R. 3RD(2001). Large-scale analysis of the yeast proteome by mul-
tidimensional protein identification technology. Nat Biotechnol 19, 242–247.
WEINBAUER, M.G. (2004). Ecology of prokaryotic viruses. FEMS Microbiol Rev 28, 127–181.
WHITAKER, R.J., GROGAN, D.W., and TAYLOR, J.W. (2003). Geographical barriers isolate endemic populations
of hyperthermophilic archaea. Science 301, 976–978.
WU, L., THOMPSON, D.K., LI, G., et al. (2001). Development and evaluation of functional gene arrays for detection
of selected genes in the environment. Appl Environ Microbiol 67, 5780–5790.
YOUNG, J.E. (1992). Mining the Earth. Worldwatch Paper 109. (Worldwatch Institute). Washington, DC.
Address reprint requests to:
Dr. Jillian F. Banfield
Department of Earth and Planetary Science
University of California
Berkeley, CA 94720
PROTEOGENOMIC APPROACHES TO NATURAL MICROBIAL COMMUNITIES