[show abstract][hide abstract] ABSTRACT: Biomedical literature, e.g., MEDLINE, contains a wealth of knowledge regarding functions of proteins. Major recurring biological concepts within such text corpora represent the domains of this body of knowledge. The goal of this research is to identify the major biological topics/concepts from a corpus of protein-related MEDLINE titles and abstracts by applying a probabilistic topic model.
The latent Dirichlet allocation (LDA) model was applied to the corpus. Based on the Bayesian model selection, 300 major topics were extracted from the corpus. The majority of identified topics/concepts was found to be semantically coherent and most represented biological objects or concepts. The identified topics/concepts were further mapped to the controlled vocabulary of the Gene Ontology (GO) terms based on mutual information.
The major and recurring biological concepts within a collection of MEDLINE documents can be extracted by the LDA model. The identified topics/concepts provide parsimonious and semantically-enriched representation of the texts in a semantic space with reduced dimensionality and can be used to index text.
[show abstract][hide abstract] ABSTRACT: Although there are numerous ethnic groups in Sierra Leone, the Mende and Temne together account for approximately 60% of the total population. To see if genetic differences could be observed among ethnic groups in Sierra Leone, the nucleotide sequence of the hypervariable 1 (HV1) region of mitochondrial DNA (mtDNA) was determined from samples of the two major ethnic groups, the Mende (n=59) and Temne (n=121), and of two minor ethnic groups, the Loko (n=29) and Limba (n=67). Among these 276 HV1 sequences, 164 individual haplotypes were observed. An analysis of molecular variance indicated that the distribution of these haplotypes within the Limba sample was significantly different from that of the other ethnic groups. No significant genetic variation was seen between the Mende, Temne, and Loko. These results indicate that distinguishing genetic differences can be observed among ethnic groups residing in historically close proximity to one another. Furthermore, we observed some mitochondrial DNA haplotypes that are common among the Sierra Leone ethnic groups but that have not been observed in other published studies of West African ethnic groups. Therefore, we may have evidence for mtDNA lineages that are unique to this region of West Africa.
American Journal of Physical Anthropology 10/2005; 128(1):156-63. · 2.48 Impact Factor
[show abstract][hide abstract] ABSTRACT: To better understand the population substructure of African Americans living in coastal South Carolina, we used restriction site polymorphisms and an insertion/deletion in mitochondrial DNA (mtDNA) to construct seven-position haplotypes across 1,395 individuals from Sierra Leone, Africa, from U.S. European Americans, and from the New World African-derived populations of Jamaica, Gullah-speaking African Americans of the South Carolina Sea Islands (Gullahs), African Americans living in Charleston, South Carolina, and West Coast African Americans. Analyses showed a high degree of similarity within the New World African-derived populations, where haplotype frequencies and diversities were similar. Phi-statistics indicated that very little genetic differentiation has occurred within New World African-derived populations, but that there has been significant differentiation of these populations from Sierra Leoneans. Genetic distance estimates indicated a close relationship of Gullahs and Jamaicans with Sierra Leoneans, while African Americans living in Charleston and the West Coast were progressively more distantly related to the Sierra Leoneans. We observed low maternal European American admixture in the Jamaican and Gullah samples (m = 0.020 and 0.064, respectively) that increased sharply in a clinal pattern from Charleston African Americans to West Coast African Americans (m = 0.099 and 0.205, respectively). The appreciably reduced maternal European American admixture noted in the Gullah indicates that the Gullah may be uniquely situated to allow genetic epidemiology studies of complex diseases in African Americans with low European American admixture.
American Journal of Physical Anthropology 09/2005; 127(4):427-38. · 2.48 Impact Factor
[show abstract][hide abstract] ABSTRACT: The Marine Genomics project is a functional genomics initiative developed to provide a pipeline for the curation of Expressed Sequence Tags (ESTs) and gene expression microarray data for marine organisms. It provides a unique clearing-house for marine specific EST and microarray data and is currently available at http://www.marinegenomics.org.
The Marine Genomics pipeline automates the processing, maintenance, storage and analysis of EST and microarray data for an increasing number of marine species. It currently contains 19 species databases (over 46,000 EST sequences) that are maintained by registered users from local and remote locations in Europe and South America in addition to the USA. A collection of analysis tools are implemented. These include a pipeline upload tool for EST FASTA file, sequence trace file and microarray data, an annotative text search, automated sequence trimming, sequence quality control (QA/QC) editing, sequence BLAST capabilities and a tool for interactive submission to GenBank. Another feature of this resource is the integration with a scientific computing analysis environment implemented by MATLAB.
The conglomeration of multiple marine organisms with integrated analysis tools enables users to focus on the comprehensive descriptions of transcriptomic responses to typical marine stresses. This cross species data comparison and integration enables users to contain their research within a marine-oriented data management and analysis environment.
[show abstract][hide abstract] ABSTRACT: To develop informative tools for the study of population affinities in African Americans, we sequenced the hypervariable segments I and II (HVS I and HVS II) of mitochondrial DNA (mtDNA) from 96 Sierra Leoneans; European Americans; rural, Gullah-speaking African Americans; urban African Americans living in Charleston, South Carolina; and Jamaicans. We identified single nucleotide polymorphisms (SNPs) exhibiting ethnic affinities, and developed restriction endonuclease tools to screen these SNPs. Here we show that three HVS restriction site polymorphisms (RSPs), EcoRV, FokI, and MfeI, exhibit appreciable differences in frequency (average delta = 0.4165) between putative African American parental populations (i.e., extant Africans living in Sierra Leone and European Americans). Estimates of European American mtDNA admixture, calculated from haplotypes composed of these three novel RSPs, show a cline of increasing admixture from Gullah-speaking African American (m = 0.0300) to urban Charleston African American (m = 0.0689) to West Coast African American (m = 0.1769) populations. This haplotype admixture in the Gullahs is the lowest recorded to date among African Americans, consistent with previous studies using autosomal markers. These RSPs may become valuable new tools in the study of ancestral affinities and admixture dynamics of African Americans.
Human Biology 05/2003; 75(2):147-61. · 1.52 Impact Factor
[show abstract][hide abstract] ABSTRACT: The knowledge of the functions of proteins serves as a corner stone of modern biomedical knowledge. Identifying the major domains of such knowledge and the connections among them not only provides a concise overview of our knowledge in the eld but also facilitates' knowledge discovery. In this paper, we report using a probabilistic topic model(1,2) to identify the major biological topics within a corpus of MEDLINE title and abstracts which describe the protein functions. The model is a Bayesian hierarchical generative model which is based the \bag of words" assumption and simulate generation of a text document as mixing words from dieren t topics with following stochastic procedures: