[Show abstract][Hide abstract] ABSTRACT: Many algorithms or techniques to discover motifs require a predefined fixed window size in advance. Because of the fixed size, these approaches often deliver a number of similar motifs simply shifted by some bases or including mismatches. To confront the mismatched motifs problem, we use the super-rule concept to construct a Super-Rule-Tree (SRT) by a modified Hybrid Hierarchical K-means (HHK) clustering algorithm, which requires no parameter set-up to identify the similarities and dissimilarities between the motifs. By analysing the motif results generated by our approach, they are significant not only in sequence area but also in secondary structure similarity.
International Journal of Data Mining and Bioinformatics 06/2010; 4(3):316-30. DOI:10.1504/IJDMB.2010.033523 · 0.50 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Protein sequence motifs have the potential to determine the conformation, function and activities of the proteins. In order to obtain protein sequence motifs which are universally conserved across protein family boundaries, unlike most popular motif discovering algorithms, our input dataset is extremely large. As a result, an efficient technique is demanded. We create two granular computing models to efficiently generate protein motif information which transcend protein family boundaries. We have performed a comprehensive comparison between the two models. In addition, we further combine the results from the FIK and FGK models to generate our best sequence motif information.
International Journal of Computational Biology and Drug Design 01/2009; 2(2):168-86. DOI:10.1504/IJCBDD.2009.028822
[Show abstract][Hide abstract] ABSTRACT: The public computer architecture shows promise as a platform for solving fundamental problems in bioinformatics such as global gene sequence alignment and data mining with tools such as the basic local alignment search tool (BLAST). Our implementation of these two problems on the Berkeley open infrastructure for network computing (BOINC) platform demonstrates a runtime reduction factor of 1.15 for sequence alignment and 16.76 for BLAST. While the runtime reduction factor of the global gene sequence alignment application is modest, this value is based on a theoretical sequential runtime extrapolated from the calculation of a smaller problem. Because this runtime is extrapolated from running the calculation in memory, the theoretical sequential runtime would require 37.3 GB of memory on a single system. With this in mind, the BOINC implementation not only offers the reduced runtime, but also the aggregation of the available memory of all participant nodes. If an actual sequential run of the problem were compared, a more drastic reduction in the runtime would be seen due to an additional secondary storage I/O overhead for a practical system. Despite the limitations of the public computer architecture, most notably in communication overhead, it represents a practical platform for grid- and cluster-scale bioinformatics computations today and shows great potential for future implementations.
[Show abstract][Hide abstract] ABSTRACT: Protein sequence motifs are gathering more and more attention in the sequence analysis area. These recurring regions have the potential to determine protein's conformation, function and activities. In our previous work, we tried to obtain protein sequence motifs which are universally conserved across protein family boundaries. Therefore, unlike most popular motif discovering algorithms, our input dataset is extremely large. In order to deal with large input datasets, we provided two granular computing models (FIK and FGK model) to efficiently generate protein motifs information. In this article, we develop a new method which combines the concept of granular computing and the power of ranking SVM to further extract protein sequence motif information. There are two reasons to eliminate redundant data: First, the information we try to generate is about sequence motifs, but the original input data are derived from whole protein sequences by a sliding window technique; second, during fuzzy c-means clustering, it has the ability to assign one segment to more than one information granule. However, not all data segments have a direct relation to the granule they assigned. The quality of motif information increases dramatically in all three evaluation measures by applying this new feature elimination model. Compared with traditional methods which shrink cluster size to obtain a more compact one, our approach shows improved results.
International Journal of Functional Informatics and Personalised Medicine 01/2008; 1:8-25. DOI:10.1504/IJFIPM.2008.018290
[Show abstract][Hide abstract] ABSTRACT: Protein sequence motifs are gathering more and more attention in the sequence analysis area. These recurring regions have the potential to determine protein 's conformation, function and activities. In our previous work, we tried to obtain protein sequence motifs which are universally conserved across protein family boundaries. Therefore, unlike most popular motif discovering algorithms, our input dataset is extremely large. In order to deal with large input datasets, we provided two granular computing models (FIK and FGK model) to efficiently generate protein motifs information and Super GSVM-FE model to do the feature elimination for improving the quality of motif information. In this article, we tried to further improve our SVM feature elimination model to achieve three goals: Reduce time execution by half, further improve motif information quality and add the ability of adjusting the number of filtered segments. Compared with the latest results, our new approach shows great improvements.
Proceedings of the 7th IEEE International Conference on Bioinformatics and Bioengineering, BIBE 2007, October 14-17, 2007, Harvard Medical School, Boston, MA, USA; 01/2007
[Show abstract][Hide abstract] ABSTRACT: Public computing can potentially supply not only computational power but also memory and short term storage resources to grid and cluster scale problems. Gene sequence alignment is a fundamental computational challenge in bioinformatics with attributes such as moderate computational requirements, extensive memory requirements, and highly interdependent tasks. This study examines the performance of calculating the alignment for two 100,000 base sequences on a public computing platform utilizing the BOINC framework. When compared to the theoretical, optimal sequential implementation, the parallel implementation achieves speedup by a factor of 1.4 and at the point of maximum parallelism and ends with a speedup of 1.2. This speedup factor is based on extrapolation of the sequential performance of a segment of the problem. This extrapolation would require a theoretical sequential machine with approximately 37.3 GB of working memory or suffer performance degradation from use of secondary storage during the calculation.
[Show abstract][Hide abstract] ABSTRACT: Public computing shows promise as an architecture offering tremendous computational power to complex scientific problems. Calculation of the joint probability distribution function for avalanche photodiode gain and impulse response is one such problem. The BOINC architecture is a public computing platform used in this experiment to examine the performance offered by the platform for a specific scientific computation featuring intense computation, task independence, and heavy communication overhead. The large output data of this computation is shown to have a significant impact on the execution of this application resulting in a speedup gain of between 4.5 and 5 times the sequential version. Workload distribution favoring a participant node with rich communication resources in the public computing platform suggests greater performance gains available from the platform if communication resources are increased.
[Show abstract][Hide abstract] ABSTRACT: Distributed computing on heterogeneous nodes, or grid computing, provides a substantial increase in computational power available for many ap- plications. This paper reports our experience of calculating cryptographic hashes on a small grid test bed using a software package called BOINC. The computation power on the grid allows for searching the input space of a crypto- graphic hash to find a matching hash value. In particular, we show an imple- mentation of searching possible 5 character passwords hashed with the MD4 al- gorithm on the grid. The resulting performance shows individual searches of sections of the password space returning a near linear decrease in calculation time based on individual participant node performance. Due to the overhead in- volved of scheduling these sections of the password space and processing of the results, the overall performance gain is slightly less than linear, but still rea- sonably good. We plan to design new scheduling algorithms and perform more testing to enhance BOINC's capability in our future research.
Grid and Cooperative Computing - GCC 2004: Third International Conference, Wuhan, China, October 21-24, 2004. Proceedings; 01/2004