期刊名称:International Journal of Computer Science Issues
印刷版ISSN:1694-0784
电子版ISSN:1694-0814
出版年度:2012
卷号:9
期号:6
出版社:IJCSI Press
摘要:Discovering protein sequence motif information is one of the most crucial tasks in bioinformatics research. In this work, we try to obtain protein recurring patterns which are universally conserved across protein family boundaries. In order to generate higher quality protein sequence motif information from Protein Sequence Culling Server (PISCES) dataset, we tried several different advanced clustering algorithms, such as hierarchical clustering, Self-Organizing Maps (SOM) etc. However, since the dataset itself contains more than 6, 60,000 segments where each segment contains 180 dimensions, any clustering algorithm required more than O(n) complexity is not applicable. Therefore, the very first step of our research is trying to reduce segments. The results suggest that the Singular Value Decomposition (SVD) computing technique is more suits for reducing segments. After that the reduced segments are followed by applying Rough K-Means clustering algorithm. Our experiments indicate that the Rough K-Means algorithm satisfactorily increases the percentage of sequence segments belonging to clusters with high structural similarity than K-Means. The experimental results suggest that the SVD with Rough K-Means algorithm may be applied to other areas of bioinformatics research in order to explore the underlying relationships between data samples more effectively.
关键词:Clustering; Motif; Protein Sequence; SVD; HSSP; DSSP; HSSP;BLOSUM62.