摘要:Motivation: Predicting functional sites in kinases is an important problem in biology. Both the functional sites and the relationship among the amino acids within the sites need to be understood. An algorithm is developed for kinase functional site prediction using amino acid sequence data based on hierarchical stochastic language (HSL) modelling. Results: Our method is validated by using two complementary approaches. Firstly, the predicted functional sites using the HSL were compared with experimentally verified functional sites including the patterns in PROSITE, the contacting sites in the Protein Data Bank (PDB), and the domains in Pfam. Compared to the patterns in PROSITE and the contacting sites in PDB, the overall average recall/precision of the HSL model was 83.5% / 23.0% and 66.1% / 79.9%, respectively. Compared to Pfam, 90% of the predicted functional sites were parts of domains with names containing the substring “kinase”. Secondly, 10-fold cross-validation was used to study the kinase function prediction accuracy of the HSL. The HSL achieved both high sensitivity (94.7%) and specificity (94.0%) compared to 94.5% and 85.8%, respectively, for MEME. The HSL model automatically detected kinase sub-families. The identified sub-families were consistent with known phylogenetic trees of the kinase sequences. Therefore, the HSL was applicable to kinase sequences with heterogeneous subsets sharing the same catalysis function. Availability and Supplementary information: The software and supplementary materials are available at http://www.math.pku.edu.cn/teachers/dengmh/HSL
关键词:kinase; functional sites; hierarchical stochastic language (HSL)