文章基本信息

标题：Significance of Data Selection in Deep Learning for Reliable Binding Mode Prediction of Ligands in the Active Site of CYP3A4
本地全文：下载
作者：Atsuko Sato ; Naoki Tanimura ; Teruki Honma 等
期刊名称：Chemical and Pharmaceutical Bulletin
印刷版ISSN：0009-2363
电子版ISSN：1347-5223
出版年度：2019
卷号：67
期号：11
页码：1183-1190
DOI：10.1248/cpb.c19-00443
出版社：The Pharmaceutical Society of Japan
摘要：

For rational drug design, it is essential to predict the binding mode of protein–ligand complexes. Although various machine learning-based models have been reported that use convolutional neural networks (deep learning) to predict binding modes from three-dimensional structures, there are few detailed reports on how best to construct and use datasets. Here, we examined how different datasets affected the prediction of the binding mode of CYP3A4 by a three-dimensional neural network when the number of crystal structures for the target protein was limited. We used four different training datasets: one large, general dataset containing various protein complexes and three smaller, more specific datasets containing complexes with CYP3A4-like pockets, complexes with CYP3A4-binding ligands, and complexes with CYP protein family members. We then trained models with different combinations of datasets with or without subsequent fine-tuning and evaluated the binding mode prediction performance of each model. The best receiver operating characteristic (ROC) area under the curve ( AUC ) model with respect to area under the receiver operating characteristic curve was obtained by training with a combination of the general protein and CYP family datasets. However, the ROC AUC —recall balanced model was obtained by training with this combination of datasets followed by fine-tuning with the CYP3A4-binding ligands dataset. Our results suggest that datasets that balance protein functionality and data size are important for optimizing binding mode prediction performance. In addition, datasets with large median binding pocket sizes may be important for the binding mode prediction specifically of CYP3A4.
关键词：computational drug design;deep learning;CYP3A4;binding mode prediction;data selection