期刊名称:Indian Journal of Computer Science and Engineering
印刷版ISSN:2231-3850
电子版ISSN:0976-5166
出版年度:2021
卷号:12
期号:3
页码:641-652
DOI:10.21817/indjcse/2021/v12i3/211203126
出版社:Engg Journals Publications
摘要:DNA barcoding (a technique that uses short DNA sequences) has become fast, economic and accurate method for discovering and identifying organisms of the three main kingdoms of eukaryotes. In plants, few coding and non coding regions of chloroplast genomes have been tested for their ability to identify species while other regions of genome are still left to be explored for their suitability as DNA barcodes. The present study is about identification of potential DNA barcodes and assessing their potential to discriminate 133 plant species belonging to family Solanaceae from chloroplast DNA (cpDNA) sequences using different machine learning classification algorithms in WEKA and distance based method in SPIDER. Thirty three hyper-variable regions were identified based on nucleotide diversity (π) using sliding window analysis of aligned file of these species. These regions along with well established markers (matK and rbcL) were assessed for their discriminating potential at genus level. Sequence richness regime was followed for six hyper-variable regions ‘ycf1’, ‘cemA, cemA-petA’, ‘rps12-clpP, clpP / rps12-psbB’, ‘petA, petA-psbJ, psbJ, psbJ-psbL’, ‘trnL-trnF, trnF, trnF-ndhJ’ and ‘ndhF, ndhF-rpl32, rpl32, rpl32-trnL’ using BLASTN along with matK and rbcL and were tested for their discrimination potential at genus and species levels. Distance based method SPIDER and machine learning algorithm SMO performed best when compared with other classification methods. It was observed from the study that with increase in number of sequences from particular species, there is increase in percentage correct identification rates. All hypervariable regions were able to achieve maximum percentage of correct identification rate (100%) at genus level. However region ‘ndhF, ndhF-rpl32, rpl32, rpl32-trnL’ was able to achieve highest discrimination rate of 69% at species level which was even better than matK and rbcL. The low identification rates at species level as compared to genus level were attributed to ambiguity within species for these regions. This study will provide valuable resource for development of DNA barcodes for Solanaceae family.