文章基本信息

标题：A Compression Algorithm for Nucleotide Data Based on Differential Direct Coding and Variable Length Look up Table (LUT)
本地全文：下载
作者：Govind Prasad Arya ; R. K. Bharti
期刊名称：International Journal of Computer Science and Information Technologies
电子版ISSN：0975-9646
出版年度：2012
卷号：3
期号：3
页码：4411-4416
出版社：TechScience Publications
摘要：The ongoing exponential increase of genomic data, together with full diploid human genomes, creates new challenges not only for understanding genomic structure, function and development, but also for the storage, navigation and privacy of genomic data. In this paper, we have proposed a modified Direct Differential Coding algorithm. It is a general purposed nucleotide compression algorithm based on variable length LUT. Here the method identifies repeat regions in the individual sequence and the repeat regions are store in the lookup table (LUT). This algorithm compresses both repeat and non repeat sequences. It also handles the non base character and compresses any nucleotide sequences. It gives better result as compared to existing algorithm. The Differential Direct Coding algorithm was a fixed size lookup table algorithm i.e. it used a table of fixed size containing the 64 maximum possible combinations of the triplets obtained by combination of four characters A, G, T and C. We make this table of variable length by adding some more combinations in the look-up table, which are of the size of multiple of triplet i.e. their size is (6,9.12….) since the number of ACSII characters available were not utilized completely. Our algorithm is based on longest common substitution (LCS). It searches a longest common sequence in multiple of 3 and then substitutes an ASCII value in the place of that sequence to generate variable length LUT. In the previous algorithms, the compression ratio so obtained was smaller as compared to the variable length LUT compression algorithm which creates a relatively massive difference when the algorithm is applied on the large genomic repositories. In addition to this, our algorithm also utilizes the maximum number of ASCII characters which are available, thus increasing the efficiency.