期刊名称:International Journal of Hybrid Information Technology
印刷版ISSN:1738-9968
出版年度:2015
卷号:8
期号:11
页码:323-332
DOI:10.14257/ijhit.2015.8.11.28
出版社:SERSC
摘要:Genomic repositories gradually increase individual and reference sequences, which shares long identical and near-identical strings of nucleotides. In this paper a lossless DNA data compression technique called Optimized Base Repeat Length DNA Compression (OBRLDNAComp) has been proposed, based upon redundancy of DNA sequences. For easy storage, retrieval time reducing and to find similarity within and between sequences compression is mandatory. OBRLDNAComp searches long identical and near-identical strings of nucleotides which are overlooked by other DNA specific compression algorithms. This technique is an optimal solution of longest possible exact repeat benefits towards compression ratio. It scans a sequence horizontally from left to right to find statistic of repeats then follow substitution technique to compress those repeats. The algorithm is straightforward and does not need any external reference file; it scans the individual file for compression and decompression. The achieved compression ratio 1.673 bpb outperforms many non-reference based compression methods
关键词:Redundancy; Reference genome; Longest Exact Repeats; Non-repeat; LZ77; ; and Compression Ratio