I have considered the following codes for the alphabets {A,, T, G, C, N}.
I am writing all the strings into a single binary file without using newline.
I am adding $ to identify the end of a string which helps in identifying the start of second string in the file.
Convention of codes is as follows. {A -> 10, T -> 11, C -> 01, G -> 001, N -> 0000, $ -> 0001}. If the collection of bits of a string is not making up bytes to write, then I am adding trailing zeroes after $ to make up to byte.
For example, if the input contains two strings ATGCNN and NNNNCGT the corresponding encoding into a single binary file is 10110010100000000000100000101000000000000000001001110001
Here first 21 bits are the direct coding of alphabets from the convention including $ at the end of the string. To make up in bytes, three trailing bits are added at the end. After that second string encoding started. For second string, no trailing bits are needed as after including $, the bits counted to 32 which is already in bytes to write to file.
As of now, I m using character-based comparisons to decode data. Can anybody suggest bit operations based implementation to decode the data faster instead of character comparisons. ?
Thanks !
Aucun commentaire:
Enregistrer un commentaire