diff options
Diffstat (limited to 'content/posts/2019-01-03-encoding-binary-data-into-dna-sequence.md')
| -rw-r--r-- | content/posts/2019-01-03-encoding-binary-data-into-dna-sequence.md | 410 |
1 files changed, 0 insertions, 410 deletions
diff --git a/content/posts/2019-01-03-encoding-binary-data-into-dna-sequence.md b/content/posts/2019-01-03-encoding-binary-data-into-dna-sequence.md deleted file mode 100644 index f0343ae..0000000 --- a/content/posts/2019-01-03-encoding-binary-data-into-dna-sequence.md +++ /dev/null | |||
| @@ -1,410 +0,0 @@ | |||
| 1 | --- | ||
| 2 | title: Encoding binary data into DNA sequence | ||
| 3 | url: encoding-binary-data-into-dna-sequence.html | ||
| 4 | date: 2019-01-03T12:00:00+02:00 | ||
| 5 | draft: false | ||
| 6 | --- | ||
| 7 | |||
| 8 | ## Initial thoughts | ||
| 9 | |||
| 10 | Imagine a world where you could go outside and take a leaf from a tree and put | ||
| 11 | it through your personal DNA sequencer and get data like music, videos or | ||
| 12 | computer programs from it. Well, this is all possible now. It was not done on a | ||
| 13 | large scale because it is quite expensive to create DNA strands but it's | ||
| 14 | possible. | ||
| 15 | |||
| 16 | Encoding data into DNA sequence is relatively simple process once you understand | ||
| 17 | the relationship between binary data and nucleotides and scientists have been | ||
| 18 | making large leaps in this field in order to provide viable long-term storage | ||
| 19 | solution for our data that would potentially survive our specie if case of | ||
| 20 | global disaster. We could imprint all the world's knowledge into plants and | ||
| 21 | ensure the survival of our knowledge. | ||
| 22 | |||
| 23 | More optimistic usage for this technology would be easier storage of ever | ||
| 24 | growing data we produce every day. Once machines for sequencing DNA become fast | ||
| 25 | enough and cheaper this could mean the next evolution of storing data and | ||
| 26 | abandoning classical hard and solid state drives in data warehouses. | ||
| 27 | |||
| 28 | As we currently stand this is still not viable but it is quite an amazing and | ||
| 29 | cool technology. | ||
| 30 | |||
| 31 | My interests in this field are purely in encoding processes and experimental | ||
| 32 | testing mainly because I don't have the access to this expensive machines. My | ||
| 33 | initial goal was to create a toolkit that can be used by everybody to encode | ||
| 34 | their data into a proper DNA sequence. | ||
| 35 | |||
| 36 | ## Glossary | ||
| 37 | |||
| 38 | **deoxyribose** A five-carbon sugar molecule with a hydrogen atom rather than a | ||
| 39 | hydroxyl group in the 2′ position; the sugar component of DNA nucleotides. | ||
| 40 | |||
| 41 | **double helix** The molecular shape of DNA in which two strands of nucleotides | ||
| 42 | wind around each other in a spiral shape. | ||
| 43 | |||
| 44 | **nitrogenous base** A nitrogen-containing molecule that acts as a base; often | ||
| 45 | referring to one of the purine or pyrimidine components of nucleic acids. | ||
| 46 | |||
| 47 | **phosphate group** A molecular group consisting of a central phosphorus atom | ||
| 48 | bound to four oxygen atoms. | ||
| 49 | |||
| 50 | **RGB** The RGB color model is an additive color model in which red, green and | ||
| 51 | blue light are added together in various ways to reproduce a broad array of | ||
| 52 | colors. | ||
| 53 | |||
| 54 | **GCC** The GNU Compiler Collection is a compiler system produced by the GNU | ||
| 55 | Project supporting various programming languages. | ||
| 56 | |||
| 57 | ## Data encoding | ||
| 58 | |||
| 59 | **TL;DR:** Encoding involves the use of a code to change original data into a | ||
| 60 | form that can be used by an external process. | ||
| 61 | |||
| 62 | Encoding is the process of converting data into a format required for a number | ||
| 63 | of information processing needs, including: | ||
| 64 | |||
| 65 | - Program compiling and execution | ||
| 66 | - Data transmission, storage and compression/decompression | ||
| 67 | - Application data processing, such as file conversion | ||
| 68 | |||
| 69 | Encoding can have two meanings: | ||
| 70 | |||
| 71 | - In computer technology, encoding is the process of applying a specific code, | ||
| 72 | such as letters, symbols and numbers, to data for conversion into an | ||
| 73 | equivalent cipher. | ||
| 74 | - In electronics, encoding refers to analog to digital conversion. | ||
| 75 | |||
| 76 | ## Quick history of DNA | ||
| 77 | |||
| 78 | - **1869** - Friedrich Miescher identifies "nuclein". | ||
| 79 | - **1900s** - The Eugenics Movement. | ||
| 80 | - **1900** – Mendel's theories are rediscovered by researchers. | ||
| 81 | - **1944** - Oswald Avery identifies DNA as the 'transforming principle'. | ||
| 82 | - **1952** - Rosalind Franklin photographs crystallized DNA fibres. | ||
| 83 | - **1953** - James Watson and Francis Crick discover the double helix structure of DNA. | ||
| 84 | - **1965** - Marshall Nirenberg is the first person to sequence the bases in each codon. | ||
| 85 | - **1983** - Huntington's disease is the first mapped genetic disease. | ||
| 86 | - **1990** - The Human Genome Project begins. | ||
| 87 | - **1995** - Haemophilus Influenzae is the first bacterium genome sequenced. | ||
| 88 | - **1996** - Dolly the sheep is cloned. | ||
| 89 | - **1999** - First human chromosome is decoded. | ||
| 90 | - **2000** – Genetic code of the fruit fly is decoded. | ||
| 91 | - **2002** – Mouse is the first mammal to have its genome decoded. | ||
| 92 | - **2003** – The Human Genome Project is completed. | ||
| 93 | - **2013** – DNA Worldwide and Eurofins Forensic discover identical twins have differences in their genetic makeup. | ||
| 94 | |||
| 95 | ## What is DNA? | ||
| 96 | |||
| 97 | Deoxyribonucleic acid, a self-replicating material which is **present in nearly | ||
| 98 | all living organisms** as the main constituent of chromosomes. It is the | ||
| 99 | **carrier of genetic information**. | ||
| 100 | |||
| 101 | > The nitrogen in our DNA, the calcium in our teeth, the iron in our blood, | ||
| 102 | > the carbon in our apple pies were made in the interiors of collapsing stars. | ||
| 103 | > We are made of starstuff. | ||
| 104 | > **-- Carl Sagan, Cosmos** | ||
| 105 | |||
| 106 | The nucleotide in DNA consists of a sugar (deoxyribose), one of four bases | ||
| 107 | (cytosine (C), thymine (T), adenine (A), guanine (G)), and a phosphate. | ||
| 108 | Cytosine and thymine are pyrimidine bases, while adenine and guanine are purine | ||
| 109 | bases. The sugar and the base together are called a nucleoside. | ||
| 110 | |||
| 111 |  | ||
| 112 | |||
| 113 | *DNA (a) forms a double stranded helix, and (b) adenine pairs with thymine and | ||
| 114 | cytosine pairs with guanine. (credit a: modification of work by Jerome Walker, | ||
| 115 | Dennis Myts)* | ||
| 116 | |||
| 117 | ## Encode binary data into DNA sequence | ||
| 118 | |||
| 119 | As an input file you can use any file you want: | ||
| 120 | |||
| 121 | - ASCII files, | ||
| 122 | - Compiled programs, | ||
| 123 | - Multimedia files (MP3, MP4, MVK, etc), | ||
| 124 | - Images, | ||
| 125 | - Database files, | ||
| 126 | - etc. | ||
| 127 | |||
| 128 | Note: If you would copy all the bytes from RAM to file or pipe data to file you | ||
| 129 | could encode also this data as long as you provide file pointer to the encoder. | ||
| 130 | |||
| 131 | ### Basic Encoding | ||
| 132 | |||
| 133 | As already mentioned, the Basic Encoding is based on a simple mapping. Since DNA | ||
| 134 | is composed of 4 nucleotides (Adenine, Cytosine, Guanine, Thymine; usually | ||
| 135 | referred using the first letter). Using this technique we can encode | ||
| 136 | |||
| 137 | $$ log_2(4) = log_2(2^2) = 2 bits $$ | ||
| 138 | |||
| 139 | using a single nucleotide. In this way, we are able to use the 4 bases that | ||
| 140 | compose the DNA strand to encode each byte of data. | ||
| 141 | |||
| 142 | | Two bits | Nucleotides | | ||
| 143 | | -------- | ---------------- | | ||
| 144 | | 00 | **A** (Adenine) | | ||
| 145 | | 10 | **G** (Guanine) | | ||
| 146 | | 01 | **C** (Cytosine) | | ||
| 147 | | 11 | **T** (Thymine) | | ||
| 148 | |||
| 149 | With this in mind we can simply encode any data by using two-bit to Nucleotides | ||
| 150 | conversion. | ||
| 151 | |||
| 152 | ```python | ||
| 153 | { Algorithm 1: Naive byte array to DNA encode } | ||
| 154 | procedure EncodeToDNASequence(f) string | ||
| 155 | begin | ||
| 156 | enc string | ||
| 157 | while not eof(f) do | ||
| 158 | c byte := buffer[0] { Read 1 byte from buffer } | ||
| 159 | bin integer := sprintf('08b', c) { Convert to string binary } | ||
| 160 | for e in range[0, 2, 4, 6] do | ||
| 161 | if e[0] == 48 and e[1] == 48 then { 0x00 - A (Adenine) } | ||
| 162 | enc += 'A' | ||
| 163 | else if e[0] == 48 and e[1] == 49 then { 0x01 - G (Guanine) } | ||
| 164 | enc += 'G' | ||
| 165 | else if e[0] == 49 and e[1] == 48 then { 0x10 - C (Cytosine) } | ||
| 166 | enc += 'C' | ||
| 167 | else if e[0] == 49 and e[1] == 49 then { 0x11 - T (Thymine) } | ||
| 168 | enc += 'T' | ||
| 169 | return enc { Return DNA sequence } | ||
| 170 | end | ||
| 171 | ``` | ||
| 172 | |||
| 173 | Another encoding would be **Goldman encoding**. Using this encoding helps with | ||
| 174 | Nonsense mutation (amino acids replaced by a stop codon) that occurs and is the | ||
| 175 | most problematic during translation because it leads to truncated amino acid | ||
| 176 | sequences, which in turn results in truncated proteins. | ||
| 177 | |||
| 178 | [Where to store big data? In DNA: Nick Goldman at TEDxPrague](https://www.youtube.com/watch?v=a4PiGWNsIEU) | ||
| 179 | |||
| 180 | ### FASTA file format | ||
| 181 | |||
| 182 | In bioinformatics, FASTA format is a text-based format for representing either | ||
| 183 | nucleotide sequences or peptide sequences, in which nucleotides or amino acids | ||
| 184 | are represented using single-letter codes. The format also allows for sequence | ||
| 185 | names and comments to precede the sequences. The format originates from the | ||
| 186 | FASTA software package, but has now become a standard in the field of | ||
| 187 | bioinformatics. | ||
| 188 | |||
| 189 | The first line in a FASTA file started either with a ">" (greater-than) symbol | ||
| 190 | or, less frequently, a ";" (semicolon) was taken as a comment. Subsequent lines | ||
| 191 | starting with a semicolon would be ignored by software. Since the only comment | ||
| 192 | used was the first, it quickly became used to hold a summary description of the | ||
| 193 | sequence, often starting with a unique library accession number, and with time | ||
| 194 | it has become commonplace to always use ">" for the first line and to not use | ||
| 195 | ";" comments (which would otherwise be ignored). | ||
| 196 | |||
| 197 | ``` | ||
| 198 | ;LCBO - Prolactin precursor - Bovine | ||
| 199 | ; a sample sequence in FASTA format | ||
| 200 | MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHDLSS | ||
| 201 | EMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPLYHL | ||
| 202 | VTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDED | ||
| 203 | ARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC* | ||
| 204 | |||
| 205 | >MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken | ||
| 206 | ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID | ||
| 207 | FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA | ||
| 208 | DIDGDGQVNYEEFVQMMTAK* | ||
| 209 | |||
| 210 | >gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] | ||
| 211 | LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV | ||
| 212 | EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG | ||
| 213 | LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL | ||
| 214 | GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX | ||
| 215 | IENY | ||
| 216 | ``` | ||
| 217 | |||
| 218 | FASTA format was extended by [FASTQ](https://en.wikipedia.org/wiki/FASTQ_format) | ||
| 219 | format from the [Sanger Centre](https://www.sanger.ac.uk/) in Cambridge. | ||
| 220 | |||
| 221 | ### PNG encoded DNA sequence | ||
| 222 | |||
| 223 | | Nucleotides | RGB | Color name | | ||
| 224 | | ------------ | ----------- | ---------- | | ||
| 225 | | A ➞ Adenine | (0,0,255) | Blue | | ||
| 226 | | G ➞ Guanine | (0,100,0) | Green | | ||
| 227 | | C ➞ Cytosine | (255,0,0) | Red | | ||
| 228 | | T ➞ Thymine | (255,255,0) | Yellow | | ||
| 229 | |||
| 230 | With this in mind we can create a simple algorithm to create PNG representation | ||
| 231 | of a DNA sequence. | ||
| 232 | |||
| 233 | ```python | ||
| 234 | { Algorithm 2: Naive DNA to PNG encode from FASTA file } | ||
| 235 | procedure EncodeDNASequenceToPNG(f) | ||
| 236 | begin | ||
| 237 | i image | ||
| 238 | while not eof(f) do | ||
| 239 | c char := buffer[0] { Read 1 char from buffer } | ||
| 240 | case c of | ||
| 241 | 'A': color := RGB(0, 0, 255) { Blue } | ||
| 242 | 'G': color := RGB(0, 100, 0) { Green } | ||
| 243 | 'C': color := RGB(255, 0, 0) { Red } | ||
| 244 | 'T': color := RGB(255, 255, 0) { Yellow } | ||
| 245 | drawRect(i, [x, y], color) | ||
| 246 | save(i) { Save PNG image } | ||
| 247 | end | ||
| 248 | ``` | ||
| 249 | |||
| 250 | ## Encoding text file in practice | ||
| 251 | |||
| 252 | In this example we will take a simple text file as our input stream for | ||
| 253 | encoding. This file will have a quote from Niels Bohr and saved as txt file. | ||
| 254 | |||
| 255 | > How wonderful that we have met with a paradox. Now we have some hope of | ||
| 256 | > making progress. | ||
| 257 | > ― Niels Bohr | ||
| 258 | |||
| 259 | First we encode text file into FASTA file. | ||
| 260 | |||
| 261 | ```bash | ||
| 262 | ./dnae-encode -i quote.txt -o quote.fa | ||
| 263 | 2019/01/10 00:38:29 Gathering input file stats | ||
| 264 | 2019/01/10 00:38:29 Starting encoding ... | ||
| 265 | 106 B / 106 B [==================================] 100.00% 0s | ||
| 266 | 2019/01/10 00:38:29 Saving to FASTA file ... | ||
| 267 | 2019/01/10 00:38:29 Output FASTA file length is 438 B | ||
| 268 | 2019/01/10 00:38:29 Process took 987.263µs | ||
| 269 | 2019/01/10 00:38:29 Done ... | ||
| 270 | ``` | ||
| 271 | |||
| 272 | Output of `quote.fa` file contains the encoded DNA sequence in ASCII format. | ||
| 273 | |||
| 274 | ``` | ||
| 275 | >SEQ1 | ||
| 276 | GACAGCTTGTGTACAAGTGTGCTTGCTCGCGAGCGGGTACGCGCGTGGGCTAACAAGTGA | ||
| 277 | GCCAGCAGGTGAACAAGTGTGCGGACAAGCCAGCAGGTGCGCGGACAAGCTGGCGGGTGA | ||
| 278 | ACAAGTGTGCCGGTGAGCCAACAAGCAGACAAGTAAGCAGGTACGCAGGCGAGCTTGTCA | ||
| 279 | ACTCACAAGATCGCTTGTGTACAAGTGTGCGGACAAGCCAGCAGGTGCGCGGACAAGTAT | ||
| 280 | GCTTGCTGGCGGACAAGCCAGCTTGTAAGCGGACAAGCTTGCGCACAAGCTGGCAGGCCT | ||
| 281 | GCCGGCTCGCGTACAAATTCACAAGTAAGTACGCTTGCGTGTACGCGGGTATGTATACTC | ||
| 282 | AACCTCACCAAACGGGACAAGATCGCCGGCGGGCTAGTATACAAGAACGCTTGCCAGTAC | ||
| 283 | AACC | ||
| 284 | ``` | ||
| 285 | |||
| 286 | Then we encode FASTA file from previous operation to encode this data into PNG. | ||
| 287 | |||
| 288 | ```bash | ||
| 289 | ./dnae-png -i quote.fa -o quote.png | ||
| 290 | 2019/01/10 00:40:09 Gathering input file stats ... | ||
| 291 | 2019/01/10 00:40:09 Deconstructing FASTA file ... | ||
| 292 | 2019/01/10 00:40:09 Compositing image file ... | ||
| 293 | 424 / 424 [==================================] 100.00% 0s | ||
| 294 | 2019/01/10 00:40:09 Saving output file ... | ||
| 295 | 2019/01/10 00:40:09 Output image file length is 1.1 kB | ||
| 296 | 2019/01/10 00:40:09 Process took 19.036117ms | ||
| 297 | 2019/01/10 00:40:09 Done ... | ||
| 298 | ``` | ||
| 299 | |||
| 300 | After encoding into PNG format this file looks like this. | ||
| 301 | |||
| 302 |  | ||
| 303 | |||
| 304 | The larger the input stream is the larger the PNG file would be. | ||
| 305 | |||
| 306 | Compiled basic Hello World C program with | ||
| 307 | [GCC](https://www.gnu.org/software/gcc/) would [look | ||
| 308 | like](/assets/dna-sequence/sample.png). | ||
| 309 | |||
| 310 | ```c | ||
| 311 | // gcc -O3 -o sample sample.c | ||
| 312 | #include <stdio.h> | ||
| 313 | |||
| 314 | main() { | ||
| 315 | printf("Hello, world!\n"); | ||
| 316 | return 0; | ||
| 317 | } | ||
| 318 | ``` | ||
| 319 | |||
| 320 | ## Toolkit for encoding data | ||
| 321 | |||
| 322 | I have created a toolkit with two main programs: | ||
| 323 | |||
| 324 | - dnae-encode (encodes file into FASTA file) | ||
| 325 | - dnae-png (encodes FASTA file into PNG) | ||
| 326 | |||
| 327 | Toolkit with full source code is available on | ||
| 328 | [github.com/mitjafelicijan/dna-encoding](https://github.com/mitjafelicijan/dna-encoding). | ||
| 329 | |||
| 330 | ### dnae-encode | ||
| 331 | |||
| 332 | ```bash | ||
| 333 | > ./dnae-encode --help | ||
| 334 | usage: dnae-encode --input=INPUT [<flags>] | ||
| 335 | |||
| 336 | A command-line application that encodes file into DNA sequence. | ||
| 337 | |||
| 338 | Flags: | ||
| 339 | --help Show context-sensitive help (also try --help-long and --help-man). | ||
| 340 | -i, --input=INPUT Input file (ASCII or binary) which will be encoded into DNA sequence. | ||
| 341 | -o, --output="out.fa" Output file which stores DNA sequence in FASTA format. | ||
| 342 | -s, --sequence=SEQ1 The description line (defline) or header/identifier line, gives a name and/or a unique identifier for the sequence. | ||
| 343 | -c, --columns=60 Row characters length (no more than 120 characters). Devices preallocate fixed line sizes in software. | ||
| 344 | --version Show application version. | ||
| 345 | ``` | ||
| 346 | |||
| 347 | ### dnae-png | ||
| 348 | |||
| 349 | ```bash | ||
| 350 | > ./dnae-png --help | ||
| 351 | usage: dnae-png --input=INPUT [<flags>] | ||
| 352 | |||
| 353 | A command-line application that encodes FASTA file into PNG image. | ||
| 354 | |||
| 355 | Flags: | ||
| 356 | --help Show context-sensitive help (also try --help-long and --help-man). | ||
| 357 | -i, --input=INPUT Input FASTA file which will be encoded into PNG image. | ||
| 358 | -o, --output="out.png" Output file in PNG format that represents DNA sequence in graphical way. | ||
| 359 | -s, --size=10 Size of pairings of DNA bases on image in pixels (lower resolution lower file size). | ||
| 360 | --version Show application version. | ||
| 361 | ``` | ||
| 362 | |||
| 363 | ## Benchmarks | ||
| 364 | |||
| 365 | First we generate some binary sample data with dd. | ||
| 366 | |||
| 367 | ```bash | ||
| 368 | dd if=<(openssl enc -aes-256-ctr -pass pass:"$(dd if=/dev/urandom bs=128 count=1 2>/dev/null | base64)" -nosalt < /dev/zero) of=1KB.bin bs=1KB count=1 iflag=fullblock | ||
| 369 | ``` | ||
| 370 | |||
| 371 | Our freshly generated 1KB file looks something like this (its full of garbage | ||
| 372 | data as intended). | ||
| 373 | |||
| 374 |  | ||
| 375 | |||
| 376 | We create following binary files: | ||
| 377 | |||
| 378 | - 1KB.bin | ||
| 379 | - 10KB.bin | ||
| 380 | - 100KB.bin | ||
| 381 | - 1MB.bin | ||
| 382 | - 10MB.bin | ||
| 383 | - 100MB.bin | ||
| 384 | |||
| 385 | After this we create FASTA files for all the binary files by encoding them | ||
| 386 | into DNA sequence. | ||
| 387 | |||
| 388 | ```bash | ||
| 389 | ./dnae-encode -i 100MB.bin -o 100MB.fa | ||
| 390 | ``` | ||
| 391 | |||
| 392 | Then we GZIP all the FASTA files to see how much the can be compressed. | ||
| 393 | |||
| 394 | ```bash | ||
| 395 | gzip -9 < 10MB.fa > 10MB.fa.gz | ||
| 396 | ``` | ||
| 397 | |||
| 398 | [Download ODS file with benchmarks](/dna-sequence/benchmarks.ods). | ||
| 399 | |||
| 400 |  | ||
| 401 | |||
| 402 |  | ||
| 403 | |||
| 404 | ## References | ||
| 405 | |||
| 406 | - https://www.techopedia.com/definition/948/encoding | ||
| 407 | - https://www.dna-worldwide.com/resource/160/history-dna-timeline | ||
| 408 | - https://opentextbc.ca/biology/chapter/9-1-the-structure-of-dna/ | ||
| 409 | - https://arxiv.org/abs/1801.04774 | ||
| 410 | - https://en.wikipedia.org/wiki/FASTA_format | ||
