aboutsummaryrefslogtreecommitdiff
path: root/content/posts/2019-01-03-encoding-binary-data-into-dna-sequence.md
diff options
context:
space:
mode:
Diffstat (limited to 'content/posts/2019-01-03-encoding-binary-data-into-dna-sequence.md')
-rw-r--r--content/posts/2019-01-03-encoding-binary-data-into-dna-sequence.md410
1 files changed, 0 insertions, 410 deletions
diff --git a/content/posts/2019-01-03-encoding-binary-data-into-dna-sequence.md b/content/posts/2019-01-03-encoding-binary-data-into-dna-sequence.md
deleted file mode 100644
index f0343ae..0000000
--- a/content/posts/2019-01-03-encoding-binary-data-into-dna-sequence.md
+++ /dev/null
@@ -1,410 +0,0 @@
1---
2title: Encoding binary data into DNA sequence
3url: encoding-binary-data-into-dna-sequence.html
4date: 2019-01-03T12:00:00+02:00
5draft: false
6---
7
8## Initial thoughts
9
10Imagine a world where you could go outside and take a leaf from a tree and put
11it through your personal DNA sequencer and get data like music, videos or
12computer programs from it. Well, this is all possible now. It was not done on a
13large scale because it is quite expensive to create DNA strands but it's
14possible.
15
16Encoding data into DNA sequence is relatively simple process once you understand
17the relationship between binary data and nucleotides and scientists have been
18making large leaps in this field in order to provide viable long-term storage
19solution for our data that would potentially survive our specie if case of
20global disaster. We could imprint all the world's knowledge into plants and
21ensure the survival of our knowledge.
22
23More optimistic usage for this technology would be easier storage of ever
24growing data we produce every day. Once machines for sequencing DNA become fast
25enough and cheaper this could mean the next evolution of storing data and
26abandoning classical hard and solid state drives in data warehouses.
27
28As we currently stand this is still not viable but it is quite an amazing and
29cool technology.
30
31My interests in this field are purely in encoding processes and experimental
32testing mainly because I don't have the access to this expensive machines. My
33initial goal was to create a toolkit that can be used by everybody to encode
34their data into a proper DNA sequence.
35
36## Glossary
37
38**deoxyribose** A five-carbon sugar molecule with a hydrogen atom rather than a
39hydroxyl group in the 2′ position; the sugar component of DNA nucleotides.
40
41**double helix** The molecular shape of DNA in which two strands of nucleotides
42wind around each other in a spiral shape.
43
44**nitrogenous base** A nitrogen-containing molecule that acts as a base; often
45referring to one of the purine or pyrimidine components of nucleic acids.
46
47**phosphate group** A molecular group consisting of a central phosphorus atom
48bound to four oxygen atoms.
49
50**RGB** The RGB color model is an additive color model in which red, green and
51blue light are added together in various ways to reproduce a broad array of
52colors.
53
54**GCC** The GNU Compiler Collection is a compiler system produced by the GNU
55Project supporting various programming languages.
56
57## Data encoding
58
59**TL;DR:** Encoding involves the use of a code to change original data into a
60form that can be used by an external process.
61
62Encoding is the process of converting data into a format required for a number
63of information processing needs, including:
64
65- Program compiling and execution
66- Data transmission, storage and compression/decompression
67- Application data processing, such as file conversion
68
69Encoding can have two meanings:
70
71- In computer technology, encoding is the process of applying a specific code,
72 such as letters, symbols and numbers, to data for conversion into an
73 equivalent cipher.
74- In electronics, encoding refers to analog to digital conversion.
75
76## Quick history of DNA
77
78- **1869** - Friedrich Miescher identifies "nuclein".
79- **1900s** - The Eugenics Movement.
80- **1900** – Mendel's theories are rediscovered by researchers.
81- **1944** - Oswald Avery identifies DNA as the 'transforming principle'.
82- **1952** - Rosalind Franklin photographs crystallized DNA fibres.
83- **1953** - James Watson and Francis Crick discover the double helix structure of DNA.
84- **1965** - Marshall Nirenberg is the first person to sequence the bases in each codon.
85- **1983** - Huntington's disease is the first mapped genetic disease.
86- **1990** - The Human Genome Project begins.
87- **1995** - Haemophilus Influenzae is the first bacterium genome sequenced.
88- **1996** - Dolly the sheep is cloned.
89- **1999** - First human chromosome is decoded.
90- **2000** – Genetic code of the fruit fly is decoded.
91- **2002** – Mouse is the first mammal to have its genome decoded.
92- **2003** – The Human Genome Project is completed.
93- **2013** – DNA Worldwide and Eurofins Forensic discover identical twins have differences in their genetic makeup.
94
95## What is DNA?
96
97Deoxyribonucleic acid, a self-replicating material which is **present in nearly
98all living organisms** as the main constituent of chromosomes. It is the
99**carrier of genetic information**.
100
101> The nitrogen in our DNA, the calcium in our teeth, the iron in our blood,
102> the carbon in our apple pies were made in the interiors of collapsing stars.
103> We are made of starstuff.
104> **-- Carl Sagan, Cosmos**
105
106The nucleotide in DNA consists of a sugar (deoxyribose), one of four bases
107(cytosine (C), thymine (T), adenine (A), guanine (G)), and a phosphate.
108Cytosine and thymine are pyrimidine bases, while adenine and guanine are purine
109bases. The sugar and the base together are called a nucleoside.
110
111![DNA](/assets/dna-sequence/dna-basics.jpg)
112
113*DNA (a) forms a double stranded helix, and (b) adenine pairs with thymine and
114cytosine pairs with guanine. (credit a: modification of work by Jerome Walker,
115Dennis Myts)*
116
117## Encode binary data into DNA sequence
118
119As an input file you can use any file you want:
120
121- ASCII files,
122- Compiled programs,
123- Multimedia files (MP3, MP4, MVK, etc),
124- Images,
125- Database files,
126- etc.
127
128Note: If you would copy all the bytes from RAM to file or pipe data to file you
129could encode also this data as long as you provide file pointer to the encoder.
130
131### Basic Encoding
132
133As already mentioned, the Basic Encoding is based on a simple mapping. Since DNA
134is composed of 4 nucleotides (Adenine, Cytosine, Guanine, Thymine; usually
135referred using the first letter). Using this technique we can encode
136
137$$ log_2(4) = log_2(2^2) = 2 bits $$
138
139using a single nucleotide. In this way, we are able to use the 4 bases that
140compose the DNA strand to encode each byte of data.
141
142| Two bits | Nucleotides |
143| -------- | ---------------- |
144| 00 | **A** (Adenine) |
145| 10 | **G** (Guanine) |
146| 01 | **C** (Cytosine) |
147| 11 | **T** (Thymine) |
148
149With this in mind we can simply encode any data by using two-bit to Nucleotides
150conversion.
151
152```python
153{ Algorithm 1: Naive byte array to DNA encode }
154procedure EncodeToDNASequence(f) string
155begin
156 enc string
157 while not eof(f) do
158 c byte := buffer[0] { Read 1 byte from buffer }
159 bin integer := sprintf('08b', c) { Convert to string binary }
160 for e in range[0, 2, 4, 6] do
161 if e[0] == 48 and e[1] == 48 then { 0x00 - A (Adenine) }
162 enc += 'A'
163 else if e[0] == 48 and e[1] == 49 then { 0x01 - G (Guanine) }
164 enc += 'G'
165 else if e[0] == 49 and e[1] == 48 then { 0x10 - C (Cytosine) }
166 enc += 'C'
167 else if e[0] == 49 and e[1] == 49 then { 0x11 - T (Thymine) }
168 enc += 'T'
169 return enc { Return DNA sequence }
170end
171```
172
173Another encoding would be **Goldman encoding**. Using this encoding helps with
174Nonsense mutation (amino acids replaced by a stop codon) that occurs and is the
175most problematic during translation because it leads to truncated amino acid
176sequences, which in turn results in truncated proteins.
177
178[Where to store big data? In DNA: Nick Goldman at TEDxPrague](https://www.youtube.com/watch?v=a4PiGWNsIEU)
179
180### FASTA file format
181
182In bioinformatics, FASTA format is a text-based format for representing either
183nucleotide sequences or peptide sequences, in which nucleotides or amino acids
184are represented using single-letter codes. The format also allows for sequence
185names and comments to precede the sequences. The format originates from the
186FASTA software package, but has now become a standard in the field of
187bioinformatics.
188
189The first line in a FASTA file started either with a ">" (greater-than) symbol
190or, less frequently, a ";" (semicolon) was taken as a comment. Subsequent lines
191starting with a semicolon would be ignored by software. Since the only comment
192used was the first, it quickly became used to hold a summary description of the
193sequence, often starting with a unique library accession number, and with time
194it has become commonplace to always use ">" for the first line and to not use
195";" comments (which would otherwise be ignored).
196
197```
198;LCBO - Prolactin precursor - Bovine
199; a sample sequence in FASTA format
200MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHDLSS
201EMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPLYHL
202VTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDED
203ARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC*
204
205>MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
206ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
207FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
208DIDGDGQVNYEEFVQMMTAK*
209
210>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
211LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
212EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
213LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
214GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
215IENY
216```
217
218FASTA format was extended by [FASTQ](https://en.wikipedia.org/wiki/FASTQ_format)
219format from the [Sanger Centre](https://www.sanger.ac.uk/) in Cambridge.
220
221### PNG encoded DNA sequence
222
223| Nucleotides | RGB | Color name |
224| ------------ | ----------- | ---------- |
225| A ➞ Adenine | (0,0,255) | Blue |
226| G ➞ Guanine | (0,100,0) | Green |
227| C ➞ Cytosine | (255,0,0) | Red |
228| T ➞ Thymine | (255,255,0) | Yellow |
229
230With this in mind we can create a simple algorithm to create PNG representation
231of a DNA sequence.
232
233```python
234{ Algorithm 2: Naive DNA to PNG encode from FASTA file }
235procedure EncodeDNASequenceToPNG(f)
236begin
237 i image
238 while not eof(f) do
239 c char := buffer[0] { Read 1 char from buffer }
240 case c of
241 'A': color := RGB(0, 0, 255) { Blue }
242 'G': color := RGB(0, 100, 0) { Green }
243 'C': color := RGB(255, 0, 0) { Red }
244 'T': color := RGB(255, 255, 0) { Yellow }
245 drawRect(i, [x, y], color)
246 save(i) { Save PNG image }
247end
248```
249
250## Encoding text file in practice
251
252In this example we will take a simple text file as our input stream for
253encoding. This file will have a quote from Niels Bohr and saved as txt file.
254
255> How wonderful that we have met with a paradox. Now we have some hope of
256> making progress.
257> ― Niels Bohr
258
259First we encode text file into FASTA file.
260
261```bash
262./dnae-encode -i quote.txt -o quote.fa
2632019/01/10 00:38:29 Gathering input file stats
2642019/01/10 00:38:29 Starting encoding ...
265 106 B / 106 B [==================================] 100.00% 0s
2662019/01/10 00:38:29 Saving to FASTA file ...
2672019/01/10 00:38:29 Output FASTA file length is 438 B
2682019/01/10 00:38:29 Process took 987.263µs
2692019/01/10 00:38:29 Done ...
270```
271
272Output of `quote.fa` file contains the encoded DNA sequence in ASCII format.
273
274```
275>SEQ1
276GACAGCTTGTGTACAAGTGTGCTTGCTCGCGAGCGGGTACGCGCGTGGGCTAACAAGTGA
277GCCAGCAGGTGAACAAGTGTGCGGACAAGCCAGCAGGTGCGCGGACAAGCTGGCGGGTGA
278ACAAGTGTGCCGGTGAGCCAACAAGCAGACAAGTAAGCAGGTACGCAGGCGAGCTTGTCA
279ACTCACAAGATCGCTTGTGTACAAGTGTGCGGACAAGCCAGCAGGTGCGCGGACAAGTAT
280GCTTGCTGGCGGACAAGCCAGCTTGTAAGCGGACAAGCTTGCGCACAAGCTGGCAGGCCT
281GCCGGCTCGCGTACAAATTCACAAGTAAGTACGCTTGCGTGTACGCGGGTATGTATACTC
282AACCTCACCAAACGGGACAAGATCGCCGGCGGGCTAGTATACAAGAACGCTTGCCAGTAC
283AACC
284```
285
286Then we encode FASTA file from previous operation to encode this data into PNG.
287
288```bash
289./dnae-png -i quote.fa -o quote.png
2902019/01/10 00:40:09 Gathering input file stats ...
2912019/01/10 00:40:09 Deconstructing FASTA file ...
2922019/01/10 00:40:09 Compositing image file ...
293 424 / 424 [==================================] 100.00% 0s
2942019/01/10 00:40:09 Saving output file ...
2952019/01/10 00:40:09 Output image file length is 1.1 kB
2962019/01/10 00:40:09 Process took 19.036117ms
2972019/01/10 00:40:09 Done ...
298```
299
300After encoding into PNG format this file looks like this.
301
302![Encoded Quote in PNG format](/assets/dna-sequence/quote.png)
303
304The larger the input stream is the larger the PNG file would be.
305
306Compiled basic Hello World C program with
307[GCC](https://www.gnu.org/software/gcc/) would [look
308like](/assets/dna-sequence/sample.png).
309
310```c
311// gcc -O3 -o sample sample.c
312#include <stdio.h>
313
314main() {
315 printf("Hello, world!\n");
316 return 0;
317}
318```
319
320## Toolkit for encoding data
321
322I have created a toolkit with two main programs:
323
324- dnae-encode (encodes file into FASTA file)
325- dnae-png (encodes FASTA file into PNG)
326
327Toolkit with full source code is available on
328[github.com/mitjafelicijan/dna-encoding](https://github.com/mitjafelicijan/dna-encoding).
329
330### dnae-encode
331
332```bash
333> ./dnae-encode --help
334usage: dnae-encode --input=INPUT [<flags>]
335
336A command-line application that encodes file into DNA sequence.
337
338Flags:
339 --help Show context-sensitive help (also try --help-long and --help-man).
340 -i, --input=INPUT Input file (ASCII or binary) which will be encoded into DNA sequence.
341 -o, --output="out.fa" Output file which stores DNA sequence in FASTA format.
342 -s, --sequence=SEQ1 The description line (defline) or header/identifier line, gives a name and/or a unique identifier for the sequence.
343 -c, --columns=60 Row characters length (no more than 120 characters). Devices preallocate fixed line sizes in software.
344 --version Show application version.
345```
346
347### dnae-png
348
349```bash
350> ./dnae-png --help
351usage: dnae-png --input=INPUT [<flags>]
352
353A command-line application that encodes FASTA file into PNG image.
354
355Flags:
356 --help Show context-sensitive help (also try --help-long and --help-man).
357 -i, --input=INPUT Input FASTA file which will be encoded into PNG image.
358 -o, --output="out.png" Output file in PNG format that represents DNA sequence in graphical way.
359 -s, --size=10 Size of pairings of DNA bases on image in pixels (lower resolution lower file size).
360 --version Show application version.
361```
362
363## Benchmarks
364
365First we generate some binary sample data with dd.
366
367```bash
368dd if=<(openssl enc -aes-256-ctr -pass pass:"$(dd if=/dev/urandom bs=128 count=1 2>/dev/null | base64)" -nosalt < /dev/zero) of=1KB.bin bs=1KB count=1 iflag=fullblock
369```
370
371Our freshly generated 1KB file looks something like this (its full of garbage
372data as intended).
373
374![Sample binary file 1KB](/assets/dna-sequence/sample-binary-file.png)
375
376We create following binary files:
377
378- 1KB.bin
379- 10KB.bin
380- 100KB.bin
381- 1MB.bin
382- 10MB.bin
383- 100MB.bin
384
385After this we create FASTA files for all the binary files by encoding them
386into DNA sequence.
387
388```bash
389./dnae-encode -i 100MB.bin -o 100MB.fa
390```
391
392Then we GZIP all the FASTA files to see how much the can be compressed.
393
394```bash
395gzip -9 < 10MB.fa > 10MB.fa.gz
396```
397
398[Download ODS file with benchmarks](/dna-sequence/benchmarks.ods).
399
400![Sample binary file 1KB](/assets/dna-sequence/chart-1.png)
401
402![Sample binary file 1KB](/assets/dna-sequence/chart-2.png)
403
404## References
405
406- https://www.techopedia.com/definition/948/encoding
407- https://www.dna-worldwide.com/resource/160/history-dna-timeline
408- https://opentextbc.ca/biology/chapter/9-1-the-structure-of-dna/
409- https://arxiv.org/abs/1801.04774
410- https://en.wikipedia.org/wiki/FASTA_format