diff options
| author | Mitja Felicijan <m@mitjafelicijan.com> | 2023-07-08 23:25:41 +0200 |
|---|---|---|
| committer | Mitja Felicijan <m@mitjafelicijan.com> | 2023-07-08 23:25:41 +0200 |
| commit | cd6644ea4ddc78597934ab0ef5ba50e3c3daa927 (patch) | |
| tree | 03de331a8db6386dfd6fa75155bfbcea6b4feaf3 /public/encoding-binary-data-into-dna-sequence.html | |
| parent | 84ed124529ffeee1590295b8de3a8faf51848680 (diff) | |
| download | mitjafelicijan.com-cd6644ea4ddc78597934ab0ef5ba50e3c3daa927.tar.gz | |
Moved to a simpler SSG
Diffstat (limited to 'public/encoding-binary-data-into-dna-sequence.html')
| -rwxr-xr-x | public/encoding-binary-data-into-dna-sequence.html | 186 |
1 files changed, 186 insertions, 0 deletions
diff --git a/public/encoding-binary-data-into-dna-sequence.html b/public/encoding-binary-data-into-dna-sequence.html new file mode 100755 index 0000000..8da76da --- /dev/null +++ b/public/encoding-binary-data-into-dna-sequence.html | |||
| @@ -0,0 +1,186 @@ | |||
| 1 | <!doctype html><html lang=en-us><meta charset=utf-8><meta name=viewport content="width=device-width,initial-scale=1"><link href="data:image/x-icon;base64,AAABAAEAEBAAAAEAIABoBAAAFgAAACgAAAAQAAAAIAAAAAEAIAAAAAAAAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAL69vf8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAv76+/8LBwQkAAAAAAAAAAAAAAAC+vb3/AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAL+9vf/Bv78JAAAAAAAAAAAAAAAAu7q6/wAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC7ubr/vr29CAAAAAAAAAAAy8nJAZ6foP8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAnqGj/6GipAoAAAAAHLjU/xcXHf/BwsL/I8XY/yPK3v8XGiD/IbjL/yPF2f8XGiD/Fxkf/yLF2f8gnK3/Fxog/62ztv8fwNf/FRcd/x271v8mz93/GRsi/xkXHf8p097/GiIp/xobIv8p0t3/KdPe/xocIv8fYmr/KNPe/xoZH/8aHCL/J87c/xy81/8VFxz/IsPZ/8zS0/8XGiD/Ir/R/yPH2/8XGiD/Fxkf/yPH2/8dd4T/GBog/yPJ3f8jyNr/uru9/xcUGv8cudb/EhITDKi5vRKlvMP/RUpOERwcHRAdOj4QHTk8EBwdHRAdNTgQHTo/EBwcHRAcHB0QSGduEKW4vf+koqQfHzg+EBqz0ewSFRv7EyMr/xq51vsTERb7ExUb+xq41fsau9j7ExUb+xiPp/sZudb7ExUb+xMVG/sZuNX/GKvI/BIUGfMdvdn/IrfL/xcaIP8n1eb/J9Dh/xkcIf8ZGR7/J8/f/xxCSv8ZGyH/J9Dg/ybQ4P8ZHCL/FSQs/yPK3/8UExj/GE1b/ybS5P8ZGB7/Ghwj/ynW5P8p2Ob/Ghwi/yWrtv8p1eH/Ghwi/xocIv8p1uT/J8XT/xkcIv8m1un/Hb7d/xUYH/8hzOr/HtHu/xcaIf8XGB//I8vi/xgxOv8XGSD/I8rg/yPK4P8XGiD/GUFL/yPP6f8SERj/Fhkh/x3A4f8AAAAAJ2f9/ydr//8mZPH/AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAlYu38J2v//ydo/f8AAAAAAAAAAAd8/fkFqf//Iob8sAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAMY39awWr//8FfP3/AAAAAAAAAAAFm/7/SfD//wR+/f8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOB/f9B7v//BaX+/wAAAAAAAAAAQ878SAyZ/v9n1v4KAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADu9v8DDJb+/z3N/XgAAAAA3/sAAN/7AADf+wAA3/sAAAAAAAAAAAAAAAAAAN/7AAAAAAAAAAAAAAAAAAAAAAAAj/EAAI/5AACP8QAA3/sAAA==" rel=icon type=image/x-icon><title>Encoding binary data into DNA sequence</title><meta name=description content="Initial thoughtsImagine a world where you could go outside and take a leaf from a tree and putit through your personal DNA sequencer and get data like music, videos orcomputer programs from it."><link rel=alternate type=application/rss+xml title="Mitja Felicijan's posts" href=https://mitjafelicijan.com/index.xml><link rel=alternate type=application/rss+xml title="Mitja Felicijan's notes" href=https://mitjafelicijan.com/notes.xml><style>body{padding:1rem;max-width:760px;background:#fff;font-family:times new roman,Times,serif;line-height:1.35rem}hr{margin-block-start:1.5rem}h1,h2,h3{line-height:initial}footer{margin-block-start:3rem}table{max-width:100%;border-collapse:separate;border-spacing:2px;border:1px solid #000;border-left:1px solid #999;border-top:1px solid #999}blockquote{font-style:italic}table thead{background:#eee}td,th{border:1px solid #000;padding:4px;border-right:1px solid #999;border-bottom:1px solid #999;text-align:left}pre{text-wrap:nowrap;overflow-x:auto;margin-block-start:1.5rem;margin-block-end:1.5rem;padding:.5rem 0;border-top:1px solid #000;border-bottom:1px solid #000}pre code{line-height:1.3em}pre,code,pre *,code *{font-family:monospace;font-size:initial!important}img,video,audio{max-width:100%}header{display:flex;flex-direction:row;gap:3rem}nav{display:flex;gap:.75rem}.pstatus-orange{background:gold}.pstatus-green{background:#9acd32}.pstatus-red{background:#cd5c5c}@media only screen and (max-width:600px){header{flex-direction:column;gap:1rem}a{word-wrap:break-word}}</style><header><nav class=main><a href=/>Home</a> | ||
| 2 | <a href=https://git.mitjafelicijan.com/ target=_blank>Git</a> | ||
| 3 | <a href=https://files.mitjafelicijan.com/ target=_blank>Files</a> | ||
| 4 | <a href=/mitjafelicijan.pgp.pub.txt target=_blank>PGP</a> | ||
| 5 | <a href=/curriculum-vitae.html>CV</a> | ||
| 6 | <a href=/index.xml target=_blank>RSS</a></nav></header><main><div><h1>Encoding binary data into DNA sequence</h1><p>Jan 3, 2019<div><h2 id=initial-thoughts>Initial thoughts</h2><p>Imagine a world where you could go outside and take a leaf from a tree and put | ||
| 7 | it through your personal DNA sequencer and get data like music, videos or | ||
| 8 | computer programs from it. Well, this is all possible now. It was not done on a | ||
| 9 | large scale because it is quite expensive to create DNA strands but it's | ||
| 10 | possible.<p>Encoding data into DNA sequence is relatively simple process once you understand | ||
| 11 | the relationship between binary data and nucleotides and scientists have been | ||
| 12 | making large leaps in this field in order to provide viable long-term storage | ||
| 13 | solution for our data that would potentially survive our specie if case of | ||
| 14 | global disaster. We could imprint all the world's knowledge into plants and | ||
| 15 | ensure the survival of our knowledge.<p>More optimistic usage for this technology would be easier storage of ever | ||
| 16 | growing data we produce every day. Once machines for sequencing DNA become fast | ||
| 17 | enough and cheaper this could mean the next evolution of storing data and | ||
| 18 | abandoning classical hard and solid state drives in data warehouses.<p>As we currently stand this is still not viable but it is quite an amazing and | ||
| 19 | cool technology.<p>My interests in this field are purely in encoding processes and experimental | ||
| 20 | testing mainly because I don't have the access to this expensive machines. My | ||
| 21 | initial goal was to create a toolkit that can be used by everybody to encode | ||
| 22 | their data into a proper DNA sequence.<h2 id=glossary>Glossary</h2><p><strong>deoxyribose</strong> A five-carbon sugar molecule with a hydrogen atom rather than a | ||
| 23 | hydroxyl group in the 2′ position; the sugar component of DNA nucleotides.<p><strong>double helix</strong> The molecular shape of DNA in which two strands of nucleotides | ||
| 24 | wind around each other in a spiral shape.<p><strong>nitrogenous base</strong> A nitrogen-containing molecule that acts as a base; often | ||
| 25 | referring to one of the purine or pyrimidine components of nucleic acids.<p><strong>phosphate group</strong> A molecular group consisting of a central phosphorus atom | ||
| 26 | bound to four oxygen atoms.<p><strong>RGB</strong> The RGB color model is an additive color model in which red, green and | ||
| 27 | blue light are added together in various ways to reproduce a broad array of | ||
| 28 | colors.<p><strong>GCC</strong> The GNU Compiler Collection is a compiler system produced by the GNU | ||
| 29 | Project supporting various programming languages.<h2 id=data-encoding>Data encoding</h2><p><strong>TL;DR:</strong> Encoding involves the use of a code to change original data into a | ||
| 30 | form that can be used by an external process.<p>Encoding is the process of converting data into a format required for a number | ||
| 31 | of information processing needs, including:<ul><li>Program compiling and execution<li>Data transmission, storage and compression/decompression<li>Application data processing, such as file conversion</ul><p>Encoding can have two meanings:<ul><li>In computer technology, encoding is the process of applying a specific code, | ||
| 32 | such as letters, symbols and numbers, to data for conversion into an | ||
| 33 | equivalent cipher.<li>In electronics, encoding refers to analog to digital conversion.</ul><h2 id=quick-history-of-dna>Quick history of DNA</h2><ul><li><strong>1869</strong> - Friedrich Miescher identifies "nuclein".<li><strong>1900s</strong> - The Eugenics Movement.<li><strong>1900</strong> – Mendel's theories are rediscovered by researchers.<li><strong>1944</strong> - Oswald Avery identifies DNA as the 'transforming principle'.<li><strong>1952</strong> - Rosalind Franklin photographs crystallized DNA fibres.<li><strong>1953</strong> - James Watson and Francis Crick discover the double helix structure of DNA.<li><strong>1965</strong> - Marshall Nirenberg is the first person to sequence the bases in each codon.<li><strong>1983</strong> - Huntington's disease is the first mapped genetic disease.<li><strong>1990</strong> - The Human Genome Project begins.<li><strong>1995</strong> - Haemophilus Influenzae is the first bacterium genome sequenced.<li><strong>1996</strong> - Dolly the sheep is cloned.<li><strong>1999</strong> - First human chromosome is decoded.<li><strong>2000</strong> – Genetic code of the fruit fly is decoded.<li><strong>2002</strong> – Mouse is the first mammal to have its genome decoded.<li><strong>2003</strong> – The Human Genome Project is completed.<li><strong>2013</strong> – DNA Worldwide and Eurofins Forensic discover identical twins have differences in their genetic makeup.</ul><h2 id=what-is-dna>What is DNA?</h2><p>Deoxyribonucleic acid, a self-replicating material which is <strong>present in nearly | ||
| 34 | all living organisms</strong> as the main constituent of chromosomes. It is the | ||
| 35 | <strong>carrier of genetic information</strong>.<blockquote><p>The nitrogen in our DNA, the calcium in our teeth, the iron in our blood, | ||
| 36 | the carbon in our apple pies were made in the interiors of collapsing stars. | ||
| 37 | We are made of starstuff. | ||
| 38 | <strong>-- Carl Sagan, Cosmos</strong></blockquote><p>The nucleotide in DNA consists of a sugar (deoxyribose), one of four bases | ||
| 39 | (cytosine (C), thymine (T), adenine (A), guanine (G)), and a phosphate. | ||
| 40 | Cytosine and thymine are pyrimidine bases, while adenine and guanine are purine | ||
| 41 | bases. The sugar and the base together are called a nucleoside.<p><img src=/assets/dna-sequence/dna-basics.jpg alt=DNA><p><em>DNA (a) forms a double stranded helix, and (b) adenine pairs with thymine and | ||
| 42 | cytosine pairs with guanine. (credit a: modification of work by Jerome Walker, | ||
| 43 | Dennis Myts)</em><h2 id=encode-binary-data-into-dna-sequence>Encode binary data into DNA sequence</h2><p>As an input file you can use any file you want:<ul><li>ASCII files,<li>Compiled programs,<li>Multimedia files (MP3, MP4, MVK, etc),<li>Images,<li>Database files,<li>etc.</ul><p>Note: If you would copy all the bytes from RAM to file or pipe data to file you | ||
| 44 | could encode also this data as long as you provide file pointer to the encoder.<h3 id=basic-encoding>Basic Encoding</h3><p>As already mentioned, the Basic Encoding is based on a simple mapping. Since DNA | ||
| 45 | is composed of 4 nucleotides (Adenine, Cytosine, Guanine, Thymine; usually | ||
| 46 | referred using the first letter). Using this technique we can encode<p>$$ log_2(4) = log_2(2^2) = 2 bits $$<p>using a single nucleotide. In this way, we are able to use the 4 bases that | ||
| 47 | compose the DNA strand to encode each byte of data.<table><thead><tr><th>Two bits<th>Nucleotides<tbody><tr><td>00<td><strong>A</strong> (Adenine)<tr><td>10<td><strong>G</strong> (Guanine)<tr><td>01<td><strong>C</strong> (Cytosine)<tr><td>11<td><strong>T</strong> (Thymine)</table><p>With this in mind we can simply encode any data by using two-bit to Nucleotides | ||
| 48 | conversion.<pre tabindex=0 style=background-color:#fff><code><span style=display:flex><span>{ Algorithm 1: Naive byte array to DNA encode } | ||
| 49 | </span></span><span style=display:flex><span>procedure EncodeToDNASequence(f) string | ||
| 50 | </span></span><span style=display:flex><span>begin | ||
| 51 | </span></span><span style=display:flex><span> enc string | ||
| 52 | </span></span><span style=display:flex><span> <span style=color:#00f>while</span> <span style=color:#00f>not</span> eof(f) do | ||
| 53 | </span></span><span style=display:flex><span> c byte := buffer[0] { Read 1 byte <span style=color:#00f>from</span> buffer } | ||
| 54 | </span></span><span style=display:flex><span> bin integer := sprintf(<span style=color:#a31515>'08b'</span>, c) { Convert to string binary } | ||
| 55 | </span></span><span style=display:flex><span> <span style=color:#00f>for</span> e <span style=color:#00f>in</span> range[0, 2, 4, 6] do | ||
| 56 | </span></span><span style=display:flex><span> <span style=color:#00f>if</span> e[0] == 48 <span style=color:#00f>and</span> e[1] == 48 then { 0x00 - A (Adenine) } | ||
| 57 | </span></span><span style=display:flex><span> enc += <span style=color:#a31515>'A'</span> | ||
| 58 | </span></span><span style=display:flex><span> <span style=color:#00f>else</span> <span style=color:#00f>if</span> e[0] == 48 <span style=color:#00f>and</span> e[1] == 49 then { 0x01 - G (Guanine) } | ||
| 59 | </span></span><span style=display:flex><span> enc += <span style=color:#a31515>'G'</span> | ||
| 60 | </span></span><span style=display:flex><span> <span style=color:#00f>else</span> <span style=color:#00f>if</span> e[0] == 49 <span style=color:#00f>and</span> e[1] == 48 then { 0x10 - C (Cytosine) } | ||
| 61 | </span></span><span style=display:flex><span> enc += <span style=color:#a31515>'C'</span> | ||
| 62 | </span></span><span style=display:flex><span> <span style=color:#00f>else</span> <span style=color:#00f>if</span> e[0] == 49 <span style=color:#00f>and</span> e[1] == 49 then { 0x11 - T (Thymine) } | ||
| 63 | </span></span><span style=display:flex><span> enc += <span style=color:#a31515>'T'</span> | ||
| 64 | </span></span><span style=display:flex><span> <span style=color:#00f>return</span> enc { Return DNA sequence } | ||
| 65 | </span></span><span style=display:flex><span>end | ||
| 66 | </span></span></code></pre><p>Another encoding would be <strong>Goldman encoding</strong>. Using this encoding helps with | ||
| 67 | Nonsense mutation (amino acids replaced by a stop codon) that occurs and is the | ||
| 68 | most problematic during translation because it leads to truncated amino acid | ||
| 69 | sequences, which in turn results in truncated proteins.<p><a href="https://www.youtube.com/watch?v=a4PiGWNsIEU">Where to store big data? In DNA: Nick Goldman at TEDxPrague</a><h3 id=fasta-file-format>FASTA file format</h3><p>In bioinformatics, FASTA format is a text-based format for representing either | ||
| 70 | nucleotide sequences or peptide sequences, in which nucleotides or amino acids | ||
| 71 | are represented using single-letter codes. The format also allows for sequence | ||
| 72 | names and comments to precede the sequences. The format originates from the | ||
| 73 | FASTA software package, but has now become a standard in the field of | ||
| 74 | bioinformatics.<p>The first line in a FASTA file started either with a ">" (greater-than) symbol | ||
| 75 | or, less frequently, a ";" (semicolon) was taken as a comment. Subsequent lines | ||
| 76 | starting with a semicolon would be ignored by software. Since the only comment | ||
| 77 | used was the first, it quickly became used to hold a summary description of the | ||
| 78 | sequence, often starting with a unique library accession number, and with time | ||
| 79 | it has become commonplace to always use ">" for the first line and to not use | ||
| 80 | ";" comments (which would otherwise be ignored).<pre><code>;LCBO - Prolactin precursor - Bovine | ||
| 81 | ; a sample sequence in FASTA format | ||
| 82 | MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHDLSS | ||
| 83 | EMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPLYHL | ||
| 84 | VTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDED | ||
| 85 | ARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC* | ||
| 86 | |||
| 87 | >MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken | ||
| 88 | ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID | ||
| 89 | FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA | ||
| 90 | DIDGDGQVNYEEFVQMMTAK* | ||
| 91 | |||
| 92 | >gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] | ||
| 93 | LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV | ||
| 94 | EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG | ||
| 95 | LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL | ||
| 96 | GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX | ||
| 97 | IENY | ||
| 98 | </code></pre><p>FASTA format was extended by <a href=https://en.wikipedia.org/wiki/FASTQ_format>FASTQ</a> | ||
| 99 | format from the <a href=https://www.sanger.ac.uk/>Sanger Centre</a> in Cambridge.<h3 id=png-encoded-dna-sequence>PNG encoded DNA sequence</h3><table><thead><tr><th>Nucleotides<th>RGB<th>Color name<tbody><tr><td>A ➞ Adenine<td>(0,0,255)<td>Blue<tr><td>G ➞ Guanine<td>(0,100,0)<td>Green<tr><td>C ➞ Cytosine<td>(255,0,0)<td>Red<tr><td>T ➞ Thymine<td>(255,255,0)<td>Yellow</table><p>With this in mind we can create a simple algorithm to create PNG representation | ||
| 100 | of a DNA sequence.<pre tabindex=0 style=background-color:#fff><code><span style=display:flex><span>{ Algorithm 2: Naive DNA to PNG encode <span style=color:#00f>from</span> FASTA file } | ||
| 101 | </span></span><span style=display:flex><span>procedure EncodeDNASequenceToPNG(f) | ||
| 102 | </span></span><span style=display:flex><span>begin | ||
| 103 | </span></span><span style=display:flex><span> i image | ||
| 104 | </span></span><span style=display:flex><span> <span style=color:#00f>while</span> <span style=color:#00f>not</span> eof(f) do | ||
| 105 | </span></span><span style=display:flex><span> c char := buffer[0] { Read 1 char <span style=color:#00f>from</span> buffer } | ||
| 106 | </span></span><span style=display:flex><span> case c of | ||
| 107 | </span></span><span style=display:flex><span> <span style=color:#a31515>'A'</span>: color := RGB(0, 0, 255) { Blue } | ||
| 108 | </span></span><span style=display:flex><span> <span style=color:#a31515>'G'</span>: color := RGB(0, 100, 0) { Green } | ||
| 109 | </span></span><span style=display:flex><span> <span style=color:#a31515>'C'</span>: color := RGB(255, 0, 0) { Red } | ||
| 110 | </span></span><span style=display:flex><span> <span style=color:#a31515>'T'</span>: color := RGB(255, 255, 0) { Yellow } | ||
| 111 | </span></span><span style=display:flex><span> drawRect(i, [x, y], color) | ||
| 112 | </span></span><span style=display:flex><span> save(i) { Save PNG image } | ||
| 113 | </span></span><span style=display:flex><span>end | ||
| 114 | </span></span></code></pre><h2 id=encoding-text-file-in-practice>Encoding text file in practice</h2><p>In this example we will take a simple text file as our input stream for | ||
| 115 | encoding. This file will have a quote from Niels Bohr and saved as txt file.<blockquote><p>How wonderful that we have met with a paradox. Now we have some hope of | ||
| 116 | making progress. | ||
| 117 | ― Niels Bohr</blockquote><p>First we encode text file into FASTA file.<pre tabindex=0 style=background-color:#fff><code><span style=display:flex><span>./dnae-encode -i quote.txt -o quote.fa | ||
| 118 | </span></span><span style=display:flex><span>2019/01/10 00:38:29 Gathering input file stats | ||
| 119 | </span></span><span style=display:flex><span>2019/01/10 00:38:29 Starting encoding ... | ||
| 120 | </span></span><span style=display:flex><span> 106 B / 106 B [==================================] 100.00% 0s | ||
| 121 | </span></span><span style=display:flex><span>2019/01/10 00:38:29 Saving to FASTA file ... | ||
| 122 | </span></span><span style=display:flex><span>2019/01/10 00:38:29 Output FASTA file length is 438 B | ||
| 123 | </span></span><span style=display:flex><span>2019/01/10 00:38:29 Process took 987.263µs | ||
| 124 | </span></span><span style=display:flex><span>2019/01/10 00:38:29 Done ... | ||
| 125 | </span></span></code></pre><p>Output of <code>quote.fa</code> file contains the encoded DNA sequence in ASCII format.<pre><code>>SEQ1 | ||
| 126 | GACAGCTTGTGTACAAGTGTGCTTGCTCGCGAGCGGGTACGCGCGTGGGCTAACAAGTGA | ||
| 127 | GCCAGCAGGTGAACAAGTGTGCGGACAAGCCAGCAGGTGCGCGGACAAGCTGGCGGGTGA | ||
| 128 | ACAAGTGTGCCGGTGAGCCAACAAGCAGACAAGTAAGCAGGTACGCAGGCGAGCTTGTCA | ||
| 129 | ACTCACAAGATCGCTTGTGTACAAGTGTGCGGACAAGCCAGCAGGTGCGCGGACAAGTAT | ||
| 130 | GCTTGCTGGCGGACAAGCCAGCTTGTAAGCGGACAAGCTTGCGCACAAGCTGGCAGGCCT | ||
| 131 | GCCGGCTCGCGTACAAATTCACAAGTAAGTACGCTTGCGTGTACGCGGGTATGTATACTC | ||
| 132 | AACCTCACCAAACGGGACAAGATCGCCGGCGGGCTAGTATACAAGAACGCTTGCCAGTAC | ||
| 133 | AACC | ||
| 134 | </code></pre><p>Then we encode FASTA file from previous operation to encode this data into PNG.<pre tabindex=0 style=background-color:#fff><code><span style=display:flex><span>./dnae-png -i quote.fa -o quote.png | ||
| 135 | </span></span><span style=display:flex><span>2019/01/10 00:40:09 Gathering input file stats ... | ||
| 136 | </span></span><span style=display:flex><span>2019/01/10 00:40:09 Deconstructing FASTA file ... | ||
| 137 | </span></span><span style=display:flex><span>2019/01/10 00:40:09 Compositing image file ... | ||
| 138 | </span></span><span style=display:flex><span> 424 / 424 [==================================] 100.00% 0s | ||
| 139 | </span></span><span style=display:flex><span>2019/01/10 00:40:09 Saving output file ... | ||
| 140 | </span></span><span style=display:flex><span>2019/01/10 00:40:09 Output image file length is 1.1 kB | ||
| 141 | </span></span><span style=display:flex><span>2019/01/10 00:40:09 Process took 19.036117ms | ||
| 142 | </span></span><span style=display:flex><span>2019/01/10 00:40:09 Done ... | ||
| 143 | </span></span></code></pre><p>After encoding into PNG format this file looks like this.<p><img src=/assets/dna-sequence/quote.png alt="Encoded Quote in PNG format"><p>The larger the input stream is the larger the PNG file would be.<p>Compiled basic Hello World C program with | ||
| 144 | <a href=https://www.gnu.org/software/gcc/>GCC</a> would <a href=/assets/dna-sequence/sample.png>look | ||
| 145 | like</a>.<pre tabindex=0 style=background-color:#fff><code><span style=display:flex><span><span style=color:green>// gcc -O3 -o sample sample.c | ||
| 146 | </span></span></span><span style=display:flex><span><span style=color:green></span><span style=color:#00f>#include</span> <span style=color:#00f><stdio.h></span><span style=color:#00f> | ||
| 147 | </span></span></span><span style=display:flex><span><span style=color:#00f></span> | ||
| 148 | </span></span><span style=display:flex><span>main() { | ||
| 149 | </span></span><span style=display:flex><span> printf(<span style=color:#a31515>"Hello, world!</span><span style=color:#a31515>\n</span><span style=color:#a31515>"</span>); | ||
| 150 | </span></span><span style=display:flex><span> <span style=color:#00f>return</span> 0; | ||
| 151 | </span></span><span style=display:flex><span>} | ||
| 152 | </span></span></code></pre><h2 id=toolkit-for-encoding-data>Toolkit for encoding data</h2><p>I have created a toolkit with two main programs:<ul><li>dnae-encode (encodes file into FASTA file)<li>dnae-png (encodes FASTA file into PNG)</ul><p>Toolkit with full source code is available on | ||
| 153 | <a href=https://github.com/mitjafelicijan/dna-encoding>github.com/mitjafelicijan/dna-encoding</a>.<h3 id=dnae-encode>dnae-encode</h3><pre tabindex=0 style=background-color:#fff><code><span style=display:flex><span>> ./dnae-encode --help | ||
| 154 | </span></span><span style=display:flex><span>usage: dnae-encode --input=INPUT [<flags>] | ||
| 155 | </span></span><span style=display:flex><span> | ||
| 156 | </span></span><span style=display:flex><span>A command-line application that encodes file into DNA sequence. | ||
| 157 | </span></span><span style=display:flex><span> | ||
| 158 | </span></span><span style=display:flex><span>Flags: | ||
| 159 | </span></span><span style=display:flex><span> --help Show context-sensitive help (also try --help-long and --help-man). | ||
| 160 | </span></span><span style=display:flex><span> -i, --input=INPUT Input file (ASCII or binary) which will be encoded into DNA sequence. | ||
| 161 | </span></span><span style=display:flex><span> -o, --output=<span style=color:#a31515>"out.fa"</span> Output file which stores DNA sequence in FASTA format. | ||
| 162 | </span></span><span style=display:flex><span> -s, --sequence=SEQ1 The description line (defline) or header/identifier line, gives a name and/or a unique identifier <span style=color:#00f>for</span> the sequence. | ||
| 163 | </span></span><span style=display:flex><span> -c, --columns=60 Row characters length (no more than 120 characters). Devices preallocate fixed line sizes in software. | ||
| 164 | </span></span><span style=display:flex><span> --version Show application version. | ||
| 165 | </span></span></code></pre><h3 id=dnae-png>dnae-png</h3><pre tabindex=0 style=background-color:#fff><code><span style=display:flex><span>> ./dnae-png --help | ||
| 166 | </span></span><span style=display:flex><span>usage: dnae-png --input=INPUT [<flags>] | ||
| 167 | </span></span><span style=display:flex><span> | ||
| 168 | </span></span><span style=display:flex><span>A command-line application that encodes FASTA file into PNG image. | ||
| 169 | </span></span><span style=display:flex><span> | ||
| 170 | </span></span><span style=display:flex><span>Flags: | ||
| 171 | </span></span><span style=display:flex><span> --help Show context-sensitive help (also try --help-long and --help-man). | ||
| 172 | </span></span><span style=display:flex><span> -i, --input=INPUT Input FASTA file which will be encoded into PNG image. | ||
| 173 | </span></span><span style=display:flex><span> -o, --output=<span style=color:#a31515>"out.png"</span> Output file in PNG format that represents DNA sequence in graphical way. | ||
| 174 | </span></span><span style=display:flex><span> -s, --size=10 Size of pairings of DNA bases on image in pixels (lower resolution lower file size). | ||
| 175 | </span></span><span style=display:flex><span> --version Show application version. | ||
| 176 | </span></span></code></pre><h2 id=benchmarks>Benchmarks</h2><p>First we generate some binary sample data with dd.<pre tabindex=0 style=background-color:#fff><code><span style=display:flex><span>dd <span style=color:#00f>if</span>=<(openssl enc -aes-256-ctr -pass pass:<span style=color:#a31515>"</span><span style=color:#00f>$(</span>dd <span style=color:#00f>if</span>=/dev/urandom bs=128 count=1 2>/dev/null | base64<span style=color:#00f>)</span><span style=color:#a31515>"</span> -nosalt < /dev/zero) of=1KB.bin bs=1KB count=1 iflag=fullblock | ||
| 177 | </span></span></code></pre><p>Our freshly generated 1KB file looks something like this (its full of garbage | ||
| 178 | data as intended).<p><img src=/assets/dna-sequence/sample-binary-file.png alt="Sample binary file 1KB"><p>We create following binary files:<ul><li>1KB.bin<li>10KB.bin<li>100KB.bin<li>1MB.bin<li>10MB.bin<li>100MB.bin</ul><p>After this we create FASTA files for all the binary files by encoding them | ||
| 179 | into DNA sequence.<pre tabindex=0 style=background-color:#fff><code><span style=display:flex><span>./dnae-encode -i 100MB.bin -o 100MB.fa | ||
| 180 | </span></span></code></pre><p>Then we GZIP all the FASTA files to see how much the can be compressed.<pre tabindex=0 style=background-color:#fff><code><span style=display:flex><span>gzip -9 < 10MB.fa > 10MB.fa.gz | ||
| 181 | </span></span></code></pre><p><a href=/dna-sequence/benchmarks.ods>Download ODS file with benchmarks</a>.<p><img src=/assets/dna-sequence/chart-1.png alt="Sample binary file 1KB"><p><img src=/assets/dna-sequence/chart-2.png alt="Sample binary file 1KB"><h2 id=references>References</h2><ul><li><a href=https://www.techopedia.com/definition/948/encoding>https://www.techopedia.com/definition/948/encoding</a><li><a href=https://www.dna-worldwide.com/resource/160/history-dna-timeline>https://www.dna-worldwide.com/resource/160/history-dna-timeline</a><li><a href=https://opentextbc.ca/biology/chapter/9-1-the-structure-of-dna/>https://opentextbc.ca/biology/chapter/9-1-the-structure-of-dna/</a><li><a href=https://arxiv.org/abs/1801.04774>https://arxiv.org/abs/1801.04774</a><li><a href=https://en.wikipedia.org/wiki/FASTA_format>https://en.wikipedia.org/wiki/FASTA_format</a></ul></div></div></main><footer><hr><div><h3>Want to comment or have something to add?</h3>You can write me an email at | ||
| 182 | <a href=mailto:m@mitjafelicijan.com>m@mitjafelicijan.com</a> or catch up | ||
| 183 | with me | ||
| 184 | <a href=https://telegram.me/mitjafelicijan target=_blank>on Telegram</a>.</div><hr><p>This website does not track you. Content is made available under | ||
| 185 | the <a href=https://creativecommons.org/licenses/by/4.0/ target=_blank rel=noreferrer>CC BY 4.0 license</a> unless specified | ||
| 186 | otherwise. Blog feed is available as <a href=/index.xml target=_blank>RSS feed</a>.</footer><script src=https://cdn.usefathom.com/script.js data-site=XHQARKXP defer></script> \ No newline at end of file | ||
