Moved to a simpler SSG

author: Mitja Felicijan <m@mitjafelicijan.com> 2023-07-08 23:25:41 +0200
committer: Mitja Felicijan <m@mitjafelicijan.com> 2023-07-08 23:25:41 +0200
commit: cd6644ea4ddc78597934ab0ef5ba50e3c3daa927 (patch)
tree: 03de331a8db6386dfd6fa75155bfbcea6b4feaf3 /public/encoding-binary-data-into-dna-sequence.html
parent: 84ed124529ffeee1590295b8de3a8faf51848680 (diff)
download: mitjafelicijan.com-cd6644ea4ddc78597934ab0ef5ba50e3c3daa927.tar.gz
1 files changed, 186 insertions, 0 deletions
diff --git a/public/encoding-binary-data-into-dna-sequence.html b/public/encoding-binary-data-into-dna-sequence.html
new file mode 100755
index 0000000..8da76da
--- /dev/null
+++ b/public/encoding-binary-data-into-dna-sequence.html
@@ -0,0 +1,186 @@
+<!doctype html><html lang=en-us><meta charset=utf-8><meta name=viewport content="width=device-width,initial-scale=1"><link href="data:image/x-icon;base64,AAABAAEAEBAAAAEAIABoBAAAFgAAACgAAAAQAAAAIAAAAAEAIAAAAAAAAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAL69vf8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAv76+/8LBwQkAAAAAAAAAAAAAAAC+vb3/AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAL+9vf/Bv78JAAAAAAAAAAAAAAAAu7q6/wAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC7ubr/vr29CAAAAAAAAAAAy8nJAZ6foP8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAnqGj/6GipAoAAAAAHLjU/xcXHf/BwsL/I8XY/yPK3v8XGiD/IbjL/yPF2f8XGiD/Fxkf/yLF2f8gnK3/Fxog/62ztv8fwNf/FRcd/x271v8mz93/GRsi/xkXHf8p097/GiIp/xobIv8p0t3/KdPe/xocIv8fYmr/KNPe/xoZH/8aHCL/J87c/xy81/8VFxz/IsPZ/8zS0/8XGiD/Ir/R/yPH2/8XGiD/Fxkf/yPH2/8dd4T/GBog/yPJ3f8jyNr/uru9/xcUGv8cudb/EhITDKi5vRKlvMP/RUpOERwcHRAdOj4QHTk8EBwdHRAdNTgQHTo/EBwcHRAcHB0QSGduEKW4vf+koqQfHzg+EBqz0ewSFRv7EyMr/xq51vsTERb7ExUb+xq41fsau9j7ExUb+xiPp/sZudb7ExUb+xMVG/sZuNX/GKvI/BIUGfMdvdn/IrfL/xcaIP8n1eb/J9Dh/xkcIf8ZGR7/J8/f/xxCSv8ZGyH/J9Dg/ybQ4P8ZHCL/FSQs/yPK3/8UExj/GE1b/ybS5P8ZGB7/Ghwj/ynW5P8p2Ob/Ghwi/yWrtv8p1eH/Ghwi/xocIv8p1uT/J8XT/xkcIv8m1un/Hb7d/xUYH/8hzOr/HtHu/xcaIf8XGB//I8vi/xgxOv8XGSD/I8rg/yPK4P8XGiD/GUFL/yPP6f8SERj/Fhkh/x3A4f8AAAAAJ2f9/ydr//8mZPH/AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAlYu38J2v//ydo/f8AAAAAAAAAAAd8/fkFqf//Iob8sAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAMY39awWr//8FfP3/AAAAAAAAAAAFm/7/SfD//wR+/f8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOB/f9B7v//BaX+/wAAAAAAAAAAQ878SAyZ/v9n1v4KAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADu9v8DDJb+/z3N/XgAAAAA3/sAAN/7AADf+wAA3/sAAAAAAAAAAAAAAAAAAN/7AAAAAAAAAAAAAAAAAAAAAAAAj/EAAI/5AACP8QAA3/sAAA==" rel=icon type=image/x-icon><title>Encoding binary data into DNA sequence</title><meta name=description content="Initial thoughtsImagine a world where you could go outside and take a leaf from a tree and putit through your personal DNA sequencer and get data like music, videos orcomputer programs from it."><link rel=alternate type=application/rss+xml title="Mitja Felicijan's posts" href=https://mitjafelicijan.com/index.xml><link rel=alternate type=application/rss+xml title="Mitja Felicijan's notes" href=https://mitjafelicijan.com/notes.xml><style>body{padding:1rem;max-width:760px;background:#fff;font-family:times new roman,Times,serif;line-height:1.35rem}hr{margin-block-start:1.5rem}h1,h2,h3{line-height:initial}footer{margin-block-start:3rem}table{max-width:100%;border-collapse:separate;border-spacing:2px;border:1px solid #000;border-left:1px solid #999;border-top:1px solid #999}blockquote{font-style:italic}table thead{background:#eee}td,th{border:1px solid #000;padding:4px;border-right:1px solid #999;border-bottom:1px solid #999;text-align:left}pre{text-wrap:nowrap;overflow-x:auto;margin-block-start:1.5rem;margin-block-end:1.5rem;padding:.5rem 0;border-top:1px solid #000;border-bottom:1px solid #000}pre code{line-height:1.3em}pre,code,pre *,code *{font-family:monospace;font-size:initial!important}img,video,audio{max-width:100%}header{display:flex;flex-direction:row;gap:3rem}nav{display:flex;gap:.75rem}.pstatus-orange{background:gold}.pstatus-green{background:#9acd32}.pstatus-red{background:#cd5c5c}@media only screen and (max-width:600px){header{flex-direction:column;gap:1rem}a{word-wrap:break-word}}</style><header><nav class=main><a href=/>Home</a>
+<a href=https://git.mitjafelicijan.com/ target=_blank>Git</a>
+<a href=https://files.mitjafelicijan.com/ target=_blank>Files</a>
+<a href=/mitjafelicijan.pgp.pub.txt target=_blank>PGP</a>
+<a href=/curriculum-vitae.html>CV</a>
+<a href=/index.xml target=_blank>RSS</a></nav></header><main><div><h1>Encoding binary data into DNA sequence</h1><p>Jan 3, 2019<div><h2 id=initial-thoughts>Initial thoughts</h2><p>Imagine a world where you could go outside and take a leaf from a tree and put
+it through your personal DNA sequencer and get data like music, videos or
+computer programs from it. Well, this is all possible now. It was not done on a
+large scale because it is quite expensive to create DNA strands but it's
+possible.<p>Encoding data into DNA sequence is relatively simple process once you understand
+the relationship between binary data and nucleotides and scientists have been
+making large leaps in this field in order to provide viable long-term storage
+solution for our data that would potentially survive our specie if case of
+global disaster. We could imprint all the world's knowledge into plants and
+ensure the survival of our knowledge.<p>More optimistic usage for this technology would be easier storage of ever
+growing data we produce every day. Once machines for sequencing DNA become fast
+enough and cheaper this could mean the next evolution of storing data and
+abandoning classical hard and solid state drives in data warehouses.<p>As we currently stand this is still not viable but it is quite an amazing and
+cool technology.<p>My interests in this field are purely in encoding processes and experimental
+testing mainly because I don't have the access to this expensive machines. My
+initial goal was to create a toolkit that can be used by everybody to encode
+their data into a proper DNA sequence.<h2 id=glossary>Glossary</h2><p><strong>deoxyribose</strong> A five-carbon sugar molecule with a hydrogen atom rather than a
+hydroxyl group in the 2′ position; the sugar component of DNA nucleotides.<p><strong>double helix</strong> The molecular shape of DNA in which two strands of nucleotides
+wind around each other in a spiral shape.<p><strong>nitrogenous base</strong> A nitrogen-containing molecule that acts as a base; often
+referring to one of the purine or pyrimidine components of nucleic acids.<p><strong>phosphate group</strong> A molecular group consisting of a central phosphorus atom
+bound to four oxygen atoms.<p><strong>RGB</strong> The RGB color model is an additive color model in which red, green and
+blue light are added together in various ways to reproduce a broad array of
+colors.<p><strong>GCC</strong> The GNU Compiler Collection is a compiler system produced by the GNU
+Project supporting various programming languages.<h2 id=data-encoding>Data encoding</h2><p><strong>TL;DR:</strong> Encoding involves the use of a code to change original data into a
+form that can be used by an external process.<p>Encoding is the process of converting data into a format required for a number
+of information processing needs, including:<ul><li>Program compiling and execution<li>Data transmission, storage and compression/decompression<li>Application data processing, such as file conversion</ul><p>Encoding can have two meanings:<ul><li>In computer technology, encoding is the process of applying a specific code,
+such as letters, symbols and numbers, to data for conversion into an
+equivalent cipher.<li>In electronics, encoding refers to analog to digital conversion.</ul><h2 id=quick-history-of-dna>Quick history of DNA</h2><ul><li><strong>1869</strong> - Friedrich Miescher identifies "nuclein".<li><strong>1900s</strong> - The Eugenics Movement.<li><strong>1900</strong> – Mendel's theories are rediscovered by researchers.<li><strong>1944</strong> - Oswald Avery identifies DNA as the 'transforming principle'.<li><strong>1952</strong> - Rosalind Franklin photographs crystallized DNA fibres.<li><strong>1953</strong> - James Watson and Francis Crick discover the double helix structure of DNA.<li><strong>1965</strong> - Marshall Nirenberg is the first person to sequence the bases in each codon.<li><strong>1983</strong> - Huntington's disease is the first mapped genetic disease.<li><strong>1990</strong> - The Human Genome Project begins.<li><strong>1995</strong> - Haemophilus Influenzae is the first bacterium genome sequenced.<li><strong>1996</strong> - Dolly the sheep is cloned.<li><strong>1999</strong> - First human chromosome is decoded.<li><strong>2000</strong> – Genetic code of the fruit fly is decoded.<li><strong>2002</strong> – Mouse is the first mammal to have its genome decoded.<li><strong>2003</strong> – The Human Genome Project is completed.<li><strong>2013</strong> – DNA Worldwide and Eurofins Forensic discover identical twins have differences in their genetic makeup.</ul><h2 id=what-is-dna>What is DNA?</h2><p>Deoxyribonucleic acid, a self-replicating material which is <strong>present in nearly
+all living organisms</strong> as the main constituent of chromosomes. It is the
+<strong>carrier of genetic information</strong>.<blockquote><p>The nitrogen in our DNA, the calcium in our teeth, the iron in our blood,
+the carbon in our apple pies were made in the interiors of collapsing stars.
+We are made of starstuff.
+<strong>-- Carl Sagan, Cosmos</strong></blockquote><p>The nucleotide in DNA consists of a sugar (deoxyribose), one of four bases
+(cytosine (C), thymine (T), adenine (A), guanine (G)), and a phosphate.
+Cytosine and thymine are pyrimidine bases, while adenine and guanine are purine
+bases. The sugar and the base together are called a nucleoside.<p><img src=/assets/dna-sequence/dna-basics.jpg alt=DNA><p><em>DNA (a) forms a double stranded helix, and (b) adenine pairs with thymine and
+cytosine pairs with guanine. (credit a: modification of work by Jerome Walker,
+Dennis Myts)</em><h2 id=encode-binary-data-into-dna-sequence>Encode binary data into DNA sequence</h2><p>As an input file you can use any file you want:<ul><li>ASCII files,<li>Compiled programs,<li>Multimedia files (MP3, MP4, MVK, etc),<li>Images,<li>Database files,<li>etc.</ul><p>Note: If you would copy all the bytes from RAM to file or pipe data to file you
+could encode also this data as long as you provide file pointer to the encoder.<h3 id=basic-encoding>Basic Encoding</h3><p>As already mentioned, the Basic Encoding is based on a simple mapping. Since DNA
+is composed of 4 nucleotides (Adenine, Cytosine, Guanine, Thymine; usually
+referred using the first letter). Using this technique we can encode<p>$$ log_2(4) = log_2(2^2) = 2 bits $$<p>using a single nucleotide. In this way, we are able to use the 4 bases that
+compose the DNA strand to encode each byte of data.<table><thead><tr><th>Two bits<th>Nucleotides<tbody><tr><td>00<td><strong>A</strong> (Adenine)<tr><td>10<td><strong>G</strong> (Guanine)<tr><td>01<td><strong>C</strong> (Cytosine)<tr><td>11<td><strong>T</strong> (Thymine)</table><p>With this in mind we can simply encode any data by using two-bit to Nucleotides
+conversion.<pre tabindex=0 style=background-color:#fff><code><span style=display:flex><span>{ Algorithm 1: Naive byte array to DNA encode }
+</span></span><span style=display:flex><span>procedure EncodeToDNASequence(f) string
+</span></span><span style=display:flex><span>begin
+</span></span><span style=display:flex><span>  enc string
+</span></span><span style=display:flex><span>  <span style=color:#00f>while</span> <span style=color:#00f>not</span> eof(f) do
+</span></span><span style=display:flex><span>    c byte := buffer[0]                             { Read 1 byte <span style=color:#00f>from</span> buffer }
+</span></span><span style=display:flex><span>    bin integer := sprintf(<span style=color:#a31515>&#39;08b&#39;</span>, c)                { Convert to string binary }
+</span></span><span style=display:flex><span>    <span style=color:#00f>for</span> e <span style=color:#00f>in</span> range[0, 2, 4, 6] do
+</span></span><span style=display:flex><span>      <span style=color:#00f>if</span> e[0] == 48 <span style=color:#00f>and</span> e[1] == 48 then             { 0x00 - A (Adenine) }
+</span></span><span style=display:flex><span>        enc += <span style=color:#a31515>&#39;A&#39;</span>
+</span></span><span style=display:flex><span>      <span style=color:#00f>else</span> <span style=color:#00f>if</span> e[0] == 48 <span style=color:#00f>and</span> e[1] == 49 then        { 0x01 - G (Guanine) }
+</span></span><span style=display:flex><span>        enc += <span style=color:#a31515>&#39;G&#39;</span>
+</span></span><span style=display:flex><span>      <span style=color:#00f>else</span> <span style=color:#00f>if</span> e[0] == 49 <span style=color:#00f>and</span> e[1] == 48 then        { 0x10 - C (Cytosine) }
+</span></span><span style=display:flex><span>        enc += <span style=color:#a31515>&#39;C&#39;</span>
+</span></span><span style=display:flex><span>      <span style=color:#00f>else</span> <span style=color:#00f>if</span> e[0] == 49 <span style=color:#00f>and</span> e[1] == 49 then        { 0x11 - T (Thymine) }
+</span></span><span style=display:flex><span>        enc += <span style=color:#a31515>&#39;T&#39;</span>
+</span></span><span style=display:flex><span>  <span style=color:#00f>return</span> enc                                        { Return DNA sequence }
+</span></span><span style=display:flex><span>end
+</span></span></code></pre><p>Another encoding would be <strong>Goldman encoding</strong>. Using this encoding helps with
+Nonsense mutation (amino acids replaced by a stop codon) that occurs and is the
+most problematic during translation because it leads to truncated amino acid
+sequences, which in turn results in truncated proteins.<p><a href="https://www.youtube.com/watch?v=a4PiGWNsIEU">Where to store big data? In DNA: Nick Goldman at TEDxPrague</a><h3 id=fasta-file-format>FASTA file format</h3><p>In bioinformatics, FASTA format is a text-based format for representing either
+nucleotide sequences or peptide sequences, in which nucleotides or amino acids
+are represented using single-letter codes. The format also allows for sequence
+names and comments to precede the sequences. The format originates from the
+FASTA software package, but has now become a standard in the field of
+bioinformatics.<p>The first line in a FASTA file started either with a ">" (greater-than) symbol
+or, less frequently, a ";" (semicolon) was taken as a comment. Subsequent lines
+starting with a semicolon would be ignored by software. Since the only comment
+used was the first, it quickly became used to hold a summary description of the
+sequence, often starting with a unique library accession number, and with time
+it has become commonplace to always use ">" for the first line and to not use
+";" comments (which would otherwise be ignored).<pre><code>;LCBO - Prolactin precursor - Bovine
+; a sample sequence in FASTA format
+MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHDLSS
+EMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPLYHL
+VTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDED
+ARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC*
+&gt;MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
+ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
+FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
+DIDGDGQVNYEEFVQMMTAK*
+&gt;gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
+LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
+EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
+LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
+GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
+IENY
+</code></pre><p>FASTA format was extended by <a href=https://en.wikipedia.org/wiki/FASTQ_format>FASTQ</a>
+format from the <a href=https://www.sanger.ac.uk/>Sanger Centre</a> in Cambridge.<h3 id=png-encoded-dna-sequence>PNG encoded DNA sequence</h3><table><thead><tr><th>Nucleotides<th>RGB<th>Color name<tbody><tr><td>A ➞ Adenine<td>(0,0,255)<td>Blue<tr><td>G ➞ Guanine<td>(0,100,0)<td>Green<tr><td>C ➞ Cytosine<td>(255,0,0)<td>Red<tr><td>T ➞ Thymine<td>(255,255,0)<td>Yellow</table><p>With this in mind we can create a simple algorithm to create PNG representation
+of a DNA sequence.<pre tabindex=0 style=background-color:#fff><code><span style=display:flex><span>{ Algorithm 2: Naive DNA to PNG encode <span style=color:#00f>from</span> FASTA file }
+</span></span><span style=display:flex><span>procedure EncodeDNASequenceToPNG(f)
+</span></span><span style=display:flex><span>begin
+</span></span><span style=display:flex><span>  i image
+</span></span><span style=display:flex><span>  <span style=color:#00f>while</span> <span style=color:#00f>not</span> eof(f) do
+</span></span><span style=display:flex><span>    c char := buffer[0]                             { Read 1 char <span style=color:#00f>from</span> buffer }
+</span></span><span style=display:flex><span>    case c of
+</span></span><span style=display:flex><span>      <span style=color:#a31515>&#39;A&#39;</span>: color := RGB(0, 0, 255)                  { Blue }
+</span></span><span style=display:flex><span>      <span style=color:#a31515>&#39;G&#39;</span>: color := RGB(0, 100, 0)                  { Green }
+</span></span><span style=display:flex><span>      <span style=color:#a31515>&#39;C&#39;</span>: color := RGB(255, 0, 0)                  { Red }
+</span></span><span style=display:flex><span>      <span style=color:#a31515>&#39;T&#39;</span>: color := RGB(255, 255, 0)                { Yellow }
+</span></span><span style=display:flex><span>    drawRect(i, [x, y], color)
+</span></span><span style=display:flex><span>  save(i)                                           { Save PNG image }
+</span></span><span style=display:flex><span>end
+</span></span></code></pre><h2 id=encoding-text-file-in-practice>Encoding text file in practice</h2><p>In this example we will take a simple text file as our input stream for
+encoding. This file will have a quote from Niels Bohr and saved as txt file.<blockquote><p>How wonderful that we have met with a paradox. Now we have some hope of
+making progress.
+― Niels Bohr</blockquote><p>First we encode text file into FASTA file.<pre tabindex=0 style=background-color:#fff><code><span style=display:flex><span>./dnae-encode -i quote.txt -o quote.fa
+</span></span><span style=display:flex><span>2019/01/10 00:38:29 Gathering input file stats
+</span></span><span style=display:flex><span>2019/01/10 00:38:29 Starting encoding ...
+</span></span><span style=display:flex><span> 106 B / 106 B [==================================] 100.00% 0s
+</span></span><span style=display:flex><span>2019/01/10 00:38:29 Saving to FASTA file ...
+</span></span><span style=display:flex><span>2019/01/10 00:38:29 Output FASTA file length is 438 B
+</span></span><span style=display:flex><span>2019/01/10 00:38:29 Process took 987.263µs
+</span></span><span style=display:flex><span>2019/01/10 00:38:29 Done ...
+</span></span></code></pre><p>Output of <code>quote.fa</code> file contains the encoded DNA sequence in ASCII format.<pre><code>&gt;SEQ1
+GACAGCTTGTGTACAAGTGTGCTTGCTCGCGAGCGGGTACGCGCGTGGGCTAACAAGTGA
+GCCAGCAGGTGAACAAGTGTGCGGACAAGCCAGCAGGTGCGCGGACAAGCTGGCGGGTGA
+ACAAGTGTGCCGGTGAGCCAACAAGCAGACAAGTAAGCAGGTACGCAGGCGAGCTTGTCA
+ACTCACAAGATCGCTTGTGTACAAGTGTGCGGACAAGCCAGCAGGTGCGCGGACAAGTAT
+GCTTGCTGGCGGACAAGCCAGCTTGTAAGCGGACAAGCTTGCGCACAAGCTGGCAGGCCT
+GCCGGCTCGCGTACAAATTCACAAGTAAGTACGCTTGCGTGTACGCGGGTATGTATACTC
+AACCTCACCAAACGGGACAAGATCGCCGGCGGGCTAGTATACAAGAACGCTTGCCAGTAC
+AACC
+</code></pre><p>Then we encode FASTA file from previous operation to encode this data into PNG.<pre tabindex=0 style=background-color:#fff><code><span style=display:flex><span>./dnae-png -i quote.fa -o quote.png
+</span></span><span style=display:flex><span>2019/01/10 00:40:09 Gathering input file stats ...
+</span></span><span style=display:flex><span>2019/01/10 00:40:09 Deconstructing FASTA file ...
+</span></span><span style=display:flex><span>2019/01/10 00:40:09 Compositing image file ...
+</span></span><span style=display:flex><span> 424 / 424 [==================================] 100.00% 0s
+</span></span><span style=display:flex><span>2019/01/10 00:40:09 Saving output file ...
+</span></span><span style=display:flex><span>2019/01/10 00:40:09 Output image file length is 1.1 kB
+</span></span><span style=display:flex><span>2019/01/10 00:40:09 Process took 19.036117ms
+</span></span><span style=display:flex><span>2019/01/10 00:40:09 Done ...
+</span></span></code></pre><p>After encoding into PNG format this file looks like this.<p><img src=/assets/dna-sequence/quote.png alt="Encoded Quote in PNG format"><p>The larger the input stream is the larger the PNG file would be.<p>Compiled basic Hello World C program with
+<a href=https://www.gnu.org/software/gcc/>GCC</a> would <a href=/assets/dna-sequence/sample.png>look
+like</a>.<pre tabindex=0 style=background-color:#fff><code><span style=display:flex><span><span style=color:green>// gcc -O3 -o sample sample.c
+</span></span></span><span style=display:flex><span><span style=color:green></span><span style=color:#00f>#include</span> <span style=color:#00f>&lt;stdio.h&gt;</span><span style=color:#00f>
+</span></span></span><span style=display:flex><span><span style=color:#00f></span>
+</span></span><span style=display:flex><span>main() {
+</span></span><span style=display:flex><span>  printf(<span style=color:#a31515>&#34;Hello, world!</span><span style=color:#a31515>\n</span><span style=color:#a31515>&#34;</span>);
+</span></span><span style=display:flex><span>  <span style=color:#00f>return</span> 0;
+</span></span><span style=display:flex><span>}
+</span></span></code></pre><h2 id=toolkit-for-encoding-data>Toolkit for encoding data</h2><p>I have created a toolkit with two main programs:<ul><li>dnae-encode (encodes file into FASTA file)<li>dnae-png (encodes FASTA file into PNG)</ul><p>Toolkit with full source code is available on
+<a href=https://github.com/mitjafelicijan/dna-encoding>github.com/mitjafelicijan/dna-encoding</a>.<h3 id=dnae-encode>dnae-encode</h3><pre tabindex=0 style=background-color:#fff><code><span style=display:flex><span>&gt; ./dnae-encode --help
+</span></span><span style=display:flex><span>usage: dnae-encode --input=INPUT [&lt;flags&gt;]
+</span></span><span style=display:flex><span>
+</span></span><span style=display:flex><span>A command-line application that encodes file into DNA sequence.
+</span></span><span style=display:flex><span>
+</span></span><span style=display:flex><span>Flags:
+</span></span><span style=display:flex><span>      --help             Show context-sensitive help (also try --help-long and --help-man).
+</span></span><span style=display:flex><span>  -i, --input=INPUT      Input file (ASCII or binary) which will be encoded into DNA sequence.
+</span></span><span style=display:flex><span>  -o, --output=<span style=color:#a31515>&#34;out.fa&#34;</span>  Output file which stores DNA sequence in FASTA format.
+</span></span><span style=display:flex><span>  -s, --sequence=SEQ1    The description line (defline) or header/identifier line, gives a name and/or a unique identifier <span style=color:#00f>for</span> the sequence.
+</span></span><span style=display:flex><span>  -c, --columns=60       Row characters length (no more than 120 characters). Devices preallocate fixed line sizes in software.
+</span></span><span style=display:flex><span>      --version          Show application version.
+</span></span></code></pre><h3 id=dnae-png>dnae-png</h3><pre tabindex=0 style=background-color:#fff><code><span style=display:flex><span>&gt; ./dnae-png --help
+</span></span><span style=display:flex><span>usage: dnae-png --input=INPUT [&lt;flags&gt;]
+</span></span><span style=display:flex><span>
+</span></span><span style=display:flex><span>A command-line application that encodes FASTA file into PNG image.
+</span></span><span style=display:flex><span>
+</span></span><span style=display:flex><span>Flags:
+</span></span><span style=display:flex><span>      --help              Show context-sensitive help (also try --help-long and --help-man).
+</span></span><span style=display:flex><span>  -i, --input=INPUT       Input FASTA file which will be encoded into PNG image.
+</span></span><span style=display:flex><span>  -o, --output=<span style=color:#a31515>&#34;out.png&#34;</span>  Output file in PNG format that represents DNA sequence in graphical way.
+</span></span><span style=display:flex><span>  -s, --size=10           Size of pairings of DNA bases on image in pixels (lower resolution lower file size).
+</span></span><span style=display:flex><span>      --version           Show application version.
+</span></span></code></pre><h2 id=benchmarks>Benchmarks</h2><p>First we generate some binary sample data with dd.<pre tabindex=0 style=background-color:#fff><code><span style=display:flex><span>dd <span style=color:#00f>if</span>=&lt;(openssl enc -aes-256-ctr  -pass pass:<span style=color:#a31515>&#34;</span><span style=color:#00f>$(</span>dd <span style=color:#00f>if</span>=/dev/urandom bs=128 count=1 2&gt;/dev/null | base64<span style=color:#00f>)</span><span style=color:#a31515>&#34;</span> -nosalt &lt; /dev/zero) of=1KB.bin bs=1KB count=1 iflag=fullblock
+</span></span></code></pre><p>Our freshly generated 1KB file looks something like this (its full of garbage
+data as intended).<p><img src=/assets/dna-sequence/sample-binary-file.png alt="Sample binary file 1KB"><p>We create following binary files:<ul><li>1KB.bin<li>10KB.bin<li>100KB.bin<li>1MB.bin<li>10MB.bin<li>100MB.bin</ul><p>After this we create FASTA files for all the binary files by encoding them
+into DNA sequence.<pre tabindex=0 style=background-color:#fff><code><span style=display:flex><span>./dnae-encode -i 100MB.bin -o 100MB.fa
+</span></span></code></pre><p>Then we GZIP all the FASTA files to see how much the can be compressed.<pre tabindex=0 style=background-color:#fff><code><span style=display:flex><span>gzip -9 &lt; 10MB.fa &gt; 10MB.fa.gz
+</span></span></code></pre><p><a href=/dna-sequence/benchmarks.ods>Download ODS file with benchmarks</a>.<p><img src=/assets/dna-sequence/chart-1.png alt="Sample binary file 1KB"><p><img src=/assets/dna-sequence/chart-2.png alt="Sample binary file 1KB"><h2 id=references>References</h2><ul><li><a href=https://www.techopedia.com/definition/948/encoding>https://www.techopedia.com/definition/948/encoding</a><li><a href=https://www.dna-worldwide.com/resource/160/history-dna-timeline>https://www.dna-worldwide.com/resource/160/history-dna-timeline</a><li><a href=https://opentextbc.ca/biology/chapter/9-1-the-structure-of-dna/>https://opentextbc.ca/biology/chapter/9-1-the-structure-of-dna/</a><li><a href=https://arxiv.org/abs/1801.04774>https://arxiv.org/abs/1801.04774</a><li><a href=https://en.wikipedia.org/wiki/FASTA_format>https://en.wikipedia.org/wiki/FASTA_format</a></ul></div></div></main><footer><hr><div><h3>Want to comment or have something to add?</h3>You can write me an email at
+<a href=mailto:m@mitjafelicijan.com>m@mitjafelicijan.com</a> or catch up
+with me
+<a href=https://telegram.me/mitjafelicijan target=_blank>on Telegram</a>.</div><hr><p>This website does not track you. Content is made available under
+the <a href=https://creativecommons.org/licenses/by/4.0/ target=_blank rel=noreferrer>CC BY 4.0 license</a> unless specified
+otherwise. Blog feed is available as <a href=/index.xml target=_blank>RSS feed</a>.</footer><script src=https://cdn.usefathom.com/script.js data-site=XHQARKXP defer></script>
+\ No newline at end of file
author	Mitja Felicijan <m@mitjafelicijan.com>	2023-07-08 23:25:41 +0200
committer	Mitja Felicijan <m@mitjafelicijan.com>	2023-07-08 23:25:41 +0200
commit	cd6644ea4ddc78597934ab0ef5ba50e3c3daa927 (patch)
tree	03de331a8db6386dfd6fa75155bfbcea6b4feaf3 /public/encoding-binary-data-into-dna-sequence.html
parent	84ed124529ffeee1590295b8de3a8faf51848680 (diff)
download	mitjafelicijan.com-cd6644ea4ddc78597934ab0ef5ba50e3c3daa927.tar.gz