aboutsummaryrefslogtreecommitdiff
path: root/_drafts/2019-01-03-encoding-binary-data-into-dna-sequence.md
diff options
context:
space:
mode:
Diffstat (limited to '_drafts/2019-01-03-encoding-binary-data-into-dna-sequence.md')
-rw-r--r--_drafts/2019-01-03-encoding-binary-data-into-dna-sequence.md144
1 files changed, 144 insertions, 0 deletions
diff --git a/_drafts/2019-01-03-encoding-binary-data-into-dna-sequence.md b/_drafts/2019-01-03-encoding-binary-data-into-dna-sequence.md
new file mode 100644
index 0000000..14e5efe
--- /dev/null
+++ b/_drafts/2019-01-03-encoding-binary-data-into-dna-sequence.md
@@ -0,0 +1,144 @@
1---
2
3layout: post
4title: Encoding binary data into DNA sequence
5description: ok
6
7---
8
9**Table of contents**
10
111. [Initial thoughts](#initial-thoughts)
122. [Data encoding](#data-encoding)
133. [Glossary](#glossary)
144. [Bit of theory and history](#bit-of-theory-and-history)
15 1. [Quick history of DNA](#quick-history-of-dna)
16 2. [What is DNA?](#what-is-dna)
175. [Encode binary data into DNA sequence](#encode-binary-data-into-dna-sequence)
18 1. [Basic Encoding](#basic-encoding)
19 2. [FASTA file format](#fasta-file-format)
20 3. [PNG encoded DNA sequence](#png-encoded-dna-sequence)
216. [References](#references)
22
23## Initial thoughts
24
25Todo ...
26
27## Data encoding
28
29**TL;DR:** Encoding involves the use of a code to change original data into a form that can be used by an external process [^1].
30
31Encoding is the process of converting data into a format required for a number of information processing needs, including:
32
33- Program compiling and execution
34- Data transmission, storage and compression/decompression
35- Application data processing, such as file conversion
36
37Encoding can have two meanings[^1]:
38
39- In computer technology, encoding is the process of applying a specific code, such as letters, symbols and numbers, to data for conversion into an equivalent cipher.
40- In electronics, encoding refers to analog to digital conversion.
41
42## Glossary
43
44**deoxyribose**
45: A five-carbon sugar molecule with a hydrogen atom rather than a hydroxyl group in the 2′ position; the sugar component of DNA nucleotides.
46
47**double helix**
48: The molecular shape of DNA in which two strands of nucleotides wind around each other in a spiral shape.
49
50**nitrogenous base**
51: A nitrogen-containing molecule that acts as a base; often referring to one of the purine or pyrimidine components of nucleic acids.
52
53**phosphate group**
54: A molecular group consisting of a central phosphorus atom bound to four oxygen atoms.
55
56## Bit of theory and history
57
58History and explanation of what DNA is and where is used.
59
60### Quick history of DNA
61
62- **1869** - Friedrich Miescher identifies "nuclein".
63- **1900s** - The Eugenics Movement.
64- **1900** – Mendel's theories are rediscovered by researchers.
65- **1902** - Sir Archibald Edward Garrod is the first to associate Mendel's theories with a human disease.
66- **1944** - Oswald Avery identifies DNA as the 'transforming principle'.
67- **1950** - Erwin Chargaff discovers that DNA composition is species specific.
68- **1952** - Rosalind Franklin photographs crystallized DNA fibres.
69- **1953** - James Watson and Francis Crick discover the double helix structure of DNA.
70- **1953** - George Gamow and the “RNA Tie Club”.
71- **1959** - An additional copy of chromosome 21 linked to Down's syndrome.
72- **1965** - Marshall Nirenberg is the first person to sequence the bases in each codon.
73- **1977** - Frederick Sanger develops rapid DNA sequencing techniques.
74- **1983** - Huntington's disease is the first mapped genetic disease.
75- **1990** - The first gene found to be associated with increased susceptibility to familial breast and ovarian cancer is identified.
76- **1990** - The Human Genome Project begins.
77- **1995** - Haemophilus Influenzae is the first bacterium genome sequenced.
78- **1996** - Dolly the sheep is cloned.
79- **1996** - 'Bermuda Principles' established.
80- **1999** - First human chromosome is decoded.
81- **2000** – Genetic code of the fruit fly is decoded.
82- **2002** – Mouse is the first mammal to have its genome decoded.
83- **2003** – The Human Genome Project is completed.
84- **2013** – DNA Worldwide and Eurofins Forensic discover identical twins have differences in their genetic makeup [^2].
85
86### What is DNA?
87
88Deoxyribonucleic acid, a self-replicating material which is **present in nearly all living organisms** as the main constituent of chromosomes. It is the **carrier of genetic information**.
89
90> The nitrogen in our DNA, the calcium in our teeth, the iron in our blood, the carbon in our apple pies were made in the interiors of collapsing stars. We are made of starstuff.
91>
92> **-- Carl Sagan, Cosmos**
93
94The nucleotide in DNA consists of a sugar (deoxyribose), one of four bases (cytosine (C), thymine (T), adenine (A), guanine (G)), and a phosphate. Cytosine and thymine are pyrimidine bases, while adenine and guanine are purine bases. The sugar and the base together are called a nucleoside.
95
96![DNA](/files/dna-sequence/dna-basics.jpg#center)
97
98*DNA (a) forms a double stranded helix, and (b) adenine pairs with thymine and cytosine pairs with guanine. (credit a: modification of work by Jerome Walker, Dennis Myts) [^3]*
99
100## Encode binary data into DNA sequence
101
102Todo ...
103
104As an input file you can use any file you want:
105- ASCII files,
106- Compiled programs,
107- Multimedia files (MP3, MP4, MVK, etc),
108- Images,
109- Database files,
110- etc.
111
112Note: If you would copy all the bytes from RAM to file or pipe data to file you could encode also this data as long as you provide file pointer to the encoder.
113
114### Basic Encoding
115
116As already mentioned, the Basic Encoding is based on a simple mapping. Since DNA is composed of 4 nucleotides (Adenine, Cytosine, Guanine, Thymine; usually referred using the first letter). Using this technique we can encode log<sub>2</sub>(4) = log<sub>2</sub>(2<sup>2</sup>) = 2 bits using a single nucleotide. In this way, we are able to use the 4 bases that compose the DNA strand to encode each byte of data. [^4]
117
118| Two bits | Nucleotides |
119| -------- | ---------------- |
120| 00 | **A** (Adenine) |
121| 10 | **G** (Guanine) |
122| 01 | **C** (Cytosine) |
123| 11 | **T** (Thymine) |
124
125With this in mind we can simply encode any data by using two-bit to Nucleotides conversion
126
127![DNA](/files/dna-sequence/algorithm-binary-to-dna.png#center)
128
129### FASTA file format
130
131Todo ...
132
133### PNG encoded DNA sequence
134
135Todo ...
136
137
138
139## References
140
141[^1]: https://www.techopedia.com/definition/948/encoding
142[^2]: https://www.dna-worldwide.com/resource/160/history-dna-timeline
143[^3]: https://opentextbc.ca/biology/chapter/9-1-the-structure-of-dna/
144[^4]: https://arxiv.org/abs/1801.04774