diff options
| author | Mitja Felicijan <mitja.felicijan@gmail.com> | 2023-06-27 14:50:20 +0200 |
|---|---|---|
| committer | Mitja Felicijan <mitja.felicijan@gmail.com> | 2023-06-27 14:50:20 +0200 |
| commit | 8697555125c57ae64a0c9b78514b4aac4fd523de (patch) | |
| tree | a699df53a7c35a4425f30bca86982c4341f6de40 /content/posts/2019-01-03-encoding-binary-data-into-dna-sequence.md | |
| parent | 33b2615a5038bc85036081e8b5e0da8584d88097 (diff) | |
| download | mitjafelicijan.com-8697555125c57ae64a0c9b78514b4aac4fd523de.tar.gz | |
Massive formatting and added figcaption
Diffstat (limited to 'content/posts/2019-01-03-encoding-binary-data-into-dna-sequence.md')
| -rw-r--r-- | content/posts/2019-01-03-encoding-binary-data-into-dna-sequence.md | 143 |
1 files changed, 70 insertions, 73 deletions
diff --git a/content/posts/2019-01-03-encoding-binary-data-into-dna-sequence.md b/content/posts/2019-01-03-encoding-binary-data-into-dna-sequence.md index a57201f..f0343ae 100644 --- a/content/posts/2019-01-03-encoding-binary-data-into-dna-sequence.md +++ b/content/posts/2019-01-03-encoding-binary-data-into-dna-sequence.md | |||
| @@ -7,64 +7,59 @@ draft: false | |||
| 7 | 7 | ||
| 8 | ## Initial thoughts | 8 | ## Initial thoughts |
| 9 | 9 | ||
| 10 | Imagine a world where you could go outside and take a leaf from a tree and put | 10 | Imagine a world where you could go outside and take a leaf from a tree and put |
| 11 | it through your personal DNA sequencer and get data like music, videos or | 11 | it through your personal DNA sequencer and get data like music, videos or |
| 12 | computer programs from it. Well, this is all possible now. It was not done | 12 | computer programs from it. Well, this is all possible now. It was not done on a |
| 13 | on a large scale because it is quite expensive to create DNA strands but it's | 13 | large scale because it is quite expensive to create DNA strands but it's |
| 14 | possible. | 14 | possible. |
| 15 | 15 | ||
| 16 | Encoding data into DNA sequence is relatively simple process once you understand | 16 | Encoding data into DNA sequence is relatively simple process once you understand |
| 17 | the relationship between binary data and nucleotides and scientists have been | 17 | the relationship between binary data and nucleotides and scientists have been |
| 18 | making large leaps in this field in order to provide viable long-term storage | 18 | making large leaps in this field in order to provide viable long-term storage |
| 19 | solution for our data that would potentially survive our specie if case of | 19 | solution for our data that would potentially survive our specie if case of |
| 20 | global disaster. We could imprint all the world's knowledge into plants and | 20 | global disaster. We could imprint all the world's knowledge into plants and |
| 21 | ensure the survival of our knowledge. | 21 | ensure the survival of our knowledge. |
| 22 | 22 | ||
| 23 | More optimistic usage for this technology would be easier storage of ever | 23 | More optimistic usage for this technology would be easier storage of ever |
| 24 | growing data we produce every day. Once machines for sequencing DNA become fast | 24 | growing data we produce every day. Once machines for sequencing DNA become fast |
| 25 | enough and cheaper this could mean the next evolution of storing data and | 25 | enough and cheaper this could mean the next evolution of storing data and |
| 26 | abandoning classical hard and solid state drives in data warehouses. | 26 | abandoning classical hard and solid state drives in data warehouses. |
| 27 | 27 | ||
| 28 | As we currently stand this is still not viable but it is quite an amazing and | 28 | As we currently stand this is still not viable but it is quite an amazing and |
| 29 | cool technology. | 29 | cool technology. |
| 30 | 30 | ||
| 31 | My interests in this field are purely in encoding processes and experimental | 31 | My interests in this field are purely in encoding processes and experimental |
| 32 | testing mainly because I don't have the access to this expensive machines. My | 32 | testing mainly because I don't have the access to this expensive machines. My |
| 33 | initial goal was to create a toolkit that can be used by everybody to encode | 33 | initial goal was to create a toolkit that can be used by everybody to encode |
| 34 | their data into a proper DNA sequence. | 34 | their data into a proper DNA sequence. |
| 35 | 35 | ||
| 36 | ## Glossary | 36 | ## Glossary |
| 37 | 37 | ||
| 38 | **deoxyribose** | 38 | **deoxyribose** A five-carbon sugar molecule with a hydrogen atom rather than a |
| 39 | A five-carbon sugar molecule with a hydrogen atom rather than a hydroxyl group | 39 | hydroxyl group in the 2′ position; the sugar component of DNA nucleotides. |
| 40 | in the 2′ position; the sugar component of DNA nucleotides. | ||
| 41 | 40 | ||
| 42 | **double helix** | 41 | **double helix** The molecular shape of DNA in which two strands of nucleotides |
| 43 | The molecular shape of DNA in which two strands of nucleotides wind around | 42 | wind around each other in a spiral shape. |
| 44 | each other in a spiral shape. | ||
| 45 | 43 | ||
| 46 | **nitrogenous base** | 44 | **nitrogenous base** A nitrogen-containing molecule that acts as a base; often |
| 47 | A nitrogen-containing molecule that acts as a base; often referring to one of | 45 | referring to one of the purine or pyrimidine components of nucleic acids. |
| 48 | the purine or pyrimidine components of nucleic acids. | ||
| 49 | 46 | ||
| 50 | **phosphate group** | 47 | **phosphate group** A molecular group consisting of a central phosphorus atom |
| 51 | A molecular group consisting of a central phosphorus atom bound to four oxygen | 48 | bound to four oxygen atoms. |
| 52 | atoms. | ||
| 53 | 49 | ||
| 54 | **RGB** | 50 | **RGB** The RGB color model is an additive color model in which red, green and |
| 55 | The RGB color model is an additive color model in which red, green and blue | 51 | blue light are added together in various ways to reproduce a broad array of |
| 56 | light are added together in various ways to reproduce a broad array of colors. | 52 | colors. |
| 57 | 53 | ||
| 58 | **GCC** | 54 | **GCC** The GNU Compiler Collection is a compiler system produced by the GNU |
| 59 | The GNU Compiler Collection is a compiler system produced by the GNU Project | 55 | Project supporting various programming languages. |
| 60 | supporting various programming languages. | ||
| 61 | 56 | ||
| 62 | ## Data encoding | 57 | ## Data encoding |
| 63 | 58 | ||
| 64 | **TL;DR:** Encoding involves the use of a code to change original data into a | 59 | **TL;DR:** Encoding involves the use of a code to change original data into a |
| 65 | form that can be used by an external process. | 60 | form that can be used by an external process. |
| 66 | 61 | ||
| 67 | Encoding is the process of converting data into a format required for a number | 62 | Encoding is the process of converting data into a format required for a number |
| 68 | of information processing needs, including: | 63 | of information processing needs, including: |
| 69 | 64 | ||
| 70 | - Program compiling and execution | 65 | - Program compiling and execution |
| @@ -74,7 +69,7 @@ of information processing needs, including: | |||
| 74 | Encoding can have two meanings: | 69 | Encoding can have two meanings: |
| 75 | 70 | ||
| 76 | - In computer technology, encoding is the process of applying a specific code, | 71 | - In computer technology, encoding is the process of applying a specific code, |
| 77 | such as letters, symbols and numbers, to data for conversion into an | 72 | such as letters, symbols and numbers, to data for conversion into an |
| 78 | equivalent cipher. | 73 | equivalent cipher. |
| 79 | - In electronics, encoding refers to analog to digital conversion. | 74 | - In electronics, encoding refers to analog to digital conversion. |
| 80 | 75 | ||
| @@ -99,23 +94,25 @@ Encoding can have two meanings: | |||
| 99 | 94 | ||
| 100 | ## What is DNA? | 95 | ## What is DNA? |
| 101 | 96 | ||
| 102 | Deoxyribonucleic acid, a self-replicating material which is | 97 | Deoxyribonucleic acid, a self-replicating material which is **present in nearly |
| 103 | **present in nearly all living organisms** as the main constituent of | 98 | all living organisms** as the main constituent of chromosomes. It is the |
| 104 | chromosomes. It is the **carrier of genetic information**. | 99 | **carrier of genetic information**. |
| 105 | 100 | ||
| 106 | > The nitrogen in our DNA, the calcium in our teeth, the iron in our blood, | 101 | > The nitrogen in our DNA, the calcium in our teeth, the iron in our blood, |
| 107 | > the carbon in our apple pies were made in the interiors of collapsing stars. | 102 | > the carbon in our apple pies were made in the interiors of collapsing stars. |
| 108 | > We are made of starstuff. | 103 | > We are made of starstuff. |
| 109 | > **-- Carl Sagan, Cosmos** | 104 | > **-- Carl Sagan, Cosmos** |
| 110 | 105 | ||
| 111 | The nucleotide in DNA consists of a sugar (deoxyribose), one of four bases | 106 | The nucleotide in DNA consists of a sugar (deoxyribose), one of four bases |
| 112 | (cytosine (C), thymine (T), adenine (A), guanine (G)), and a phosphate. | 107 | (cytosine (C), thymine (T), adenine (A), guanine (G)), and a phosphate. |
| 113 | Cytosine and thymine are pyrimidine bases, while adenine and guanine are purine | 108 | Cytosine and thymine are pyrimidine bases, while adenine and guanine are purine |
| 114 | bases. The sugar and the base together are called a nucleoside. | 109 | bases. The sugar and the base together are called a nucleoside. |
| 115 | 110 | ||
| 116 |  | 111 |  |
| 117 | 112 | ||
| 118 | *DNA (a) forms a double stranded helix, and (b) adenine pairs with thymine and cytosine pairs with guanine. (credit a: modification of work by Jerome Walker, Dennis Myts)* | 113 | *DNA (a) forms a double stranded helix, and (b) adenine pairs with thymine and |
| 114 | cytosine pairs with guanine. (credit a: modification of work by Jerome Walker, | ||
| 115 | Dennis Myts)* | ||
| 119 | 116 | ||
| 120 | ## Encode binary data into DNA sequence | 117 | ## Encode binary data into DNA sequence |
| 121 | 118 | ||
| @@ -128,18 +125,18 @@ As an input file you can use any file you want: | |||
| 128 | - Database files, | 125 | - Database files, |
| 129 | - etc. | 126 | - etc. |
| 130 | 127 | ||
| 131 | Note: If you would copy all the bytes from RAM to file or pipe data to file you | 128 | Note: If you would copy all the bytes from RAM to file or pipe data to file you |
| 132 | could encode also this data as long as you provide file pointer to the encoder. | 129 | could encode also this data as long as you provide file pointer to the encoder. |
| 133 | 130 | ||
| 134 | ### Basic Encoding | 131 | ### Basic Encoding |
| 135 | 132 | ||
| 136 | As already mentioned, the Basic Encoding is based on a simple mapping. Since | 133 | As already mentioned, the Basic Encoding is based on a simple mapping. Since DNA |
| 137 | DNA is composed of 4 nucleotides (Adenine, Cytosine, Guanine, Thymine; usually | 134 | is composed of 4 nucleotides (Adenine, Cytosine, Guanine, Thymine; usually |
| 138 | referred using the first letter). Using this technique we can encode | 135 | referred using the first letter). Using this technique we can encode |
| 139 | 136 | ||
| 140 | $$ log_2(4) = log_2(2^2) = 2 bits $$ | 137 | $$ log_2(4) = log_2(2^2) = 2 bits $$ |
| 141 | 138 | ||
| 142 | using a single nucleotide. In this way, we are able to use the 4 bases that | 139 | using a single nucleotide. In this way, we are able to use the 4 bases that |
| 143 | compose the DNA strand to encode each byte of data. | 140 | compose the DNA strand to encode each byte of data. |
| 144 | 141 | ||
| 145 | | Two bits | Nucleotides | | 142 | | Two bits | Nucleotides | |
| @@ -149,8 +146,8 @@ compose the DNA strand to encode each byte of data. | |||
| 149 | | 01 | **C** (Cytosine) | | 146 | | 01 | **C** (Cytosine) | |
| 150 | | 11 | **T** (Thymine) | | 147 | | 11 | **T** (Thymine) | |
| 151 | 148 | ||
| 152 | With this in mind we can simply encode any data by using two-bit to | 149 | With this in mind we can simply encode any data by using two-bit to Nucleotides |
| 153 | Nucleotides conversion. | 150 | conversion. |
| 154 | 151 | ||
| 155 | ```python | 152 | ```python |
| 156 | { Algorithm 1: Naive byte array to DNA encode } | 153 | { Algorithm 1: Naive byte array to DNA encode } |
| @@ -173,29 +170,29 @@ begin | |||
| 173 | end | 170 | end |
| 174 | ``` | 171 | ``` |
| 175 | 172 | ||
| 176 | Another encoding would be **Goldman encoding**. Using this encoding helps with | 173 | Another encoding would be **Goldman encoding**. Using this encoding helps with |
| 177 | Nonsense mutation (amino acids replaced by a stop codon) that occurs and is | 174 | Nonsense mutation (amino acids replaced by a stop codon) that occurs and is the |
| 178 | the most problematic during translation because it leads to truncated amino | 175 | most problematic during translation because it leads to truncated amino acid |
| 179 | acid sequences, which in turn results in truncated proteins. | 176 | sequences, which in turn results in truncated proteins. |
| 180 | 177 | ||
| 181 | [Where to store big data? In DNA: Nick Goldman at TEDxPrague](https://www.youtube.com/watch?v=a4PiGWNsIEU) | 178 | [Where to store big data? In DNA: Nick Goldman at TEDxPrague](https://www.youtube.com/watch?v=a4PiGWNsIEU) |
| 182 | 179 | ||
| 183 | ### FASTA file format | 180 | ### FASTA file format |
| 184 | 181 | ||
| 185 | In bioinformatics, FASTA format is a text-based format for representing either | 182 | In bioinformatics, FASTA format is a text-based format for representing either |
| 186 | nucleotide sequences or peptide sequences, in which nucleotides or amino acids | 183 | nucleotide sequences or peptide sequences, in which nucleotides or amino acids |
| 187 | are represented using single-letter codes. The format also allows for sequence | 184 | are represented using single-letter codes. The format also allows for sequence |
| 188 | names and comments to precede the sequences. The format originates from the | 185 | names and comments to precede the sequences. The format originates from the |
| 189 | FASTA software package, but has now become a standard in the field of | 186 | FASTA software package, but has now become a standard in the field of |
| 190 | bioinformatics. | 187 | bioinformatics. |
| 191 | 188 | ||
| 192 | The first line in a FASTA file started either with a ">" (greater-than) symbol | 189 | The first line in a FASTA file started either with a ">" (greater-than) symbol |
| 193 | or, less frequently, a ";" (semicolon) was taken as a comment. Subsequent | 190 | or, less frequently, a ";" (semicolon) was taken as a comment. Subsequent lines |
| 194 | lines starting with a semicolon would be ignored by software. Since the only | 191 | starting with a semicolon would be ignored by software. Since the only comment |
| 195 | comment used was the first, it quickly became used to hold a summary | 192 | used was the first, it quickly became used to hold a summary description of the |
| 196 | description of the sequence, often starting with a unique library accession | 193 | sequence, often starting with a unique library accession number, and with time |
| 197 | number, and with time it has become commonplace to always use ">" for the first | 194 | it has become commonplace to always use ">" for the first line and to not use |
| 198 | line and to not use ";" comments (which would otherwise be ignored). | 195 | ";" comments (which would otherwise be ignored). |
| 199 | 196 | ||
| 200 | ``` | 197 | ``` |
| 201 | ;LCBO - Prolactin precursor - Bovine | 198 | ;LCBO - Prolactin precursor - Bovine |
| @@ -218,7 +215,7 @@ GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX | |||
| 218 | IENY | 215 | IENY |
| 219 | ``` | 216 | ``` |
| 220 | 217 | ||
| 221 | FASTA format was extended by [FASTQ](https://en.wikipedia.org/wiki/FASTQ_format) | 218 | FASTA format was extended by [FASTQ](https://en.wikipedia.org/wiki/FASTQ_format) |
| 222 | format from the [Sanger Centre](https://www.sanger.ac.uk/) in Cambridge. | 219 | format from the [Sanger Centre](https://www.sanger.ac.uk/) in Cambridge. |
| 223 | 220 | ||
| 224 | ### PNG encoded DNA sequence | 221 | ### PNG encoded DNA sequence |
| @@ -230,7 +227,7 @@ format from the [Sanger Centre](https://www.sanger.ac.uk/) in Cambridge. | |||
| 230 | | C ➞ Cytosine | (255,0,0) | Red | | 227 | | C ➞ Cytosine | (255,0,0) | Red | |
| 231 | | T ➞ Thymine | (255,255,0) | Yellow | | 228 | | T ➞ Thymine | (255,255,0) | Yellow | |
| 232 | 229 | ||
| 233 | With this in mind we can create a simple algorithm to create PNG representation | 230 | With this in mind we can create a simple algorithm to create PNG representation |
| 234 | of a DNA sequence. | 231 | of a DNA sequence. |
| 235 | 232 | ||
| 236 | ```python | 233 | ```python |
| @@ -252,7 +249,7 @@ end | |||
| 252 | 249 | ||
| 253 | ## Encoding text file in practice | 250 | ## Encoding text file in practice |
| 254 | 251 | ||
| 255 | In this example we will take a simple text file as our input stream for | 252 | In this example we will take a simple text file as our input stream for |
| 256 | encoding. This file will have a quote from Niels Bohr and saved as txt file. | 253 | encoding. This file will have a quote from Niels Bohr and saved as txt file. |
| 257 | 254 | ||
| 258 | > How wonderful that we have met with a paradox. Now we have some hope of | 255 | > How wonderful that we have met with a paradox. Now we have some hope of |
| @@ -306,8 +303,9 @@ After encoding into PNG format this file looks like this. | |||
| 306 | 303 | ||
| 307 | The larger the input stream is the larger the PNG file would be. | 304 | The larger the input stream is the larger the PNG file would be. |
| 308 | 305 | ||
| 309 | Compiled basic Hello World C program with [GCC](https://www.gnu.org/software/gcc/) | 306 | Compiled basic Hello World C program with |
| 310 | would [look like](/assets/dna-sequence/sample.png). | 307 | [GCC](https://www.gnu.org/software/gcc/) would [look |
| 308 | like](/assets/dna-sequence/sample.png). | ||
| 311 | 309 | ||
| 312 | ```c | 310 | ```c |
| 313 | // gcc -O3 -o sample sample.c | 311 | // gcc -O3 -o sample sample.c |
| @@ -370,7 +368,7 @@ First we generate some binary sample data with dd. | |||
| 370 | dd if=<(openssl enc -aes-256-ctr -pass pass:"$(dd if=/dev/urandom bs=128 count=1 2>/dev/null | base64)" -nosalt < /dev/zero) of=1KB.bin bs=1KB count=1 iflag=fullblock | 368 | dd if=<(openssl enc -aes-256-ctr -pass pass:"$(dd if=/dev/urandom bs=128 count=1 2>/dev/null | base64)" -nosalt < /dev/zero) of=1KB.bin bs=1KB count=1 iflag=fullblock |
| 371 | ``` | 369 | ``` |
| 372 | 370 | ||
| 373 | Our freshly generated 1KB file looks something like this (its full of garbage | 371 | Our freshly generated 1KB file looks something like this (its full of garbage |
| 374 | data as intended). | 372 | data as intended). |
| 375 | 373 | ||
| 376 |  | 374 |  |
| @@ -410,4 +408,3 @@ gzip -9 < 10MB.fa > 10MB.fa.gz | |||
| 410 | - https://opentextbc.ca/biology/chapter/9-1-the-structure-of-dna/ | 408 | - https://opentextbc.ca/biology/chapter/9-1-the-structure-of-dna/ |
| 411 | - https://arxiv.org/abs/1801.04774 | 409 | - https://arxiv.org/abs/1801.04774 |
| 412 | - https://en.wikipedia.org/wiki/FASTA_format | 410 | - https://en.wikipedia.org/wiki/FASTA_format |
| 413 | |||
