diff options
| -rw-r--r-- | _layouts/base.html | 14 | ||||
| -rw-r--r-- | _posts/2024-02-11-k-mer.md | 126 |
2 files changed, 138 insertions, 2 deletions
diff --git a/_layouts/base.html b/_layouts/base.html index aa4be41..12202d9 100644 --- a/_layouts/base.html +++ b/_layouts/base.html | |||
| @@ -4,7 +4,7 @@ | |||
| 4 | <meta charset="utf-8"> | 4 | <meta charset="utf-8"> |
| 5 | <meta name="viewport" content="width=device-width,initial-scale=1"> | 5 | <meta name="viewport" content="width=device-width,initial-scale=1"> |
| 6 | 6 | ||
| 7 | <link href="data:image/x-icon;base64,AAABAAEAEBAAAAEAIABoBAAAFgAAACgAAAAQAAAAIAAAAAEAIAAAAAAAAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAL69vf8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAv76+/8LBwQkAAAAAAAAAAAAAAAC+vb3/AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAL+9vf/Bv78JAAAAAAAAAAAAAAAAu7q6/wAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC7ubr/vr29CAAAAAAAAAAAy8nJAZ6foP8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAnqGj/6GipAoAAAAAHLjU/xcXHf/BwsL/I8XY/yPK3v8XGiD/IbjL/yPF2f8XGiD/Fxkf/yLF2f8gnK3/Fxog/62ztv8fwNf/FRcd/x271v8mz93/GRsi/xkXHf8p097/GiIp/xobIv8p0t3/KdPe/xocIv8fYmr/KNPe/xoZH/8aHCL/J87c/xy81/8VFxz/IsPZ/8zS0/8XGiD/Ir/R/yPH2/8XGiD/Fxkf/yPH2/8dd4T/GBog/yPJ3f8jyNr/uru9/xcUGv8cudb/EhITDKi5vRKlvMP/RUpOERwcHRAdOj4QHTk8EBwdHRAdNTgQHTo/EBwcHRAcHB0QSGduEKW4vf+koqQfHzg+EBqz0ewSFRv7EyMr/xq51vsTERb7ExUb+xq41fsau9j7ExUb+xiPp/sZudb7ExUb+xMVG/sZuNX/GKvI/BIUGfMdvdn/IrfL/xcaIP8n1eb/J9Dh/xkcIf8ZGR7/J8/f/xxCSv8ZGyH/J9Dg/ybQ4P8ZHCL/FSQs/yPK3/8UExj/GE1b/ybS5P8ZGB7/Ghwj/ynW5P8p2Ob/Ghwi/yWrtv8p1eH/Ghwi/xocIv8p1uT/J8XT/xkcIv8m1un/Hb7d/xUYH/8hzOr/HtHu/xcaIf8XGB//I8vi/xgxOv8XGSD/I8rg/yPK4P8XGiD/GUFL/yPP6f8SERj/Fhkh/x3A4f8AAAAAJ2f9/ydr//8mZPH/AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAlYu38J2v//ydo/f8AAAAAAAAAAAd8/fkFqf//Iob8sAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAMY39awWr//8FfP3/AAAAAAAAAAAFm/7/SfD//wR+/f8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOB/f9B7v//BaX+/wAAAAAAAAAAQ878SAyZ/v9n1v4KAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADu9v8DDJb+/z3N/XgAAAAA3/sAAN/7AADf+wAA3/sAAAAAAAAAAAAAAAAAAN/7AAAAAAAAAAAAAAAAAAAAAAAAj/EAAI/5AACP8QAA3/sAAA==" rel="icon" type="image/x-icon" /> | 7 | <link href="data:image/x-icon;base64,AAABAAEAEBAAAAEAIABoBAAAFgAAACgAAAAQAAAAIAAAAAEAIAAAAAAAAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAL69vf8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAv76+/8LBwQkAAAAAAAAAAAAAAAC+vb3/AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAL+9vf/Bv78JAAAAAAAAAAAAAAAAu7q6/wAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC7ubr/vr29CAAAAAAAAAAAy8nJAZ6foP8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAnqGj/6GipAoAAAAAHLjU/xcXHf/BwsL/I8XY/yPK3v8XGiD/IbjL/yPF2f8XGiD/Fxkf/yLF2f8gnK3/Fxog/62ztv8fwNf/FRcd/x271v8mz93/GRsi/xkXHf8p097/GiIp/xobIv8p0t3/KdPe/xocIv8fYmr/KNPe/xoZH/8aHCL/J87c/xy81/8VFxz/IsPZ/8zS0/8XGiD/Ir/R/yPH2/8XGiD/Fxkf/yPH2/8dd4T/GBog/yPJ3f8jyNr/uru9/xcUGv8cudb/EhITDKi5vRKlvMP/RUpOERwcHRAdOj4QHTk8EBwdHRAdNTgQHTo/EBwcHRAcHB0QSGduEKW4vf+koqQfHzg+EBqz0ewSFRv7EyMr/xq51vsTERb7ExUb+xq41fsau9j7ExUb+xiPp/sZudb7ExUb+xMVG/sZuNX/GKvI/BIUGfMdvdn/IrfL/xcaIP8n1eb/J9Dh/xkcIf8ZGR7/J8/f/xxCSv8ZGyH/J9Dg/ybQ4P8ZHCL/FSQs/yPK3/8UExj/GE1b/ybS5P8ZGB7/Ghwj/ynW5P8p2Ob/Ghwi/yWrtv8p1eH/Ghwi/xocIv8p1uT/J8XT/xkcIv8m1un/Hb7d/xUYH/8hzOr/HtHu/xcaIf8XGB//I8vi/xgxOv8XGSD/I8rg/yPK4P8XGiD/GUFL/yPP6f8SERj/Fhkh/x3A4f8AAAAAJ2f9/ydr//8mZPH/AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAlYu38J2v//ydo/f8AAAAAAAAAAAd8/fkFqf//Iob8sAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAMY39awWr//8FfP3/AAAAAAAAAAAFm/7/SfD//wR+/f8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOB/f9B7v//BaX+/wAAAAAAAAAAQ878SAyZ/v9n1v4KAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADu9v8DDJb+/z3N/XgAAAAA3/sAAN/7AADf+wAA3/sAAAAAAAAAAAAAAAAAAN/7AAAAAAAAAAAAAAAAAAAAAAAAj/EAAI/5AACP8QAA3/sAAA==" rel="icon" type="image/x-icon" /> |
| 8 | 8 | ||
| 9 | <title> | 9 | <title> |
| 10 | {% if page.title %}{{ page.title | escape }} - {{ site.title | escape }} | 10 | {% if page.title %}{{ page.title | escape }} - {{ site.title | escape }} |
| @@ -17,7 +17,7 @@ | |||
| 17 | 17 | ||
| 18 | <style> | 18 | <style> |
| 19 | :root { | 19 | :root { |
| 20 | --body-max-width: 1600px; | 20 | --body-max-width: 1200px; |
| 21 | --border-color: gainsboro; | 21 | --border-color: gainsboro; |
| 22 | --border-size: 1px; | 22 | --border-size: 1px; |
| 23 | --border-style: solid; | 23 | --border-style: solid; |
| @@ -61,6 +61,10 @@ | |||
| 61 | text-decoration: none; | 61 | text-decoration: none; |
| 62 | } | 62 | } |
| 63 | 63 | ||
| 64 | a:hover { | ||
| 65 | text-decoration: underline; | ||
| 66 | } | ||
| 67 | |||
| 64 | h1, h2, h3 { | 68 | h1, h2, h3 { |
| 65 | line-height: initial; | 69 | line-height: initial; |
| 66 | } | 70 | } |
| @@ -247,5 +251,11 @@ | |||
| 247 | <hr> | 251 | <hr> |
| 248 | <p><small>This page's most recent build occurred on {{ site.time | date: "%A, %-d %B, %Y" }} and is also available as <a href="/feed.xml" target="_blank">RSS feed</a>.</small></p> | 252 | <p><small>This page's most recent build occurred on {{ site.time | date: "%A, %-d %B, %Y" }} and is also available as <a href="/feed.xml" target="_blank">RSS feed</a>.</small></p> |
| 249 | </footer> | 253 | </footer> |
| 254 | |||
| 255 | {% if page.mathjax %} | ||
| 256 | <script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script> | ||
| 257 | <script src="https://cdn.jsdelivr.net/npm/mathjax@3.0.1/es5/tex-mml-chtml.js" async></script> | ||
| 258 | {% endif %} | ||
| 259 | |||
| 250 | </body> | 260 | </body> |
| 251 | </html> | 261 | </html> |
diff --git a/_posts/2024-02-11-k-mer.md b/_posts/2024-02-11-k-mer.md new file mode 100644 index 0000000..f8f6e17 --- /dev/null +++ b/_posts/2024-02-11-k-mer.md | |||
| @@ -0,0 +1,126 @@ | |||
| 1 | --- | ||
| 2 | title: "Navigating the genome using k-mers for DNA analysis and visualization" | ||
| 3 | permalink: /navigating-the-genome-using-k-mers-for-dna-analysis-and-visualization.html | ||
| 4 | date: 2024-02-11T01:04:28+02:00 | ||
| 5 | layout: post | ||
| 6 | type: post | ||
| 7 | mathjax: yes | ||
| 8 | draft: true | ||
| 9 | --- | ||
| 10 | |||
| 11 | ## Brief introduction to K-mer | ||
| 12 | |||
| 13 | A "k-mer" refers to all the possible substrings of length \\(k\\) contained in a | ||
| 14 | string, which is commonly used in computational biology and bioinformatics. In | ||
| 15 | the context of DNA, RNA, or protein sequences, a k-mer is a sequence of \\(k\\) | ||
| 16 | nucleotides (for DNA and RNA) or amino acids (for proteins). | ||
| 17 | |||
| 18 | The concept of k-mers is fundamental in various bioinformatics applications, | ||
| 19 | including genome assembly, sequence alignment, and identification of repeat | ||
| 20 | sequences. By analyzing the frequency and distribution of k-mers within a | ||
| 21 | sequence or set of sequences, researchers can infer structural characteristics, | ||
| 22 | identify genetic variants, and compare genomic or proteomic compositions between | ||
| 23 | different organisms or conditions. | ||
| 24 | |||
| 25 | For example, in genome assembly, k-mers are used to reconstruct the sequence of | ||
| 26 | a genome from a collection of short sequencing reads. By finding overlaps | ||
| 27 | between the k-mers derived from these reads, assembly algorithms can piece | ||
| 28 | together contiguous sequences (contigs), which represent longer sections of the | ||
| 29 | genome. | ||
| 30 | |||
| 31 | The choice of \\(k\\) (the length of the k-mer) is crucial and depends on the | ||
| 32 | specific application. A larger \\(k\\) provides more specificity (useful for | ||
| 33 | distinguishing between closely related sequences), while a smaller \\(k\\) | ||
| 34 | offers greater sensitivity (useful for detecting repeats or low-complexity | ||
| 35 | regions). However, the computational resources required increase with \\(k\\), | ||
| 36 | as there are \\(4^k\\) possible k-mers for nucleotide sequences (due to the four | ||
| 37 | types of nucleotides: A, T, C, G) and \\(20^k\\) for amino acid sequences (due | ||
| 38 | to the twenty standard amino acids). | ||
| 39 | |||
| 40 | ## K-mer counting | ||
| 41 | |||
| 42 | K-mer counting is a fundamental process in bioinformatics used for analyzing the | ||
| 43 | frequency of k-mers (subsequences of length \\(k\\)) in DNA, RNA, or protein | ||
| 44 | sequences. Efficient k-mer counting is crucial for various applications such as | ||
| 45 | genome assembly, metagenomics, and sequence comparison. The implementation | ||
| 46 | typically involves parsing a sequence into all possible k-mers and then counting | ||
| 47 | the occurrences of each unique k-mer. Here's a general approach to implementing | ||
| 48 | k-mer counting: | ||
| 49 | |||
| 50 | ### Reading the Sequences | ||
| 51 | |||
| 52 | The first step involves reading the genetic or protein sequences from files, | ||
| 53 | which are often in formats like FASTA or FASTQ. These files contain one or | ||
| 54 | multiple sequences that will be processed to extract k-mers. | ||
| 55 | |||
| 56 | ### Generating K-mers | ||
| 57 | |||
| 58 | For each sequence, generate all possible subsequences of length \\(k\\). This is | ||
| 59 | done by sliding a window of size \\(k\\) across the sequence, one nucleotide (or | ||
| 60 | amino acid) at a time, and extracting the subsequence within this window. | ||
| 61 | |||
| 62 | ### Counting K-mers | ||
| 63 | |||
| 64 | The extracted k-mers are then counted. This can be achieved using various data | ||
| 65 | structures: | ||
| 66 | |||
| 67 | - **Hash Tables (Dictionaries)**: They offer an efficient way to keep track of | ||
| 68 | k-mer counts, with k-mers as keys and their frequencies as values. This | ||
| 69 | approach is straightforward but can become memory-intensive with large | ||
| 70 | datasets or large values of \\(k\\). | ||
| 71 | - **Suffix Trees or Arrays**: These data structures are more space-efficient for | ||
| 72 | k-mer counting, especially for large datasets. They allow for efficient | ||
| 73 | retrieval of k-mer occurrences but are more complex to implement. | ||
| 74 | - **Bloom Filters and Count-Min Sketch**: For very large datasets, probabilistic | ||
| 75 | data structures like Bloom filters or Count-Min Sketch can estimate k-mer | ||
| 76 | counts using significantly less memory, at the cost of a controlled error | ||
| 77 | rate. | ||
| 78 | |||
| 79 | ### Handling Memory and Performance Issues | ||
| 80 | |||
| 81 | K-mer counting can be memory-intensive, especially for large values of \\(k\\) or | ||
| 82 | large datasets. Optimizations include: | ||
| 83 | |||
| 84 | - **Compressing K-mers**: Representing k-mers using a binary format rather than | ||
| 85 | strings can save memory. | ||
| 86 | - **Parallel Processing**: Distributing the k-mer counting task across multiple | ||
| 87 | processors or machines can significantly speed up the process. | ||
| 88 | - **Minimizing I/O Operations**: Efficiently reading and processing sequences | ||
| 89 | from files in chunks reduces I/O overhead. | ||
| 90 | |||
| 91 | ### Post-processing | ||
| 92 | |||
| 93 | After counting, the k-mer frequencies can be used directly for analyses or can | ||
| 94 | undergo further processing, such as filtering rare k-mers, which are often | ||
| 95 | errors, or normalizing counts for comparative analysis. | ||
| 96 | |||
| 97 | ### Implementation Example | ||
| 98 | |||
| 99 | Here's a simple Python example using a dictionary for k-mer counting: | ||
| 100 | |||
| 101 | ```python | ||
| 102 | def count_kmers(sequence, k): | ||
| 103 | kmer_counts = {} | ||
| 104 | for i in range(len(sequence) - k + 1): | ||
| 105 | kmer = sequence[i:i+k] | ||
| 106 | if kmer in kmer_counts: | ||
| 107 | kmer_counts[kmer] += 1 | ||
| 108 | else: | ||
| 109 | kmer_counts[kmer] = 1 | ||
| 110 | return kmer_counts | ||
| 111 | |||
| 112 | # Example usage | ||
| 113 | sequence = "ATGCGATGATCTGATG" | ||
| 114 | k = 3 | ||
| 115 | kmer_counts = count_kmers(sequence, k) | ||
| 116 | print(kmer_counts) | ||
| 117 | ``` | ||
| 118 | |||
| 119 | This code snippet counts the occurrences of each 3-mer in a given sequence. | ||
| 120 | |||
| 121 | For real-world applications, especially those involving large datasets, consider | ||
| 122 | using specialized bioinformatics tools like Jellyfish, KMC, or khmer, which are | ||
| 123 | optimized for efficiency and scalability. | ||
| 124 | |||
| 125 | Now that we have the basics out of the way we can start implementing basic k-mer | ||
| 126 | counter in C. | ||
