aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorMitja Felicijan <mitja.felicijan@gmail.com>2024-02-11 22:41:14 +0100
committerMitja Felicijan <mitja.felicijan@gmail.com>2024-02-11 22:41:14 +0100
commit81a1359f60d209f4a1df86b3f5ae51d3efbb4d82 (patch)
tree72fdb89ad091428801044851637b200a7b85d031
parent18c372f43037af48c3e724fa84379bdca7943d4a (diff)
downloadmitjafelicijan.com-81a1359f60d209f4a1df86b3f5ae51d3efbb4d82.tar.gz
Added DNA k-mer post draft
-rw-r--r--_layouts/base.html14
-rw-r--r--_posts/2024-02-11-k-mer.md126
2 files changed, 138 insertions, 2 deletions
diff --git a/_layouts/base.html b/_layouts/base.html
index aa4be41..12202d9 100644
--- a/_layouts/base.html
+++ b/_layouts/base.html
@@ -4,7 +4,7 @@
4 <meta charset="utf-8"> 4 <meta charset="utf-8">
5 <meta name="viewport" content="width=device-width,initial-scale=1"> 5 <meta name="viewport" content="width=device-width,initial-scale=1">
6 6
7 <link href="data:image/x-icon;base64,AAABAAEAEBAAAAEAIABoBAAAFgAAACgAAAAQAAAAIAAAAAEAIAAAAAAAAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAL69vf8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAv76+/8LBwQkAAAAAAAAAAAAAAAC+vb3/AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAL+9vf/Bv78JAAAAAAAAAAAAAAAAu7q6/wAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC7ubr/vr29CAAAAAAAAAAAy8nJAZ6foP8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAnqGj/6GipAoAAAAAHLjU/xcXHf/BwsL/I8XY/yPK3v8XGiD/IbjL/yPF2f8XGiD/Fxkf/yLF2f8gnK3/Fxog/62ztv8fwNf/FRcd/x271v8mz93/GRsi/xkXHf8p097/GiIp/xobIv8p0t3/KdPe/xocIv8fYmr/KNPe/xoZH/8aHCL/J87c/xy81/8VFxz/IsPZ/8zS0/8XGiD/Ir/R/yPH2/8XGiD/Fxkf/yPH2/8dd4T/GBog/yPJ3f8jyNr/uru9/xcUGv8cudb/EhITDKi5vRKlvMP/RUpOERwcHRAdOj4QHTk8EBwdHRAdNTgQHTo/EBwcHRAcHB0QSGduEKW4vf+koqQfHzg+EBqz0ewSFRv7EyMr/xq51vsTERb7ExUb+xq41fsau9j7ExUb+xiPp/sZudb7ExUb+xMVG/sZuNX/GKvI/BIUGfMdvdn/IrfL/xcaIP8n1eb/J9Dh/xkcIf8ZGR7/J8/f/xxCSv8ZGyH/J9Dg/ybQ4P8ZHCL/FSQs/yPK3/8UExj/GE1b/ybS5P8ZGB7/Ghwj/ynW5P8p2Ob/Ghwi/yWrtv8p1eH/Ghwi/xocIv8p1uT/J8XT/xkcIv8m1un/Hb7d/xUYH/8hzOr/HtHu/xcaIf8XGB//I8vi/xgxOv8XGSD/I8rg/yPK4P8XGiD/GUFL/yPP6f8SERj/Fhkh/x3A4f8AAAAAJ2f9/ydr//8mZPH/AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAlYu38J2v//ydo/f8AAAAAAAAAAAd8/fkFqf//Iob8sAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAMY39awWr//8FfP3/AAAAAAAAAAAFm/7/SfD//wR+/f8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOB/f9B7v//BaX+/wAAAAAAAAAAQ878SAyZ/v9n1v4KAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADu9v8DDJb+/z3N/XgAAAAA3/sAAN/7AADf+wAA3/sAAAAAAAAAAAAAAAAAAN/7AAAAAAAAAAAAAAAAAAAAAAAAj/EAAI/5AACP8QAA3/sAAA==" rel="icon" type="image/x-icon" /> 7 <link href="data:image/x-icon;base64,AAABAAEAEBAAAAEAIABoBAAAFgAAACgAAAAQAAAAIAAAAAEAIAAAAAAAAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAL69vf8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAv76+/8LBwQkAAAAAAAAAAAAAAAC+vb3/AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAL+9vf/Bv78JAAAAAAAAAAAAAAAAu7q6/wAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC7ubr/vr29CAAAAAAAAAAAy8nJAZ6foP8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAnqGj/6GipAoAAAAAHLjU/xcXHf/BwsL/I8XY/yPK3v8XGiD/IbjL/yPF2f8XGiD/Fxkf/yLF2f8gnK3/Fxog/62ztv8fwNf/FRcd/x271v8mz93/GRsi/xkXHf8p097/GiIp/xobIv8p0t3/KdPe/xocIv8fYmr/KNPe/xoZH/8aHCL/J87c/xy81/8VFxz/IsPZ/8zS0/8XGiD/Ir/R/yPH2/8XGiD/Fxkf/yPH2/8dd4T/GBog/yPJ3f8jyNr/uru9/xcUGv8cudb/EhITDKi5vRKlvMP/RUpOERwcHRAdOj4QHTk8EBwdHRAdNTgQHTo/EBwcHRAcHB0QSGduEKW4vf+koqQfHzg+EBqz0ewSFRv7EyMr/xq51vsTERb7ExUb+xq41fsau9j7ExUb+xiPp/sZudb7ExUb+xMVG/sZuNX/GKvI/BIUGfMdvdn/IrfL/xcaIP8n1eb/J9Dh/xkcIf8ZGR7/J8/f/xxCSv8ZGyH/J9Dg/ybQ4P8ZHCL/FSQs/yPK3/8UExj/GE1b/ybS5P8ZGB7/Ghwj/ynW5P8p2Ob/Ghwi/yWrtv8p1eH/Ghwi/xocIv8p1uT/J8XT/xkcIv8m1un/Hb7d/xUYH/8hzOr/HtHu/xcaIf8XGB//I8vi/xgxOv8XGSD/I8rg/yPK4P8XGiD/GUFL/yPP6f8SERj/Fhkh/x3A4f8AAAAAJ2f9/ydr//8mZPH/AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAlYu38J2v//ydo/f8AAAAAAAAAAAd8/fkFqf//Iob8sAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAMY39awWr//8FfP3/AAAAAAAAAAAFm/7/SfD//wR+/f8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOB/f9B7v//BaX+/wAAAAAAAAAAQ878SAyZ/v9n1v4KAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADu9v8DDJb+/z3N/XgAAAAA3/sAAN/7AADf+wAA3/sAAAAAAAAAAAAAAAAAAN/7AAAAAAAAAAAAAAAAAAAAAAAAj/EAAI/5AACP8QAA3/sAAA==" rel="icon" type="image/x-icon" />
8 8
9 <title> 9 <title>
10 {% if page.title %}{{ page.title | escape }} - {{ site.title | escape }} 10 {% if page.title %}{{ page.title | escape }} - {{ site.title | escape }}
@@ -17,7 +17,7 @@
17 17
18 <style> 18 <style>
19 :root { 19 :root {
20 --body-max-width: 1600px; 20 --body-max-width: 1200px;
21 --border-color: gainsboro; 21 --border-color: gainsboro;
22 --border-size: 1px; 22 --border-size: 1px;
23 --border-style: solid; 23 --border-style: solid;
@@ -61,6 +61,10 @@
61 text-decoration: none; 61 text-decoration: none;
62 } 62 }
63 63
64 a:hover {
65 text-decoration: underline;
66 }
67
64 h1, h2, h3 { 68 h1, h2, h3 {
65 line-height: initial; 69 line-height: initial;
66 } 70 }
@@ -247,5 +251,11 @@
247 <hr> 251 <hr>
248 <p><small>This page's most recent build occurred on {{ site.time | date: "%A, %-d %B, %Y" }} and is also available as <a href="/feed.xml" target="_blank">RSS feed</a>.</small></p> 252 <p><small>This page's most recent build occurred on {{ site.time | date: "%A, %-d %B, %Y" }} and is also available as <a href="/feed.xml" target="_blank">RSS feed</a>.</small></p>
249 </footer> 253 </footer>
254
255 {% if page.mathjax %}
256 <script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script>
257 <script src="https://cdn.jsdelivr.net/npm/mathjax@3.0.1/es5/tex-mml-chtml.js" async></script>
258 {% endif %}
259
250 </body> 260 </body>
251</html> 261</html>
diff --git a/_posts/2024-02-11-k-mer.md b/_posts/2024-02-11-k-mer.md
new file mode 100644
index 0000000..f8f6e17
--- /dev/null
+++ b/_posts/2024-02-11-k-mer.md
@@ -0,0 +1,126 @@
1---
2title: "Navigating the genome using k-mers for DNA analysis and visualization"
3permalink: /navigating-the-genome-using-k-mers-for-dna-analysis-and-visualization.html
4date: 2024-02-11T01:04:28+02:00
5layout: post
6type: post
7mathjax: yes
8draft: true
9---
10
11## Brief introduction to K-mer
12
13A "k-mer" refers to all the possible substrings of length \\(k\\) contained in a
14string, which is commonly used in computational biology and bioinformatics. In
15the context of DNA, RNA, or protein sequences, a k-mer is a sequence of \\(k\\)
16nucleotides (for DNA and RNA) or amino acids (for proteins).
17
18The concept of k-mers is fundamental in various bioinformatics applications,
19including genome assembly, sequence alignment, and identification of repeat
20sequences. By analyzing the frequency and distribution of k-mers within a
21sequence or set of sequences, researchers can infer structural characteristics,
22identify genetic variants, and compare genomic or proteomic compositions between
23different organisms or conditions.
24
25For example, in genome assembly, k-mers are used to reconstruct the sequence of
26a genome from a collection of short sequencing reads. By finding overlaps
27between the k-mers derived from these reads, assembly algorithms can piece
28together contiguous sequences (contigs), which represent longer sections of the
29genome.
30
31The choice of \\(k\\) (the length of the k-mer) is crucial and depends on the
32specific application. A larger \\(k\\) provides more specificity (useful for
33distinguishing between closely related sequences), while a smaller \\(k\\)
34offers greater sensitivity (useful for detecting repeats or low-complexity
35regions). However, the computational resources required increase with \\(k\\),
36as there are \\(4^k\\) possible k-mers for nucleotide sequences (due to the four
37types of nucleotides: A, T, C, G) and \\(20^k\\) for amino acid sequences (due
38to the twenty standard amino acids).
39
40## K-mer counting
41
42K-mer counting is a fundamental process in bioinformatics used for analyzing the
43frequency of k-mers (subsequences of length \\(k\\)) in DNA, RNA, or protein
44sequences. Efficient k-mer counting is crucial for various applications such as
45genome assembly, metagenomics, and sequence comparison. The implementation
46typically involves parsing a sequence into all possible k-mers and then counting
47the occurrences of each unique k-mer. Here's a general approach to implementing
48k-mer counting:
49
50### Reading the Sequences
51
52The first step involves reading the genetic or protein sequences from files,
53which are often in formats like FASTA or FASTQ. These files contain one or
54multiple sequences that will be processed to extract k-mers.
55
56### Generating K-mers
57
58For each sequence, generate all possible subsequences of length \\(k\\). This is
59done by sliding a window of size \\(k\\) across the sequence, one nucleotide (or
60amino acid) at a time, and extracting the subsequence within this window.
61
62### Counting K-mers
63
64The extracted k-mers are then counted. This can be achieved using various data
65structures:
66
67- **Hash Tables (Dictionaries)**: They offer an efficient way to keep track of
68 k-mer counts, with k-mers as keys and their frequencies as values. This
69 approach is straightforward but can become memory-intensive with large
70 datasets or large values of \\(k\\).
71- **Suffix Trees or Arrays**: These data structures are more space-efficient for
72 k-mer counting, especially for large datasets. They allow for efficient
73 retrieval of k-mer occurrences but are more complex to implement.
74- **Bloom Filters and Count-Min Sketch**: For very large datasets, probabilistic
75 data structures like Bloom filters or Count-Min Sketch can estimate k-mer
76 counts using significantly less memory, at the cost of a controlled error
77 rate.
78
79### Handling Memory and Performance Issues
80
81K-mer counting can be memory-intensive, especially for large values of \\(k\\) or
82large datasets. Optimizations include:
83
84- **Compressing K-mers**: Representing k-mers using a binary format rather than
85 strings can save memory.
86- **Parallel Processing**: Distributing the k-mer counting task across multiple
87 processors or machines can significantly speed up the process.
88- **Minimizing I/O Operations**: Efficiently reading and processing sequences
89 from files in chunks reduces I/O overhead.
90
91### Post-processing
92
93After counting, the k-mer frequencies can be used directly for analyses or can
94undergo further processing, such as filtering rare k-mers, which are often
95errors, or normalizing counts for comparative analysis.
96
97### Implementation Example
98
99Here's a simple Python example using a dictionary for k-mer counting:
100
101```python
102def count_kmers(sequence, k):
103 kmer_counts = {}
104 for i in range(len(sequence) - k + 1):
105 kmer = sequence[i:i+k]
106 if kmer in kmer_counts:
107 kmer_counts[kmer] += 1
108 else:
109 kmer_counts[kmer] = 1
110 return kmer_counts
111
112# Example usage
113sequence = "ATGCGATGATCTGATG"
114k = 3
115kmer_counts = count_kmers(sequence, k)
116print(kmer_counts)
117```
118
119This code snippet counts the occurrences of each 3-mer in a given sequence.
120
121For real-world applications, especially those involving large datasets, consider
122using specialized bioinformatics tools like Jellyfish, KMC, or khmer, which are
123optimized for efficiency and scalability.
124
125Now that we have the basics out of the way we can start implementing basic k-mer
126counter in C.