aboutsummaryrefslogtreecommitdiff
path: root/_posts/posts/2024-02-11-k-mer.md
diff options
context:
space:
mode:
authorMitja Felicijan <mitja.felicijan@gmail.com>2024-03-10 14:59:14 +0100
committerMitja Felicijan <mitja.felicijan@gmail.com>2024-03-10 14:59:14 +0100
commit1100562e29f6476448b656dbddd4cf22505523f6 (patch)
tree442eec492199104bd49dfd74474ce89ade8fcac9 /_posts/posts/2024-02-11-k-mer.md
parenta40d80be378e46a6c490e1b99b0d8f4acd968503 (diff)
downloadmitjafelicijan.com-1100562e29f6476448b656dbddd4cf22505523f6.tar.gz
Move back to JBMAFP
Diffstat (limited to '_posts/posts/2024-02-11-k-mer.md')
-rw-r--r--_posts/posts/2024-02-11-k-mer.md141
1 files changed, 0 insertions, 141 deletions
diff --git a/_posts/posts/2024-02-11-k-mer.md b/_posts/posts/2024-02-11-k-mer.md
deleted file mode 100644
index 254b5df..0000000
--- a/_posts/posts/2024-02-11-k-mer.md
+++ /dev/null
@@ -1,141 +0,0 @@
1---
2title: "Navigating the genome using k-mers for DNA analysis and visualization"
3permalink: /navigating-the-genome-using-k-mers-for-dna-analysis-and-visualization.html
4date: 2024-02-11T01:04:28+02:00
5layout: post
6type: post
7mathjax: yes
8draft: true
9published: false
10---
11
12## Brief introduction to K-mer
13
14A "k-mer" refers to all the possible substrings of length \\(k\\) contained in a
15string, which is commonly used in computational biology and bioinformatics. In
16the context of DNA, RNA, or protein sequences, a k-mer is a sequence of \\(k\\)
17nucleotides (for DNA and RNA) or amino acids (for proteins).
18
19The concept of k-mers is fundamental in various bioinformatics applications,
20including genome assembly, sequence alignment, and identification of repeat
21sequences. By analyzing the frequency and distribution of k-mers within a
22sequence or set of sequences, researchers can infer structural characteristics,
23identify genetic variants, and compare genomic or proteomic compositions between
24different organisms or conditions.
25
26For example, in genome assembly, k-mers are used to reconstruct the sequence of
27a genome from a collection of short sequencing reads. By finding overlaps
28between the k-mers derived from these reads, assembly algorithms can piece
29together contiguous sequences (contigs), which represent longer sections of the
30genome.
31
32The choice of \\(k\\) (the length of the k-mer) is crucial and depends on the
33specific application. A larger \\(k\\) provides more specificity (useful for
34distinguishing between closely related sequences), while a smaller \\(k\\)
35offers greater sensitivity (useful for detecting repeats or low-complexity
36regions). However, the computational resources required increase with \\(k\\),
37as there are \\(4^k\\) possible k-mers for nucleotide sequences (due to the four
38types of nucleotides: A, T, C, G) and \\(20^k\\) for amino acid sequences (due
39to the twenty standard amino acids).
40
41## K-mer counting
42
43K-mer counting is a fundamental process in bioinformatics used for analyzing the
44frequency of k-mers (subsequences of length \\(k\\)) in DNA, RNA, or protein
45sequences. Efficient k-mer counting is crucial for various applications such as
46genome assembly, metagenomics, and sequence comparison. The implementation
47typically involves parsing a sequence into all possible k-mers and then counting
48the occurrences of each unique k-mer. Here's a general approach to implementing
49k-mer counting:
50
51### Reading the Sequences
52
53The first step involves reading the genetic or protein sequences from files,
54which are often in formats like FASTA or FASTQ. These files contain one or
55multiple sequences that will be processed to extract k-mers.
56
57### Generating K-mers
58
59For each sequence, generate all possible subsequences of length \\(k\\). This is
60done by sliding a window of size \\(k\\) across the sequence, one nucleotide (or
61amino acid) at a time, and extracting the subsequence within this window.
62
63### Counting K-mers
64
65The extracted k-mers are then counted. This can be achieved using various data
66structures:
67
68- **Hash Tables (Dictionaries)**: They offer an efficient way to keep track of
69 k-mer counts, with k-mers as keys and their frequencies as values. This
70 approach is straightforward but can become memory-intensive with large
71 datasets or large values of \\(k\\).
72- **Suffix Trees or Arrays**: These data structures are more space-efficient for
73 k-mer counting, especially for large datasets. They allow for efficient
74 retrieval of k-mer occurrences but are more complex to implement.
75- **Bloom Filters and Count-Min Sketch**: For very large datasets, probabilistic
76 data structures like Bloom filters or Count-Min Sketch can estimate k-mer
77 counts using significantly less memory, at the cost of a controlled error
78 rate.
79
80### Handling Memory and Performance Issues
81
82K-mer counting can be memory-intensive, especially for large values of \\(k\\) or
83large datasets. Optimizations include:
84
85- **Compressing K-mers**: Representing k-mers using a binary format rather than
86 strings can save memory.
87- **Parallel Processing**: Distributing the k-mer counting task across multiple
88 processors or machines can significantly speed up the process.
89- **Minimizing I/O Operations**: Efficiently reading and processing sequences
90 from files in chunks reduces I/O overhead.
91
92### Post-processing
93
94After counting, the k-mer frequencies can be used directly for analyses or can
95undergo further processing, such as filtering rare k-mers, which are often
96errors, or normalizing counts for comparative analysis.
97
98### Implementation Example
99
100Here's a simple Python example using a dictionary for k-mer counting:
101
102```python
103def count_kmers(sequence, k):
104 kmer_counts = {}
105 for i in range(len(sequence) - k + 1):
106 kmer = sequence[i:i+k]
107 if kmer in kmer_counts:
108 kmer_counts[kmer] += 1
109 else:
110 kmer_counts[kmer] = 1
111 return kmer_counts
112
113# Example usage
114sequence = "ATGCGATGATCTGATG"
115k = 3
116kmer_counts = count_kmers(sequence, k)
117print(kmer_counts)
118```
119
120This code snippet counts the occurrences of each 3-mer in a given sequence.
121
122For real-world applications, especially those involving large datasets, consider
123using specialized bioinformatics tools like Jellyfish, KMC, or khmer, which are
124optimized for efficiency and scalability.
125
126Now that we have the basics out of the way we can start implementing basic k-mer
127counter in C.
128
129## Implementing sequence reading in C
130
131## Additional reading material
132
133- [2101.08385](https://arxiv.org/pdf/2101.08385.pdf) - Motif Identification using CNN-based Pairwise
134- [2112.15107](https://arxiv.org/pdf/2112.15107.pdf) - Probabilistic Models of k-mer Frequencies
135- [2205.13915](https://arxiv.org/pdf/2205.13915.pdf) - DiMA: Sequence Diversity Dynamics Analyser for Viruses
136- [2209.09242](https://arxiv.org/pdf/2209.09242.pdf) - Computing Phylo-k-mers
137- [2305.07545](https://arxiv.org/pdf/2305.07545.pdf) - KmerCo: A lightweight K-mer counting technique with a tiny memory footprint
138- [2308.01920](https://arxiv.org/pdf/2308.01920.pdf) - Sequence-Based Nanobody-Antigen Binding
139- [2310.10321](https://arxiv.org/pdf/2310.10321.pdf) - Hamming Encoder: Mining Discriminative k-mers for Discrete Sequence Classification
140- [2312.03865](https://arxiv.org/pdf/2312.03865.pdf) - Learning Genomic Sequence Representations using Graph Neural Networks over De Bruijn Graphs
141- [2401.14025](https://arxiv.org/pdf/2401.14025.pdf) - DNA Sequence Classification with Compressors