From 81a1359f60d209f4a1df86b3f5ae51d3efbb4d82 Mon Sep 17 00:00:00 2001
From: Mitja Felicijan <mitja.felicijan@gmail.com>
Date: Sun, 11 Feb 2024 22:41:14 +0100
Subject: Added DNA k-mer post draft

---
 _posts/2024-02-11-k-mer.md | 126 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 126 insertions(+)
 create mode 100644 _posts/2024-02-11-k-mer.md

(limited to '_posts')

diff --git a/_posts/2024-02-11-k-mer.md b/_posts/2024-02-11-k-mer.md
new file mode 100644
index 0000000..f8f6e17
--- /dev/null
+++ b/_posts/2024-02-11-k-mer.md
@@ -0,0 +1,126 @@
+---
+title: "Navigating the genome using k-mers for DNA analysis and visualization"
+permalink: /navigating-the-genome-using-k-mers-for-dna-analysis-and-visualization.html
+date: 2024-02-11T01:04:28+02:00
+layout: post
+type: post
+mathjax: yes
+draft: true
+---
+
+## Brief introduction to K-mer
+
+A "k-mer" refers to all the possible substrings of length \\(k\\) contained in a
+string, which is commonly used in computational biology and bioinformatics. In
+the context of DNA, RNA, or protein sequences, a k-mer is a sequence of \\(k\\)
+nucleotides (for DNA and RNA) or amino acids (for proteins).
+
+The concept of k-mers is fundamental in various bioinformatics applications,
+including genome assembly, sequence alignment, and identification of repeat
+sequences. By analyzing the frequency and distribution of k-mers within a
+sequence or set of sequences, researchers can infer structural characteristics,
+identify genetic variants, and compare genomic or proteomic compositions between
+different organisms or conditions.
+
+For example, in genome assembly, k-mers are used to reconstruct the sequence of
+a genome from a collection of short sequencing reads. By finding overlaps
+between the k-mers derived from these reads, assembly algorithms can piece
+together contiguous sequences (contigs), which represent longer sections of the
+genome.
+
+The choice of \\(k\\) (the length of the k-mer) is crucial and depends on the
+specific application. A larger \\(k\\) provides more specificity (useful for
+distinguishing between closely related sequences), while a smaller \\(k\\)
+offers greater sensitivity (useful for detecting repeats or low-complexity
+regions). However, the computational resources required increase with \\(k\\),
+as there are \\(4^k\\) possible k-mers for nucleotide sequences (due to the four
+types of nucleotides: A, T, C, G) and \\(20^k\\) for amino acid sequences (due
+to the twenty standard amino acids).
+
+## K-mer counting
+
+K-mer counting is a fundamental process in bioinformatics used for analyzing the
+frequency of k-mers (subsequences of length \\(k\\)) in DNA, RNA, or protein
+sequences. Efficient k-mer counting is crucial for various applications such as
+genome assembly, metagenomics, and sequence comparison. The implementation
+typically involves parsing a sequence into all possible k-mers and then counting
+the occurrences of each unique k-mer. Here's a general approach to implementing
+k-mer counting:
+
+### Reading the Sequences
+
+The first step involves reading the genetic or protein sequences from files,
+which are often in formats like FASTA or FASTQ. These files contain one or
+multiple sequences that will be processed to extract k-mers.
+
+### Generating K-mers
+
+For each sequence, generate all possible subsequences of length \\(k\\). This is
+done by sliding a window of size \\(k\\) across the sequence, one nucleotide (or
+amino acid) at a time, and extracting the subsequence within this window.
+
+### Counting K-mers
+
+The extracted k-mers are then counted. This can be achieved using various data
+structures:
+
+- **Hash Tables (Dictionaries)**: They offer an efficient way to keep track of
+  k-mer counts, with k-mers as keys and their frequencies as values. This
+  approach is straightforward but can become memory-intensive with large
+  datasets or large values of \\(k\\).
+- **Suffix Trees or Arrays**: These data structures are more space-efficient for
+  k-mer counting, especially for large datasets. They allow for efficient
+  retrieval of k-mer occurrences but are more complex to implement.
+- **Bloom Filters and Count-Min Sketch**: For very large datasets, probabilistic
+  data structures like Bloom filters or Count-Min Sketch can estimate k-mer
+  counts using significantly less memory, at the cost of a controlled error
+  rate.
+
+### Handling Memory and Performance Issues
+
+K-mer counting can be memory-intensive, especially for large values of \\(k\\) or
+large datasets. Optimizations include:
+
+- **Compressing K-mers**: Representing k-mers using a binary format rather than
+  strings can save memory.
+- **Parallel Processing**: Distributing the k-mer counting task across multiple
+  processors or machines can significantly speed up the process.
+- **Minimizing I/O Operations**: Efficiently reading and processing sequences
+  from files in chunks reduces I/O overhead.
+
+### Post-processing
+
+After counting, the k-mer frequencies can be used directly for analyses or can
+undergo further processing, such as filtering rare k-mers, which are often
+errors, or normalizing counts for comparative analysis.
+
+### Implementation Example
+
+Here's a simple Python example using a dictionary for k-mer counting:
+
+```python
+def count_kmers(sequence, k):
+    kmer_counts = {}
+    for i in range(len(sequence) - k + 1):
+        kmer = sequence[i:i+k]
+        if kmer in kmer_counts:
+            kmer_counts[kmer] += 1
+        else:
+            kmer_counts[kmer] = 1
+    return kmer_counts
+
+# Example usage
+sequence = "ATGCGATGATCTGATG"
+k = 3
+kmer_counts = count_kmers(sequence, k)
+print(kmer_counts)
+```
+
+This code snippet counts the occurrences of each 3-mer in a given sequence.
+
+For real-world applications, especially those involving large datasets, consider
+using specialized bioinformatics tools like Jellyfish, KMC, or khmer, which are
+optimized for efficiency and scalability.
+
+Now that we have the basics out of the way we can start implementing basic k-mer
+counter in C.
-- 
cgit v1.2.3