From 4abcce013c9ee3053badf2abda77190233066676 Mon Sep 17 00:00:00 2001 From: Mitja Felicijan Date: Fri, 23 Feb 2024 10:35:22 +0100 Subject: Testing thoughts page --- _posts/2024-02-11-k-mer.md | 140 --------------------------------------------- 1 file changed, 140 deletions(-) delete mode 100644 _posts/2024-02-11-k-mer.md (limited to '_posts/2024-02-11-k-mer.md') diff --git a/_posts/2024-02-11-k-mer.md b/_posts/2024-02-11-k-mer.md deleted file mode 100644 index c3e4a17..0000000 --- a/_posts/2024-02-11-k-mer.md +++ /dev/null @@ -1,140 +0,0 @@ ---- -title: "Navigating the genome using k-mers for DNA analysis and visualization" -permalink: /navigating-the-genome-using-k-mers-for-dna-analysis-and-visualization.html -date: 2024-02-11T01:04:28+02:00 -layout: post -type: post -mathjax: yes -draft: true ---- - -## Brief introduction to K-mer - -A "k-mer" refers to all the possible substrings of length \\(k\\) contained in a -string, which is commonly used in computational biology and bioinformatics. In -the context of DNA, RNA, or protein sequences, a k-mer is a sequence of \\(k\\) -nucleotides (for DNA and RNA) or amino acids (for proteins). - -The concept of k-mers is fundamental in various bioinformatics applications, -including genome assembly, sequence alignment, and identification of repeat -sequences. By analyzing the frequency and distribution of k-mers within a -sequence or set of sequences, researchers can infer structural characteristics, -identify genetic variants, and compare genomic or proteomic compositions between -different organisms or conditions. - -For example, in genome assembly, k-mers are used to reconstruct the sequence of -a genome from a collection of short sequencing reads. By finding overlaps -between the k-mers derived from these reads, assembly algorithms can piece -together contiguous sequences (contigs), which represent longer sections of the -genome. - -The choice of \\(k\\) (the length of the k-mer) is crucial and depends on the -specific application. A larger \\(k\\) provides more specificity (useful for -distinguishing between closely related sequences), while a smaller \\(k\\) -offers greater sensitivity (useful for detecting repeats or low-complexity -regions). However, the computational resources required increase with \\(k\\), -as there are \\(4^k\\) possible k-mers for nucleotide sequences (due to the four -types of nucleotides: A, T, C, G) and \\(20^k\\) for amino acid sequences (due -to the twenty standard amino acids). - -## K-mer counting - -K-mer counting is a fundamental process in bioinformatics used for analyzing the -frequency of k-mers (subsequences of length \\(k\\)) in DNA, RNA, or protein -sequences. Efficient k-mer counting is crucial for various applications such as -genome assembly, metagenomics, and sequence comparison. The implementation -typically involves parsing a sequence into all possible k-mers and then counting -the occurrences of each unique k-mer. Here's a general approach to implementing -k-mer counting: - -### Reading the Sequences - -The first step involves reading the genetic or protein sequences from files, -which are often in formats like FASTA or FASTQ. These files contain one or -multiple sequences that will be processed to extract k-mers. - -### Generating K-mers - -For each sequence, generate all possible subsequences of length \\(k\\). This is -done by sliding a window of size \\(k\\) across the sequence, one nucleotide (or -amino acid) at a time, and extracting the subsequence within this window. - -### Counting K-mers - -The extracted k-mers are then counted. This can be achieved using various data -structures: - -- **Hash Tables (Dictionaries)**: They offer an efficient way to keep track of - k-mer counts, with k-mers as keys and their frequencies as values. This - approach is straightforward but can become memory-intensive with large - datasets or large values of \\(k\\). -- **Suffix Trees or Arrays**: These data structures are more space-efficient for - k-mer counting, especially for large datasets. They allow for efficient - retrieval of k-mer occurrences but are more complex to implement. -- **Bloom Filters and Count-Min Sketch**: For very large datasets, probabilistic - data structures like Bloom filters or Count-Min Sketch can estimate k-mer - counts using significantly less memory, at the cost of a controlled error - rate. - -### Handling Memory and Performance Issues - -K-mer counting can be memory-intensive, especially for large values of \\(k\\) or -large datasets. Optimizations include: - -- **Compressing K-mers**: Representing k-mers using a binary format rather than - strings can save memory. -- **Parallel Processing**: Distributing the k-mer counting task across multiple - processors or machines can significantly speed up the process. -- **Minimizing I/O Operations**: Efficiently reading and processing sequences - from files in chunks reduces I/O overhead. - -### Post-processing - -After counting, the k-mer frequencies can be used directly for analyses or can -undergo further processing, such as filtering rare k-mers, which are often -errors, or normalizing counts for comparative analysis. - -### Implementation Example - -Here's a simple Python example using a dictionary for k-mer counting: - -```python -def count_kmers(sequence, k): - kmer_counts = {} - for i in range(len(sequence) - k + 1): - kmer = sequence[i:i+k] - if kmer in kmer_counts: - kmer_counts[kmer] += 1 - else: - kmer_counts[kmer] = 1 - return kmer_counts - -# Example usage -sequence = "ATGCGATGATCTGATG" -k = 3 -kmer_counts = count_kmers(sequence, k) -print(kmer_counts) -``` - -This code snippet counts the occurrences of each 3-mer in a given sequence. - -For real-world applications, especially those involving large datasets, consider -using specialized bioinformatics tools like Jellyfish, KMC, or khmer, which are -optimized for efficiency and scalability. - -Now that we have the basics out of the way we can start implementing basic k-mer -counter in C. - -## Implementing sequence reading in C - -## Additional reading material - -- [2101.08385](https://arxiv.org/pdf/2101.08385.pdf) - Motif Identification using CNN-based Pairwise -- [2112.15107](https://arxiv.org/pdf/2112.15107.pdf) - Probabilistic Models of k-mer Frequencies -- [2205.13915](https://arxiv.org/pdf/2205.13915.pdf) - DiMA: Sequence Diversity Dynamics Analyser for Viruses -- [2209.09242](https://arxiv.org/pdf/2209.09242.pdf) - Computing Phylo-k-mers -- [2305.07545](https://arxiv.org/pdf/2305.07545.pdf) - KmerCo: A lightweight K-mer counting technique with a tiny memory footprint -- [2308.01920](https://arxiv.org/pdf/2308.01920.pdf) - Sequence-Based Nanobody-Antigen Binding -- [2310.10321](https://arxiv.org/pdf/2310.10321.pdf) - Hamming Encoder: Mining Discriminative k-mers for Discrete Sequence Classification -- [2312.03865](https://arxiv.org/pdf/2312.03865.pdf) - Learning Genomic Sequence Representations using Graph Neural Networks over De Bruijn Graphs -- [2401.14025](https://arxiv.org/pdf/2401.14025.pdf) - DNA Sequence Classification with Compressors -- cgit v1.2.3