From 4abcce013c9ee3053badf2abda77190233066676 Mon Sep 17 00:00:00 2001
From: Mitja Felicijan <mitja.felicijan@gmail.com>
Date: Fri, 23 Feb 2024 10:35:22 +0100
Subject: Testing thoughts page

---
 _posts/2024-02-11-k-mer.md | 140 ---------------------------------------------
 1 file changed, 140 deletions(-)
 delete mode 100644 _posts/2024-02-11-k-mer.md

(limited to '_posts/2024-02-11-k-mer.md')

diff --git a/_posts/2024-02-11-k-mer.md b/_posts/2024-02-11-k-mer.md
deleted file mode 100644
index c3e4a17..0000000
--- a/_posts/2024-02-11-k-mer.md
+++ /dev/null
@@ -1,140 +0,0 @@
----
-title: "Navigating the genome using k-mers for DNA analysis and visualization"
-permalink: /navigating-the-genome-using-k-mers-for-dna-analysis-and-visualization.html
-date: 2024-02-11T01:04:28+02:00
-layout: post
-type: post
-mathjax: yes
-draft: true
----
-
-## Brief introduction to K-mer
-
-A "k-mer" refers to all the possible substrings of length \\(k\\) contained in a
-string, which is commonly used in computational biology and bioinformatics. In
-the context of DNA, RNA, or protein sequences, a k-mer is a sequence of \\(k\\)
-nucleotides (for DNA and RNA) or amino acids (for proteins).
-
-The concept of k-mers is fundamental in various bioinformatics applications,
-including genome assembly, sequence alignment, and identification of repeat
-sequences. By analyzing the frequency and distribution of k-mers within a
-sequence or set of sequences, researchers can infer structural characteristics,
-identify genetic variants, and compare genomic or proteomic compositions between
-different organisms or conditions.
-
-For example, in genome assembly, k-mers are used to reconstruct the sequence of
-a genome from a collection of short sequencing reads. By finding overlaps
-between the k-mers derived from these reads, assembly algorithms can piece
-together contiguous sequences (contigs), which represent longer sections of the
-genome.
-
-The choice of \\(k\\) (the length of the k-mer) is crucial and depends on the
-specific application. A larger \\(k\\) provides more specificity (useful for
-distinguishing between closely related sequences), while a smaller \\(k\\)
-offers greater sensitivity (useful for detecting repeats or low-complexity
-regions). However, the computational resources required increase with \\(k\\),
-as there are \\(4^k\\) possible k-mers for nucleotide sequences (due to the four
-types of nucleotides: A, T, C, G) and \\(20^k\\) for amino acid sequences (due
-to the twenty standard amino acids).
-
-## K-mer counting
-
-K-mer counting is a fundamental process in bioinformatics used for analyzing the
-frequency of k-mers (subsequences of length \\(k\\)) in DNA, RNA, or protein
-sequences. Efficient k-mer counting is crucial for various applications such as
-genome assembly, metagenomics, and sequence comparison. The implementation
-typically involves parsing a sequence into all possible k-mers and then counting
-the occurrences of each unique k-mer. Here's a general approach to implementing
-k-mer counting:
-
-### Reading the Sequences
-
-The first step involves reading the genetic or protein sequences from files,
-which are often in formats like FASTA or FASTQ. These files contain one or
-multiple sequences that will be processed to extract k-mers.
-
-### Generating K-mers
-
-For each sequence, generate all possible subsequences of length \\(k\\). This is
-done by sliding a window of size \\(k\\) across the sequence, one nucleotide (or
-amino acid) at a time, and extracting the subsequence within this window.
-
-### Counting K-mers
-
-The extracted k-mers are then counted. This can be achieved using various data
-structures:
-
-- **Hash Tables (Dictionaries)**: They offer an efficient way to keep track of
-  k-mer counts, with k-mers as keys and their frequencies as values. This
-  approach is straightforward but can become memory-intensive with large
-  datasets or large values of \\(k\\).
-- **Suffix Trees or Arrays**: These data structures are more space-efficient for
-  k-mer counting, especially for large datasets. They allow for efficient
-  retrieval of k-mer occurrences but are more complex to implement.
-- **Bloom Filters and Count-Min Sketch**: For very large datasets, probabilistic
-  data structures like Bloom filters or Count-Min Sketch can estimate k-mer
-  counts using significantly less memory, at the cost of a controlled error
-  rate.
-
-### Handling Memory and Performance Issues
-
-K-mer counting can be memory-intensive, especially for large values of \\(k\\) or
-large datasets. Optimizations include:
-
-- **Compressing K-mers**: Representing k-mers using a binary format rather than
-  strings can save memory.
-- **Parallel Processing**: Distributing the k-mer counting task across multiple
-  processors or machines can significantly speed up the process.
-- **Minimizing I/O Operations**: Efficiently reading and processing sequences
-  from files in chunks reduces I/O overhead.
-
-### Post-processing
-
-After counting, the k-mer frequencies can be used directly for analyses or can
-undergo further processing, such as filtering rare k-mers, which are often
-errors, or normalizing counts for comparative analysis.
-
-### Implementation Example
-
-Here's a simple Python example using a dictionary for k-mer counting:
-
-```python
-def count_kmers(sequence, k):
-    kmer_counts = {}
-    for i in range(len(sequence) - k + 1):
-        kmer = sequence[i:i+k]
-        if kmer in kmer_counts:
-            kmer_counts[kmer] += 1
-        else:
-            kmer_counts[kmer] = 1
-    return kmer_counts
-
-# Example usage
-sequence = "ATGCGATGATCTGATG"
-k = 3
-kmer_counts = count_kmers(sequence, k)
-print(kmer_counts)
-```
-
-This code snippet counts the occurrences of each 3-mer in a given sequence.
-
-For real-world applications, especially those involving large datasets, consider
-using specialized bioinformatics tools like Jellyfish, KMC, or khmer, which are
-optimized for efficiency and scalability.
-
-Now that we have the basics out of the way we can start implementing basic k-mer
-counter in C.
-
-## Implementing sequence reading in C
-
-## Additional reading material
-
-- [2101.08385](https://arxiv.org/pdf/2101.08385.pdf) - Motif Identification using CNN-based Pairwise
-- [2112.15107](https://arxiv.org/pdf/2112.15107.pdf) - Probabilistic Models of k-mer Frequencies
-- [2205.13915](https://arxiv.org/pdf/2205.13915.pdf) - DiMA: Sequence Diversity Dynamics Analyser for Viruses
-- [2209.09242](https://arxiv.org/pdf/2209.09242.pdf) - Computing Phylo-k-mers
-- [2305.07545](https://arxiv.org/pdf/2305.07545.pdf) - KmerCo: A lightweight K-mer counting technique with a tiny memory footprint
-- [2308.01920](https://arxiv.org/pdf/2308.01920.pdf) - Sequence-Based Nanobody-Antigen Binding
-- [2310.10321](https://arxiv.org/pdf/2310.10321.pdf) - Hamming Encoder: Mining Discriminative k-mers for Discrete Sequence Classification
-- [2312.03865](https://arxiv.org/pdf/2312.03865.pdf) - Learning Genomic Sequence Representations using Graph Neural Networks over De Bruijn Graphs
-- [2401.14025](https://arxiv.org/pdf/2401.14025.pdf) - DNA Sequence Classification with Compressors
-- 
cgit v1.2.3