Move back to JBMAFP

author: Mitja Felicijan <mitja.felicijan@gmail.com> 2024-03-10 14:59:14 +0100
committer: Mitja Felicijan <mitja.felicijan@gmail.com> 2024-03-10 14:59:14 +0100
commit: 1100562e29f6476448b656dbddd4cf22505523f6 (patch)
tree: 442eec492199104bd49dfd74474ce89ade8fcac9 /_posts/posts/2024-02-11-k-mer.md
parent: a40d80be378e46a6c490e1b99b0d8f4acd968503 (diff)
download: mitjafelicijan.com-1100562e29f6476448b656dbddd4cf22505523f6.tar.gz
1 files changed, 0 insertions, 141 deletions
diff --git a/_posts/posts/2024-02-11-k-mer.md b/_posts/posts/2024-02-11-k-mer.md
deleted file mode 100644
index 254b5df..0000000
--- a/_posts/posts/2024-02-11-k-mer.md
+++ /dev/null
@@ -1,141 +0,0 @@
---
-title: "Navigating the genome using k-mers for DNA analysis and visualization"
-permalink: /navigating-the-genome-using-k-mers-for-dna-analysis-and-visualization.html
-date: 2024-02-11T01:04:28+02:00
-layout: post
-type: post
-mathjax: yes
-draft: true
-published: false
---
-## Brief introduction to K-mer
-A "k-mer" refers to all the possible substrings of length \\(k\\) contained in a
-string, which is commonly used in computational biology and bioinformatics. In
-the context of DNA, RNA, or protein sequences, a k-mer is a sequence of \\(k\\)
-nucleotides (for DNA and RNA) or amino acids (for proteins).
-The concept of k-mers is fundamental in various bioinformatics applications,
-including genome assembly, sequence alignment, and identification of repeat
-sequences. By analyzing the frequency and distribution of k-mers within a
-sequence or set of sequences, researchers can infer structural characteristics,
-identify genetic variants, and compare genomic or proteomic compositions between
-different organisms or conditions.
-For example, in genome assembly, k-mers are used to reconstruct the sequence of
-a genome from a collection of short sequencing reads. By finding overlaps
-between the k-mers derived from these reads, assembly algorithms can piece
-together contiguous sequences (contigs), which represent longer sections of the
-genome.
-The choice of \\(k\\) (the length of the k-mer) is crucial and depends on the
-specific application. A larger \\(k\\) provides more specificity (useful for
-distinguishing between closely related sequences), while a smaller \\(k\\)
-offers greater sensitivity (useful for detecting repeats or low-complexity
-regions). However, the computational resources required increase with \\(k\\),
-as there are \\(4^k\\) possible k-mers for nucleotide sequences (due to the four
-types of nucleotides: A, T, C, G) and \\(20^k\\) for amino acid sequences (due
-to the twenty standard amino acids).
-## K-mer counting
-K-mer counting is a fundamental process in bioinformatics used for analyzing the
-frequency of k-mers (subsequences of length \\(k\\)) in DNA, RNA, or protein
-sequences. Efficient k-mer counting is crucial for various applications such as
-genome assembly, metagenomics, and sequence comparison. The implementation
-typically involves parsing a sequence into all possible k-mers and then counting
-the occurrences of each unique k-mer. Here's a general approach to implementing
-k-mer counting:
-### Reading the Sequences
-The first step involves reading the genetic or protein sequences from files,
-which are often in formats like FASTA or FASTQ. These files contain one or
-multiple sequences that will be processed to extract k-mers.
-### Generating K-mers
-For each sequence, generate all possible subsequences of length \\(k\\). This is
-done by sliding a window of size \\(k\\) across the sequence, one nucleotide (or
-amino acid) at a time, and extracting the subsequence within this window.
-### Counting K-mers
-The extracted k-mers are then counted. This can be achieved using various data
-structures:
- **Hash Tables (Dictionaries)**: They offer an efficient way to keep track of
-  k-mer counts, with k-mers as keys and their frequencies as values. This
-  approach is straightforward but can become memory-intensive with large
-  datasets or large values of \\(k\\).
- **Suffix Trees or Arrays**: These data structures are more space-efficient for
-  k-mer counting, especially for large datasets. They allow for efficient
-  retrieval of k-mer occurrences but are more complex to implement.
- **Bloom Filters and Count-Min Sketch**: For very large datasets, probabilistic
-  data structures like Bloom filters or Count-Min Sketch can estimate k-mer
-  counts using significantly less memory, at the cost of a controlled error
-  rate.
-### Handling Memory and Performance Issues
-K-mer counting can be memory-intensive, especially for large values of \\(k\\) or
-large datasets. Optimizations include:
- **Compressing K-mers**: Representing k-mers using a binary format rather than
-  strings can save memory.
- **Parallel Processing**: Distributing the k-mer counting task across multiple
-  processors or machines can significantly speed up the process.
- **Minimizing I/O Operations**: Efficiently reading and processing sequences
-  from files in chunks reduces I/O overhead.
-### Post-processing
-After counting, the k-mer frequencies can be used directly for analyses or can
-undergo further processing, such as filtering rare k-mers, which are often
-errors, or normalizing counts for comparative analysis.
-### Implementation Example
-Here's a simple Python example using a dictionary for k-mer counting:
-```python
-def count_kmers(sequence, k):
-    kmer_counts = {}
-    for i in range(len(sequence) - k + 1):
-        kmer = sequence[i:i+k]
-        if kmer in kmer_counts:
-            kmer_counts[kmer] += 1
-        else:
-            kmer_counts[kmer] = 1
-    return kmer_counts
-# Example usage
-sequence = "ATGCGATGATCTGATG"
-k = 3
-kmer_counts = count_kmers(sequence, k)
-print(kmer_counts)
-```
-This code snippet counts the occurrences of each 3-mer in a given sequence.
-For real-world applications, especially those involving large datasets, consider
-using specialized bioinformatics tools like Jellyfish, KMC, or khmer, which are
-optimized for efficiency and scalability.
-Now that we have the basics out of the way we can start implementing basic k-mer
-counter in C.
-## Implementing sequence reading in C
-## Additional reading material
- [2101.08385](https://arxiv.org/pdf/2101.08385.pdf) - Motif Identification using CNN-based Pairwise
- [2112.15107](https://arxiv.org/pdf/2112.15107.pdf) - Probabilistic Models of k-mer Frequencies
- [2205.13915](https://arxiv.org/pdf/2205.13915.pdf) - DiMA: Sequence Diversity Dynamics Analyser for Viruses
- [2209.09242](https://arxiv.org/pdf/2209.09242.pdf) - Computing Phylo-k-mers
- [2305.07545](https://arxiv.org/pdf/2305.07545.pdf) - KmerCo: A lightweight K-mer counting technique with a tiny memory footprint
- [2308.01920](https://arxiv.org/pdf/2308.01920.pdf) - Sequence-Based Nanobody-Antigen Binding
- [2310.10321](https://arxiv.org/pdf/2310.10321.pdf) - Hamming Encoder: Mining Discriminative k-mers for Discrete Sequence Classification
- [2312.03865](https://arxiv.org/pdf/2312.03865.pdf) - Learning Genomic Sequence Representations using Graph Neural Networks over De Bruijn Graphs
- [2401.14025](https://arxiv.org/pdf/2401.14025.pdf) - DNA Sequence Classification with Compressors
author	Mitja Felicijan <mitja.felicijan@gmail.com>	2024-03-10 14:59:14 +0100
committer	Mitja Felicijan <mitja.felicijan@gmail.com>	2024-03-10 14:59:14 +0100
commit	1100562e29f6476448b656dbddd4cf22505523f6 (patch)
tree	442eec492199104bd49dfd74474ce89ade8fcac9 /_posts/posts/2024-02-11-k-mer.md
parent	a40d80be378e46a6c490e1b99b0d8f4acd968503 (diff)
download	mitjafelicijan.com-1100562e29f6476448b656dbddd4cf22505523f6.tar.gz

diff --git a/_posts/posts/2024-02-11-k-mer.md b/_posts/posts/2024-02-11-k-mer.md deleted file mode 100644 index 254b5df..0000000 --- a/_posts/posts/2024-02-11-k-mer.md +++ /dev/null
@@ -1,141 +0,0 @@
1	---
2	title: "Navigating the genome using k-mers for DNA analysis and visualization"
3	permalink: /navigating-the-genome-using-k-mers-for-dna-analysis-and-visualization.html
4	date: 2024-02-11T01:04:28+02:00
5	layout: post
6	type: post
7	mathjax: yes
8	draft: true
9	published: false
10	---
11
12	## Brief introduction to K-mer
13
14	A "k-mer" refers to all the possible substrings of length \\(k\\) contained in a
15	string, which is commonly used in computational biology and bioinformatics. In
16	the context of DNA, RNA, or protein sequences, a k-mer is a sequence of \\(k\\)
17	nucleotides (for DNA and RNA) or amino acids (for proteins).
18
19	The concept of k-mers is fundamental in various bioinformatics applications,
20	including genome assembly, sequence alignment, and identification of repeat
21	sequences. By analyzing the frequency and distribution of k-mers within a
22	sequence or set of sequences, researchers can infer structural characteristics,
23	identify genetic variants, and compare genomic or proteomic compositions between
24	different organisms or conditions.
25
26	For example, in genome assembly, k-mers are used to reconstruct the sequence of
27	a genome from a collection of short sequencing reads. By finding overlaps
28	between the k-mers derived from these reads, assembly algorithms can piece
29	together contiguous sequences (contigs), which represent longer sections of the
30	genome.
31
32	The choice of \\(k\\) (the length of the k-mer) is crucial and depends on the
33	specific application. A larger \\(k\\) provides more specificity (useful for
34	distinguishing between closely related sequences), while a smaller \\(k\\)
35	offers greater sensitivity (useful for detecting repeats or low-complexity
36	regions). However, the computational resources required increase with \\(k\\),
37	as there are \\(4^k\\) possible k-mers for nucleotide sequences (due to the four
38	types of nucleotides: A, T, C, G) and \\(20^k\\) for amino acid sequences (due
39	to the twenty standard amino acids).
40
41	## K-mer counting
42
43	K-mer counting is a fundamental process in bioinformatics used for analyzing the
44	frequency of k-mers (subsequences of length \\(k\\)) in DNA, RNA, or protein
45	sequences. Efficient k-mer counting is crucial for various applications such as
46	genome assembly, metagenomics, and sequence comparison. The implementation
47	typically involves parsing a sequence into all possible k-mers and then counting
48	the occurrences of each unique k-mer. Here's a general approach to implementing
49	k-mer counting:
50
51	### Reading the Sequences
52
53	The first step involves reading the genetic or protein sequences from files,
54	which are often in formats like FASTA or FASTQ. These files contain one or
55	multiple sequences that will be processed to extract k-mers.
56
57	### Generating K-mers
58
59	For each sequence, generate all possible subsequences of length \\(k\\). This is
60	done by sliding a window of size \\(k\\) across the sequence, one nucleotide (or
61	amino acid) at a time, and extracting the subsequence within this window.
62
63	### Counting K-mers
64
65	The extracted k-mers are then counted. This can be achieved using various data
66	structures:
67
68	- Hash Tables (Dictionaries): They offer an efficient way to keep track of
69	k-mer counts, with k-mers as keys and their frequencies as values. This
70	approach is straightforward but can become memory-intensive with large
71	datasets or large values of \\(k\\).
72	- Suffix Trees or Arrays: These data structures are more space-efficient for
73	k-mer counting, especially for large datasets. They allow for efficient
74	retrieval of k-mer occurrences but are more complex to implement.
75	- Bloom Filters and Count-Min Sketch: For very large datasets, probabilistic
76	data structures like Bloom filters or Count-Min Sketch can estimate k-mer
77	counts using significantly less memory, at the cost of a controlled error
78	rate.
79
80	### Handling Memory and Performance Issues
81
82	K-mer counting can be memory-intensive, especially for large values of \\(k\\) or
83	large datasets. Optimizations include:
84
85	- Compressing K-mers: Representing k-mers using a binary format rather than
86	strings can save memory.
87	- Parallel Processing: Distributing the k-mer counting task across multiple
88	processors or machines can significantly speed up the process.
89	- Minimizing I/O Operations: Efficiently reading and processing sequences
90	from files in chunks reduces I/O overhead.
91
92	### Post-processing
93
94	After counting, the k-mer frequencies can be used directly for analyses or can
95	undergo further processing, such as filtering rare k-mers, which are often
96	errors, or normalizing counts for comparative analysis.
97
98	### Implementation Example
99
100	Here's a simple Python example using a dictionary for k-mer counting:
101
102	```python
103	def count_kmers(sequence, k):
104	kmer_counts = {}
105	for i in range(len(sequence) - k + 1):
106	kmer = sequence[i:i+k]
107	if kmer in kmer_counts:
108	kmer_counts[kmer] += 1
109	else:
110	kmer_counts[kmer] = 1
111	return kmer_counts
112
113	# Example usage
114	sequence = "ATGCGATGATCTGATG"
115	k = 3
116	kmer_counts = count_kmers(sequence, k)
117	print(kmer_counts)
118	```
119
120	This code snippet counts the occurrences of each 3-mer in a given sequence.
121
122	For real-world applications, especially those involving large datasets, consider
123	using specialized bioinformatics tools like Jellyfish, KMC, or khmer, which are
124	optimized for efficiency and scalability.
125
126	Now that we have the basics out of the way we can start implementing basic k-mer
127	counter in C.
128
129	## Implementing sequence reading in C
130
131	## Additional reading material
132
133	- [2101.08385](https://arxiv.org/pdf/2101.08385.pdf) - Motif Identification using CNN-based Pairwise
134	- [2112.15107](https://arxiv.org/pdf/2112.15107.pdf) - Probabilistic Models of k-mer Frequencies
135	- [2205.13915](https://arxiv.org/pdf/2205.13915.pdf) - DiMA: Sequence Diversity Dynamics Analyser for Viruses
136	- [2209.09242](https://arxiv.org/pdf/2209.09242.pdf) - Computing Phylo-k-mers
137	- [2305.07545](https://arxiv.org/pdf/2305.07545.pdf) - KmerCo: A lightweight K-mer counting technique with a tiny memory footprint
138	- [2308.01920](https://arxiv.org/pdf/2308.01920.pdf) - Sequence-Based Nanobody-Antigen Binding
139	- [2310.10321](https://arxiv.org/pdf/2310.10321.pdf) - Hamming Encoder: Mining Discriminative k-mers for Discrete Sequence Classification
140	- [2312.03865](https://arxiv.org/pdf/2312.03865.pdf) - Learning Genomic Sequence Representations using Graph Neural Networks over De Bruijn Graphs
141	- [2401.14025](https://arxiv.org/pdf/2401.14025.pdf) - DNA Sequence Classification with Compressors