A comprehensive collection of examples demonstrating RustKmer's capabilities through both CLI commands and Python API, with automated consistency verification.
This directory contains practical examples for genomic k-mer analysis using the RustKmer library. Each operation is demonstrated using both the command-line interface and Python API, with automated verification that both approaches produce identical results.
examples/
├── data/ # Demo datasets
│ └── demo_rice_genome.fa.gz # Rice genome sample (490KB uncompressed)
├── bash/ # CLI examples
│ ├── 01_counting.sh # k-mer counting operations
│ ├── 02_database_ops.sh # database creation, stats, export
│ ├── 03_querying.sh # single and batch querying
│ ├── 04_fuzzy_search.sh # wildcard patterns and mutations
│ └── 05_benchmarking.sh # performance testing
├── python/ # Python API examples
│ ├── 01_counting.py # k-mer counting operations
│ ├── 02_database_ops.py # database operations
│ ├── 03_querying.py # querying operations
│ ├── 04_fuzzy_search.py # fuzzy search operations
│ ├── 05_benchmarking.py # performance benchmarking
│ └── utils/ # shared utilities
│ ├── result_validator.py # CLI vs Python result comparison
│ └── performance_profiler.py # timing and memory profiling
├── utils/ # Shared verification utilities
│ └── verify_consistency.sh # master verification script
├── marimo/ # Interactive notebooks
│ ├── rustkmer_analysis.py
│ └── kmer_analysis.py
└── README.md # This file
-
RustKmer CLI Installation:
cargo build --release # or install from crates.io cargo install rustkmer -
Python API Installation:
pip install rustkmer
-
Verify Installation:
cd examples ./utils/verify_consistency.sh --help
CLI Example:
cd examples/bash
./01_counting.shPython Example:
cd examples/python
python3 01_counting.pyRun All Examples with Verification:
cd examples
./utils/verify_consistency.shPurpose: Count k-mers in genomic sequences with different configurations.
Features Demonstrated:
- Basic k-mer counting with k=7
- Multi-threading optimization (1, 2, 4 threads)
- Canonical vs non-canonical k-mer processing
- Database export vs counting-only
- Performance monitoring and statistics
CLI Usage:
./01_counting.sh
# Demonstrates:
# - Single-threaded counting
# - Multi-threaded optimization
# - Canonical k-mer processing
# - Performance comparisonPython Usage:
python3 01_counting.py
# Demonstrates:
# - KmerCounter class usage
# - File processing and database creation
# - Statistics collection and analysis
# - Performance benchmarkingPurpose: Manage k-mer databases with comprehensive operations.
Features Demonstrated:
- Database creation and validation
- Database statistics and metadata
- Content export to text format
- Database comparison and analysis
- File size and performance analysis
CLI Usage:
./02_database_ops.sh
# Creates databases and demonstrates:
# - Database statistics
# - Content export
# - Metadata extractionPython Usage:
python3 02_database_ops.py
# Shows Python API for:
# - Database creation
# - Statistics retrieval
# - Content export and analysisPurpose: Efficient k-mer lookup and query operations.
Features Demonstrated:
- Single k-mer queries
- Batch query operations
- Query performance analysis
- Result validation and formatting
- Performance comparison (single vs batch)
CLI Usage:
./03_querying.sh
# Demonstrates:
# - Individual k-mer lookups
# - File-based batch queries
# - Performance timingPython Usage:
python3 03_querying.py
# Shows:
# - Database class usage
# - Query result processing
# - Batch query optimizationPurpose: Pattern matching with wildcards and mutations.
Features Demonstrated:
- Wildcard pattern expansion (N → A,T,C,G)
- Mutation tolerance (Hamming distance)
- Variant generation and filtering
- Result ranking and export
- Performance analysis
CLI Usage:
./04_fuzzy_search.sh
# Examples of:
# - Pattern: ACGTN → expands to ACGTA, ACGTT, ACGTC, ACGTG
# - Pattern: ANAN → 16 combinations with 2 wildcardsPython Usage:
python3 04_fuzzy_search.py
# Demonstrates:
# - FuzzyQuery class usage
# - Pattern expansion
# - Mutation tolerance searchesPurpose: Comprehensive performance analysis and optimization.
Features Demonstrated:
- Database creation performance
- Multi-threading scalability
- Memory usage analysis
- Query speed benchmarks
- Performance report generation
CLI Usage:
./05_benchmarking.sh
# Generates:
# - Performance metrics
# - Scalability analysis
# - Memory usage reportsPython Usage:
python3 05_benchmarking.py
# Provides:
# - Detailed performance profiling
# - Memory monitoring
# - Optimization recommendationsThe utils/verify_consistency.sh script automatically runs all examples and verifies consistency between CLI and Python API:
# Run full verification (includes benchmarking)
./utils/verify_consistency.sh
# Quick verification (skip benchmarking)
./utils/verify_consistency.sh --quick
# Verbose output
./utils/verify_consistency.sh --verbose
# Generate reports only
./utils/verify_consistency.sh --report-onlyThe python/utils/result_validator.py module provides comprehensive validation:
from utils.result_validator import ResultValidator
validator = ResultValidator(
cli_path="rustkmer",
data_path="demo_rice_genome.fa.gz",
output_dir="output"
)
# Compare database creation
validator.compare_counting_results(k=7, threads=4)
# Compare query results
validator.compare_query_results(database_path, query_list)
# Validate fuzzy search
validator.compare_fuzzy_results(patterns=["ACGTN", "ANC", "CNN"])- Speed: ~1-5 MB/sec depending on k-mer size and threading
- Memory: ~50-200MB for typical genome datasets
- Scalability: Excellent multi-threading performance (2-4 threads optimal)
- Format: Efficient binary RKDB format
- Single queries: ~1-5ms per k-mer
- Batch queries: ~200-1000 queries/second
- Memory usage: Minimal with memory-mapped files
- Database size: Compact binary format with fast access
- Wildcard expansion: Linear in number of combinations (4^n for n wildcards)
- Mutation tolerance: Quadratic in k-mer size and mutation level
- Optimization: Early termination and result caching
# Set custom output directory
export RUSTKMER_OUTPUT_DIR="/path/to/output"
# Override default thread count
export RUSTKMER_THREADS=8
# Enable verbose logging
export RUSTKMER_VERBOSE=1To use your own data:
-
Replace the demo data:
cp your_genome.fa.gz examples/data/ # Update DATA_PATH in scripts -
Adjust k-mer size:
# Edit KMER_SIZE=7 in scripts to your preferred value # Recommended: 5-31 depending on genome size
-
UTF-8 Validation Error with CLI:
Error: Invalid UTF-8 sequence in input fileSolution: The CLI has strict UTF-8 validation. Use Python API for files with 'N' characters or preprocess the file.
-
Memory Issues:
Error: Out of memorySolution: Reduce thread count or k-mer size. Monitor memory usage with
htopor Activity Monitor. -
Python Import Error:
ModuleNotFoundError: No module named 'rustkmer'Solution: Install with
pip install rustkmeror build from source. -
Permission Denied:
Permission denied: ./01_counting.shSolution: Make scripts executable with
chmod +x examples/bash/*.sh examples/utils/*.sh
- Multi-threading: Use 2-4 threads for optimal performance
- K-mer size: Smaller k-mers (5-11) are faster, larger k-mers (21-31) are more specific
- Storage: Use SSD storage for better I/O performance
- Memory: Ensure sufficient RAM for k-mer size × dataset size
Running ./01_counting.sh produces:
=== RustKmer CLI K-mer Counting Demo ===
Data: examples/data/demo_rice_genome.fa.gz
K-mer size: 7
=== Creating Test Database ===
✓ Created database: count_test_k7_1thread.rkdb (2.1MB, 1.2s)
✓ Created database: count_test_k7_4threads.rkdb (2.1MB, 0.4s)
=== Performance Comparison ===
Configuration Time(s) K-mers Memory Efficiency
1-thread 1.234 45,678 85MB 100%
4-threads 0.432 45,678 120MB 71%
=== Database Statistics ===
Database: count_test_k7_4threads.rkdb
K-mer size: 7
Total k-mers: 45,678
Unique k-mers: 12,345
Each example creates output files in examples/output/:
count_test_k7_*.rkdb- K-mer counting databasesquery_test_k7.rkdb- Database for query testingdatabase_export_k7.txt- Database content exportquery_results_k7.txt- Query resultsfuzzy_search_results_k5.txt- Fuzzy search results*_performance_report.md- Performance analysis reports
- Create paired examples: One bash script and one Python script
- Use consistent patterns: Follow existing naming and structure conventions
- Include validation: Ensure results are verifiable between CLI and Python
- Add documentation: Include comprehensive comments and usage examples
- Test thoroughly: Run verification script to ensure compatibility
# Run quick tests during development
./utils/verify_consistency.sh --quick --verbose
# Full test suite before submitting
./utils/verify_consistency.sh- Custom K-mer Definitions: Implement specialized k-mer counting
- Stream Processing: Handle large files incrementally
- Parallel Processing: Optimize for HPC environments
- Integration: Combine with other bioinformatics tools
These examples are provided under the same license as RustKmer. See the main project license for details.
For questions or issues:
- Check troubleshooting: Review the troubleshooting section above
- Run verification: Use
./utils/verify_consistency.sh --verbosefor diagnostics - GitHub Issues: Report bugs or request features on the main repository
- Documentation: Consult the main RustKmer documentation
Happy k-mer analyzing! 🧬✨