This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
CNVkit is a command-line toolkit and Python library for detecting copy number variants and alterations genome-wide from high-throughput sequencing data. It provides both a CLI interface and Python API for genomic analysis workflows.
Supported Python versions: 3.10+ (tested on 3.10-3.14) Minimum versions aligned with Ubuntu 25.04 (Plucky)
- Bug fixes and new features: Write a failing test first, then implement.
- Edge cases: Before finishing, verify behavior for empty inputs, NaN/missing values, and single-element arrays.
- NaN weight safety:
np.average(values, weights=w)andnumpy.sum(w)propagate NaN; usenp.nansumfor weight sums and filter~np.isnan(wt)beforenp.average. Note thatpandas.Series.sum()skips NaN by default — prefer explicitnp.nansumfor clarity. - User-facing changes: Update the relevant docs in
doc/*.rst. - Clinical impact: When reviewing changes, consider whether the changeset alters numerical output or output file formats (
.cnr,.cns,.cnn, SEG, VCF). Flag any such changes explicitly, as downstream clinical pipelines may depend on exact output stability.
Use 'bd' for task tracking.
Conda (recommended):
conda env create -f environment-dev.yml # All deps, R, testing tools
conda activate cnvkitNote: The conda env must be activated before running pytest, mypy, or other dev tools -- they are not installed globally. The conda env includes R with DNAcopy for segmentation. Use conda activate cnvkit && <command> in scripts; conda run is unreliable here.
Run tests directly with pytest (not tox):
pytest test/ # Run all tests
pytest test/test_cnvlib.py # Run specific test file
pytest test/test_commands.py::CommandTests::test_batch # Run specific test
pytest test/test_commands.py -k "batch or coverage" # Run tests matching patternsComprehensive testing:
tox- Full matrix: Python 3.10-3.14, linting, security, coverage, docscd test/ && make mini- Integration tests with real genomic data (used in CI)
mypy # Check both cnvlib and skgenome
mypy cnvlib/batch.py # Check a single filemypy configuration (pyproject.toml [tool.mypy]):
check_untyped_defs = true,warn_unreachable = true,warn_return_any = trueenable_error_code = ["ignore-without-code"]- All# type: ignoremust include error codes
Common type patterns in this codebase:
tabio.read()returns union types -- use# type: ignore[return-value]at call sites- Pandas operations often return
Any-- use# type: ignore[no-any-return] - Closures don't narrow types in mypy -- use
# type: ignore[index]or local asserts - Parameters typed
param: None = Nonecause unreachable blocks -- useOptional[Type] = Noneinstead - Generator functions must use
-> Generator[YieldType, SendType, ReturnType], not-> Iteratoror-> None - When a variable changes type (e.g.
str→list[int]), rename it to avoid shadowing (e.g.copies→copy_strs) - Use
assert x is not Noneto narrowOptionaltypes when control flow guarantees non-None numpy.bool_is not assignable toboolin mypy -- use# type: ignore[assignment]or widen parameters tobool | bool_ | None
Pre-commit hooks run automatically on git commit (ruff, bandit, whitespace, YAML/TOML checks). Install with pre-commit install.
Manual commands:
make lint/make format- Ruff linting and formattingmake security- Safety + Bandit scanstox -e lint/tox -e security/tox -e coverage/tox -e docmake help- Show all Makefile targets
-
cnvlib/- Main Python packagecommands.py- CLI definitions and API functions (_cmd_*for arg parsing,do_*for logic)cnvkit.py- CLI entry point that routes to commandscore.py- Core data structures and utilitiessegmentation/- Segmentation algorithms (CBS, HMM, etc.)batch.py,segment.py, etc. - Individual command implementationscmdutil.py,params.py- Utility functionsplots.py,diagram.py,heatmap.py,scatter.py- Visualizationimporters.py,export.py- Data import/exportcnary.py- CopyNumArray (extends GenomicArray with log2 ratios and gene names)vary.py- VariantArray (extends GenomicArray with variant allele data)
-
skgenome/- Genomic data handling library (part of CNVkit but decoupled)gary.py- GenomicArray class for genomic interval data (wraps pandas DataFrame)tabio/- File I/O for BED, GFF, VCF, SEG, Picard, CNVkit formats (.cnn/.cnr/.cns), bedGraph
GenomicArray uses a TypeVar pattern (_GA = TypeVar("_GA", bound="GenomicArray")) so that methods like .copy(), .concat(), and .as_dataframe() preserve the subclass type through type checking.
.cnn- Coverage/reference data.cnr- Copy number ratio data.cns- Segmented copy number data
Core dependencies are declared in requirements/core.txt; min.txt pins exact minimums for compatibility testing.
- Type annotations use PEP 604 union syntax:
X | YandX | None, notUnion[X, Y]orOptional[X] - Match/case (PEP 634) is used for dispatch on string literals where it improves clarity
removeprefix()/removesuffix()(PEP 616) preferred over manual slicing for prefix/suffix removal- Dict
|=(PEP 584) preferred over.update()for merging dict literals - All
zip()calls use explicitstrict=Trueorstrict=False(PEP 618)
test/test_commands.pyandtest/test_cnvlib.pyeach have a top-levelfrom cnvlib import (...)block that serves as both a smoke test and the shared import set. Add newcnvlibsubmodule imports there rather than as local imports inside individual test methods.
- The codebase uses
bam_fnameorsample_fnamefor file paths that can be either BAM or bedGraph files - Parameter names in function signatures often use generic terms (e.g.,
bam_fname) even when they accept multiple formats
A Serena MCP server provides LSP-backed code intelligence tools (configured via claude mcp add with --context claude-code, which exposes only LSP tools, not file I/O or shell).
When to use Serena vs. built-in tools:
- Exploring unfamiliar code — use
get_symbols_overviewto see a file's structure without reading the whole file, thenfind_symbolwithinclude_body=Trueto read only the methods you need - Tracing call graphs — use
find_referencing_symbolsto find all callers/users of a symbol across the codebase - Refactoring — use
replace_symbol_body/insert_after_symbolfor precise symbolic edits - Simple lookups — use Grep/Glob for known string patterns; Serena is better for semantic queries (e.g. "all methods of CopyNumArray" or "all callers of
do_segmentation")
Key tools:
find_symbol- Find by name path (e.g.CopyNumArray/squash_genes); usedepth=1to list methods,include_body=Trueto read implementationsget_symbols_overview- List all top-level symbols in a filefind_referencing_symbols- Find all references to a symbol with surrounding code contextsearch_for_pattern- Regex search with file filtering (for non-symbol searches)replace_symbol_body/insert_after_symbol/insert_before_symbol- Symbolic edits
The analytical methods implemented in CNVkit are described in the publication: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004873
When implementing or modifying analytical methods, look up the primary literature to understand the underlying algorithms. Use Google Scholar and Europe PMC to find and read the original papers for methods referenced in the code (e.g. segmentation algorithms, statistical tests, normalization approaches).