CLAUDE.md - PyPath Development Guide

Project Overview

PyPath is a Python library for molecular biology and biomedical prior knowledge processing. It aggregates data from ~200 biological databases into unified structures for network biology, annotations, protein complexes, and enzyme-substrate relationships.

Key facts:

12+ years of development history with heterogeneous code quality
Python 3.9+ (dropped Python 2 in 2021)
Poetry for dependency management
GPLv3 license
Ongoing modularization into separate packages

Project Structure

pypath/
├── core/          # Database classes (network, complex, annot, enz_sub, intercell)
├── inputs/        # Data acquisition: 189 modules for ~150 data sources
├── share/         # Shared utilities (curl, session/logging, settings, cache)
├── utils/         # Standalone utilities (mapping, taxonomy, orthology, GO, sequences)
├── resources/     # Data resources (urls.py with 1000+ endpoints, descriptions)
├── formats/       # Format parsers (sqlite, xml, sdf, obo, sqldump)
├── internals/     # Internal infrastructure (input_formats, resource definitions)
├── omnipath/      # High-level database management, web server, export
├── legacy/        # Deprecated code (do not use)
└── obsolete/      # Old code (do not use)

External companion packages:

pypath-common - Shared utilities across pypath projects
download-manager - Generic download infrastructure
cachemanager - Cache management

Code Style and Conventions

Naming

Classes: PascalCase. Resource names as single words: ProteinatlasAnnotation
Functions/methods/variables: snake_case
Resource names in functions: Single word, no underscores: proteinatlas_annotations() not protein_atlas_annotations()
- Underscores separate primary/secondary sources: PhosphoSite_ProtMapper

Input Function Naming Pattern

{resource}_{suffix}

Standard suffixes:

_interactions - Protein-protein interactions
_enz_sub - Enzyme-substrate relationships
_complexes - Protein complex membership
_annotations - Functional annotations
_raw - Near-raw data dumps
_mapping - ID cross-references

Formatting

Two blank lines between methods and classes
One blank line before/after logic blocks, between logical segments
Single quotes preferred unless reason for double quotes
Operators surrounded by spaces: a = a * 4 + 3

Argument lists:

def function(
        arg1: str,
        arg2: int,
        **kwargs,
    ) -> list:

New line after opening parenthesis
Each element on own line, indented one level
Trailing comma after each element
Closing parenthesis on own line at original indentation

Long comprehensions:

result = [
    transform(item)
    for item in items
    if condition(item)
]

Docstrings

Google (Napoleon) style with type hints:

def function(a: int) -> list[int]:
    """
    Brief description of function.

    Args:
        a: A number. Description continues
            on next line with indent.

    Returns:
        A list of integers.
    """

Imports

Order (with blank lines between sections):

from __future__ import annotations
Standard library
Typing imports
Third-party packages
PyPath internal imports

from __future__ import annotations

import collections
import re

from typing import Iterable, Optional

import pandas as pd

import pypath.share.curl as curl
import pypath.utils.mapping as mapping

Avoid in new code (legacy compatibility):

# DON'T use these in new code:
from future.utils import iteritems  # Use .items() directly
from past.builtins import xrange    # Use range() directly

Logging

Classes inherit from pypath.share.session.Logger:

class MyClass(session_mod.Logger):
    def __init__(self):
        session_mod.Logger.__init__(self, name='mymodule')
        self._log('Initialized')

Or module-level logger:

import pypath.share.session as session_mod
_logger = session_mod.Logger(name='inputs.mymodule')
_logger._log('Processing data...')

The `inputs` Module

Structure

189 Python files for ~150 data sources
16 sub-packages for complex resources (chembl, brenda, bindingdb, ramp, reactome, rhea, disgenet, hmdb, lipidmaps, etc.)
Simple resources: single file inputs/{resource}.py
Complex resources: sub-package inputs/{resource}/

Standard Data Flow

Download via pypath.share.curl.Curl
Parse to Python structures (lists, dicts, namedtuples)
Map IDs to UniProt using pypath.utils.mapping
Return standardized format

Basic Input Module Pattern

"""
Client and parser for ResourceName database.
"""

from __future__ import annotations

import collections

import pypath.share.curl as curl
import pypath.resources.urls as urls
import pypath.utils.mapping as mapping


ResourceRecord = collections.namedtuple(
    'ResourceRecord',
    ['source', 'target', 'effect'],
)


def resource_interactions() -> list[ResourceRecord]:
    """
    Retrieves interactions from ResourceName.

    Returns:
        List of interaction records.
    """

    url = urls.urls['resource']['url']
    c = curl.Curl(url, silent=False, large=True)

    result = []

    for line in c.result.split('\n')[1:]:  # Skip header

        fields = line.strip().split('\t')
        if len(fields) < 3:
            continue

        result.append(
            ResourceRecord(
                source=fields[0],
                target=fields[1],
                effect=fields[2],
            )
        )

    return result

Complex Resource Sub-Package Pattern

inputs/resource/
├── __init__.py      # Public API exports
├── _common.py       # Shared utilities and logger
├── _sqlite.py       # SQLite database access (if applicable)
├── _molecules.py    # Specific data extractors
├── _targets.py
└── _raw.py          # Raw data access

SQLite Database Pattern

import pypath.formats.sqlite as _sqlite
import pypath.share.curl as curl
import pypath.resources.urls as urls


def resource_sqlite(version: int = 1, connect: bool = True):
    """
    Downloads and connects to ResourceName SQLite database.
    """

    url = urls.urls['resource']['sqlite'] % version

    def _download() -> curl.Curl:
        return curl.Curl(url, large=True, silent=False)

    def _extractor(curl_obj: curl.Curl) -> str:
        return curl_obj.result

    return _sqlite.download_sqlite(
        download_callback=_download,
        extractor=_extractor,
        database='ResourceName',
        version=str(version),
        connect=connect,
    )

Common Utilities

ID Mapping:

import pypath.utils.mapping as mapping

# Map multiple IDs
uniprot_ids = mapping.map_names(gene_symbols, 'genesymbol', 'uniprot')

# Map single ID
uniprot = mapping.map_name('EGFR', 'genesymbol', 'uniprot')

Taxonomy:

import pypath.utils.taxonomy as taxonomy

name = taxonomy.taxids[9606]  # 'Homo sapiens'

pypath_common utilities:

from pypath_common import _misc, _constants as _const

first_item = _misc.first(iterable)

Glom for JSON processing:

import glom
from pypath_common import _constants as _const

spec = {'field': 'nested.path.to.field'}
result = glom.glom(record, spec, default=_const.GLOM_ERROR)

Cache and Downloads

Cache location: ~/.pypath/cache/ (hash-named files)

Cache control context managers:

import pypath.share.curl as curl

# Force fresh download
with curl.cache_delete_on():
    data = fetch_data()

# Enable caching
with curl.cache_on():
    data = fetch_data()

# Debug mode
with curl.debug_on():
    data = fetch_data()

Testing Input Modules

# Test specific module
python input_module_maintenance/test_input_modules.py --module signor

# Test specific function
python input_module_maintenance/test_input_modules.py --function signor.signor_interactions

# List all modules
python input_module_maintenance/test_input_modules.py --list-modules

Quick smoke test:

from pypath.inputs import signor
ints = signor.signor_interactions()
print(len(ints), ints[0])

Adding a New Resource

Create module in pypath/inputs/ (file or sub-package)
Add URL to pypath/resources/urls.py
Add metadata to pypath/resources/descriptions.py
Update resources.json with license information
Create core database input definitions if needed:
- network: Format definition in resources/data_formats.py
- annot: Resource class in core/annot.py
- complex: Resource class in core/complex.py
- enz_sub: Definition in resources.json
Write tests
Update documentation

Common Gotchas

Network Issues

SSL certificate errors common - use curl.debug_on() context
Clear corrupted cache with curl.cache_delete_on()
Some resources have rate limiting or require retries

ID Mapping

Species parameter is critical: ncbi_tax_id=9606 for human
Old/deleted UniProt IDs may fail to map
Always validate mapping results

Format Drift

Data sources change formats without warning
Compare cached vs fresh downloads when debugging
Manual browser verification needed for failed tests

Code Quality Variance

~30% of codebase is older "messy" style from pre-2019
Follow modern conventions in new code
Don't fix unrelated style issues in the same PR

Large Datasets

Some resources timeout during download
May need extended timeout configuration
Use large=True parameter in Curl for big files

Key Files Reference

pypath/resources/urls.py - All 1000+ data source URLs
pypath/resources/descriptions.py - Resource metadata
pypath/resources/data_formats.py - Input format definitions
pypath/share/curl.py - Download infrastructure
pypath/utils/mapping.py - ID translation
CONTRIBUTING.md - Detailed contribution guide
input_module_maintenance/ - Testing and maintenance tools

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLAUDE.md - PyPath Development Guide

Project Overview

Project Structure

Code Style and Conventions

Naming

Input Function Naming Pattern

Formatting

Docstrings

Imports

Logging

The `inputs` Module

Structure

Standard Data Flow

Basic Input Module Pattern

Complex Resource Sub-Package Pattern

SQLite Database Pattern

Common Utilities

Cache and Downloads

Testing Input Modules

Adding a New Resource

Common Gotchas

Network Issues

ID Mapping

Format Drift

Code Quality Variance

Large Datasets

Key Files Reference

FilesExpand file tree

AGENTS.md

Latest commit

History

AGENTS.md

File metadata and controls

CLAUDE.md - PyPath Development Guide

Project Overview

Project Structure

Code Style and Conventions

Naming

Input Function Naming Pattern

Formatting

Docstrings

Imports

Logging

The inputs Module

Structure

Standard Data Flow

Basic Input Module Pattern

Complex Resource Sub-Package Pattern

SQLite Database Pattern

Common Utilities

Cache and Downloads

Testing Input Modules

Adding a New Resource

Common Gotchas

Network Issues

ID Mapping

Format Drift

Code Quality Variance

Large Datasets

Key Files Reference

The `inputs` Module