PyPath is a Python library for molecular biology and biomedical prior knowledge processing. It aggregates data from ~200 biological databases into unified structures for network biology, annotations, protein complexes, and enzyme-substrate relationships.
Key facts:
- 12+ years of development history with heterogeneous code quality
- Python 3.9+ (dropped Python 2 in 2021)
- Poetry for dependency management
- GPLv3 license
- Ongoing modularization into separate packages
pypath/
├── core/ # Database classes (network, complex, annot, enz_sub, intercell)
├── inputs/ # Data acquisition: 189 modules for ~150 data sources
├── share/ # Shared utilities (curl, session/logging, settings, cache)
├── utils/ # Standalone utilities (mapping, taxonomy, orthology, GO, sequences)
├── resources/ # Data resources (urls.py with 1000+ endpoints, descriptions)
├── formats/ # Format parsers (sqlite, xml, sdf, obo, sqldump)
├── internals/ # Internal infrastructure (input_formats, resource definitions)
├── omnipath/ # High-level database management, web server, export
├── legacy/ # Deprecated code (do not use)
└── obsolete/ # Old code (do not use)
External companion packages:
pypath-common- Shared utilities across pypath projectsdownload-manager- Generic download infrastructurecachemanager- Cache management
- Classes:
PascalCase. Resource names as single words:ProteinatlasAnnotation - Functions/methods/variables:
snake_case - Resource names in functions: Single word, no underscores:
proteinatlas_annotations()notprotein_atlas_annotations()- Underscores separate primary/secondary sources:
PhosphoSite_ProtMapper
- Underscores separate primary/secondary sources:
{resource}_{suffix}
Standard suffixes:
_interactions- Protein-protein interactions_enz_sub- Enzyme-substrate relationships_complexes- Protein complex membership_annotations- Functional annotations_raw- Near-raw data dumps_mapping- ID cross-references
- Two blank lines between methods and classes
- One blank line before/after logic blocks, between logical segments
- Single quotes preferred unless reason for double quotes
- Operators surrounded by spaces:
a = a * 4 + 3
Argument lists:
def function(
arg1: str,
arg2: int,
**kwargs,
) -> list:- New line after opening parenthesis
- Each element on own line, indented one level
- Trailing comma after each element
- Closing parenthesis on own line at original indentation
Long comprehensions:
result = [
transform(item)
for item in items
if condition(item)
]Google (Napoleon) style with type hints:
def function(a: int) -> list[int]:
"""
Brief description of function.
Args:
a: A number. Description continues
on next line with indent.
Returns:
A list of integers.
"""Order (with blank lines between sections):
from __future__ import annotations- Standard library
- Typing imports
- Third-party packages
- PyPath internal imports
from __future__ import annotations
import collections
import re
from typing import Iterable, Optional
import pandas as pd
import pypath.share.curl as curl
import pypath.utils.mapping as mappingAvoid in new code (legacy compatibility):
# DON'T use these in new code:
from future.utils import iteritems # Use .items() directly
from past.builtins import xrange # Use range() directlyClasses inherit from pypath.share.session.Logger:
class MyClass(session_mod.Logger):
def __init__(self):
session_mod.Logger.__init__(self, name='mymodule')
self._log('Initialized')Or module-level logger:
import pypath.share.session as session_mod
_logger = session_mod.Logger(name='inputs.mymodule')
_logger._log('Processing data...')- 189 Python files for ~150 data sources
- 16 sub-packages for complex resources (chembl, brenda, bindingdb, ramp, reactome, rhea, disgenet, hmdb, lipidmaps, etc.)
- Simple resources: single file
inputs/{resource}.py - Complex resources: sub-package
inputs/{resource}/
- Download via
pypath.share.curl.Curl - Parse to Python structures (lists, dicts, namedtuples)
- Map IDs to UniProt using
pypath.utils.mapping - Return standardized format
"""
Client and parser for ResourceName database.
"""
from __future__ import annotations
import collections
import pypath.share.curl as curl
import pypath.resources.urls as urls
import pypath.utils.mapping as mapping
ResourceRecord = collections.namedtuple(
'ResourceRecord',
['source', 'target', 'effect'],
)
def resource_interactions() -> list[ResourceRecord]:
"""
Retrieves interactions from ResourceName.
Returns:
List of interaction records.
"""
url = urls.urls['resource']['url']
c = curl.Curl(url, silent=False, large=True)
result = []
for line in c.result.split('\n')[1:]: # Skip header
fields = line.strip().split('\t')
if len(fields) < 3:
continue
result.append(
ResourceRecord(
source=fields[0],
target=fields[1],
effect=fields[2],
)
)
return resultinputs/resource/
├── __init__.py # Public API exports
├── _common.py # Shared utilities and logger
├── _sqlite.py # SQLite database access (if applicable)
├── _molecules.py # Specific data extractors
├── _targets.py
└── _raw.py # Raw data access
import pypath.formats.sqlite as _sqlite
import pypath.share.curl as curl
import pypath.resources.urls as urls
def resource_sqlite(version: int = 1, connect: bool = True):
"""
Downloads and connects to ResourceName SQLite database.
"""
url = urls.urls['resource']['sqlite'] % version
def _download() -> curl.Curl:
return curl.Curl(url, large=True, silent=False)
def _extractor(curl_obj: curl.Curl) -> str:
return curl_obj.result
return _sqlite.download_sqlite(
download_callback=_download,
extractor=_extractor,
database='ResourceName',
version=str(version),
connect=connect,
)ID Mapping:
import pypath.utils.mapping as mapping
# Map multiple IDs
uniprot_ids = mapping.map_names(gene_symbols, 'genesymbol', 'uniprot')
# Map single ID
uniprot = mapping.map_name('EGFR', 'genesymbol', 'uniprot')Taxonomy:
import pypath.utils.taxonomy as taxonomy
name = taxonomy.taxids[9606] # 'Homo sapiens'pypath_common utilities:
from pypath_common import _misc, _constants as _const
first_item = _misc.first(iterable)Glom for JSON processing:
import glom
from pypath_common import _constants as _const
spec = {'field': 'nested.path.to.field'}
result = glom.glom(record, spec, default=_const.GLOM_ERROR)Cache location: ~/.pypath/cache/ (hash-named files)
Cache control context managers:
import pypath.share.curl as curl
# Force fresh download
with curl.cache_delete_on():
data = fetch_data()
# Enable caching
with curl.cache_on():
data = fetch_data()
# Debug mode
with curl.debug_on():
data = fetch_data()# Test specific module
python input_module_maintenance/test_input_modules.py --module signor
# Test specific function
python input_module_maintenance/test_input_modules.py --function signor.signor_interactions
# List all modules
python input_module_maintenance/test_input_modules.py --list-modulesQuick smoke test:
from pypath.inputs import signor
ints = signor.signor_interactions()
print(len(ints), ints[0])- Create module in
pypath/inputs/(file or sub-package) - Add URL to
pypath/resources/urls.py - Add metadata to
pypath/resources/descriptions.py - Update
resources.jsonwith license information - Create core database input definitions if needed:
network: Format definition inresources/data_formats.pyannot: Resource class incore/annot.pycomplex: Resource class incore/complex.pyenz_sub: Definition inresources.json
- Write tests
- Update documentation
- SSL certificate errors common - use
curl.debug_on()context - Clear corrupted cache with
curl.cache_delete_on() - Some resources have rate limiting or require retries
- Species parameter is critical:
ncbi_tax_id=9606for human - Old/deleted UniProt IDs may fail to map
- Always validate mapping results
- Data sources change formats without warning
- Compare cached vs fresh downloads when debugging
- Manual browser verification needed for failed tests
- ~30% of codebase is older "messy" style from pre-2019
- Follow modern conventions in new code
- Don't fix unrelated style issues in the same PR
- Some resources timeout during download
- May need extended timeout configuration
- Use
large=Trueparameter in Curl for big files
pypath/resources/urls.py- All 1000+ data source URLspypath/resources/descriptions.py- Resource metadatapypath/resources/data_formats.py- Input format definitionspypath/share/curl.py- Download infrastructurepypath/utils/mapping.py- ID translationCONTRIBUTING.md- Detailed contribution guideinput_module_maintenance/- Testing and maintenance tools