This mini-toolkit lets you download protein or gene information for a given organism (taxonomy ID) from two major biological databases.
| Script | Purpose |
|---|---|
uniprot.py |
Fetches protein-centric annotations from UniProt REST API (TSV/CSV + name list). |
ncbi.py |
Fetches gene-centric metadata from NCBI (ESearch / ESummary) into a CSV file. |
ncbi_protein.py |
Fetches protein-centric data from NCBI (ESearch / ESummary / EFetch) and writes a TSV with sequences and gene mappings. |
Make sure these three Python files are in the same directory (this repo’s root). No other project files are required.
- Python 3.8+
requestsandpandas(install withpip install requests pandas)
Optional but recommended:
- An NCBI API key (free – increases rate-limit to 10 req/s).
# activate your virtualenv / conda env first
# 3.1 UniProt – download protein table for an organism
python uniprot.py <organism_taxid>
# Example: Pseudomonas putida (160488)
python uniprot.py 160488
# → outputs:
# - protein_only_<taxid>.csv (full table)
# - protein_names_only_<taxid>.txt (unique names list)
# 3.2 NCBI – gene metadata for an organism
python ncbi.py <taxonomy_id>
# Example: Human (9606)
python ncbi.py 9606
# → outputs gene_records_<taxid>.csv
# 3.3 NCBI – protein + gene mapping + sequences (advanced)
python ncbi_protein.py --taxid <taxonomy_id> [--email you@lab.org] [--api_key YOUR_NCBI_KEY] [--out output.tsv]
# Example: E. coli (562)
python ncbi_protein.py --taxid 562 --email you@lab.org --api_key ABC123 --out ecoli_proteins.tsv--taxidNCBI taxonomy ID (required)--emailYour contact email (optional but recommended – NCBI courtesy)--api_keyNCBI API key to raise rate-limit (optional)--outOutput TSV filename (defaultproteins_by_taxon.tsv)
accession– protein accession (with version)protein_name– description / deflinegene_id– linked NCBI Gene IDgene_symbol– gene symbol (if available)locus_tag– locus tag (if available)protein_sequence– amino-acid sequence
- NCBI occasionally returns HTTP 500 for very large queries.
ncbi_protein.pynow uses ESearch history + paginated ESummary POST requests to avoid this. - If you hit rate-limits, lower
--taxidscope (e.g., subspecies) or supply an API key.
A minimal smoke-test (requires internet):
python -m unittest tests/test_ncbi_scripts.pySee tests/test_ncbi_scripts.py for simple assertions that the scripts run and write non-empty tables for small model organisms (E. coli 562).
Feel free to open issues or PRs for improvements.
