UniProt / NCBI Retrieval Toolkit

This mini-toolkit lets you download protein or gene information for a given organism (taxonomy ID) from two major biological databases.

1. Required Files

Script	Purpose
`uniprot.py`	Fetches protein-centric annotations from UniProt REST API (TSV/CSV + name list).
`ncbi.py`	Fetches gene-centric metadata from NCBI (ESearch / ESummary) into a CSV file.
`ncbi_protein.py`	Fetches protein-centric data from NCBI (ESearch / ESummary / EFetch) and writes a TSV with sequences and gene mappings.

Make sure these three Python files are in the same directory (this repo’s root). No other project files are required.

2. Prerequisites

Python 3.8+
requests and pandas (install with pip install requests pandas)

Optional but recommended:

An NCBI API key (free – increases rate-limit to 10 req/s).

3. Usage

# activate your virtualenv / conda env first

# 3.1 UniProt – download protein table for an organism
python uniprot.py <organism_taxid>
# Example: Pseudomonas putida (160488)
python uniprot.py 160488
# → outputs:
#   - protein_only_<taxid>.csv         (full table)
#   - protein_names_only_<taxid>.txt   (unique names list)

# 3.2 NCBI – gene metadata for an organism
python ncbi.py <taxonomy_id>
# Example: Human (9606)
python ncbi.py 9606
# → outputs gene_records_<taxid>.csv

# 3.3 NCBI – protein + gene mapping + sequences (advanced)
python ncbi_protein.py --taxid <taxonomy_id> [--email you@lab.org] [--api_key YOUR_NCBI_KEY] [--out output.tsv]
# Example: E. coli (562)
python ncbi_protein.py --taxid 562 --email you@lab.org --api_key ABC123 --out ecoli_proteins.tsv

Flags Explained (ncbi_protein.py)

--taxid NCBI taxonomy ID (required)
--email Your contact email (optional but recommended – NCBI courtesy)
--api_key NCBI API key to raise rate-limit (optional)
--out Output TSV filename (default proteins_by_taxon.tsv)

Output Columns (ncbi_protein.py)

accession – protein accession (with version)
protein_name – description / defline
gene_id – linked NCBI Gene ID
gene_symbol – gene symbol (if available)
locus_tag – locus tag (if available)
protein_sequence – amino-acid sequence

4. Notes / Troubleshooting

NCBI occasionally returns HTTP 500 for very large queries. ncbi_protein.py now uses ESearch history + paginated ESummary POST requests to avoid this.
If you hit rate-limits, lower --taxid scope (e.g., subspecies) or supply an API key.

5. Testing

A minimal smoke-test (requires internet):

python -m unittest tests/test_ncbi_scripts.py

See tests/test_ncbi_scripts.py for simple assertions that the scripts run and write non-empty tables for small model organisms (E. coli 562).

Feel free to open issues or PRs for improvements.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
LICENSE		LICENSE
Pseudomonas_putida_protein_list_160488.csv		Pseudomonas_putida_protein_list_160488.csv
README.md		README.md
gene_records_5286.csv		gene_records_5286.csv
ncbi.py		ncbi.py
ncbi_protein.py		ncbi_protein.py
protein_names_only_5286.txt		protein_names_only_5286.txt
taxonomy_to_proteins.png		taxonomy_to_proteins.png
uniprot.py		uniprot.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UniProt / NCBI Retrieval Toolkit

1. Required Files

2. Prerequisites

3. Usage

Flags Explained (ncbi_protein.py)

Output Columns (ncbi_protein.py)

4. Notes / Troubleshooting

5. Testing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

UniProt / NCBI Retrieval Toolkit

1. Required Files

2. Prerequisites

3. Usage

Flags Explained (ncbi_protein.py)

Output Columns (ncbi_protein.py)

4. Notes / Troubleshooting

5. Testing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages