Skip to content

Latest commit

 

History

History

README.md

Build status PyPI version

owid-catalog

A Pythonic library for working with OWID data.

The owid-catalog library is the foundation of Our World in Data's data management system. It provides:

  1. Data APIs: Access OWID's published data through unified client interfaces
  2. Data Structures: Enhanced pandas DataFrames with rich metadata support

Installation

pip install owid-catalog

Quick Examples

Accessing OWID Data

from owid.catalog import fetch, search

# Search for charts (default)
charts = search("population")
tb = charts[0].fetch()

# Fetch data from OWID Chart at ourworldindata.org/grapher/life-expectancy
tb = fetch("life-expectancy")

# Search for tables
tables = search("population", kind="table", namespace="un")
tb = tables[0].fetch()

# Search indicators (using semantic search)
search("renewable energy", kind="indicator")

Working with Data Structures

from owid.catalog import Table
from owid.catalog import processing as pr

# Tables are pandas DataFrames with metadata
tb = Table(df, metadata={"short_name": "population"})

# Metadata propagates through operations
tb_filtered = tb[tb["year"] > 2000]  # Keeps metadata
tb_merged = pr.merge(tb1, tb2, on="country")  # Merges metadata

Documentation

For detailed documentation, see:

Architecture

graph TB
etl -->|reads| snapshot[upstream datasets]
etl -->|generates| s3[data catalog]
catalog[owid-catalog] -->|queries| s3
Loading

This library is part of OWID's ETL project, which contains recipes for all datasets we publish.

Development

You need Python 3.10+, uv and make installed. Clone the repo, then you can simply run:

# run all unit tests and CI checks
make test

# watch for changes, then run all checks
make watch

Changelog

v1.0.1

  • ResponseSet ergonomics
    • Remove deprecated ResponseSet.results property (use .items instead)
    • Add .to_dict() method for serializing results to plain dicts (useful for AI/LLM context windows)
    • Add all_fields parameter to .to_frame() to temporarily override display mode without mutating instance state

v1.0.0

  • New unified Client API
    • owid.catalog.Client as single entry point with ChartsAPI, IndicatorsAPI, TablesAPI
    • Quick access via search() and fetch() convenience functions
    • Rich result types: ChartResult, IndicatorResult, TableResult with ResponseSet container
  • Charts API
    • Fetch chart data by slug, URL, or slug with query params
    • Parse chart slugs from grapher/explorer URLs via parse_chart_slug()
    • Explorer best-effort fetching with graceful error handling
    • set_ui_advanced() / set_ui_basic() for display configuration
  • Tables API
    • Search catalog by table, namespace, version, dataset, and channel
    • Fetch tables directly by catalog path
    • Embedded catalog index with local caching
  • Indicators API
    • Semantic search via search.owid.io vector embeddings
    • Sort by relevance (similarity + popularity blend) or similarity only
    • fetch() for single-column indicator or fetch_table() for the full table
  • Search & discovery
    • Fuzzy, exact, contains, and regex matching modes
    • .latest() filtering to keep only newest versions
    • Popularity scores (0.0-1.0) from analytics views, results sorted by popularity
    • refresh_index parameter to force catalog index reload
  • Data structures integration
    • All fetch() methods return owid.catalog.Table with full metadata
    • CatalogPath helper for parsing catalog paths
    • Lazy loading with load_data=False for deferred data access
  • Library reorganization
    • Restructured into owid.catalog.core (data structures) and owid.catalog.api (remote access)
    • catalog.find() deprecated in favor of Client().tables.search() (backwards compat maintained)
    • Legacy code moved to owid.catalog.api.legacy
    • New dependencies: pydantic v2.0+
  • Private data support
    • Private datasets served from separate R2 bucket
    • API can fetch private data from private bucket
  • Performance
    • Vectorized operations replacing iterrows() in TablesAPI
    • Embedded catalog index loading (removed ETLCatalog dependency)
    • Modularized search into helper methods
  • Other
    • Thumbnail display in ResponseSet for chart results
    • JSON output format support
    • Comprehensive exception handling: ChartNotFoundError, LicenseError
    • API URLs immutable with Pydantic Field(frozen=True)
See previous versions

v0.4.5

  • Allow both table and dataset parameters in find() (they can now be used together)
  • Migrate from pyright to ty type checker for improved type checking

v0.4.4

  • Enhanced find() with better search capabilities:
    • Case-insensitive search by default (use case=True for case-sensitive)
    • Regex support enabled by default for table and dataset parameters
    • New fuzzy search with fuzzy=True - typo-tolerant matching sorted by relevance
    • Configurable fuzzy threshold (0-100) to control match strictness
  • New dependency: rapidfuzz for fuzzy string matching

v0.4.3

  • Fixed minor bugs

v0.4.0

  • Highlights
    • Support for Python 3.10-3.13 (was 3.11-3.13)
    • Drop support for Python 3.9 (breaking change)
  • Others
    • Deprecate Walden.
    • Dependencies: Change rdata for pyreadr.
    • Support: indicator dimensions.
    • Support: MDIMs.
    • Switched from Poetry to UV package manager.
    • New decorator @keep_metadata to propagate metadata in pandas functions.
  • Fixes: Table.apply, groupby.apply, metadata propagation, type hinting, etc.

v0.3.11

  • Add support for Python 3.12 in pypackage.toml

v0.3.10

  • Add experimental chart data API in owid.catalog.charts

v0.3.9

  • Switch from isort & black & fake8 to ruff

v0.3.8

  • Pin dataclasses-json==0.5.8 to fix error with python3.9

v0.3.7

  • Fix bugs.
  • Improve metadata propagation.
  • Improve metadata YAML file handling, to have common definitions.
  • Remove DatasetMeta.origins.

v0.3.6

  • Fixed tons of bugs
  • processing.py module with pandas-like functions that propagate metadata
  • Support for Dynamic YAML files
  • Support for R2 alongside S3

v0.3.5

  • Remove catalog.frames; use owid-repack package instead
  • Relax dependency constraints
  • Add optional channel argument to DatasetMeta
  • Stop supporting metadata in Parquet format, load JSON sidecar instead
  • Fix errors when creating new Table columns

v0.3.4

  • Bump pyarrow dependency to enable Python 3.11 support

v0.3.3

  • Add more arguments to Table.__init__ that are often used in ETL
  • Add Dataset.update_metadata function for updating metadata from YAML file
  • Python 3.11 support via update of pyarrow dependency

v0.3.2

  • Fix a bug in Catalog.__getitem__()
  • Replace mypy type checker by pyright

v0.3.1

  • Sort imports with isort
  • Change black line length to 120
  • Add grapher channel
  • Support path-based indexing into catalogs

v0.3.0

  • Update OWID_CATALOG_VERSION to 3
  • Support multiple formats per table
  • Support reading and writing parquet files with embedded metadata
  • Optional repack argument when adding tables to dataset
  • Underscore |
  • Get version field from DatasetMeta init
  • Resolve collisions of underscore_table function
  • Convert version to str and load json dimensions

v0.2.9

  • Allow multiple channels in catalog.find function

v0.2.8

  • Update OWID_CATALOG_VERSION to 2

v0.2.7

  • Split datasets into channels (garden, meadow, open_numbers, ...) and make garden default one
  • Add .find_latest method to Catalog

v0.2.6

  • Add flag is_public for public/private datasets
  • Enforce snake_case for table, dataset and variable short names
  • Add fields published_by and published_at to Source
    • Added a list of supported and unsupported operations on columns
    • Updated pyarrow

v0.2.5

  • Fix ability to load remote CSV tables

v0.2.4

  • Update the default catalog URL to use a CDN

v0.2.3

  • Fix methods for finding and loading data from a LocalCatalog

v0.2.2

  • Repack frames to compact dtypes on Table.to_feather()

v0.2.1

  • Fix key typo used in version check

v0.2.0

  • Copy dataset metadata into tables, to make tables more traceable
  • Add API versioning, and a requirement to update if your version of this library is too old

v0.1.1

  • Add support for Python 3.8

v0.1.0

  • Initial release, including searching and fetching data from a remote catalog