Skip to content

Support Parquet schema auto-extraction #65

@ohaibbq

Description

@ohaibbq

Overview

Implement automatic schema extraction from Parquet file metadata, eliminating the need for explicit schema parameters.

Current Behavior

The emulator requires an explicit schema parameter when loading Parquet files.

Location: server/handler.go:1073-1096

case "PARQUET":
    reader := parquet.NewReader(bytes.NewReader(b))
    // Requires schema to be provided externally

Expected Behavior

BigQuery automatically extracts schema from "self-describing formats" like Parquet. No --autodetect flag or explicit schema is needed - the schema is read directly from the Parquet file metadata.

Implementation Requirements

  1. Extract schema from Parquet file metadata when no explicit schema is provided
  2. Map Parquet types to BigQuery types:
    • Primitive types (INT32, INT64, FLOAT, DOUBLE, BOOLEAN, BYTE_ARRAY)
    • Complex types (nested structs → RECORD, arrays → REPEATED fields)
  3. Handle nested structures appropriately
  4. Preserve field names, nullability, and repetition information

Test Cases

  • Load Parquet file without schema parameter → should auto-detect from file metadata
  • Parquet with complex types (nested structs, arrays) → should create appropriate RECORD fields
  • Parquet with all primitive types → should map correctly to BigQuery types
  • Verify schema matches what BigQuery would generate

Documentation Reference

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions