Skip to content

Error in writing panda data frame with pyarrow engine to DeltaTable #3315

@topagarwal

Description

@topagarwal

Environment

Delta-rs version:
v.17.0

Binding:

Environment:

  • OS: Linux, under python3 shell.

Bug

Details with steps to reproduce
I'm using Delta lake v 0.17.0
Here are the steps performed:

  1. Read in the DeltaTable from existing S3 location. dt = DeltaTable("s3://mylocation/")
  2. Converted it to pyarrow table. arrow_table = dt.to_pyarrow_table()
  3. Filtered the arrow table and selected specific columns of interest
  4. Converted arrow table to pandas data frame. df = arrow_table.to_pandas()
  5. Writing panda dataframe back to existing new delta table. Table is empty at this point.
  6. write_deltalake("s3://test_sample_process/", df, mode="overwrite"). also tried it with schema_mode="overwrite"

To trouble shoot it I looked into deltalake/writer.py file. The Exception is thrown form ln 351. It is trying to sort the data schema and table schema and then match them. It is using pyarrow engine.

The output of matching is on visual inspection.

Actual Error message printed

python/3.12.7/lib/python3.12/site-packages/deltalake/writer.py", line 351, in write_deltalake
    raise ValueError(
ValueError: Schema of data does not match table schema

Data schema:
namespace: string
ki_record_name: string
work_center: string
kt_config: string
kt_parameters: string
mi_updated_at: timestamp[us, tz=UTC]
mi_updated_by: string

Table Schema:

namespace: string
ki_record_name: string
work_center: string not null
kt_config: string
kt_parameters: string
mi_updated_at: timestamp[us, tz=UTC] not null
  -- field metadata --
  comment: '"The time this record was updated"'
mi_updated_by: string not null
  -- field metadata --
  comment: '"The process that updated this record"'

Would appreciate any help in figuring out my Table and Data schema are considered as mis-match by code when they seem to be same.

I couldn't isolate the difference in 2 schemas other than table schema as comment and not null defined. The field names and data types are same for both. Wondering what I am missing here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions