Skip to content

duckdb_loader.py silently corrupts the last column on CRLF input files #541

@turbomam

Description

@turbomam

Context

Surfaced in Copilot's 2026-04-16 review of #531.

kg_microbe/query_utils/duckdb_loader.py sets lineterminator="\n" on the pandas reader and strips \r from the header line — but not from the rest of the file. On a file with CRLF (\r\n) line endings this leaves a trailing \r in the last field of every data row, because \r is no longer treated as part of the newline.

Effects:

  • string values are silently altered ("foo" becomes "foo\r")
  • equality filtering and indexing break in ways that are hard to spot — most viewers render the \r invisibly
  • downstream joins on those columns silently drop rows

Suggested fix

Either:

  • drop the custom lineterminator and let pandas handle CRLF normally, or
  • normalize the file contents on load (strip \r from every line, not only the header).

Option 1 is the simpler fix unless there's a concrete reason the custom terminator was introduced.

File involved

  • kg_microbe/query_utils/duckdb_loader.py

References

  • PR #531
  • Copilot review at commit 1de973d, 2026-04-16T23:15Z

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions