Skip to content

SQLAlchemy destination + duckdb_engine fails with "Schema with name "{name}" already exists!" #3907

@rkennedy-argus

Description

@rkennedy-argus

dlt version

1.26.0

Describe the problem

When using the SQLAlchemy destination to write to DuckDB (via duckdb_engine), pipeline runs fail with the error:

<class 'dlt.destinations.exceptions.DatabaseTransientException'>
(_duckdb.CatalogException) Catalog Error: Schema with name "myschema" already exists!
[SQL: CREATE SCHEMA myschema]

This appears to be due to SqlalchemyClient.has_dataset not realizing that duckdb_engine returns the list of schemas namespaced by the corresponding database holding the schema (duckdb_engine.Dialect.get_schema_names). i.e. if my pipeline name is mydata and my dataset name is myschema, duckdb_engine returns mydata.myschema in the list of returned schemas. dlt, however, is looking only for myschema. As a result, dlt believes the schema is missing and it attempts to create a schema that already exists, resulting in the above error.

Expected behavior

dlt should detect that the schema already exists and not attempt to create it again, allowing the pipeline to complete successfully.

Steps to reproduce

The following Python program reproduces the problem:

import os
import shutil

import dlt
from sqlalchemy import create_engine

# ENGINE = "sqlite"
ENGINE = "duckdb"
PIPELINE = "mydata"
DATASET = "myschema"

if __name__ == '__main__':
    dbpath = f"{PIPELINE}.db"
    if os.path.exists(dbpath):
        os.remove(dbpath)
    if os.path.isdir(PIPELINE):
        shutil.rmtree(PIPELINE)

    engine = create_engine(f"{ENGINE}:///{dbpath}")

    pipeline = dlt.pipeline(
        destination=dlt.destinations.sqlalchemy(engine),
        pipeline_name=PIPELINE,
        dataset_name=DATASET,
        pipelines_dir=".",
    )

    info = pipeline.run([
        {"id": 1},
        {"id": 2},
        {"id": 3},
    ], table_name="numbers")

    print(info)

Here's the contents of my pyproject.toml:

[project]
name = "minimal_dlt_schema_problem_repro"
version = "0.1.0"
description = "Add your description here"
requires-python = ">=3.14"
dependencies = [
    "dlt[duckdb,sqlalchemy]>=1.26.0",
    "duckdb-engine>=0.17.0",
]

Operating system

macOS

Runtime environment

Local

Python version

3.13

dlt data source

It doesn't matter. I can reproduce this with a custom data source as well as with a static list of Python maps.

dlt destination

No response

Other deployment details

No response

Additional information

While we could sidestep this problem by using the DuckDB destination, we're using SQLAlchemy because the DuckDB destination does not support encrypted DuckDB files. In order to support encrypted DuckDB files, our best option is to use SQLAlchemy, which will allow us to execute the necessary SQL for all new connections in order to ATTACH an encrypted database (this functionality isn't available when utilizing the DuckDB destination).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdestinationIssue with a specific destination

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions