dlt version
1.26.0
Describe the problem
When using the SQLAlchemy destination to write to DuckDB (via duckdb_engine), pipeline runs fail with the error:
<class 'dlt.destinations.exceptions.DatabaseTransientException'>
(_duckdb.CatalogException) Catalog Error: Schema with name "myschema" already exists!
[SQL: CREATE SCHEMA myschema]
This appears to be due to SqlalchemyClient.has_dataset not realizing that duckdb_engine returns the list of schemas namespaced by the corresponding database holding the schema (duckdb_engine.Dialect.get_schema_names). i.e. if my pipeline name is mydata and my dataset name is myschema, duckdb_engine returns mydata.myschema in the list of returned schemas. dlt, however, is looking only for myschema. As a result, dlt believes the schema is missing and it attempts to create a schema that already exists, resulting in the above error.
Expected behavior
dlt should detect that the schema already exists and not attempt to create it again, allowing the pipeline to complete successfully.
Steps to reproduce
The following Python program reproduces the problem:
import os
import shutil
import dlt
from sqlalchemy import create_engine
# ENGINE = "sqlite"
ENGINE = "duckdb"
PIPELINE = "mydata"
DATASET = "myschema"
if __name__ == '__main__':
dbpath = f"{PIPELINE}.db"
if os.path.exists(dbpath):
os.remove(dbpath)
if os.path.isdir(PIPELINE):
shutil.rmtree(PIPELINE)
engine = create_engine(f"{ENGINE}:///{dbpath}")
pipeline = dlt.pipeline(
destination=dlt.destinations.sqlalchemy(engine),
pipeline_name=PIPELINE,
dataset_name=DATASET,
pipelines_dir=".",
)
info = pipeline.run([
{"id": 1},
{"id": 2},
{"id": 3},
], table_name="numbers")
print(info)
Here's the contents of my pyproject.toml:
[project]
name = "minimal_dlt_schema_problem_repro"
version = "0.1.0"
description = "Add your description here"
requires-python = ">=3.14"
dependencies = [
"dlt[duckdb,sqlalchemy]>=1.26.0",
"duckdb-engine>=0.17.0",
]
Operating system
macOS
Runtime environment
Local
Python version
3.13
dlt data source
It doesn't matter. I can reproduce this with a custom data source as well as with a static list of Python maps.
dlt destination
No response
Other deployment details
No response
Additional information
While we could sidestep this problem by using the DuckDB destination, we're using SQLAlchemy because the DuckDB destination does not support encrypted DuckDB files. In order to support encrypted DuckDB files, our best option is to use SQLAlchemy, which will allow us to execute the necessary SQL for all new connections in order to ATTACH an encrypted database (this functionality isn't available when utilizing the DuckDB destination).
dlt version
1.26.0
Describe the problem
When using the SQLAlchemy destination to write to DuckDB (via duckdb_engine), pipeline runs fail with the error:
This appears to be due to SqlalchemyClient.has_dataset not realizing that duckdb_engine returns the list of schemas namespaced by the corresponding database holding the schema (duckdb_engine.Dialect.get_schema_names). i.e. if my pipeline name is
mydataand my dataset name ismyschema, duckdb_engine returnsmydata.myschemain the list of returned schemas. dlt, however, is looking only formyschema. As a result, dlt believes the schema is missing and it attempts to create a schema that already exists, resulting in the above error.Expected behavior
dlt should detect that the schema already exists and not attempt to create it again, allowing the pipeline to complete successfully.
Steps to reproduce
The following Python program reproduces the problem:
Here's the contents of my pyproject.toml:
Operating system
macOS
Runtime environment
Local
Python version
3.13
dlt data source
It doesn't matter. I can reproduce this with a custom data source as well as with a static list of Python maps.
dlt destination
No response
Other deployment details
No response
Additional information
While we could sidestep this problem by using the DuckDB destination, we're using SQLAlchemy because the DuckDB destination does not support encrypted DuckDB files. In order to support encrypted DuckDB files, our best option is to use SQLAlchemy, which will allow us to execute the necessary SQL for all new connections in order to
ATTACHan encrypted database (this functionality isn't available when utilizing the DuckDB destination).