feat: add DuckDB, Trino, Dremio & Spark support to CI and CLI by devin-ai-integration[bot] · Pull Request #2135 · elementary-data/elementary

devin-ai-integration · 2026-03-01T14:33:36Z

Summary

Adds full CI infrastructure and CLI support for DuckDB, Trino, Dremio, and Spark adapters to elementary-cli. CI runs all 11 warehouse tests (including these 4) plus code-quality.

Key changes:

pyproject.toml: Added dbt-duckdb and dbt-dremio as optional dependencies
profiles.yml.j2: Added profile targets for all 4 adapters (DuckDB file-based, Trino with Iceberg, Dremio with Nessie enterprise catalog, Spark Thrift Server with Delta)
docker-compose.yml: Added Docker service stacks for Trino (Iceberg + Hive Metastore + MinIO), Dremio (Nessie + MinIO), Spark (Thrift Server + Delta Lake)
CI workflows: Docker startup steps, seed caching for Postgres/ClickHouse (Docker volumes) + DuckDB (db file), external seeders for Dremio and Spark
E2E dbt project: Renamed reserved-word columns (min→min_val, max→max_val, sum→sum_val, one→col1), updated data_type mappings, excluded schema_changes for Spark, added adapter-specific schema drop dispatches
CLI adapter code: Added transient error patterns and get_adapter_unique_id dispatch for DuckDB

Dremio-specific workarounds (dbt-dremio adapter limitations):

--exclude tag:ephemeral_model on dbt test — dbt-dremio doesn't support ephemeral model CTEs
Elementary profile uses enterprise_catalog_folder instead of schema to avoid duplicate-key errors from dbt-dremio's alias chain
Seeds placed in same Nessie namespace as models (VDS validator can't resolve cross-namespace references)
Single-threaded execution (--threads 1) to avoid Nessie catalog race conditions

Updates since last revision (CodeRabbit follow-ups / CI robustness)

Dremio startup safety in CI: replaced docker compose up --exit-code-from dremio-setup ... with a two-step pattern:
1. docker compose up -d --wait dremio dremio-minio nessie
2. docker compose run --rm dremio-setup
Bounded Dremio readiness loop in dremio-setup.sh (max attempts + curl timeout + hard exit) to avoid CI hangs.
Seed cache scope aligned: seed-cache key/restore now only runs for adapters that actually save a cache (Postgres/ClickHouse/DuckDB), and the save condition is simplified accordingly.
Dremio external seeder identifier quoting: added a _quote_ident() helper to escape embedded quotes in table/column identifiers.
Spark external seeder dependency clarity: wraps pyhive import with a clear runtime error message (CI installs Spark deps via dbt-spark[PyHive]).
Removed Dremio macro overrides: dateadd.sql, cast_column.sql, and dremio_target_database.sql are now handled by dbt-data-reliability (PR #957 and PR #952, both merged). Only the ref.sql override remains (dbt-core 1.11 cross-package ref workaround for dbt-dremio's broken ref() macro).

Review & Testing Checklist for Human

Dremio override removal: Verify that dbt-data-reliability master branch (used by default) includes the dremio__edr_dateadd, dremio__edr_cast_as_timestamp, and dremio__target_database macros from PRs Merge pull request #956 from elementary-data/ele-1139-update-dbt-package-requirement-in-python-package #957 and Ele 983 generate job url and job run url #952. If CI uses an older version, Dremio tests will fail.
Column rename consistency: Confirm all references to renamed columns (min_val, max_val, sum_val, col1) are updated consistently across models, tests, and seed files.
CI orchestration: Confirm Dremio services stay up after dremio-setup completes (two-step compose flow) and subsequent dbt steps can connect reliably.
Seed cache behavior: Verify caching is only applied to Postgres/ClickHouse/DuckDB as intended and doesn't regress runtime/correctness for other adapters.
Spark external seeder runtime deps: Ensure the Spark CI job continues to install dbt-spark[PyHive] and that local failure mode is acceptable (actionable error message).

Suggested Test Plan

Re-run CI once (or trigger "Re-run all jobs") and confirm all warehouse jobs + code-quality pass.
(Optional) Re-run CI a second time to validate seed-cache hits for Postgres/ClickHouse/DuckDB behave as expected.
If testing locally: bring up Dremio stack (docker compose up -d --wait dremio dremio-minio nessie + docker compose run --rm dremio-setup) and run the Dremio dbt steps.

Notes

All Docker credentials (e.g. minio123, dremio123) are local-only test credentials, not real secrets.
Spark seeding uses pyhive; in CI this is provided via dbt-spark[PyHive].
Dremio ref() override fix: PR #2137 adds a project-level ref() macro that properly handles dbt-core 1.11's two-arg cross-package syntax. See dremio/dbt-dremio#232 for upstream issue.

Requested by: @haritamar
Link to Devin Session: https://app.devin.ai/sessions/6c12a19d456541568bcd7974623f3557

Summary by CodeRabbit

New Features
- Added support for DuckDB, Trino, Dremio, and Spark warehouses; optional DuckDB/Dremio adapters
- External seed loaders and CLI to load seeds into Dremio and Spark
- Deterministic seed data generation for stable CI artifacts
Tests
- CI expanded to run and cache seeds across new warehouses; extended Docker infra for Trino, Dremio, Spark
Models
- Renamed several seed/model columns (min → min_val, one → col1)
Bug Fixes
- Improved Dremio timestamp and dateadd handling for compatibility

- pyproject.toml: add dbt-duckdb and dbt-dremio as optional dependencies - Docker config files for Trino and Spark (non-credential files) - test-all-warehouses.yml: add duckdb, trino, dremio, spark to CI matrix - schema.yml: update data_type expressions for new adapter type mappings - test_alerts_union.sql: exclude schema_changes for Spark (like Databricks) - drop_test_schemas.sql: add dispatched edr_drop_schema for all new adapters - transient_errors.py: add spark and duckdb entries to _ADAPTER_PATTERNS - get_adapter_type_and_unique_id.sql: add duckdb dispatch (uses target.path) Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

devin-ai-integration · 2026-03-01T14:33:39Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

linear · 2026-03-01T14:33:40Z

ELE-5266 Add DuckDB, Trino, Dremio & Spark support to elementary-cli CI + OSS Docs

github-actions · 2026-03-01T14:33:46Z

👋 @devin-ai-integration[bot]
Thank you for raising your pull request.
Please make sure to add tests and document all user-facing changes.
You can do this by editing the docs files in this pull request.

coderabbitai · 2026-03-01T14:33:54Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds DuckDB, Trino, Dremio, and Spark support across CI matrices and workflows, Docker compose services, dbt profiles/macros, external seed loaders (Dremio/Spark), deterministic seed data and column renames, seed cache save/restore utilities, and related configuration and dependency updates.

Changes

Cohort / File(s)	Summary
CI Workflows `.github/workflows/test-all-warehouses.yml`, `.github/workflows/test-warehouse.yml`	Adds duckdb/trino/dremio/spark to CI matrix; implements per-warehouse seed cache compute/restore/save and conditional startup/tooling steps.
Docker & Compose `tests/e2e_dbt_project/docker-compose.yml`, `tests/e2e_dbt_project/docker/...`, `tests/e2e_dbt_project/docker/trino/...`, `tests/e2e_dbt_project/docker/spark/...`	Adds Trino (Iceberg), MinIO/Nessie, Dremio stack, Spark Thrift + Hive metastore, new volumes/networks, healthchecks and orchestration for new services.
External Seeders & CLI `tests/e2e_dbt_project/external_seeders/__init__.py`, `.../base.py`, `.../dremio.py`, `.../spark.py`, `tests/e2e_dbt_project/load_seeds_external.py`	New ExternalSeeder base and adapter implementations for Dremio (MinIO, Dremio REST/Nessie, Iceberg/COPY) and Spark (PyHive/Thrift, Delta); CLI dispatcher to invoke them.
Seed Cache Utilities `tests/e2e_dbt_project/ci/restore_seed_cache.sh`, `tests/e2e_dbt_project/ci/save_seed_cache.sh`	Scripts to archive and restore Docker volumes to /tmp seed-cache-, with container stop/start and readiness polling.
Profiles & dependencies `tests/profiles/profiles.yml.j2`, `pyproject.toml`, `elementary/monitor/dbt_project/dbt_project.yml`, `tests/e2e_dbt_project/dbt_project.yml`	Adds trino/dremio/duckdb/spark profile targets and anchors; optional deps dbt-duckdb/dbt-dremio and extras; dispatch rules and conditional delta file_format for spark.
Adapter logic & macros `elementary/clients/dbt/transient_errors.py`, `elementary/monitor/dbt_project/macros/...`, `tests/e2e_dbt_project/macros/system/*`	Adds spark transient errors and duckdb placeholder; package-qualified refs; adapter-specific macros (duckdb/trino/dremio/spark), Dremio overrides (cast/dateadd), generate_schema_name and edr_drop_schema implementations.
Monitoring macros & SQL safety `elementary/monitor/dbt_project/macros/...`, `elementary/monitor/dbt_project/models/alerts/*.sql`	Replaces wildcard refs with package-qualified refs, explicit column lists, explicit JOINs, and escapes reserved keywords across multiple macros.
SQL models, tests & seed data `tests/e2e_dbt_project/models/`, `tests/e2e_dbt_project/models/schema.yml`, `tests/e2e_dbt_project/tests/.sql`, `tests/e2e_dbt_project/generate_data.py`	Renames numeric columns (min→min_val, max→max_val, sum→sum_val), updates aliases and tests, adds deterministic seed generation (RANDOM_SEED=42), and includes spark in target-aware type logic.
Trino configs & Spark files `tests/e2e_dbt_project/docker/trino/catalog/`, `tests/e2e_dbt_project/docker/trino/etc/`, `tests/e2e_dbt_project/docker/spark/*`	Adds Iceberg/memory catalogs and Trino JVM/node configs; Spark Dockerfile, entrypoint, hive-site.xml, and spark-defaults.conf.
Utilities / setup scripts `tests/e2e_dbt_project/docker/dremio/dremio-setup.sh`, `tests/e2e_dbt_project/docker/spark/entrypoint.sh`	Adds Dremio setup script to create Nessie source and Spark entrypoint/wait logic for service readiness.

Sequence Diagram(s)

sequenceDiagram
    participant CI as GitHub Actions
    participant Cache as Seed Cache
    participant Docker as Docker Compose
    participant dbt as dbt / CLI
    participant Service as Warehouse Service

    rect rgba(100, 150, 200, 0.5)
    Note over CI,Cache: Seed cache decision
    CI->>Cache: compute seed-hash & check cache
    alt cache hit
        Cache-->>Docker: restore tar.gz into volumes
        Docker->>Service: start service(s) with mounted cached data
    else cache miss
        CI->>dbt: generate seeds / run external seeders
        dbt->>Service: load seed data (external or local)
    end
    end

    rect rgba(150, 200, 100, 0.5)
    Note over Docker,dbt: Test execution
    Docker->>Service: start & healthcheck services
    dbt->>Service: run dbt (seed/run/test)
    Service-->>dbt: return results/artifacts
    end

    rect rgba(200, 150, 100, 0.5)
    Note over Docker,Cache: Save cache
    Docker->>Cache: archive volumes to tar.gz
    Cache-->>CI: upload/store cache
    end

sequenceDiagram
    participant CLI as load_seeds_external
    participant Seeder as ExternalSeeder (Dremio/Spark)
    participant MinIO as MinIO
    participant Dremio as Dremio API
    participant Spark as Spark Thrift

    rect rgba(100, 200, 150, 0.5)
    Note over CLI,Seeder: Dremio flow
    CLI->>Seeder: invoke DremioExternalSeeder.load()
    Seeder->>MinIO: upload CSVs (mc via docker)
    Seeder->>Dremio: create/update S3 source (API)
    Seeder->>Dremio: create Nessie namespace
    loop per CSV
        Seeder->>Dremio: CREATE ICEBERG TABLE & COPY INTO from S3
    end
    Dremio-->>Seeder: job completion / metadata propagation
    end

    rect rgba(150, 100, 200, 0.5)
    Note over CLI,Seeder: Spark flow
    CLI->>Seeder: invoke SparkExternalSeeder.load()
    Seeder->>Spark: connect via Thrift (PyHive)
    loop per CSV
        Seeder->>Spark: create schema/table and populate via temporary view
    end
    Spark-->>Seeder: tables available
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Poem

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 56.52% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main feature addition: support for four new warehouse adapters (DuckDB, Trino, Dremio, Spark) in both CI and CLI infrastructure.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch devin/ELE-5266-1772368864-code-ci

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (2)

tests/e2e_dbt_project/docker/spark/Dockerfile (1)

1-30: Run the Spark container as a non-root user.

The image currently runs as root; adding a dedicated runtime user improves baseline container hardening.

Suggested hardening diff

 COPY entrypoint.sh /scripts/
-RUN chmod +x /scripts/entrypoint.sh
+RUN chmod +x /scripts/entrypoint.sh && \
+    groupadd -r spark && useradd -r -g spark -m spark && \
+    chown -R spark:spark /usr/spark /scripts
 
+USER spark
 ENTRYPOINT ["/scripts/entrypoint.sh"]
 CMD ["--help"]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/e2e_dbt_project/docker/spark/Dockerfile` around lines 1 - 30, The
Dockerfile runs the Spark image as root; create and switch to a non-root runtime
user to improve container hardening by adding a dedicated user/group (e.g.,
spark:spark) after installing packages and before COPY/ENTRYPOINT, ensure
ownership of SPARK_HOME (/usr/spark) and /scripts is changed to that user (chown
-R spark:spark /usr/spark /scripts) and then add a USER spark line so the
ENTRYPOINT ("/scripts/entrypoint.sh") and CMD run unprivileged; make sure any
later runtime actions in entrypoint.sh do not require root or handle privilege
escalation explicitly.

tests/e2e_dbt_project/docker-compose.yml (1)

106-106: Pin new container images to explicit versions/digests.

Using latest (or no tag) makes CI non-deterministic and harder to debug when upstream images change.

🔧 Suggested direction

-    image: projectnessie/nessie:latest
+    image: projectnessie/nessie:<pinned-version-or-digest>

-    image: minio/minio:latest
+    image: minio/minio:<pinned-version-or-digest>

-    image: minio/mc
+    image: minio/mc:<pinned-version-or-digest>

-    image: dremio/dremio-oss:latest
+    image: dremio/dremio-oss:<pinned-version-or-digest>

-    image: alpine:latest
+    image: alpine:<pinned-version-or-digest>

Also applies to: 114-114, 132-132, 150-150, 175-175

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/e2e_dbt_project/docker-compose.yml` at line 106, Replace unconstrained
image tags using "latest" with explicit version tags or digests to ensure
deterministic CI; update the image entry "image: projectnessie/nessie:latest" in
the docker-compose.yml to a specific version or sha256 digest, and apply the
same change for the other image entries referenced (lines 114, 132, 150, 175) so
all container images are pinned to fixed tags/digests.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/test-warehouse.yml:
- Around line 109-113: The readiness loops for Trino, Dremio and Spark currently
break on success but never fail on timeout; update each curl loop (the
Trino/Dremio/Spark readiness checks) to detect if the loop completed without
success by adding a post-loop conditional that checks whether the last curl
succeeded and, if not, prints an error and exits non‑zero (e.g., after the seq
loop, if the curl never returned success then echo "Timed out waiting for
<Service>" and exit 1). Ensure you apply this pattern to the Trino readiness
loop and the analogous Dremio and Spark loops so the job fails fast when
dependencies never become healthy.
- Line 216: The DuckDB target is using an in-memory database (path ":memory:")
which cannot persist across workflow steps; change the DuckDB configuration used
in CI so it uses a persistent file path (for example set path to
"/tmp/elementary_test.db") or exclude DuckDB from the multi-step workflow when
in-memory testing is intended; update the workflow logic that references
inputs.warehouse-type (and any DuckDB-specific job/step configs) to use the
persistent file path or add a conditional to skip DuckDB when
inputs.warehouse-type == 'duckdb' and in-memory mode is active.

In `@tests/e2e_dbt_project/docker-compose.yml`:
- Around line 95-100: The trino-mc-job entrypoint uses a fixed sleep which can
fail if MinIO starts slower; update the trino-mc-job entrypoint to replace
"sleep 5" with a readiness loop that waits for minio:9000 to accept connections
(e.g., a small loop using curl or nc that retries until success) before running
the mc commands, then proceed with "/usr/bin/mc config ...", "/usr/bin/mc mb
..."; modify the entrypoint block in the trino-mc-job service so it polls minio
(hostname "minio", port 9000) and exits the loop only when reachable.

In `@tests/e2e_dbt_project/docker/dremio/dremio-setup.sh`:
- Around line 15-33: The script currently treats AUTH_TOKEN and the S3 source
creation as always successful; update the AUTH_TOKEN acquisition and the
subsequent curl call that creates S3Source to fail fast: after the login curl
that sets AUTH_TOKEN (variable AUTH_TOKEN), check for empty string OR the
literal "null" and exit non‑zero with an error message if found; for both curl
calls (the login POST and the PUT to create S3Source) capture HTTP status (e.g.,
via curl -w/%{http_code}% and/or -f) and the response body, verify the status is
a 2xx before proceeding, and on non‑2xx log the response body and exit 1 so the
script does not print "Obtained Dremio auth token" or "S3 Source created in
Dremio" on failure.

In `@tests/e2e_dbt_project/macros/system/drop_test_schemas.sql`:
- Around line 43-53: The DROP macros interpolate schema_name unquoted which
breaks on special chars/reserved words; update dremio__edr_drop_schema,
duckdb__edr_drop_schema, and spark__edr_drop_schema to use
adapter.quote(schema_name) (same pattern as clickhouse__edr_drop_schema) when
building the raw DROP SQL so the identifier is safely quoted (e.g., replace
plain schema_name concatenation with adapter.quote(schema_name) in the run_query
argument).

In `@tests/profiles/profiles.yml.j2`:
- Around line 49-53: The DuckDB profile currently uses the process-local
in-memory path ":memory:", which breaks across SubprocessDbtRunner subprocesses;
update the profiles template (the duckdb &duckdb block) to use a persistent or
shared DB URI instead of ":memory:"—either switch to a file-backed path (e.g., a
temp DB file under a test-specific directory) or use a shared-memory URI like
the SQLite-style shared cache (e.g., file:...?mode=memory&cache=shared) so
multiple dbt subprocesses can see the same database; keep the schema
interpolation ({{ schema_name }}) and threads setting unchanged.

---

Nitpick comments:
In `@tests/e2e_dbt_project/docker-compose.yml`:
- Line 106: Replace unconstrained image tags using "latest" with explicit
version tags or digests to ensure deterministic CI; update the image entry
"image: projectnessie/nessie:latest" in the docker-compose.yml to a specific
version or sha256 digest, and apply the same change for the other image entries
referenced (lines 114, 132, 150, 175) so all container images are pinned to
fixed tags/digests.

In `@tests/e2e_dbt_project/docker/spark/Dockerfile`:
- Around line 1-30: The Dockerfile runs the Spark image as root; create and
switch to a non-root runtime user to improve container hardening by adding a
dedicated user/group (e.g., spark:spark) after installing packages and before
COPY/ENTRYPOINT, ensure ownership of SPARK_HOME (/usr/spark) and /scripts is
changed to that user (chown -R spark:spark /usr/spark /scripts) and then add a
USER spark line so the ENTRYPOINT ("/scripts/entrypoint.sh") and CMD run
unprivileged; make sure any later runtime actions in entrypoint.sh do not
require root or handle privilege escalation explicitly.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled

📥 Commits

Reviewing files that changed from the base of the PR and between 746ce41 and 0836bbf.

📒 Files selected for processing (20)

.github/workflows/test-all-warehouses.yml
.github/workflows/test-warehouse.yml
elementary/clients/dbt/transient_errors.py
elementary/monitor/dbt_project/macros/get_adapter_type_and_unique_id.sql
pyproject.toml
tests/e2e_dbt_project/docker-compose.yml
tests/e2e_dbt_project/docker/dremio/dremio-setup.sh
tests/e2e_dbt_project/docker/spark/Dockerfile
tests/e2e_dbt_project/docker/spark/entrypoint.sh
tests/e2e_dbt_project/docker/spark/hive-site.xml
tests/e2e_dbt_project/docker/spark/spark-defaults.conf
tests/e2e_dbt_project/docker/trino/catalog/iceberg.properties
tests/e2e_dbt_project/docker/trino/catalog/memory.properties
tests/e2e_dbt_project/docker/trino/etc/config.properties
tests/e2e_dbt_project/docker/trino/etc/jvm.config
tests/e2e_dbt_project/docker/trino/etc/node.properties
tests/e2e_dbt_project/macros/system/drop_test_schemas.sql
tests/e2e_dbt_project/models/schema.yml
tests/e2e_dbt_project/models/test_alerts_union.sql
tests/profiles/profiles.yml.j2

.github/workflows/test-warehouse.yml

tests/e2e_dbt_project/docker-compose.yml

tests/e2e_dbt_project/docker/dremio/dremio-setup.sh

tests/e2e_dbt_project/macros/system/drop_test_schemas.sql

tests/profiles/profiles.yml.j2

- DuckDB: use file-backed DB path instead of :memory: to persist across subprocess calls, and reduce threads to 1 to avoid concurrent commit errors - Spark: install dbt-spark[PyHive] extras required for thrift connection method - Dremio: add dremio__target_database() dispatch override in e2e project to return target.database (upstream elementary package lacks this dispatch) Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

coderabbitai

♻️ Duplicate comments (1)

.github/workflows/test-warehouse.yml (1)

109-113: ⚠️ Potential issue | 🟠 Major

Fail fast when readiness polling times out.

Line 109, Line 124, and Line 141 loops still continue even if services never become ready. This can allow later steps to run against unhealthy dependencies and fail non-deterministically.

Suggested fix pattern (apply to all three loops)

-          for i in $(seq 1 60); do
-            curl -sf http://localhost:8086/v1/info && break
+          ready=0
+          for i in $(seq 1 60); do
+            if curl -sf http://localhost:8086/v1/info > /dev/null; then
+              ready=1
+              break
+            fi
             echo "Waiting for Trino... ($i/60)"
             sleep 5
           done
+          if [ "$ready" -ne 1 ]; then
+            echo "Timed out waiting for Trino"
+            exit 1
+          fi

Also applies to: 124-128, 141-145

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/test-warehouse.yml around lines 109 - 113, The readiness
polling loops that use "for i in $(seq 1 60); do curl -sf ... && break; ...
done" should fail fast when the max attempts are reached; update each such loop
(the Trino/other service readiness loops) to check after sleeping whether i
equals the max (e.g., 60) and, if so, print an error to stderr and exit 1 so the
workflow stops rather than continuing with unhealthy dependencies. Ensure the
same pattern is applied to all three polling loops to provide deterministic
failure on timeout.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In @.github/workflows/test-warehouse.yml:
- Around line 109-113: The readiness polling loops that use "for i in $(seq 1
60); do curl -sf ... && break; ... done" should fail fast when the max attempts
are reached; update each such loop (the Trino/other service readiness loops) to
check after sleeping whether i equals the max (e.g., 60) and, if so, print an
error to stderr and exit 1 so the workflow stops rather than continuing with
unhealthy dependencies. Ensure the same pattern is applied to all three polling
loops to provide deterministic failure on timeout.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled

📥 Commits

Reviewing files that changed from the base of the PR and between 0836bbf and 40b2c74.

📒 Files selected for processing (3)

.github/workflows/test-warehouse.yml
tests/e2e_dbt_project/macros/system/dremio_target_database.sql
tests/profiles/profiles.yml.j2

🚧 Files skipped from review as they are similar to previous changes (1)

tests/profiles/profiles.yml.j2

Plain S3 sources in Dremio do not support CREATE TABLE (needed for dbt seed). Switch to Nessie catalog source which supports table creation via Iceberg. Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

tests/e2e_dbt_project/docker/dremio/dremio-setup.sh (1)

15-31: ⚠️ Potential issue | 🟠 Major

Fail fast on login/source API errors and reject "null" tokens.

At Line 20, checking only -z misses the literal "null" token case; and Lines 15-31 do not validate HTTP status for either API call, so setup can report success on non-2xx responses.

🔧 Suggested patch

-# Log in to Dremio to get the auth token
-AUTH_TOKEN=$(curl -s -X POST "http://dremio:9047/apiv2/login" \
-  -H "Content-Type: application/json" \
-  --data "{\"userName\":\"dremio\", \"password\":\"dremio123\"}" | jq -r .token)
+# Log in to Dremio to get the auth token
+LOGIN_RESPONSE="$(mktemp)"
+LOGIN_HTTP_CODE=$(curl -sS -o "$LOGIN_RESPONSE" -w "%{http_code}" -X POST "http://dremio:9047/apiv2/login" \
+  -H "Content-Type: application/json" \
+  --data "{\"userName\":\"dremio\", \"password\":\"dremio123\"}")
+if [ "$LOGIN_HTTP_CODE" -lt 200 ] || [ "$LOGIN_HTTP_CODE" -ge 300 ]; then
+  echo "Failed to log in to Dremio (HTTP $LOGIN_HTTP_CODE)"
+  cat "$LOGIN_RESPONSE"
+  exit 1
+fi
+AUTH_TOKEN=$(jq -r '.token // empty' "$LOGIN_RESPONSE")
 
 # Check if AUTH_TOKEN is not empty
-if [ -z "$AUTH_TOKEN" ]; then
+if [ -z "$AUTH_TOKEN" ] || [ "$AUTH_TOKEN" = "null" ]; then
   echo "Failed to obtain Dremio auth token"
   exit 1
 fi
 
 echo "Obtained Dremio auth token"
 
 # Create a Nessie catalog source in Dremio (supports CREATE TABLE for dbt seed)
-curl -s -X PUT "http://dremio:9047/apiv2/source/NessieSource" \
+SOURCE_RESPONSE="$(mktemp)"
+SOURCE_HTTP_CODE=$(curl -sS -o "$SOURCE_RESPONSE" -w "%{http_code}" -X PUT "http://dremio:9047/apiv2/source/NessieSource" \
   -H "Content-Type: application/json" \
   -H "Authorization: _dremio$AUTH_TOKEN" \
-  --data "{\"name\":\"NessieSource\",\"config\":{\"nessieEndpoint\":\"http://catalog:19120/api/v2\",\"nessieAuthType\":\"NONE\",\"credentialType\":\"ACCESS_KEY\",\"awsAccessKey\":\"admin\",\"awsAccessSecret\":\"password\",\"awsRootPath\":\"datalake\",\"secure\":false,\"propertyList\":[{\"name\":\"fs.s3a.path.style.access\",\"value\":\"true\"},{\"name\":\"fs.s3a.endpoint\",\"value\":\"dremio-storage:9000\"},{\"name\":\"dremio.s3.compat\",\"value\":\"true\"}]},\"type\":\"NESSIE\",\"metadataPolicy\":{\"deleteUnavailableDatasets\":true,\"autoPromoteDatasets\":false,\"namesRefreshMillis\":3600000,\"datasetDefinitionRefreshAfterMillis\":3600000,\"datasetDefinitionExpireAfterMillis\":10800000,\"authTTLMillis\":86400000,\"updateMode\":\"PREFETCH_QUERIED\"}}"
+  --data "{\"name\":\"NessieSource\",\"config\":{\"nessieEndpoint\":\"http://catalog:19120/api/v2\",\"nessieAuthType\":\"NONE\",\"credentialType\":\"ACCESS_KEY\",\"awsAccessKey\":\"admin\",\"awsAccessSecret\":\"password\",\"awsRootPath\":\"datalake\",\"secure\":false,\"propertyList\":[{\"name\":\"fs.s3a.path.style.access\",\"value\":\"true\"},{\"name\":\"fs.s3a.endpoint\",\"value\":\"dremio-storage:9000\"},{\"name\":\"dremio.s3.compat\",\"value\":\"true\"}]},\"type\":\"NESSIE\",\"metadataPolicy\":{\"deleteUnavailableDatasets\":true,\"autoPromoteDatasets\":false,\"namesRefreshMillis\":3600000,\"datasetDefinitionRefreshAfterMillis\":3600000,\"datasetDefinitionExpireAfterMillis\":10800000,\"authTTLMillis\":86400000,\"updateMode\":\"PREFETCH_QUERIED\"}}")
+if [ "$SOURCE_HTTP_CODE" -lt 200 ] || [ "$SOURCE_HTTP_CODE" -ge 300 ]; then
+  echo "Failed to create Nessie Source in Dremio (HTTP $SOURCE_HTTP_CODE)"
+  cat "$SOURCE_RESPONSE"
+  exit 1
+fi

#!/bin/bash
# Verify login/source calls capture and validate HTTP status and token nullability.
FILE="$(fd 'dremio-setup.sh' | head -n1)"
nl -ba "$FILE" | sed -n '12,90p'
rg -n 'apiv2/login|apiv2/source/NessieSource|http_code|AUTH_TOKEN|jq -r| -z ' "$FILE"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/e2e_dbt_project/docker/dremio/dremio-setup.sh` around lines 15 - 31,
The login and source API calls must validate HTTP status and reject a literal
"null" token; modify the AUTH_TOKEN retrieval block to capture both response
body and HTTP status from the login curl (use curl -w to produce HTTP code and
capture body separately), then extract token with jq -r .token and fail if the
HTTP status is not 2xx or if AUTH_TOKEN is empty or equals "null" (print the
response body/error and exit 1); do the same for the PUT creating NessieSource:
capture response body and HTTP status, and exit non-zero with the response on
any non-2xx status. Ensure you reference the existing AUTH_TOKEN variable, the
login curl to "apiv2/login", and the PUT to "apiv2/source/NessieSource" when
making these checks.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/e2e_dbt_project/docker/dremio/dremio-setup.sh`:
- Around line 7-10: The readiness loop that uses "until curl -s
http://dremio:9047; do ... done" must be bounded: add a MAX_ATTEMPTS variable
and a counter, set curl timeouts (e.g., --connect-timeout and/or --max-time) so
each try fails fast, increment the counter inside the loop, and when attempts >=
MAX_ATTEMPTS log an error and exit non-zero; update the existing until/curl loop
code in the script to use these variables and checks so CI will fail instead of
hanging indefinitely.

---

Duplicate comments:
In `@tests/e2e_dbt_project/docker/dremio/dremio-setup.sh`:
- Around line 15-31: The login and source API calls must validate HTTP status
and reject a literal "null" token; modify the AUTH_TOKEN retrieval block to
capture both response body and HTTP status from the login curl (use curl -w to
produce HTTP code and capture body separately), then extract token with jq -r
.token and fail if the HTTP status is not 2xx or if AUTH_TOKEN is empty or
equals "null" (print the response body/error and exit 1); do the same for the
PUT creating NessieSource: capture response body and HTTP status, and exit
non-zero with the response on any non-2xx status. Ensure you reference the
existing AUTH_TOKEN variable, the login curl to "apiv2/login", and the PUT to
"apiv2/source/NessieSource" when making these checks.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled

📥 Commits

Reviewing files that changed from the base of the PR and between 40b2c74 and bac36f9.

📒 Files selected for processing (2)

tests/e2e_dbt_project/docker/dremio/dremio-setup.sh
tests/profiles/profiles.yml.j2

🚧 Files skipped from review as they are similar to previous changes (1)

tests/profiles/profiles.yml.j2

tests/e2e_dbt_project/docker/dremio/dremio-setup.sh

- Make generate_data.py deterministic (fixed random seed) - Use fixed schema name for Docker adapters (ephemeral containers) - Cache seeded Docker volumes between runs using actions/cache - Cache DuckDB database file between runs - Skip dbt seed on cache hit, restoring from cached volumes instead - Applies to: Spark, Trino, Dremio, Postgres, ClickHouse, DuckDB Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/test-warehouse.yml:
- Around line 249-264: The "Restore cached seed data into Docker volumes"
workflow step currently restores tarballs into volumes after the warehouse
services have started (uses CACHE_DIR, docker volume create, docker run ... tar
xzf), which can corrupt live DB files; move this step to run before the job step
that starts warehouse containers (the service startup block) so volumes are
populated prior to container creation, or alternatively modify the step to stop
the warehouse containers before restore and then restart and run a readiness
check (e.g., wait-for/postgres healthcheck) after extraction to ensure services
reinitialize cleanly.

---

Duplicate comments:
In @.github/workflows/test-warehouse.yml:
- Around line 109-113: The readiness loops using "for i in $(seq 1 60); do curl
-sf http://localhost:8086/v1/info && break; ... done" do not fail when the
service never becomes healthy; update each such loop (the curl-based readiness
loops at the three occurrences) to detect timeout and exit non-zero if the loop
completes without success — e.g., after the done add a check like "if [ $i -eq
60 ] && ! curl -sf http://localhost:8086/v1/info; then echo 'Service did not
become ready' >&2; exit 1; fi" (or equivalent) so the workflow fails instead of
continuing.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled

📥 Commits

Reviewing files that changed from the base of the PR and between bac36f9 and 3682f1c.

📒 Files selected for processing (2)

.github/workflows/test-warehouse.yml
tests/e2e_dbt_project/generate_data.py

.github/workflows/test-warehouse.yml

Addresses CodeRabbit review: restoring cached tarballs into Docker volumes while containers are already running risks data corruption. Now the cache key computation and volume restore happen before any Docker services are started, so containers initialise with the pre-seeded data. Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

.github/workflows/test-warehouse.yml (1)

160-164: ⚠️ Potential issue | 🟠 Major

Readiness loops still don’t fail the job on timeout.

On Line 160-164, Line 175-179, and Line 192-196, the loops stop on success but never exit 1 when retries are exhausted. The workflow can continue with unhealthy dependencies and fail later with misleading errors.

Fail-fast pattern to apply to each loop

-          for i in $(seq 1 60); do
-            curl -sf http://localhost:8086/v1/info && break
-            echo "Waiting for Trino... ($i/60)"
-            sleep 5
-          done
+          ready=0
+          for i in $(seq 1 60); do
+            if curl -sf http://localhost:8086/v1/info > /dev/null; then
+              ready=1
+              break
+            fi
+            echo "Waiting for Trino... ($i/60)"
+            sleep 5
+          done
+          if [ "$ready" -ne 1 ]; then
+            echo "Timed out waiting for Trino"
+            exit 1
+          fi

Use the same timeout-fail pattern for Dremio and Spark checks.

Also applies to: 175-179, 192-196

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/test-warehouse.yml around lines 160 - 164, The readiness
loops that poll services (the Trino loop shown and the similar Dremio and Spark
loops) currently break on success but do nothing when retries are exhausted;
update each loop so that if the curl never succeeds within the retries it fails
the job (exit 1). Concretely, after each for/seq loop (the Trino, Dremio, and
Spark readiness checks), check whether the last curl succeeded and, if not,
print an error and run exit 1 (or set a failure flag inside the loop and exit
after the loop) so the workflow stops immediately on timeout.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/test-warehouse.yml:
- Around line 96-103: The SEED_HASH calculation (the SEED_HASH env var block)
currently hashes generate_data.py, dbt_project.yml and the profiles.yml.j2
template but omits the Docker stack/compose definition, allowing cached DB
volume tarballs to be restored against incompatible images or volume layouts;
update the SEED_HASH command to also include the Docker stack/compose file(s)
used to bring up the test DB (e.g., docker-compose.yml,
docker-compose.override.yml or any stack definition referenced by the workflow)
so changes in images/layouts change the cache key and prevent restoring
incompatible tarballs.

---

Duplicate comments:
In @.github/workflows/test-warehouse.yml:
- Around line 160-164: The readiness loops that poll services (the Trino loop
shown and the similar Dremio and Spark loops) currently break on success but do
nothing when retries are exhausted; update each loop so that if the curl never
succeeds within the retries it fails the job (exit 1). Concretely, after each
for/seq loop (the Trino, Dremio, and Spark readiness checks), check whether the
last curl succeeded and, if not, print an error and run exit 1 (or set a failure
flag inside the loop and exit after the loop) so the workflow stops immediately
on timeout.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled

📥 Commits

Reviewing files that changed from the base of the PR and between 3682f1c and 7969949.

📒 Files selected for processing (1)

.github/workflows/test-warehouse.yml

.github/workflows/test-warehouse.yml

… loops Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

…ching Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

…upport Points to dbt-data-reliability#948 which adds: - trino__full_name_split (1-based array indexing) - trino__edr_get_create_table_as_sql (bypass model.config issue) - spark__edr_get_create_table_as_sql TODO: revert after dbt-data-reliability#948 is merged Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

Prevents tar race condition where ClickHouse temporary merge files disappear during archiving, causing 'No such file or directory' errors. Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

…ache After stopping containers for volume archiving and restarting them, services like Trino need time to reinitialize. Added per-adapter health checks to wait for readiness before proceeding. Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

The /v1/info endpoint returns HTTP 200 even when Trino is still initializing. Check for '"starting":false' in the response body to ensure Trino is fully ready before proceeding. Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

…rino Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

haritamar · 2026-03-03T01:31:54Z

elementary/monitor/dbt_project/macros/can_upload_source_freshness.sql

            where {{ elementary.edr_datediff(elementary.edr_cast_as_timestamp('generated_at'), elementary.edr_current_timestamp(), 'day') }} < {{ days_back }}
        )
-        select count(*) as count
+        select count(*) as {% if target.type == 'dremio' %}"count"{% else %}count{% endif %}


do you think we can use elementary.escape_reserved_keywords? It was created for this purpose
(its code is in the dbt-data-reliability repo)

Good call! Refactored all manual {% if target.type == 'dremio' %} quoting to use elementary.escape_reserved_keywords() instead. This covers:

row_number alias (2 files)

count alias (1 file)

filter column reference (2 files)

Also created dbt-data-reliability#955 to add row_number and snapshots to the reserved keywords list. CI is temporarily pinned to that branch.

…ed words Replace manual {% if target.type == 'dremio' %} quoting with the existing elementary.escape_reserved_keywords() utility from dbt-data-reliability. Files updated: - get_models_latest_invocation.sql: row_number alias - get_models_latest_invocations_data.sql: row_number alias - can_upload_source_freshness.sql: count alias - source_freshness_alerts.sql: filter column reference - get_source_freshness_results.sql: filter column reference Also temporarily pins dbt-data-reliability to branch with row_number and snapshots added to the reserved keywords list (PR #955). Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

Replace USING (unique_id) with explicit ON clause and select specific columns instead of SELECT * to avoid ambiguous column references in Dremio's SQL engine, which doesn't deduplicate join columns with USING. Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

…N join The switch from USING to ON for Dremio compatibility requires qualifying column references since ON doesn't deduplicate join columns like USING does. Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

coderabbitai

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

🧹 Nitpick comments (5)

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/test-warehouse.yml:
- Around line 269-275: The script currently builds EXTRA_ARGS as a plain string
and expands it unquoted in the dbt run invocation, which is brittle and triggers
SC2086; change EXTRA_ARGS into a bash array (e.g., EXTRA_ARGS=() and
EXTRA_ARGS+=(--threads 1) when testing inputs.warehouse-type == "dremio") and
then invoke dbt run --target "${{ inputs.warehouse-type }}" "${EXTRA_ARGS[@]}"
|| true so each argument is passed safely without word-splitting or globbing.

In
`@elementary/monitor/dbt_project/macros/base_queries/current_tests_run_results_query.sql`:
- Line 3: Occurrences of ref(...) use the unsupported keyword argument package=;
update each ref call to the positional form with the package name first (e.g.
change ref('elementary_test_results', package='elementary') to the positional
form with 'elementary' as the first argument) for the following identifiers:
ref('elementary_test_results', package='elementary'), ref('dbt_run_results',
package='elementary'), ref('dbt_tests', package='elementary'), ref('dbt_models',
package='elementary'), and ref('dbt_sources', package='elementary').

In `@tests/e2e_dbt_project/ci/save_seed_cache.sh`:
- Around line 32-41: The readiness loops for ClickHouse (curl /ping) and
Postgres (pg_isready) currently break on success but silently continue on
exhaustion; update both loops in save_seed_cache.sh to detect when the retry
loop finishes without a successful check and immediately fail the script (exit
non-zero) with a clear error message; specifically, after the for ...; do ...;
done blocks that use curl -sf http://localhost:8123/ping and pg_isready -h
localhost -p 5432, add logic to check whether the loop exited via break or
exhausted all attempts and call exit 1 (or return non-zero) and log an error so
the CI job fails fast when services are not ready.

In `@tests/e2e_dbt_project/external_seeders/base.py`:
- Around line 24-27: The run helper currently uses subprocess.run(...,
shell=True) which opens a command-injection vector; change run(cmd: str, ...) to
accept either a sequence or build cmd as List[str] and call
subprocess.run(cmd_list, shell=False, check=check, **kw) so commands are
executed without a shell, and update callers (notably _upload_csvs_to_minio and
the mc_cmds construction in dremio.py) to pass a list of argv tokens instead of
interpolated shell strings; if you must keep composing shell strings, escape all
environment-derived values (minio_access_key, minio_secret_key, network, etc.)
with shlex.quote() before interpolation, but the preferred fix is refactoring
_upload_csvs_to_minio to build and pass list-based commands to run to eliminate
shell and chaining via "&&".

In `@tests/e2e_dbt_project/external_seeders/dremio.py`:
- Around line 336-341: The current try/except blocks around calls to self._sql
(e.g., where you run create_sql, COPY INTO and refresh operations) swallow
exceptions and only print warnings so the seeder can exit successfully with a
partial state; change the behavior to collect failures and re-raise after
attempting all tables: capture exceptions (including table_name and exception
message) into a list (e.g., failed_tables or errors) inside each except instead
of just printing, continue attempting remaining tables, and at the end of the
loading routine check the collection and raise a single aggregated Exception (or
RuntimeError) summarizing all table failures; reference the existing self._sql
calls, created_tables list and table_name variables to locate the blocks to
update, and apply the same pattern to the other try/except blocks noted for COPY
INTO and refresh.
- Around line 281-282: The retry exhaustion currently only prints "  ERROR:
Failed to create SeedFiles source after 3 attempts" and continues; change this
to raise an exception so CI fails fast: replace that print with raising a
descriptive exception (e.g., raise RuntimeError("Failed to create SeedFiles
source after 3 attempts")) at the same location where the S3/SeedFiles creation
retry loop ends (the code that prints that exact message), ensuring the
exception bubbles out of the loader so the test run fails.
- Around line 99-125: The HTTP calls currently have no timeouts and can hang;
update every requests call to include an explicit timeout (use timeout=(5, 30)
or similar) — specifically add the timeout param to requests.post in _get_token,
to both requests.post and the periodic requests.get in _sql (so each looped poll
cannot block past the deadline), and to the requests.post, requests.get, and
both requests.put calls inside _create_s3_source; keep the same headers/json
args and only append timeout=(5, 30) (or configurable constants) to each call.

In `@tests/e2e_dbt_project/external_seeders/spark.py`:
- Around line 51-55: The try/except around cursor.execute currently prints
errors and continues, which swallows failures; change the logic in the failing
blocks (the try/except handling the cursor.execute calls at the spots shown and
the similar blocks around lines handling other seed executions) to aggregate
failures and return a non-zero exit or re-raise after processing so the process
fails: collect exceptions into a failure list (e.g., seed_errors), append the
error and the SQL/seed info inside each except, and after the loop if
seed_errors is non-empty call sys.exit(1) or raise a RuntimeError summarizing
the failures so the test run exits non-zero instead of reporting success. Ensure
you update all occurrences referenced (the try/except around cursor.execute and
the other two similar blocks).
- Around line 45-49: The SQL-building uses unescaped dynamic identifiers
(`seed_schema`, `table_name`, `tmp_view`, and column names from `cols`), so
implement and use a helper like quote_identifier(name) that escapes any
backticks in name (replace ` with ``) and then wraps the result in backticks;
apply this helper when building `col_defs` (the generator that builds f"`{c}`
STRING"), when interpolating `{table_name}` and `{seed_schema}` in the CREATE
TABLE SQL, and wherever `tmp_view` is used (also in the other block around lines
58-70) so all dynamic identifiers are safely quoted before embedding in Spark
SQL.

In `@tests/e2e_dbt_project/load_seeds_external.py`:
- Around line 29-44: The CLI docstring for main currently says SCHEMA_NAME is
the target schema but for adapter "spark" that argument is ignored; update main
to either document this behavior and/or enforce it: inside main, detect if
adapter == "spark" and set schema_name = "test_seeds" (or call out and
raise/echo a warning) before creating seeder_cls = SEEDERS[adapter] and
instantiating seeder = seeder_cls(...); also update the function docstring to
explicitly state that SCHEMA_NAME is ignored for Spark (or mention that it will
be overridden to "test_seeds") so callers understand the behavior.

---

Outside diff comments:
In @.github/workflows/test-warehouse.yml:
- Around line 292-295: The "Test e2e dbt project" workflow step (named "Test e2e
dbt project" and using working-directory ${{ env.E2E_DBT_PROJECT_DIR }}) is
marked with continue-on-error: true which allows dbt failures to be ignored;
remove the continue-on-error: true key from that step so dbt test failures will
fail the workflow (making this step consistent with the Python e2e tests).

---

Nitpick comments:
In `@elementary/monitor/dbt_project/macros/overrides/cast_column.sql`:
- Around line 9-10: The macro dremio__edr_cast_as_timestamp currently only
strips a trailing 'Z' before casting, which misses numeric UTC offsets; update
the final REGEXP_REPLACE pattern in dremio__edr_cast_as_timestamp so it also
matches and removes timezone offsets like +HH:MM, -HH:MM, +HHMM or -HHMM (in
addition to 'Z') prior to casting to elementary.edr_type_timestamp(), using an
anchored regex that accepts optional colon between hours and minutes and both
+/− signs.

In `@elementary/monitor/dbt_project/macros/overrides/dateadd.sql`:
- Line 14: Add a brief explanation for stripping 'order by 1' where interval is
normalized: document why interval may include "order by 1" (e.g., upstream
dbt-dremio or a wrapped SELECT that appends an ORDER BY) and note that the line
{% set interval = interval.replace('order by 1', '') %} removes this artifact to
prevent SQL syntax errors or incorrect interval parsing; place this as an inline
comment immediately above that line and/or expand the header comment to mention
the specific upstream behavior and the conditions when the workaround is
required.
- Around line 29-30: The template implicitly handles the 'day' datepart in the
final else branch; add an explicit branch for when datepart equals 'day' in the
dateadd macro to map it to TIMESTAMPADD(SQL_TSI_DAY, CAST({{interval}} as int),
CAST({{from_date_or_timestamp}} as TIMESTAMP)) — locate the conditional block in
the dateadd macro (the same block that currently emits TIMESTAMPADD for other
dateparts) and insert an elif/else-if for 'day' that returns the exact
TIMESTAMPADD(SQL_TSI_DAY, ...) expression so the behavior is explicit and
readable.

In `@tests/e2e_dbt_project/macros/system/generate_schema_name.sql`:
- Line 23: Replace the inline expression-style return with the statement-style
return for consistency: change the usage of the template expression "{{
return(dremio__generate_schema_name(custom_schema_name, node)) }}" to the
do-statement form "{% do return(dremio__generate_schema_name(custom_schema_name,
node)) %}" so the return in the macro dremio__generate_schema_name matches the
rest of the file's style.

In `@tests/e2e_dbt_project/models/schema.yml`:
- Line 350: Extract the inline Jinja conditional in the data_type field into a
reusable macro (e.g., get_string_type) that encapsulates the
adapter-to-string-type mapping and then call that macro from the schema.yml
data_type line; specifically, create a macro (name it get_string_type) that
checks target.type and returns the appropriate literal for
bigquery/databricks/athena/spark, trino/dremio/duckdb, redshift, and a default,
and replace the existing inline expression in the data_type field with "{{
get_string_type() }}" so the mapping logic lives in one place and is easier to
maintain.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled

📥 Commits

Reviewing files that changed from the base of the PR and between 7969949 and d48a335.

⛔ Files ignored due to path filters (2)

tests/e2e_dbt_project/data/training/numeric_column_anomalies_training.csv is excluded by !**/*.csv
tests/e2e_dbt_project/data/validation/numeric_column_anomalies_validation.csv is excluded by !**/*.csv

📒 Files selected for processing (54)

.github/workflows/test-warehouse.yml
elementary/monitor/dbt_project/dbt_project.yml
elementary/monitor/dbt_project/macros/alerts/population/model_alerts.sql
elementary/monitor/dbt_project/macros/alerts/population/source_freshness_alerts.sql
elementary/monitor/dbt_project/macros/alerts/population/test_alerts.sql
elementary/monitor/dbt_project/macros/base_queries/current_tests_run_results_query.sql
elementary/monitor/dbt_project/macros/base_queries/owners.sql
elementary/monitor/dbt_project/macros/base_queries/resources.sql
elementary/monitor/dbt_project/macros/base_queries/tags.sql
elementary/monitor/dbt_project/macros/can_upload_source_freshness.sql
elementary/monitor/dbt_project/macros/get_dbt_models_test_coverage.sql
elementary/monitor/dbt_project/macros/get_exposures.sql
elementary/monitor/dbt_project/macros/get_models.sql
elementary/monitor/dbt_project/macros/get_models_latest_invocation.sql
elementary/monitor/dbt_project/macros/get_models_latest_invocations_data.sql
elementary/monitor/dbt_project/macros/get_models_runs.sql
elementary/monitor/dbt_project/macros/get_nodes_depends_on_nodes.sql
elementary/monitor/dbt_project/macros/get_result_rows_agate.sql
elementary/monitor/dbt_project/macros/get_seeds.sql
elementary/monitor/dbt_project/macros/get_singular_tests.sql
elementary/monitor/dbt_project/macros/get_snapshots.sql
elementary/monitor/dbt_project/macros/get_source_freshness_results.sql
elementary/monitor/dbt_project/macros/get_sources.sql
elementary/monitor/dbt_project/macros/get_test_last_invocation.sql
elementary/monitor/dbt_project/macros/get_test_results.sql
elementary/monitor/dbt_project/macros/get_tests.sql
elementary/monitor/dbt_project/macros/overrides/cast_column.sql
elementary/monitor/dbt_project/macros/overrides/dateadd.sql
elementary/monitor/dbt_project/models/alerts/alerts_v2.sql
elementary/monitor/dbt_project/models/alerts/deprecated/alerts.sql
tests/e2e/report/fixtures/elementary_output.json
tests/e2e_dbt_project/ci/restore_seed_cache.sh
tests/e2e_dbt_project/ci/save_seed_cache.sh
tests/e2e_dbt_project/dbt_project.yml
tests/e2e_dbt_project/docker-compose.yml
tests/e2e_dbt_project/docker/dremio/dremio-setup.sh
tests/e2e_dbt_project/docker/spark/Dockerfile
tests/e2e_dbt_project/external_seeders/__init__.py
tests/e2e_dbt_project/external_seeders/base.py
tests/e2e_dbt_project/external_seeders/dremio.py
tests/e2e_dbt_project/external_seeders/spark.py
tests/e2e_dbt_project/generate_data.py
tests/e2e_dbt_project/load_seeds_external.py
tests/e2e_dbt_project/macros/system/drop_test_schemas.sql
tests/e2e_dbt_project/macros/system/generate_schema_name.sql
tests/e2e_dbt_project/models/nested/models/tree/nested.sql
tests/e2e_dbt_project/models/numeric_column_anomalies.sql
tests/e2e_dbt_project/models/one.sql
tests/e2e_dbt_project/models/schema.yml
tests/e2e_dbt_project/tests/singular_test_with_no_ref.sql
tests/e2e_dbt_project/tests/singular_test_with_one_ref.sql
tests/e2e_dbt_project/tests/singular_test_with_source_ref.sql
tests/e2e_dbt_project/tests/singular_test_with_two_refs.sql
tests/profiles/profiles.yml.j2

🚧 Files skipped from review as they are similar to previous changes (1)

tests/e2e_dbt_project/docker/dremio/dremio-setup.sh

.github/workflows/test-warehouse.yml

elementary/monitor/dbt_project/macros/base_queries/current_tests_run_results_query.sql

tests/e2e_dbt_project/ci/save_seed_cache.sh

tests/e2e_dbt_project/external_seeders/base.py

tests/e2e_dbt_project/external_seeders/dremio.py

tests/e2e_dbt_project/external_seeders/spark.py

tests/e2e_dbt_project/load_seeds_external.py

- Revert ref() syntax from package= keyword to positional form in 20 monitor macros - Add HTTP_TIMEOUT constant and apply to all 7 requests calls in dremio.py - Raise RuntimeError on S3 source creation failure instead of silent print - Aggregate and raise failures in dremio.py and spark.py load() methods - Fix shell=True injection: convert base.py run() to list-based subprocess - Quote MinIO credentials with shlex.quote() in dremio.py - Add backtick-escaping helper _q() for Spark SQL identifiers - Fail fast on readiness timeout in save_seed_cache.sh - Convert EXTRA_ARGS to bash array in test-warehouse.yml (SC2086) - Remove continue-on-error from dbt test step - Add explicit day case in dateadd.sql override - Document Spark schema_name limitation in load_seeds_external.py Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

…er.devin.ai/proxy/github.com/elementary-data/elementary into devin/ELE-5266-1772368864-code-ci

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (6)

elementary/monitor/dbt_project/macros/overrides/dateadd.sql (1)

34-36: Silent fallback to day for unrecognized dateparts may mask bugs.

If a caller passes an invalid or misspelled datepart (e.g., 'years' instead of 'year'), this silently defaults to day without any indication. Consider raising an error or logging a warning to surface misuse early.

🛡️ Proposed fix to raise an error for unrecognized dateparts

     {% elif datepart == 'day' %}
         TIMESTAMPADD(SQL_TSI_DAY, CAST({{interval}} as int), CAST({{from_date_or_timestamp}} as TIMESTAMP))
     {% else %}
-        {# Fallback for unrecognized dateparts — default to day #}
-        TIMESTAMPADD(SQL_TSI_DAY, CAST({{interval}} as int), CAST({{from_date_or_timestamp}} as TIMESTAMP))
+        {{ exceptions.raise_compiler_error("dremio__dateadd: unrecognized datepart '" ~ datepart ~ "'. Supported values: year, quarter, month, week, day, hour, minute, second.") }}
     {% endif %}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@elementary/monitor/dbt_project/macros/overrides/dateadd.sql` around lines 34
- 36, The current else branch silently falls back to TIMESTAMPADD(SQL_TSI_DAY,
...) for unrecognized dateparts; replace that silent fallback with a clear
failure by invoking dbt's compiler error helper
(exceptions.raise_compiler_error) and include the invalid datepart value in the
message so callers see the misuse; update the macro (dateadd.sql) to call
exceptions.raise_compiler_error with a descriptive message instead of returning
TIMESTAMPADD for the else branch (keep TIMESTAMPADD only for the valid "day"
branch).

tests/e2e_dbt_project/external_seeders/base.py (1)

58-58: Use os.path.splitext for table name derivation.

Line 58 can strip internal .csv fragments (e.g. a.csv.backup.csv → a.backup). splitext is safer and explicit.

♻️ Proposed change

-                table_name = fname.replace(".csv", "")
+                table_name, _ = os.path.splitext(fname)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/e2e_dbt_project/external_seeders/base.py` at line 58, Replace the
fragile string replace used to derive table_name from fname (table_name =
fname.replace(".csv", "")) with a splitext-based extraction: use
os.path.splitext(fname)[0] so filenames like "a.csv.backup.csv" correctly yield
"a.csv.backup" (update imports to include os if missing and adjust any
references to table_name accordingly).

tests/e2e_dbt_project/external_seeders/spark.py (1)

38-40: Guarantee cursor/connection cleanup on early exceptions.

If execution fails before Line 84, resource cleanup is skipped. Wrap DB work in try/finally (or context managers).

♻️ Proposed structure

         conn = hive.Connection(host=host, port=port, username="dbt")
         cursor = conn.cursor()
-
-        print(f"Creating schema '{seed_schema}'...")
-        cursor.execute(f"CREATE DATABASE IF NOT EXISTS `{seed_schema}`")
-
-        for subdir, csv_path, table_name in self.iter_seed_csvs():
-            ...
-
-        cursor.close()
-        conn.close()
+        try:
+            print(f"Creating schema '{seed_schema}'...")
+            cursor.execute(f"CREATE DATABASE IF NOT EXISTS `{seed_schema}`")
+
+            for subdir, csv_path, table_name in self.iter_seed_csvs():
+                ...
+        finally:
+            cursor.close()
+            conn.close()

Also applies to: 84-85

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/e2e_dbt_project/external_seeders/spark.py` around lines 38 - 40, Wrap
the Hive connection and cursor usage in a try/finally (or use context managers)
so they are always cleaned up on early exceptions: create conn via
hive.Connection and cursor = conn.cursor() inside a try block, perform DB
operations there, and in the finally ensure cursor.close() and conn.close() are
called (apply the same pattern to the later block around the code at lines
84-85). Reference the variables conn and cursor and the DB operation blocks so
you ensure both the initial connection/cursor and the later one are always
closed even if an exception is raised.

tests/e2e_dbt_project/external_seeders/dremio.py (1)

337-351: Escape quoted identifiers for schema/table/column names.

seed_schema, table_name, and CSV header values are embedded into quoted identifiers without escaping internal " characters.

♻️ Suggested helper and usage

 class DremioExternalSeeder(ExternalSeeder):
@@
+    `@staticmethod`
+    def _q_ident(name: str) -> str:
+        return '"' + name.replace('"', '""') + '"'
@@
-        nessie_ns = f'NessieSource."{seed_schema}"'
+        q = self._q_ident
+        nessie_ns = f"NessieSource.{q(seed_schema)}"
@@
-            fqn = f'{nessie_ns}."{table_name}"'
+            fqn = f"{nessie_ns}.{q(table_name)}"
@@
-            col_defs = ", ".join(f'"{c}" VARCHAR' for c in cols)
+            col_defs = ", ".join(f"{q(c)} VARCHAR" for c in cols)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/e2e_dbt_project/external_seeders/dremio.py` around lines 337 - 351, The
code builds quoted identifiers using seed_schema, table_name and CSV header
values without escaping embedded double-quotes (e.g., nessie_ns, fqn, col_defs),
which can produce invalid SQL or injection; add a small helper (e.g.,
quote_identifier(name: str) -> str) that returns the name with internal "
characters doubled and wrapped in double quotes, then use it when composing
nessie_ns, fqn and each column name produced by csv_columns() (replace
f'{...}"{table_name}"' and f'"{c}" VARCHAR' with calls to the helper) so all
schema/table/column identifiers are safely escaped before building create_sql.

.github/workflows/test-warehouse.yml (2)

255-255: Simplify redundant cache-save condition.

Line 255 already restricts to (postgres || clickhouse), so the extra != duckdb/trino/spark/dremio checks are unnecessary and make maintenance harder.

Minimal cleanup

-        if: steps.seed-cache.outputs.cache-hit != 'true' && inputs.warehouse-type != 'duckdb' && inputs.warehouse-type != 'trino' && inputs.warehouse-type != 'spark' && inputs.warehouse-type != 'dremio' && (inputs.warehouse-type == 'postgres' || inputs.warehouse-type == 'clickhouse')
+        if: steps.seed-cache.outputs.cache-hit != 'true' && (inputs.warehouse-type == 'postgres' || inputs.warehouse-type == 'clickhouse')

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/test-warehouse.yml at line 255, The conditional on line
containing "if: steps.seed-cache.outputs.cache-hit != 'true' &&
inputs.warehouse-type != 'duckdb' && inputs.warehouse-type != 'trino' &&
inputs.warehouse-type != 'spark' && inputs.warehouse-type != 'dremio' &&
(inputs.warehouse-type == 'postgres' || inputs.warehouse-type == 'clickhouse')"
is redundant; remove the unnecessary "!= duckdb/trino/spark/dremio" checks and
simplify it to check only that the cache missed and the warehouse type is
postgres or clickhouse, e.g. replace the whole condition with "if:
steps.seed-cache.outputs.cache-hit != 'true' && (inputs.warehouse-type ==
'postgres' || inputs.warehouse-type == 'clickhouse')" so future maintenance is
easier.

93-95: Align cache-enabled adapter lists between restore and save paths.

Lines 93-95 allow cache compute/restore for trino, dremio, and spark, but Lines 254-260 never save cache for those adapters in this workflow. This creates ongoing cache misses and extra cache API work. Either add save support for them or remove them from restore eligibility.

Also applies to: 254-260

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/test-warehouse.yml around lines 93 - 95, The restore
conditional that checks inputs.warehouse-type includes adapters (trino, dremio,
spark, duckdb) that are not included in the cache save path, causing cache
misses; either remove those adapters from the restore condition (the if:
inputs.warehouse-type == ... expression used before the run block) or add them
to the cache save step (the workflow’s cache save step / actions/cache usage
later in the file that currently limits saved adapters) so the lists match
exactly; update the same adapter names in both the restore and save/cache steps
to keep symmetry.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/test-warehouse.yml:
- Around line 145-153: The workflow currently uses "docker compose up
--exit-code-from dremio-setup" which implicitly aborts all services when the
dremio-setup container exits; instead bring the stack up detached, verify
readiness, then run the setup container separately and fail the job if it
returns non-zero: replace the single "docker compose up --exit-code-from
dremio-setup dremio-setup" step with a two-step sequence that (1) runs "docker
compose up -d" to start Dremio and MinIO, waits/health-checks until the services
are ready (e.g., polling Dremio/MinIO endpoints or using docker compose
ps/healthchecks), and (2) runs the setup as an isolated run (e.g., "docker
compose run --rm dremio-setup") capturing its exit code and failing the job when
that exit code is non-zero so successful setup does not stop the background
services.

In `@tests/e2e_dbt_project/external_seeders/dremio.py`:
- Around line 251-276: The 409-handling branch currently returns unconditionally
even when GET (get_resp) or PUT (put_resp) fail, preventing retries and leaving
SeedFiles potentially broken; change the logic in the 409 branch so you only
return when the PUT succeeds (put_resp.status_code == 200) and otherwise do not
return—log the warning for non-200 GET/PUT responses but let the function
continue (so callers/retries can handle/fix the failure) by removing the
premature `return` after the GET/PUT failure paths and keeping the
successful-return only inside the `if put_resp.status_code == 200` block.

In `@tests/e2e_dbt_project/external_seeders/spark.py`:
- Line 32: The runtime import "from pyhive import hive" introduces a missing
dependency; add "pyhive" (and optional transitive packages such as "thrift-sasl"
if SASL auth is used) to the project's dependency manifests—e.g., add pyhive to
the main dependencies in pyproject.toml (or the relevant extras group used for
e2e/CI) and also to dev-requirements.txt so CI/e2e environments install it;
ensure the versions are compatible with existing thrift libraries and update
lockfiles if present.

---

Nitpick comments:
In @.github/workflows/test-warehouse.yml:
- Line 255: The conditional on line containing "if:
steps.seed-cache.outputs.cache-hit != 'true' && inputs.warehouse-type !=
'duckdb' && inputs.warehouse-type != 'trino' && inputs.warehouse-type != 'spark'
&& inputs.warehouse-type != 'dremio' && (inputs.warehouse-type == 'postgres' ||
inputs.warehouse-type == 'clickhouse')" is redundant; remove the unnecessary "!=
duckdb/trino/spark/dremio" checks and simplify it to check only that the cache
missed and the warehouse type is postgres or clickhouse, e.g. replace the whole
condition with "if: steps.seed-cache.outputs.cache-hit != 'true' &&
(inputs.warehouse-type == 'postgres' || inputs.warehouse-type == 'clickhouse')"
so future maintenance is easier.
- Around line 93-95: The restore conditional that checks inputs.warehouse-type
includes adapters (trino, dremio, spark, duckdb) that are not included in the
cache save path, causing cache misses; either remove those adapters from the
restore condition (the if: inputs.warehouse-type == ... expression used before
the run block) or add them to the cache save step (the workflow’s cache save
step / actions/cache usage later in the file that currently limits saved
adapters) so the lists match exactly; update the same adapter names in both the
restore and save/cache steps to keep symmetry.

In `@elementary/monitor/dbt_project/macros/overrides/dateadd.sql`:
- Around line 34-36: The current else branch silently falls back to
TIMESTAMPADD(SQL_TSI_DAY, ...) for unrecognized dateparts; replace that silent
fallback with a clear failure by invoking dbt's compiler error helper
(exceptions.raise_compiler_error) and include the invalid datepart value in the
message so callers see the misuse; update the macro (dateadd.sql) to call
exceptions.raise_compiler_error with a descriptive message instead of returning
TIMESTAMPADD for the else branch (keep TIMESTAMPADD only for the valid "day"
branch).

In `@tests/e2e_dbt_project/external_seeders/base.py`:
- Line 58: Replace the fragile string replace used to derive table_name from
fname (table_name = fname.replace(".csv", "")) with a splitext-based extraction:
use os.path.splitext(fname)[0] so filenames like "a.csv.backup.csv" correctly
yield "a.csv.backup" (update imports to include os if missing and adjust any
references to table_name accordingly).

In `@tests/e2e_dbt_project/external_seeders/dremio.py`:
- Around line 337-351: The code builds quoted identifiers using seed_schema,
table_name and CSV header values without escaping embedded double-quotes (e.g.,
nessie_ns, fqn, col_defs), which can produce invalid SQL or injection; add a
small helper (e.g., quote_identifier(name: str) -> str) that returns the name
with internal " characters doubled and wrapped in double quotes, then use it
when composing nessie_ns, fqn and each column name produced by csv_columns()
(replace f'{...}"{table_name}"' and f'"{c}" VARCHAR' with calls to the helper)
so all schema/table/column identifiers are safely escaped before building
create_sql.

In `@tests/e2e_dbt_project/external_seeders/spark.py`:
- Around line 38-40: Wrap the Hive connection and cursor usage in a try/finally
(or use context managers) so they are always cleaned up on early exceptions:
create conn via hive.Connection and cursor = conn.cursor() inside a try block,
perform DB operations there, and in the finally ensure cursor.close() and
conn.close() are called (apply the same pattern to the later block around the
code at lines 84-85). Reference the variables conn and cursor and the DB
operation blocks so you ensure both the initial connection/cursor and the later
one are always closed even if an exception is raised.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled

📥 Commits

Reviewing files that changed from the base of the PR and between d48a335 and 19652db.

📒 Files selected for processing (10)

.github/workflows/test-warehouse.yml
elementary/monitor/dbt_project/macros/alerts/population/model_alerts.sql
elementary/monitor/dbt_project/macros/alerts/population/source_freshness_alerts.sql
elementary/monitor/dbt_project/macros/get_source_freshness_results.sql
elementary/monitor/dbt_project/macros/overrides/dateadd.sql
tests/e2e_dbt_project/ci/save_seed_cache.sh
tests/e2e_dbt_project/external_seeders/base.py
tests/e2e_dbt_project/external_seeders/dremio.py
tests/e2e_dbt_project/external_seeders/spark.py
tests/e2e_dbt_project/load_seeds_external.py

🚧 Files skipped from review as they are similar to previous changes (3)

tests/e2e_dbt_project/ci/save_seed_cache.sh
elementary/monitor/dbt_project/macros/alerts/population/model_alerts.sql
elementary/monitor/dbt_project/macros/alerts/population/source_freshness_alerts.sql

.github/workflows/test-warehouse.yml

tests/e2e_dbt_project/external_seeders/dremio.py

coderabbitai · 2026-03-03T11:18:08Z

tests/e2e_dbt_project/external_seeders/spark.py

+            f"(schema={seed_schema}) ==="
+        )
+
+        from pyhive import hive


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== Checking line 32 in spark.py ==" if [ -f "tests/e2e_dbt_project/external_seeders/spark.py" ]; then sed -n '25,40p' "tests/e2e_dbt_project/external_seeders/spark.py" else echo "File not found, listing directory contents:" find tests -name "spark.py" -type f 2>/dev/null | head -5 fi echo echo "== All pyhive imports in e2e seeder code ==" rg -n 'pyhive' tests/e2e_dbt_project -C2 echo echo "== Searching for dependency manifests ==" fd -t f '(pyproject\.toml|setup\.py|setup\.cfg|requirements.*\.txt|Pipfile|poetry\.lock)' --max-results 20 echo echo "== Checking for pyhive in dependency files ==" fd -t f '(pyproject\.toml|setup\.py|setup\.cfg|requirements.*\.txt)' | while read -r f; do if rg -q 'pyhive|PyHive' "$f" 2>/dev/null; then echo "Found in: $f" rg -n 'pyhive|PyHive' "$f" fi done

Repository: elementary-data/elementary

Length of output: 1138

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== Content of pyproject.toml (looking for pyhive/thrift) ==" if [ -f "pyproject.toml" ]; then cat pyproject.toml | grep -i -A5 -B5 'pyhive\|thrift\|test\|dev' || echo "No matches found, showing full file structure:" echo "---" grep -E '^\[.*\]|pyhive|thrift' pyproject.toml || true fi echo echo "== Content of dev-requirements.txt ==" if [ -f "dev-requirements.txt" ]; then cat dev-requirements.txt fi echo echo "== Full context around line 32 in spark.py (check for conditionals) ==" sed -n '1,40p' "tests/e2e_dbt_project/external_seeders/spark.py" echo echo "== Looking for all requirements files ==" find . -name '*requirement*.txt' -type f 2>/dev/null | grep -v '.git'

Repository: elementary-data/elementary

Length of output: 2655

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== Full pyproject.toml [tool.poetry.dependencies] section ==" sed -n '/^\[tool\.poetry\.dependencies\]/,/^\[/p' pyproject.toml | head -50 echo echo "== Check for optional dependencies (extras) ==" sed -n '/^\[tool\.poetry\.extras\]/,/^\[/p' pyproject.toml echo echo "== All occurrences of 'pyhive' or 'thrift' in repository ==" rg -i 'pyhive|thrift_sasl|sasl' . --type py --type toml --type txt 2>/dev/null || echo "No matches found"

Repository: elementary-data/elementary

Length of output: 2185

Add pyhive to project dependencies.

Line 32 introduces a hard runtime dependency on pyhive that is missing from all dependency manifests (pyproject.toml, dev-requirements.txt). If the Spark seeding path is executed in CI/e2e tests, the import will fail at runtime. Add pyhive with any required transitive dependencies (e.g., thrift-sasl for SASL authentication) to the appropriate dependency group.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/e2e_dbt_project/external_seeders/spark.py` at line 32, The runtime import "from pyhive import hive" introduces a missing dependency; add "pyhive" (and optional transitive packages such as "thrift-sasl" if SASL auth is used) to the project's dependency manifests—e.g., add pyhive to the main dependencies in pyproject.toml (or the relevant extras group used for e2e/CI) and also to dev-requirements.txt so CI/e2e environments install it; ensure the versions are compatible with existing thrift libraries and update lockfiles if present.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

tests/e2e_dbt_project/external_seeders/dremio.py (1)

257-283: ⚠️ Potential issue | 🟠 Major

Do not return from the 409 branch unless update succeeds.

Line [282] exits even when GET by-path or PUT update fails, which skips the v2 fallback/retry path and can leave SeedFiles broken.

🔧 Minimal fix

             if resp.status_code == 409:
                 # Source already exists – update it with PUT
                 print("  SeedFiles source exists, updating...")
                 # Get the existing source id + tag for the update
                 get_resp = requests.get(
                     f"http://{self.dremio_host}:{self.dremio_port}/api/v3/catalog/by-path/SeedFiles",
                     headers=headers,
                     timeout=self.HTTP_TIMEOUT,
                 )
                 if get_resp.status_code == 200:
                     existing = get_resp.json()
                     payload["id"] = existing["id"]
                     payload["tag"] = existing.get("tag", "")
                     put_resp = requests.put(
                         f"http://{self.dremio_host}:{self.dremio_port}/api/v3/catalog/{existing['id']}",
                         headers=headers,
                         json=payload,
                         timeout=self.HTTP_TIMEOUT,
                     )
                     if put_resp.status_code == 200:
                         print("  SeedFiles S3 source updated")
                         return
                     print(
                         f"  Warning: update returned {put_resp.status_code}: {put_resp.text}"
                     )
-                return
+                else:
+                    print(
+                        f"  Warning: failed fetching existing source "
+                        f"({get_resp.status_code}: {get_resp.text[:200]})"
+                    )
+                # Continue to v2 fallback / retry.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/e2e_dbt_project/external_seeders/dremio.py` around lines 257 - 283, The
409 branch currently returns early even when the GET by-path or PUT update
fails; update the logic in the block handling resp.status_code == 409
(referencing payload, get_resp, existing, put_resp) so that you only return
immediately when the PUT succeeds (put_resp.status_code == 200); if
get_resp.status_code != 200 or the PUT does not succeed, do not return—log or
print the warning but let execution continue so the v2 fallback/retry path can
run and attempt to recover SeedFiles.

🧹 Nitpick comments (1)

tests/e2e_dbt_project/external_seeders/dremio.py (1)

23-39: Avoid silent parse failures in _docker_defaults.

Lines [38-39] and [54-55] swallow all exceptions, which makes credential-default parsing failures hard to diagnose in CI.

🔧 Suggested tweak

-    except Exception:
-        pass
+    except Exception as e:
+        print(f"  Warning: failed reading defaults from {compose_path}: {e}")
@@
-    except Exception:
-        pass
+    except Exception as e:
+        print(f"  Warning: failed reading defaults from {setup_path}: {e}")

Also applies to: 45-55

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/e2e_dbt_project/external_seeders/dremio.py` around lines 23 - 39, The
try/except in _docker_defaults currently swallows all exceptions during parsing
of compose_path and environment items (around the block that opens compose_path,
iterates services -> "dremio-minio" -> "environment", and sets
defaults["MINIO_ACCESS_KEY"/"MINIO_SECRET_KEY"]) which hides CI failures; change
the broad except to either narrow exceptions you expect (e.g.,
FileNotFoundError, yaml.YAMLError, ValueError) or log the exception details and
context before continuing (use logging.exception or raise a wrapped error) so
failures are visible — apply the same change to the second try/except that
handles the other parse block so neither silent-swallowing try/except hides
parsing errors.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/e2e_dbt_project/external_seeders/spark.py`:
- Around line 54-62: The current logic creates empty tables with "CREATE TABLE
IF NOT EXISTS" which preserves prior data; change the empty-seed path in
spark.py to ensure a clean table by using "CREATE OR REPLACE TABLE" (or by
issuing "DROP TABLE IF EXISTS {q(seed_schema)}.{q(table_name)}" then a plain
CREATE TABLE) so header-only CSVs do not retain old rows; update the sql string
built for the empty-CSV branch (the same variable passed to cursor.execute)
accordingly and keep the existing try/except that appends to failures to
preserve error handling.

---

Duplicate comments:
In `@tests/e2e_dbt_project/external_seeders/dremio.py`:
- Around line 257-283: The 409 branch currently returns early even when the GET
by-path or PUT update fails; update the logic in the block handling
resp.status_code == 409 (referencing payload, get_resp, existing, put_resp) so
that you only return immediately when the PUT succeeds (put_resp.status_code ==
200); if get_resp.status_code != 200 or the PUT does not succeed, do not
return—log or print the warning but let execution continue so the v2
fallback/retry path can run and attempt to recover SeedFiles.

---

Nitpick comments:
In `@tests/e2e_dbt_project/external_seeders/dremio.py`:
- Around line 23-39: The try/except in _docker_defaults currently swallows all
exceptions during parsing of compose_path and environment items (around the
block that opens compose_path, iterates services -> "dremio-minio" ->
"environment", and sets defaults["MINIO_ACCESS_KEY"/"MINIO_SECRET_KEY"]) which
hides CI failures; change the broad except to either narrow exceptions you
expect (e.g., FileNotFoundError, yaml.YAMLError, ValueError) or log the
exception details and context before continuing (use logging.exception or raise
a wrapped error) so failures are visible — apply the same change to the second
try/except that handles the other parse block so neither silent-swallowing
try/except hides parsing errors.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled

📥 Commits

Reviewing files that changed from the base of the PR and between 19652db and a63ce6b.

📒 Files selected for processing (2)

tests/e2e_dbt_project/external_seeders/dremio.py
tests/e2e_dbt_project/external_seeders/spark.py

tests/e2e_dbt_project/external_seeders/spark.py

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

coderabbitai

🧹 Nitpick comments (3)

tests/e2e_dbt_project/external_seeders/spark.py (1)

42-42: Use _q() for schema identifier quoting.

Line 42 uses raw backtick quoting for seed_schema while the rest of the file consistently uses the _q() helper for identifier escaping. For consistency and to maintain the quoting invariant established by the helper:

♻️ Proposed fix

         print(f"Creating schema '{seed_schema}'...")
-        cursor.execute(f"CREATE DATABASE IF NOT EXISTS `{seed_schema}`")
+        cursor.execute(f"CREATE DATABASE IF NOT EXISTS {q(seed_schema)}")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/e2e_dbt_project/external_seeders/spark.py` at line 42, The CREATE
DATABASE statement uses raw backticks around seed_schema instead of the
project's identifier-quoting helper; update the cursor.execute call that builds
the SQL (the line with cursor.execute(f"CREATE DATABASE IF NOT EXISTS
`{seed_schema}`")) to call the _q() helper for escaping (e.g. use
_q(seed_schema) inside the CREATE DATABASE IF NOT EXISTS expression) so it
follows the same quoting invariant as the rest of the file.

tests/e2e_dbt_project/external_seeders/dremio.py (2)

360-387: Consider idempotent table recreation to prevent duplicate data on reruns.

The current pattern uses CREATE TABLE IF NOT EXISTS (line 362) followed by COPY INTO (line 385). If the table already exists from a previous run with data, COPY INTO will append rows, causing duplicates.

Unlike the Spark seeder which drops before creating, this could lead to unexpected state during local development reruns.

♻️ Proposed fix for idempotent table loading

             # Create empty Iceberg table with VARCHAR columns
             col_defs = ", ".join(f'"{c}" VARCHAR' for c in cols)
-            create_sql = f"CREATE TABLE IF NOT EXISTS {fqn} ({col_defs})"
+            # Drop and recreate to ensure idempotent reruns
+            drop_sql = f"DROP TABLE IF EXISTS {fqn}"
+            create_sql = f"CREATE TABLE {fqn} ({col_defs})"
             try:
+                self._sql(token, drop_sql, timeout=60)
                 self._sql(token, create_sql, timeout=60)
                 created_tables.append(table_name)
             except Exception as e:

38-39: Silent exception handling obscures configuration issues.

The try-except-pass blocks when reading docker-compose.yml and dremio-setup.sh silently swallow all exceptions. While failures here aren't fatal (defaults are optional), logging at debug level would help diagnose misconfiguration.

♻️ Optional: Add debug logging

+import logging
+
+logger = logging.getLogger(__name__)
+
 def _docker_defaults() -> dict[str, str]:
     ...
     try:
         with open(compose_path) as fh:
             cfg = yaml.safe_load(fh)
         ...
     except Exception:
-        pass
+        logger.debug("Could not read docker-compose.yml for defaults", exc_info=True)

     ...
     try:
         with open(setup_path) as fh:
             content = fh.read()
         ...
     except Exception:
-        pass
+        logger.debug("Could not read dremio-setup.sh for defaults", exc_info=True)

Also applies to: 54-55

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/e2e_dbt_project/external_seeders/dremio.py` around lines 38 - 39,
Replace the silent "except Exception: pass" blocks in dremio.py (the ones
wrapping reading docker-compose.yml and dremio-setup.sh) with debug-level
logging that records the exception and contextual message; ensure a logger is
created (e.g., logger = logging.getLogger(__name__)) and call
logger.debug("Failed to read docker-compose.yml, using defaults", exc_info=True)
and similarly for the dremio-setup.sh block so configuration failures are
visible in debug logs.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/e2e_dbt_project/external_seeders/dremio.py`:
- Around line 38-39: Replace the silent "except Exception: pass" blocks in
dremio.py (the ones wrapping reading docker-compose.yml and dremio-setup.sh)
with debug-level logging that records the exception and contextual message;
ensure a logger is created (e.g., logger = logging.getLogger(__name__)) and call
logger.debug("Failed to read docker-compose.yml, using defaults", exc_info=True)
and similarly for the dremio-setup.sh block so configuration failures are
visible in debug logs.

In `@tests/e2e_dbt_project/external_seeders/spark.py`:
- Line 42: The CREATE DATABASE statement uses raw backticks around seed_schema
instead of the project's identifier-quoting helper; update the cursor.execute
call that builds the SQL (the line with cursor.execute(f"CREATE DATABASE IF NOT
EXISTS `{seed_schema}`")) to call the _q() helper for escaping (e.g. use
_q(seed_schema) inside the CREATE DATABASE IF NOT EXISTS expression) so it
follows the same quoting invariant as the rest of the file.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled

📥 Commits

Reviewing files that changed from the base of the PR and between a63ce6b and 2263199.

📒 Files selected for processing (2)

tests/e2e_dbt_project/external_seeders/dremio.py
tests/e2e_dbt_project/external_seeders/spark.py

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

devin-ai-integration · 2026-03-03T11:52:47Z

Follow-up on CodeRabbit comment about pyhive in SparkExternalSeeder: CI installs Spark deps via dbt-spark[PyHive], which brings in pyhive. I also added an explicit ImportError -> RuntimeError with a clear install hint so local runs fail with an actionable message.

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (5)

.github/workflows/test-warehouse.yml (1)

189-196: Prefer a case statement for adapter membership.

The loop works, but case is shorter and easier to maintain as adapter list evolves.

♻️ Optional simplification

-          DOCKER_ADAPTERS="postgres clickhouse trino dremio duckdb spark"
-          IS_DOCKER=false
-          for adapter in $DOCKER_ADAPTERS; do
-            if [ "$adapter" = "${{ inputs.warehouse-type }}" ]; then
-              IS_DOCKER=true
-              break
-            fi
-          done
+          IS_DOCKER=false
+          case "${{ inputs.warehouse-type }}" in
+            postgres|clickhouse|trino|dremio|duckdb|spark) IS_DOCKER=true ;;
+          esac

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/test-warehouse.yml around lines 189 - 196, Replace the
manual for-loop membership check (iterating DOCKER_ADAPTERS and setting
IS_DOCKER) with a shell case statement that matches "${{ inputs.warehouse-type
}}" against the adapter patterns; update the block referencing DOCKER_ADAPTERS,
IS_DOCKER, adapter and inputs.warehouse-type to use case ... esac to set
IS_DOCKER=true on a match and IS_DOCKER=false otherwise, keeping the same
behavior but with a shorter, easier-to-maintain structure.

elementary/monitor/dbt_project/macros/overrides/dateadd.sql (1)

tests/e2e_dbt_project/external_seeders/spark.py (1)

43-45: Harden connection cleanup for partial initialization failures.

If cursor creation fails after opening the connection, cleanup is not guaranteed. Initialize conn/cursor defensively and close conditionally in finally.

♻️ Proposed refactor

-        conn = hive.Connection(host=host, port=port, username="dbt")
-        cursor = conn.cursor()
-        try:
+        conn = None
+        cursor = None
+        try:
+            conn = hive.Connection(host=host, port=port, username="dbt")
+            cursor = conn.cursor()
             print(f"Creating schema '{seed_schema}'...")
             cursor.execute(f"CREATE DATABASE IF NOT EXISTS `{seed_schema}`")
@@
-        finally:
-            cursor.close()
-            conn.close()
+        finally:
+            if cursor is not None:
+                cursor.close()
+            if conn is not None:
+                conn.close()

Also applies to: 90-92

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/e2e_dbt_project/external_seeders/spark.py` around lines 43 - 45, The
connection/cursor cleanup is unsafe if cursor creation fails; initialize conn =
None and cursor = None before opening/creating them in the spark seeder and wrap
the connection/cursor logic in try/finally so you conditionally close
cursor.close() if cursor is not None and conn.close() if conn is not None; apply
the same defensive pattern to the second occurrence around the other
hive.Connection/cursor creation (the block currently at lines with
hive.Connection(host=host, port=port, username="dbt") and subsequent
conn.cursor()) to ensure partial initialization failures are cleaned up.

tests/e2e_dbt_project/external_seeders/dremio.py (2)

38-39: Consider logging non-file parsing failures in _docker_defaults.

The broad except ...: pass blocks hide malformed YAML/regex issues and make auth failures harder to diagnose.

♻️ Proposed refinement

-    except Exception:
-        pass
+    except FileNotFoundError:
+        pass
+    except Exception as e:
+        print(f"  Warning: failed parsing docker defaults from {compose_path}: {e}")
@@
-    except Exception:
-        pass
+    except FileNotFoundError:
+        pass
+    except Exception as e:
+        print(f"  Warning: failed parsing dremio defaults from {setup_path}: {e}")

Also applies to: 54-55

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/e2e_dbt_project/external_seeders/dremio.py` around lines 38 - 39, The
broad "except: pass" in _docker_defaults (and the second similar block)
suppresses errors — change each to "except Exception as e:" and log the
exception details (e.g., using logger.exception or logger.error with the
exception) including context about which YAML/regex or auth step failed; only
swallow the exception if it is specifically a FileNotFoundError (or other
expected file-read errors), otherwise log and re-raise or return a clear failure
so malformed YAML/regex/auth errors are visible during test runs.

9-10: Resolve YAML library alignment: use declared ruamel.yaml API instead of PyYAML.

This file imports PyYAML (import yaml, yaml.safe_load() at lines 24–26), but the project declares ruamel.yaml as its YAML dependency in pyproject.toml. PyYAML is not explicitly declared, creating an implicit dependency assumption.

Switch to the ruamel.yaml API for consistency with the project's declared dependencies.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/e2e_dbt_project/external_seeders/dremio.py` around lines 9 - 10, The
file imports PyYAML and calls yaml.safe_load, but the project uses ruamel.yaml;
replace the PyYAML usage by importing ruamel.yaml's API (e.g., from ruamel.yaml
import YAML), create a YAML() instance (e.g., yaml = YAML()), and replace any
yaml.safe_load(...) calls with yaml.load(...) (or yaml.load(stream) for file
streams) so the module uses ruamel.yaml consistently; update the top-level
import and all references to the symbol yaml.safe_load to the ruamel YAML
instance in the ExternalSeeder-related code.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/test-warehouse.yml:
- Around line 96-105: The SEED_HASH calculation currently concatenates
generate_data.py, dbt_project.yml, docker-compose.yml and profiles.yml.j2 but
omits actual seed files, so include the seed data files (e.g., the data/**
CSVs/parquets) in the hash computation; update the SEED_HASH generation (the
block that defines SEED_HASH) to also read or list the seed files (for example
by catting matching files or appending the output of git ls-files data/**) so
changes to seed content will change the hash—refer to the SEED_HASH variable and
the listed files generate_data.py, dbt_project.yml, docker-compose.yml, and
profiles.yml.j2 when editing.

In `@elementary/monitor/dbt_project/macros/overrides/dateadd.sql`:
- Around line 12-37: The macro dremio__dateadd uses JDBC constants like
SQL_TSI_YEAR in TIMESTAMPADD calls which Dremio doesn't support; update each
TIMESTAMPADD call in the dremio__dateadd macro (the branches handling datepart
== 'year','quarter','month','week','day','hour','minute','second') to use bare
unit keywords YEAR, QUARTER, MONTH, WEEK, DAY, HOUR, MINUTE, SECOND respectively
(e.g., replace TIMESTAMPADD(SQL_TSI_YEAR, ...) with TIMESTAMPADD(YEAR, ...))
while keeping the CASTs and interval handling unchanged.

In `@tests/e2e_dbt_project/external_seeders/dremio.py`:
- Around line 367-369: The seeder currently uses "CREATE TABLE IF NOT EXISTS"
(variables col_defs, create_sql, fqn, qi) and then COPY INTO which leaves prior
data intact; change the flow to make seeding idempotent by clearing the table
before loading: if the table exists drop it or run a TRUNCATE/DELETE against fqn
(use the same fqn/qi references) and then CREATE TABLE (or CREATE TABLE IF NOT
EXISTS followed by TRUNCATE) before executing COPY INTO; apply the same change
to the other occurrence around the COPY INTO block at the second location
referenced (the block at ~391-394) so each run resets the table state prior to
loading.

In `@tests/e2e_dbt_project/external_seeders/spark.py`:
- Around line 71-80: The SQL embeds container_path directly into a single-quoted
literal inside cursor.execute for creating the temp view (tmp_view), which will
break if container_path contains single quotes; change the code that builds the
SQL for cursor.execute (the CREATE OR REPLACE TEMPORARY VIEW ... OPTIONS (path
'...')) to escape single quotes in container_path (e.g., replace each ' with ''
before formatting) or otherwise use a safe parameterization; ensure you apply
the escaping where container_path is used so the generated SQL string remains
valid.

---

Nitpick comments:
In @.github/workflows/test-warehouse.yml:
- Around line 189-196: Replace the manual for-loop membership check (iterating
DOCKER_ADAPTERS and setting IS_DOCKER) with a shell case statement that matches
"${{ inputs.warehouse-type }}" against the adapter patterns; update the block
referencing DOCKER_ADAPTERS, IS_DOCKER, adapter and inputs.warehouse-type to use
case ... esac to set IS_DOCKER=true on a match and IS_DOCKER=false otherwise,
keeping the same behavior but with a shorter, easier-to-maintain structure.

In `@elementary/monitor/dbt_project/macros/overrides/dateadd.sql`:
- Around line 18-36: Normalize the datepart variable to lowercase before the
comparisons in the dateadd macro (use the Jinja filter for lower on the datepart
used in the if/elif checks) so all branches (year, quarter, month, week, day,
hour, minute, second) match case-insensitively; update the
exceptions.raise_compiler_error call to reference the normalized value or
include the original input for clarity if desired. Ensure you modify the
conditional checks that reference datepart and the error message generation in
the dateadd macro to use the lower-cased form.

In `@tests/e2e_dbt_project/external_seeders/dremio.py`:
- Around line 38-39: The broad "except: pass" in _docker_defaults (and the
second similar block) suppresses errors — change each to "except Exception as
e:" and log the exception details (e.g., using logger.exception or logger.error
with the exception) including context about which YAML/regex or auth step
failed; only swallow the exception if it is specifically a FileNotFoundError (or
other expected file-read errors), otherwise log and re-raise or return a clear
failure so malformed YAML/regex/auth errors are visible during test runs.
- Around line 9-10: The file imports PyYAML and calls yaml.safe_load, but the
project uses ruamel.yaml; replace the PyYAML usage by importing ruamel.yaml's
API (e.g., from ruamel.yaml import YAML), create a YAML() instance (e.g., yaml =
YAML()), and replace any yaml.safe_load(...) calls with yaml.load(...) (or
yaml.load(stream) for file streams) so the module uses ruamel.yaml consistently;
update the top-level import and all references to the symbol yaml.safe_load to
the ruamel YAML instance in the ExternalSeeder-related code.

In `@tests/e2e_dbt_project/external_seeders/spark.py`:
- Around line 43-45: The connection/cursor cleanup is unsafe if cursor creation
fails; initialize conn = None and cursor = None before opening/creating them in
the spark seeder and wrap the connection/cursor logic in try/finally so you
conditionally close cursor.close() if cursor is not None and conn.close() if
conn is not None; apply the same defensive pattern to the second occurrence
around the other hive.Connection/cursor creation (the block currently at lines
with hive.Connection(host=host, port=port, username="dbt") and subsequent
conn.cursor()) to ensure partial initialization failures are cleaned up.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled

📥 Commits

Reviewing files that changed from the base of the PR and between 2263199 and a5dcef9.

📒 Files selected for processing (6)

.github/workflows/test-warehouse.yml
elementary/monitor/dbt_project/macros/overrides/dateadd.sql
tests/e2e_dbt_project/docker/dremio/dremio-setup.sh
tests/e2e_dbt_project/external_seeders/base.py
tests/e2e_dbt_project/external_seeders/dremio.py
tests/e2e_dbt_project/external_seeders/spark.py

🚧 Files skipped from review as they are similar to previous changes (1)

tests/e2e_dbt_project/docker/dremio/dremio-setup.sh

.github/workflows/test-warehouse.yml

elementary/monitor/dbt_project/macros/overrides/dateadd.sql

tests/e2e_dbt_project/external_seeders/dremio.py

tests/e2e_dbt_project/external_seeders/spark.py

- test-warehouse.yml: replace for-loop with case statement for Docker adapter check - dateadd.sql: use bare TIMESTAMPADD keywords instead of SQL_TSI_* constants, add case-insensitive datepart matching - spark.py: harden connection cleanup with None-init + conditional close, escape single quotes in container_path - dremio.py: switch from PyYAML to ruamel.yaml for project consistency, log non-file parsing failures, make seeding idempotent with DROP TABLE before CREATE TABLE Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

…esigned to fail) The e2e project has tests tagged error_test and should_fail that are intentionally designed to fail. The dbt test step needs continue-on-error so these expected failures don't block the CI job. The edr monitoring steps that follow validate the expected outcomes. Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

haritamar · 2026-03-03T13:37:36Z

elementary/monitor/dbt_project/macros/overrides/cast_column.sql

+  This override normalizes ISO 8601 format to 'YYYY-MM-DD HH:MM:SS.sss'.
+#}
+
+{%- macro dremio__edr_cast_as_timestamp(timestamp_field) -%}


can we open a PR to dbt-data-reliability to include this instead?

Created dbt-data-reliability PR #957 which updates the existing dremio__edr_cast_as_timestamp with the full ISO 8601 normalization (removing T separator and Z suffix). Once merged, I'll remove this override from the elementary CLI project.

haritamar · 2026-03-03T13:40:10Z

elementary/monitor/dbt_project/macros/overrides/dateadd.sql

+  - Outputs just TIMESTAMPADD(...) as an expression (no "select" prefix)
+#}
+
+{% macro dremio__dateadd(datepart, interval, from_date_or_timestamp) %}


can we maybe in a PR to dbt-data-reliability to instead override edr_dateadd for dremio with this logic?
(can be the same PR as the other comment I mentioned)

Created dbt-data-reliability PR #957 to move this logic there as dremio__edr_dateadd (using adapter.dispatch). Once merged, I'll remove this override from the elementary CLI project.

…368864-code-ci

…bt-data-reliability Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

…eliability Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

devin-ai-integration bot and others added 7 commits March 1, 2026 14:31

feat: add Docker startup steps for Trino, Dremio, Spark in CI workflow

63a6bc1

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

feat: add DuckDB, Trino, Dremio, Spark profile targets

04b80ad

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

feat: add Trino Iceberg catalog config for CI testing

16665b8

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

feat: add Spark Hive metastore config for CI testing

6e520fb

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

feat: add Dremio setup script for CI testing

9351fbc

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

feat: add Trino, Dremio, Spark Docker services to docker-compose.yml

0836bbf

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

devin-ai-integration bot assigned haritamar Mar 1, 2026

devin-ai-integration bot mentioned this pull request Mar 1, 2026

docs: add DuckDB, Dremio, Spark support to OSS documentation #2136

Open

5 tasks

coderabbitai bot reviewed Mar 1, 2026

View reviewed changes

fix: use Nessie catalog source for Dremio instead of plain S3

bac36f9

Plain S3 sources in Dremio do not support CREATE TABLE (needed for dbt seed). Switch to Nessie catalog source which supports table creation via Iceberg. Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

coderabbitai bot reviewed Mar 1, 2026

View reviewed changes

tests/e2e_dbt_project/docker/dremio/dremio-setup.sh Outdated Show resolved Hide resolved

coderabbitai bot reviewed Mar 1, 2026

View reviewed changes

.github/workflows/test-warehouse.yml Outdated Show resolved Hide resolved

coderabbitai bot reviewed Mar 1, 2026

View reviewed changes

.github/workflows/test-warehouse.yml Show resolved Hide resolved

devin-ai-integration bot and others added 2 commits March 1, 2026 15:37

fix: add docker-compose.yml to seed cache key and fail-fast readiness…

0d00cfa

… loops Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

fix: convert ClickHouse bind mount to named Docker volume for seed ca…

7213534

…ching Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

devin-ai-integration bot mentioned this pull request Mar 1, 2026

fix: add Trino and Spark adapter support for create_table_as and full_name_split elementary-data/dbt-data-reliability#948

Merged

4 tasks

devin-ai-integration bot and others added 5 commits March 1, 2026 15:46

fix: stop Docker containers before archiving seed cache volumes

c40da20

Prevents tar race condition where ClickHouse temporary merge files disappear during archiving, causing 'No such file or directory' errors. Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

fix: add Hive Metastore readiness check after container restart for T…

6459984

…rino Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

haritamar reviewed Mar 3, 2026

View reviewed changes

devin-ai-integration bot mentioned this pull request Mar 3, 2026

fix: add row_number and snapshots to Dremio reserved keywords list elementary-data/dbt-data-reliability#955

Merged

2 tasks

devin-ai-integration bot and others added 5 commits March 3, 2026 01:35

chore: revert temporary dbt-data-reliability branch pin (PR #955 merged)

8a9596d

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

Merge branch 'master' into devin/ELE-5266-1772368864-code-ci

d48a335

coderabbitai bot reviewed Mar 3, 2026

View reviewed changes

devin-ai-integration bot and others added 3 commits March 3, 2026 11:11

Merge branch 'devin/ELE-5266-1772368864-code-ci' of https://git-manag…

19652db

…er.devin.ai/proxy/github.com/elementary-data/elementary into devin/ELE-5266-1772368864-code-ci

style: fix black formatting in dremio.py and spark.py

a63ce6b

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

coderabbitai bot reviewed Mar 3, 2026

View reviewed changes

tests/e2e_dbt_project/external_seeders/spark.py Outdated Show resolved Hide resolved

fix: address CodeRabbit bugs - 409 fallback and stale empty tables

2263199

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

coderabbitai bot reviewed Mar 3, 2026

View reviewed changes

devin-ai-integration bot and others added 2 commits March 3, 2026 11:50

fix: address remaining CodeRabbit CI comments

becc9a1

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

fix: clarify Spark seeder pyhive dependency

a5dcef9

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

coderabbitai bot reviewed Mar 3, 2026

View reviewed changes

devin-ai-integration bot and others added 3 commits March 3, 2026 12:02

fix: correct isort import order in dremio.py

05c57a3

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

haritamar reviewed Mar 3, 2026

View reviewed changes

devin-ai-integration bot mentioned this pull request Mar 3, 2026

fix: add Dremio-specific edr_dateadd and improve edr_cast_as_timestamp elementary-data/dbt-data-reliability#957

Merged

4 tasks

devin-ai-integration bot and others added 3 commits March 3, 2026 17:22

Merge remote-tracking branch 'origin/master' into devin/ELE-5266-1772…

856ddfa

…368864-code-ci

fix: remove Dremio dateadd and cast_column overrides now handled by d…

71d11e7

…bt-data-reliability Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

fix: remove dremio_target_database override now handled by dbt-data-r…

3ecb6f4

…eliability Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>

haritamar approved these changes Mar 3, 2026

View reviewed changes

haritamar merged commit e8dabbf into master Mar 3, 2026
20 checks passed

haritamar deleted the devin/ELE-5266-1772368864-code-ci branch March 3, 2026 22:33

Conversation

devin-ai-integration bot commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Updates since last revision (CodeRabbit follow-ups / CI robustness)

Review & Testing Checklist for Human

Suggested Test Plan

Notes

Summary by CodeRabbit

Uh oh!

devin-ai-integration bot commented Mar 1, 2026

🤖 Devin AI Engineer

Uh oh!

linear bot commented Mar 1, 2026

Uh oh!

github-actions bot commented Mar 1, 2026

Uh oh!

coderabbitai bot commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

haritamar Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

devin-ai-integration bot commented Mar 1, 2026 •

edited

Loading

coderabbitai bot commented Mar 1, 2026 •

edited

Loading