Skip to content

feat: add DuckDB, Trino, Dremio & Spark support to CI and CLI#2135

Merged
haritamar merged 98 commits intomasterfrom
devin/ELE-5266-1772368864-code-ci
Mar 3, 2026
Merged

feat: add DuckDB, Trino, Dremio & Spark support to CI and CLI#2135
haritamar merged 98 commits intomasterfrom
devin/ELE-5266-1772368864-code-ci

Conversation

@devin-ai-integration
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot commented Mar 1, 2026

Summary

Adds full CI infrastructure and CLI support for DuckDB, Trino, Dremio, and Spark adapters to elementary-cli. CI runs all 11 warehouse tests (including these 4) plus code-quality.

Key changes:

  • pyproject.toml: Added dbt-duckdb and dbt-dremio as optional dependencies
  • profiles.yml.j2: Added profile targets for all 4 adapters (DuckDB file-based, Trino with Iceberg, Dremio with Nessie enterprise catalog, Spark Thrift Server with Delta)
  • docker-compose.yml: Added Docker service stacks for Trino (Iceberg + Hive Metastore + MinIO), Dremio (Nessie + MinIO), Spark (Thrift Server + Delta Lake)
  • CI workflows: Docker startup steps, seed caching for Postgres/ClickHouse (Docker volumes) + DuckDB (db file), external seeders for Dremio and Spark
  • E2E dbt project: Renamed reserved-word columns (minmin_val, maxmax_val, sumsum_val, onecol1), updated data_type mappings, excluded schema_changes for Spark, added adapter-specific schema drop dispatches
  • CLI adapter code: Added transient error patterns and get_adapter_unique_id dispatch for DuckDB

Dremio-specific workarounds (dbt-dremio adapter limitations):

  • --exclude tag:ephemeral_model on dbt test — dbt-dremio doesn't support ephemeral model CTEs
  • Elementary profile uses enterprise_catalog_folder instead of schema to avoid duplicate-key errors from dbt-dremio's alias chain
  • Seeds placed in same Nessie namespace as models (VDS validator can't resolve cross-namespace references)
  • Single-threaded execution (--threads 1) to avoid Nessie catalog race conditions

Updates since last revision (CodeRabbit follow-ups / CI robustness)

  • Dremio startup safety in CI: replaced docker compose up --exit-code-from dremio-setup ... with a two-step pattern:
    1. docker compose up -d --wait dremio dremio-minio nessie
    2. docker compose run --rm dremio-setup
  • Bounded Dremio readiness loop in dremio-setup.sh (max attempts + curl timeout + hard exit) to avoid CI hangs.
  • Seed cache scope aligned: seed-cache key/restore now only runs for adapters that actually save a cache (Postgres/ClickHouse/DuckDB), and the save condition is simplified accordingly.
  • Dremio external seeder identifier quoting: added a _quote_ident() helper to escape embedded quotes in table/column identifiers.
  • Spark external seeder dependency clarity: wraps pyhive import with a clear runtime error message (CI installs Spark deps via dbt-spark[PyHive]).
  • Removed Dremio macro overrides: dateadd.sql, cast_column.sql, and dremio_target_database.sql are now handled by dbt-data-reliability (PR #957 and PR #952, both merged). Only the ref.sql override remains (dbt-core 1.11 cross-package ref workaround for dbt-dremio's broken ref() macro).

Review & Testing Checklist for Human

  • Dremio override removal: Verify that dbt-data-reliability master branch (used by default) includes the dremio__edr_dateadd, dremio__edr_cast_as_timestamp, and dremio__target_database macros from PRs Merge pull request #956 from elementary-data/ele-1139-update-dbt-package-requirement-in-python-package #957 and Ele 983 generate job url and job run url #952. If CI uses an older version, Dremio tests will fail.
  • Column rename consistency: Confirm all references to renamed columns (min_val, max_val, sum_val, col1) are updated consistently across models, tests, and seed files.
  • CI orchestration: Confirm Dremio services stay up after dremio-setup completes (two-step compose flow) and subsequent dbt steps can connect reliably.
  • Seed cache behavior: Verify caching is only applied to Postgres/ClickHouse/DuckDB as intended and doesn't regress runtime/correctness for other adapters.
  • Spark external seeder runtime deps: Ensure the Spark CI job continues to install dbt-spark[PyHive] and that local failure mode is acceptable (actionable error message).

Suggested Test Plan

  1. Re-run CI once (or trigger "Re-run all jobs") and confirm all warehouse jobs + code-quality pass.
  2. (Optional) Re-run CI a second time to validate seed-cache hits for Postgres/ClickHouse/DuckDB behave as expected.
  3. If testing locally: bring up Dremio stack (docker compose up -d --wait dremio dremio-minio nessie + docker compose run --rm dremio-setup) and run the Dremio dbt steps.

Notes

  • All Docker credentials (e.g. minio123, dremio123) are local-only test credentials, not real secrets.
  • Spark seeding uses pyhive; in CI this is provided via dbt-spark[PyHive].
  • Dremio ref() override fix: PR #2137 adds a project-level ref() macro that properly handles dbt-core 1.11's two-arg cross-package syntax. See dremio/dbt-dremio#232 for upstream issue.

Requested by: @haritamar
Link to Devin Session: https://app.devin.ai/sessions/6c12a19d456541568bcd7974623f3557

Summary by CodeRabbit

  • New Features

    • Added support for DuckDB, Trino, Dremio, and Spark warehouses; optional DuckDB/Dremio adapters
    • External seed loaders and CLI to load seeds into Dremio and Spark
    • Deterministic seed data generation for stable CI artifacts
  • Tests

    • CI expanded to run and cache seeds across new warehouses; extended Docker infra for Trino, Dremio, Spark
  • Models

    • Renamed several seed/model columns (min → min_val, one → col1)
  • Bug Fixes

    • Improved Dremio timestamp and dateadd handling for compatibility

devin-ai-integration bot and others added 7 commits March 1, 2026 14:31
- pyproject.toml: add dbt-duckdb and dbt-dremio as optional dependencies
- Docker config files for Trino and Spark (non-credential files)
- test-all-warehouses.yml: add duckdb, trino, dremio, spark to CI matrix
- schema.yml: update data_type expressions for new adapter type mappings
- test_alerts_union.sql: exclude schema_changes for Spark (like Databricks)
- drop_test_schemas.sql: add dispatched edr_drop_schema for all new adapters
- transient_errors.py: add spark and duckdb entries to _ADAPTER_PATTERNS
- get_adapter_type_and_unique_id.sql: add duckdb dispatch (uses target.path)

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
@devin-ai-integration
Copy link
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@linear
Copy link

linear bot commented Mar 1, 2026

@github-actions
Copy link
Contributor

github-actions bot commented Mar 1, 2026

👋 @devin-ai-integration[bot]
Thank you for raising your pull request.
Please make sure to add tests and document all user-facing changes.
You can do this by editing the docs files in this pull request.

@coderabbitai
Copy link

coderabbitai bot commented Mar 1, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds DuckDB, Trino, Dremio, and Spark support across CI matrices and workflows, Docker compose services, dbt profiles/macros, external seed loaders (Dremio/Spark), deterministic seed data and column renames, seed cache save/restore utilities, and related configuration and dependency updates.

Changes

Cohort / File(s) Summary
CI Workflows
​.github/workflows/test-all-warehouses.yml, ​.github/workflows/test-warehouse.yml
Adds duckdb/trino/dremio/spark to CI matrix; implements per-warehouse seed cache compute/restore/save and conditional startup/tooling steps.
Docker & Compose
tests/e2e_dbt_project/docker-compose.yml, tests/e2e_dbt_project/docker/..., tests/e2e_dbt_project/docker/trino/..., tests/e2e_dbt_project/docker/spark/...
Adds Trino (Iceberg), MinIO/Nessie, Dremio stack, Spark Thrift + Hive metastore, new volumes/networks, healthchecks and orchestration for new services.
External Seeders & CLI
tests/e2e_dbt_project/external_seeders/__init__.py, .../base.py, .../dremio.py, .../spark.py, tests/e2e_dbt_project/load_seeds_external.py
New ExternalSeeder base and adapter implementations for Dremio (MinIO, Dremio REST/Nessie, Iceberg/COPY) and Spark (PyHive/Thrift, Delta); CLI dispatcher to invoke them.
Seed Cache Utilities
tests/e2e_dbt_project/ci/restore_seed_cache.sh, tests/e2e_dbt_project/ci/save_seed_cache.sh
Scripts to archive and restore Docker volumes to /tmp seed-cache-, with container stop/start and readiness polling.
Profiles & dependencies
tests/profiles/profiles.yml.j2, pyproject.toml, elementary/monitor/dbt_project/dbt_project.yml, tests/e2e_dbt_project/dbt_project.yml
Adds trino/dremio/duckdb/spark profile targets and anchors; optional deps dbt-duckdb/dbt-dremio and extras; dispatch rules and conditional delta file_format for spark.
Adapter logic & macros
elementary/clients/dbt/transient_errors.py, elementary/monitor/dbt_project/macros/..., tests/e2e_dbt_project/macros/system/*
Adds spark transient errors and duckdb placeholder; package-qualified refs; adapter-specific macros (duckdb/trino/dremio/spark), Dremio overrides (cast/dateadd), generate_schema_name and edr_drop_schema implementations.
Monitoring macros & SQL safety
elementary/monitor/dbt_project/macros/..., elementary/monitor/dbt_project/models/alerts/*.sql
Replaces wildcard refs with package-qualified refs, explicit column lists, explicit JOINs, and escapes reserved keywords across multiple macros.
SQL models, tests & seed data
tests/e2e_dbt_project/models/*, tests/e2e_dbt_project/models/schema.yml, tests/e2e_dbt_project/tests/*.sql, tests/e2e_dbt_project/generate_data.py
Renames numeric columns (min→min_val, max→max_val, sum→sum_val), updates aliases and tests, adds deterministic seed generation (RANDOM_SEED=42), and includes spark in target-aware type logic.
Trino configs & Spark files
tests/e2e_dbt_project/docker/trino/catalog/*, tests/e2e_dbt_project/docker/trino/etc/*, tests/e2e_dbt_project/docker/spark/*
Adds Iceberg/memory catalogs and Trino JVM/node configs; Spark Dockerfile, entrypoint, hive-site.xml, and spark-defaults.conf.
Utilities / setup scripts
tests/e2e_dbt_project/docker/dremio/dremio-setup.sh, tests/e2e_dbt_project/docker/spark/entrypoint.sh
Adds Dremio setup script to create Nessie source and Spark entrypoint/wait logic for service readiness.

Sequence Diagram(s)

sequenceDiagram
    participant CI as GitHub Actions
    participant Cache as Seed Cache
    participant Docker as Docker Compose
    participant dbt as dbt / CLI
    participant Service as Warehouse Service

    rect rgba(100, 150, 200, 0.5)
    Note over CI,Cache: Seed cache decision
    CI->>Cache: compute seed-hash & check cache
    alt cache hit
        Cache-->>Docker: restore tar.gz into volumes
        Docker->>Service: start service(s) with mounted cached data
    else cache miss
        CI->>dbt: generate seeds / run external seeders
        dbt->>Service: load seed data (external or local)
    end
    end

    rect rgba(150, 200, 100, 0.5)
    Note over Docker,dbt: Test execution
    Docker->>Service: start & healthcheck services
    dbt->>Service: run dbt (seed/run/test)
    Service-->>dbt: return results/artifacts
    end

    rect rgba(200, 150, 100, 0.5)
    Note over Docker,Cache: Save cache
    Docker->>Cache: archive volumes to tar.gz
    Cache-->>CI: upload/store cache
    end
Loading
sequenceDiagram
    participant CLI as load_seeds_external
    participant Seeder as ExternalSeeder (Dremio/Spark)
    participant MinIO as MinIO
    participant Dremio as Dremio API
    participant Spark as Spark Thrift

    rect rgba(100, 200, 150, 0.5)
    Note over CLI,Seeder: Dremio flow
    CLI->>Seeder: invoke DremioExternalSeeder.load()
    Seeder->>MinIO: upload CSVs (mc via docker)
    Seeder->>Dremio: create/update S3 source (API)
    Seeder->>Dremio: create Nessie namespace
    loop per CSV
        Seeder->>Dremio: CREATE ICEBERG TABLE & COPY INTO from S3
    end
    Dremio-->>Seeder: job completion / metadata propagation
    end

    rect rgba(150, 100, 200, 0.5)
    Note over CLI,Seeder: Spark flow
    CLI->>Seeder: invoke SparkExternalSeeder.load()
    Seeder->>Spark: connect via Thrift (PyHive)
    loop per CSV
        Seeder->>Spark: create schema/table and populate via temporary view
    end
    Spark-->>Seeder: tables available
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Poem

🐰 I hopped through configs, jars, and streams,
DuckDB, Trino, Dremio, Spark joined the teams.
Seeds packed, cached, and carried in tar,
Docker dances, profiles set to spar.
A carrot of CI — tests run near and far!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 56.52% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main feature addition: support for four new warehouse adapters (DuckDB, Trino, Dremio, Spark) in both CI and CLI infrastructure.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch devin/ELE-5266-1772368864-code-ci

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (2)
tests/e2e_dbt_project/docker/spark/Dockerfile (1)

1-30: Run the Spark container as a non-root user.

The image currently runs as root; adding a dedicated runtime user improves baseline container hardening.

Suggested hardening diff
 COPY entrypoint.sh /scripts/
-RUN chmod +x /scripts/entrypoint.sh
+RUN chmod +x /scripts/entrypoint.sh && \
+    groupadd -r spark && useradd -r -g spark -m spark && \
+    chown -R spark:spark /usr/spark /scripts
 
+USER spark
 ENTRYPOINT ["/scripts/entrypoint.sh"]
 CMD ["--help"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e_dbt_project/docker/spark/Dockerfile` around lines 1 - 30, The
Dockerfile runs the Spark image as root; create and switch to a non-root runtime
user to improve container hardening by adding a dedicated user/group (e.g.,
spark:spark) after installing packages and before COPY/ENTRYPOINT, ensure
ownership of SPARK_HOME (/usr/spark) and /scripts is changed to that user (chown
-R spark:spark /usr/spark /scripts) and then add a USER spark line so the
ENTRYPOINT ("/scripts/entrypoint.sh") and CMD run unprivileged; make sure any
later runtime actions in entrypoint.sh do not require root or handle privilege
escalation explicitly.
tests/e2e_dbt_project/docker-compose.yml (1)

106-106: Pin new container images to explicit versions/digests.

Using latest (or no tag) makes CI non-deterministic and harder to debug when upstream images change.

🔧 Suggested direction
-    image: projectnessie/nessie:latest
+    image: projectnessie/nessie:<pinned-version-or-digest>

-    image: minio/minio:latest
+    image: minio/minio:<pinned-version-or-digest>

-    image: minio/mc
+    image: minio/mc:<pinned-version-or-digest>

-    image: dremio/dremio-oss:latest
+    image: dremio/dremio-oss:<pinned-version-or-digest>

-    image: alpine:latest
+    image: alpine:<pinned-version-or-digest>

Also applies to: 114-114, 132-132, 150-150, 175-175

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e_dbt_project/docker-compose.yml` at line 106, Replace unconstrained
image tags using "latest" with explicit version tags or digests to ensure
deterministic CI; update the image entry "image: projectnessie/nessie:latest" in
the docker-compose.yml to a specific version or sha256 digest, and apply the
same change for the other image entries referenced (lines 114, 132, 150, 175) so
all container images are pinned to fixed tags/digests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/test-warehouse.yml:
- Around line 109-113: The readiness loops for Trino, Dremio and Spark currently
break on success but never fail on timeout; update each curl loop (the
Trino/Dremio/Spark readiness checks) to detect if the loop completed without
success by adding a post-loop conditional that checks whether the last curl
succeeded and, if not, prints an error and exits non‑zero (e.g., after the seq
loop, if the curl never returned success then echo "Timed out waiting for
<Service>" and exit 1). Ensure you apply this pattern to the Trino readiness
loop and the analogous Dremio and Spark loops so the job fails fast when
dependencies never become healthy.
- Line 216: The DuckDB target is using an in-memory database (path ":memory:")
which cannot persist across workflow steps; change the DuckDB configuration used
in CI so it uses a persistent file path (for example set path to
"/tmp/elementary_test.db") or exclude DuckDB from the multi-step workflow when
in-memory testing is intended; update the workflow logic that references
inputs.warehouse-type (and any DuckDB-specific job/step configs) to use the
persistent file path or add a conditional to skip DuckDB when
inputs.warehouse-type == 'duckdb' and in-memory mode is active.

In `@tests/e2e_dbt_project/docker-compose.yml`:
- Around line 95-100: The trino-mc-job entrypoint uses a fixed sleep which can
fail if MinIO starts slower; update the trino-mc-job entrypoint to replace
"sleep 5" with a readiness loop that waits for minio:9000 to accept connections
(e.g., a small loop using curl or nc that retries until success) before running
the mc commands, then proceed with "/usr/bin/mc config ...", "/usr/bin/mc mb
..."; modify the entrypoint block in the trino-mc-job service so it polls minio
(hostname "minio", port 9000) and exits the loop only when reachable.

In `@tests/e2e_dbt_project/docker/dremio/dremio-setup.sh`:
- Around line 15-33: The script currently treats AUTH_TOKEN and the S3 source
creation as always successful; update the AUTH_TOKEN acquisition and the
subsequent curl call that creates S3Source to fail fast: after the login curl
that sets AUTH_TOKEN (variable AUTH_TOKEN), check for empty string OR the
literal "null" and exit non‑zero with an error message if found; for both curl
calls (the login POST and the PUT to create S3Source) capture HTTP status (e.g.,
via curl -w/%{http_code}% and/or -f) and the response body, verify the status is
a 2xx before proceeding, and on non‑2xx log the response body and exit 1 so the
script does not print "Obtained Dremio auth token" or "S3 Source created in
Dremio" on failure.

In `@tests/e2e_dbt_project/macros/system/drop_test_schemas.sql`:
- Around line 43-53: The DROP macros interpolate schema_name unquoted which
breaks on special chars/reserved words; update dremio__edr_drop_schema,
duckdb__edr_drop_schema, and spark__edr_drop_schema to use
adapter.quote(schema_name) (same pattern as clickhouse__edr_drop_schema) when
building the raw DROP SQL so the identifier is safely quoted (e.g., replace
plain schema_name concatenation with adapter.quote(schema_name) in the run_query
argument).

In `@tests/profiles/profiles.yml.j2`:
- Around line 49-53: The DuckDB profile currently uses the process-local
in-memory path ":memory:", which breaks across SubprocessDbtRunner subprocesses;
update the profiles template (the duckdb &duckdb block) to use a persistent or
shared DB URI instead of ":memory:"—either switch to a file-backed path (e.g., a
temp DB file under a test-specific directory) or use a shared-memory URI like
the SQLite-style shared cache (e.g., file:...?mode=memory&cache=shared) so
multiple dbt subprocesses can see the same database; keep the schema
interpolation ({{ schema_name }}) and threads setting unchanged.

---

Nitpick comments:
In `@tests/e2e_dbt_project/docker-compose.yml`:
- Line 106: Replace unconstrained image tags using "latest" with explicit
version tags or digests to ensure deterministic CI; update the image entry
"image: projectnessie/nessie:latest" in the docker-compose.yml to a specific
version or sha256 digest, and apply the same change for the other image entries
referenced (lines 114, 132, 150, 175) so all container images are pinned to
fixed tags/digests.

In `@tests/e2e_dbt_project/docker/spark/Dockerfile`:
- Around line 1-30: The Dockerfile runs the Spark image as root; create and
switch to a non-root runtime user to improve container hardening by adding a
dedicated user/group (e.g., spark:spark) after installing packages and before
COPY/ENTRYPOINT, ensure ownership of SPARK_HOME (/usr/spark) and /scripts is
changed to that user (chown -R spark:spark /usr/spark /scripts) and then add a
USER spark line so the ENTRYPOINT ("/scripts/entrypoint.sh") and CMD run
unprivileged; make sure any later runtime actions in entrypoint.sh do not
require root or handle privilege escalation explicitly.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 746ce41 and 0836bbf.

📒 Files selected for processing (20)
  • .github/workflows/test-all-warehouses.yml
  • .github/workflows/test-warehouse.yml
  • elementary/clients/dbt/transient_errors.py
  • elementary/monitor/dbt_project/macros/get_adapter_type_and_unique_id.sql
  • pyproject.toml
  • tests/e2e_dbt_project/docker-compose.yml
  • tests/e2e_dbt_project/docker/dremio/dremio-setup.sh
  • tests/e2e_dbt_project/docker/spark/Dockerfile
  • tests/e2e_dbt_project/docker/spark/entrypoint.sh
  • tests/e2e_dbt_project/docker/spark/hive-site.xml
  • tests/e2e_dbt_project/docker/spark/spark-defaults.conf
  • tests/e2e_dbt_project/docker/trino/catalog/iceberg.properties
  • tests/e2e_dbt_project/docker/trino/catalog/memory.properties
  • tests/e2e_dbt_project/docker/trino/etc/config.properties
  • tests/e2e_dbt_project/docker/trino/etc/jvm.config
  • tests/e2e_dbt_project/docker/trino/etc/node.properties
  • tests/e2e_dbt_project/macros/system/drop_test_schemas.sql
  • tests/e2e_dbt_project/models/schema.yml
  • tests/e2e_dbt_project/models/test_alerts_union.sql
  • tests/profiles/profiles.yml.j2

- DuckDB: use file-backed DB path instead of :memory: to persist across
  subprocess calls, and reduce threads to 1 to avoid concurrent commit errors
- Spark: install dbt-spark[PyHive] extras required for thrift connection method
- Dremio: add dremio__target_database() dispatch override in e2e project
  to return target.database (upstream elementary package lacks this dispatch)

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
.github/workflows/test-warehouse.yml (1)

109-113: ⚠️ Potential issue | 🟠 Major

Fail fast when readiness polling times out.

Line 109, Line 124, and Line 141 loops still continue even if services never become ready. This can allow later steps to run against unhealthy dependencies and fail non-deterministically.

Suggested fix pattern (apply to all three loops)
-          for i in $(seq 1 60); do
-            curl -sf http://localhost:8086/v1/info && break
+          ready=0
+          for i in $(seq 1 60); do
+            if curl -sf http://localhost:8086/v1/info > /dev/null; then
+              ready=1
+              break
+            fi
             echo "Waiting for Trino... ($i/60)"
             sleep 5
           done
+          if [ "$ready" -ne 1 ]; then
+            echo "Timed out waiting for Trino"
+            exit 1
+          fi

Also applies to: 124-128, 141-145

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/test-warehouse.yml around lines 109 - 113, The readiness
polling loops that use "for i in $(seq 1 60); do curl -sf ... && break; ...
done" should fail fast when the max attempts are reached; update each such loop
(the Trino/other service readiness loops) to check after sleeping whether i
equals the max (e.g., 60) and, if so, print an error to stderr and exit 1 so the
workflow stops rather than continuing with unhealthy dependencies. Ensure the
same pattern is applied to all three polling loops to provide deterministic
failure on timeout.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In @.github/workflows/test-warehouse.yml:
- Around line 109-113: The readiness polling loops that use "for i in $(seq 1
60); do curl -sf ... && break; ... done" should fail fast when the max attempts
are reached; update each such loop (the Trino/other service readiness loops) to
check after sleeping whether i equals the max (e.g., 60) and, if so, print an
error to stderr and exit 1 so the workflow stops rather than continuing with
unhealthy dependencies. Ensure the same pattern is applied to all three polling
loops to provide deterministic failure on timeout.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 0836bbf and 40b2c74.

📒 Files selected for processing (3)
  • .github/workflows/test-warehouse.yml
  • tests/e2e_dbt_project/macros/system/dremio_target_database.sql
  • tests/profiles/profiles.yml.j2
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/profiles/profiles.yml.j2

Plain S3 sources in Dremio do not support CREATE TABLE (needed for dbt seed).
Switch to Nessie catalog source which supports table creation via Iceberg.

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
tests/e2e_dbt_project/docker/dremio/dremio-setup.sh (1)

15-31: ⚠️ Potential issue | 🟠 Major

Fail fast on login/source API errors and reject "null" tokens.

At Line 20, checking only -z misses the literal "null" token case; and Lines 15-31 do not validate HTTP status for either API call, so setup can report success on non-2xx responses.

🔧 Suggested patch
-# Log in to Dremio to get the auth token
-AUTH_TOKEN=$(curl -s -X POST "http://dremio:9047/apiv2/login" \
-  -H "Content-Type: application/json" \
-  --data "{\"userName\":\"dremio\", \"password\":\"dremio123\"}" | jq -r .token)
+# Log in to Dremio to get the auth token
+LOGIN_RESPONSE="$(mktemp)"
+LOGIN_HTTP_CODE=$(curl -sS -o "$LOGIN_RESPONSE" -w "%{http_code}" -X POST "http://dremio:9047/apiv2/login" \
+  -H "Content-Type: application/json" \
+  --data "{\"userName\":\"dremio\", \"password\":\"dremio123\"}")
+if [ "$LOGIN_HTTP_CODE" -lt 200 ] || [ "$LOGIN_HTTP_CODE" -ge 300 ]; then
+  echo "Failed to log in to Dremio (HTTP $LOGIN_HTTP_CODE)"
+  cat "$LOGIN_RESPONSE"
+  exit 1
+fi
+AUTH_TOKEN=$(jq -r '.token // empty' "$LOGIN_RESPONSE")
 
 # Check if AUTH_TOKEN is not empty
-if [ -z "$AUTH_TOKEN" ]; then
+if [ -z "$AUTH_TOKEN" ] || [ "$AUTH_TOKEN" = "null" ]; then
   echo "Failed to obtain Dremio auth token"
   exit 1
 fi
 
 echo "Obtained Dremio auth token"
 
 # Create a Nessie catalog source in Dremio (supports CREATE TABLE for dbt seed)
-curl -s -X PUT "http://dremio:9047/apiv2/source/NessieSource" \
+SOURCE_RESPONSE="$(mktemp)"
+SOURCE_HTTP_CODE=$(curl -sS -o "$SOURCE_RESPONSE" -w "%{http_code}" -X PUT "http://dremio:9047/apiv2/source/NessieSource" \
   -H "Content-Type: application/json" \
   -H "Authorization: _dremio$AUTH_TOKEN" \
-  --data "{\"name\":\"NessieSource\",\"config\":{\"nessieEndpoint\":\"http://catalog:19120/api/v2\",\"nessieAuthType\":\"NONE\",\"credentialType\":\"ACCESS_KEY\",\"awsAccessKey\":\"admin\",\"awsAccessSecret\":\"password\",\"awsRootPath\":\"datalake\",\"secure\":false,\"propertyList\":[{\"name\":\"fs.s3a.path.style.access\",\"value\":\"true\"},{\"name\":\"fs.s3a.endpoint\",\"value\":\"dremio-storage:9000\"},{\"name\":\"dremio.s3.compat\",\"value\":\"true\"}]},\"type\":\"NESSIE\",\"metadataPolicy\":{\"deleteUnavailableDatasets\":true,\"autoPromoteDatasets\":false,\"namesRefreshMillis\":3600000,\"datasetDefinitionRefreshAfterMillis\":3600000,\"datasetDefinitionExpireAfterMillis\":10800000,\"authTTLMillis\":86400000,\"updateMode\":\"PREFETCH_QUERIED\"}}"
+  --data "{\"name\":\"NessieSource\",\"config\":{\"nessieEndpoint\":\"http://catalog:19120/api/v2\",\"nessieAuthType\":\"NONE\",\"credentialType\":\"ACCESS_KEY\",\"awsAccessKey\":\"admin\",\"awsAccessSecret\":\"password\",\"awsRootPath\":\"datalake\",\"secure\":false,\"propertyList\":[{\"name\":\"fs.s3a.path.style.access\",\"value\":\"true\"},{\"name\":\"fs.s3a.endpoint\",\"value\":\"dremio-storage:9000\"},{\"name\":\"dremio.s3.compat\",\"value\":\"true\"}]},\"type\":\"NESSIE\",\"metadataPolicy\":{\"deleteUnavailableDatasets\":true,\"autoPromoteDatasets\":false,\"namesRefreshMillis\":3600000,\"datasetDefinitionRefreshAfterMillis\":3600000,\"datasetDefinitionExpireAfterMillis\":10800000,\"authTTLMillis\":86400000,\"updateMode\":\"PREFETCH_QUERIED\"}}")
+if [ "$SOURCE_HTTP_CODE" -lt 200 ] || [ "$SOURCE_HTTP_CODE" -ge 300 ]; then
+  echo "Failed to create Nessie Source in Dremio (HTTP $SOURCE_HTTP_CODE)"
+  cat "$SOURCE_RESPONSE"
+  exit 1
+fi
#!/bin/bash
# Verify login/source calls capture and validate HTTP status and token nullability.
FILE="$(fd 'dremio-setup.sh' | head -n1)"
nl -ba "$FILE" | sed -n '12,90p'
rg -n 'apiv2/login|apiv2/source/NessieSource|http_code|AUTH_TOKEN|jq -r| -z ' "$FILE"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e_dbt_project/docker/dremio/dremio-setup.sh` around lines 15 - 31,
The login and source API calls must validate HTTP status and reject a literal
"null" token; modify the AUTH_TOKEN retrieval block to capture both response
body and HTTP status from the login curl (use curl -w to produce HTTP code and
capture body separately), then extract token with jq -r .token and fail if the
HTTP status is not 2xx or if AUTH_TOKEN is empty or equals "null" (print the
response body/error and exit 1); do the same for the PUT creating NessieSource:
capture response body and HTTP status, and exit non-zero with the response on
any non-2xx status. Ensure you reference the existing AUTH_TOKEN variable, the
login curl to "apiv2/login", and the PUT to "apiv2/source/NessieSource" when
making these checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/e2e_dbt_project/docker/dremio/dremio-setup.sh`:
- Around line 7-10: The readiness loop that uses "until curl -s
http://dremio:9047; do ... done" must be bounded: add a MAX_ATTEMPTS variable
and a counter, set curl timeouts (e.g., --connect-timeout and/or --max-time) so
each try fails fast, increment the counter inside the loop, and when attempts >=
MAX_ATTEMPTS log an error and exit non-zero; update the existing until/curl loop
code in the script to use these variables and checks so CI will fail instead of
hanging indefinitely.

---

Duplicate comments:
In `@tests/e2e_dbt_project/docker/dremio/dremio-setup.sh`:
- Around line 15-31: The login and source API calls must validate HTTP status
and reject a literal "null" token; modify the AUTH_TOKEN retrieval block to
capture both response body and HTTP status from the login curl (use curl -w to
produce HTTP code and capture body separately), then extract token with jq -r
.token and fail if the HTTP status is not 2xx or if AUTH_TOKEN is empty or
equals "null" (print the response body/error and exit 1); do the same for the
PUT creating NessieSource: capture response body and HTTP status, and exit
non-zero with the response on any non-2xx status. Ensure you reference the
existing AUTH_TOKEN variable, the login curl to "apiv2/login", and the PUT to
"apiv2/source/NessieSource" when making these checks.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 40b2c74 and bac36f9.

📒 Files selected for processing (2)
  • tests/e2e_dbt_project/docker/dremio/dremio-setup.sh
  • tests/profiles/profiles.yml.j2
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/profiles/profiles.yml.j2

- Make generate_data.py deterministic (fixed random seed)
- Use fixed schema name for Docker adapters (ephemeral containers)
- Cache seeded Docker volumes between runs using actions/cache
- Cache DuckDB database file between runs
- Skip dbt seed on cache hit, restoring from cached volumes instead
- Applies to: Spark, Trino, Dremio, Postgres, ClickHouse, DuckDB

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
.github/workflows/test-warehouse.yml (1)

109-113: ⚠️ Potential issue | 🟠 Major

Readiness loops still don’t fail on timeout.

These checks can still silently continue after 60 attempts if the service never becomes healthy, which allows downstream steps to run against broken dependencies.

Also applies to: 124-128, 141-145

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/test-warehouse.yml around lines 109 - 113, The readiness
loops using "for i in $(seq 1 60); do curl -sf http://localhost:8086/v1/info &&
break; ... done" do not fail when the service never becomes healthy; update each
such loop (the curl-based readiness loops at the three occurrences) to detect
timeout and exit non-zero if the loop completes without success — e.g., after
the done add a check like "if [ $i -eq 60 ] && ! curl -sf
http://localhost:8086/v1/info; then echo 'Service did not become ready' >&2;
exit 1; fi" (or equivalent) so the workflow fails instead of continuing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/test-warehouse.yml:
- Around line 249-264: The "Restore cached seed data into Docker volumes"
workflow step currently restores tarballs into volumes after the warehouse
services have started (uses CACHE_DIR, docker volume create, docker run ... tar
xzf), which can corrupt live DB files; move this step to run before the job step
that starts warehouse containers (the service startup block) so volumes are
populated prior to container creation, or alternatively modify the step to stop
the warehouse containers before restore and then restart and run a readiness
check (e.g., wait-for/postgres healthcheck) after extraction to ensure services
reinitialize cleanly.

---

Duplicate comments:
In @.github/workflows/test-warehouse.yml:
- Around line 109-113: The readiness loops using "for i in $(seq 1 60); do curl
-sf http://localhost:8086/v1/info && break; ... done" do not fail when the
service never becomes healthy; update each such loop (the curl-based readiness
loops at the three occurrences) to detect timeout and exit non-zero if the loop
completes without success — e.g., after the done add a check like "if [ $i -eq
60 ] && ! curl -sf http://localhost:8086/v1/info; then echo 'Service did not
become ready' >&2; exit 1; fi" (or equivalent) so the workflow fails instead of
continuing.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between bac36f9 and 3682f1c.

📒 Files selected for processing (2)
  • .github/workflows/test-warehouse.yml
  • tests/e2e_dbt_project/generate_data.py

Addresses CodeRabbit review: restoring cached tarballs into Docker
volumes while containers are already running risks data corruption.
Now the cache key computation and volume restore happen before any
Docker services are started, so containers initialise with the
pre-seeded data.

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
.github/workflows/test-warehouse.yml (1)

160-164: ⚠️ Potential issue | 🟠 Major

Readiness loops still don’t fail the job on timeout.

On Line 160-164, Line 175-179, and Line 192-196, the loops stop on success but never exit 1 when retries are exhausted. The workflow can continue with unhealthy dependencies and fail later with misleading errors.

Fail-fast pattern to apply to each loop
-          for i in $(seq 1 60); do
-            curl -sf http://localhost:8086/v1/info && break
-            echo "Waiting for Trino... ($i/60)"
-            sleep 5
-          done
+          ready=0
+          for i in $(seq 1 60); do
+            if curl -sf http://localhost:8086/v1/info > /dev/null; then
+              ready=1
+              break
+            fi
+            echo "Waiting for Trino... ($i/60)"
+            sleep 5
+          done
+          if [ "$ready" -ne 1 ]; then
+            echo "Timed out waiting for Trino"
+            exit 1
+          fi

Use the same timeout-fail pattern for Dremio and Spark checks.

Also applies to: 175-179, 192-196

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/test-warehouse.yml around lines 160 - 164, The readiness
loops that poll services (the Trino loop shown and the similar Dremio and Spark
loops) currently break on success but do nothing when retries are exhausted;
update each loop so that if the curl never succeeds within the retries it fails
the job (exit 1). Concretely, after each for/seq loop (the Trino, Dremio, and
Spark readiness checks), check whether the last curl succeeded and, if not,
print an error and run exit 1 (or set a failure flag inside the loop and exit
after the loop) so the workflow stops immediately on timeout.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/test-warehouse.yml:
- Around line 96-103: The SEED_HASH calculation (the SEED_HASH env var block)
currently hashes generate_data.py, dbt_project.yml and the profiles.yml.j2
template but omits the Docker stack/compose definition, allowing cached DB
volume tarballs to be restored against incompatible images or volume layouts;
update the SEED_HASH command to also include the Docker stack/compose file(s)
used to bring up the test DB (e.g., docker-compose.yml,
docker-compose.override.yml or any stack definition referenced by the workflow)
so changes in images/layouts change the cache key and prevent restoring
incompatible tarballs.

---

Duplicate comments:
In @.github/workflows/test-warehouse.yml:
- Around line 160-164: The readiness loops that poll services (the Trino loop
shown and the similar Dremio and Spark loops) currently break on success but do
nothing when retries are exhausted; update each loop so that if the curl never
succeeds within the retries it fails the job (exit 1). Concretely, after each
for/seq loop (the Trino, Dremio, and Spark readiness checks), check whether the
last curl succeeded and, if not, print an error and run exit 1 (or set a failure
flag inside the loop and exit after the loop) so the workflow stops immediately
on timeout.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 3682f1c and 7969949.

📒 Files selected for processing (1)
  • .github/workflows/test-warehouse.yml

devin-ai-integration bot and others added 2 commits March 1, 2026 15:37
… loops

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…ching

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
devin-ai-integration bot and others added 5 commits March 1, 2026 15:46
…upport

Points to dbt-data-reliability#948 which adds:
- trino__full_name_split (1-based array indexing)
- trino__edr_get_create_table_as_sql (bypass model.config issue)
- spark__edr_get_create_table_as_sql

TODO: revert after dbt-data-reliability#948 is merged
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Prevents tar race condition where ClickHouse temporary merge files
disappear during archiving, causing 'No such file or directory' errors.

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…ache

After stopping containers for volume archiving and restarting them,
services like Trino need time to reinitialize. Added per-adapter
health checks to wait for readiness before proceeding.

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
The /v1/info endpoint returns HTTP 200 even when Trino is still
initializing. Check for '"starting":false' in the response body
to ensure Trino is fully ready before proceeding.

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…rino

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
where {{ elementary.edr_datediff(elementary.edr_cast_as_timestamp('generated_at'), elementary.edr_current_timestamp(), 'day') }} < {{ days_back }}
)
select count(*) as count
select count(*) as {% if target.type == 'dremio' %}"count"{% else %}count{% endif %}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think we can use elementary.escape_reserved_keywords? It was created for this purpose
(its code is in the dbt-data-reliability repo)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! Refactored all manual {% if target.type == 'dremio' %} quoting to use elementary.escape_reserved_keywords() instead. This covers:

  • row_number alias (2 files)
  • count alias (1 file)
  • filter column reference (2 files)

Also created dbt-data-reliability#955 to add row_number and snapshots to the reserved keywords list. CI is temporarily pinned to that branch.

devin-ai-integration bot and others added 5 commits March 3, 2026 01:35
…ed words

Replace manual {% if target.type == 'dremio' %} quoting with the existing
elementary.escape_reserved_keywords() utility from dbt-data-reliability.

Files updated:
- get_models_latest_invocation.sql: row_number alias
- get_models_latest_invocations_data.sql: row_number alias
- can_upload_source_freshness.sql: count alias
- source_freshness_alerts.sql: filter column reference
- get_source_freshness_results.sql: filter column reference

Also temporarily pins dbt-data-reliability to branch with row_number
and snapshots added to the reserved keywords list (PR #955).

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Replace USING (unique_id) with explicit ON clause and select specific
columns instead of SELECT * to avoid ambiguous column references in
Dremio's SQL engine, which doesn't deduplicate join columns with USING.

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…N join

The switch from USING to ON for Dremio compatibility requires qualifying
column references since ON doesn't deduplicate join columns like USING does.

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.github/workflows/test-warehouse.yml (1)

292-295: ⚠️ Potential issue | 🟠 Major

Remove continue-on-error: true from dbt test step.

The "Test e2e dbt project" step at line 294 has continue-on-error: true, making dbt test failures non-blocking. This is inconsistent with the Python e2e tests at line 385, which properly fail the workflow when tests fail. Critical validation steps should not be allowed to fail silently.

Suggested fix
      - name: Test e2e dbt project
        working-directory: ${{ env.E2E_DBT_PROJECT_DIR }}
-       continue-on-error: true
        run: |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/test-warehouse.yml around lines 292 - 295, The "Test e2e
dbt project" workflow step (named "Test e2e dbt project" and using
working-directory ${{ env.E2E_DBT_PROJECT_DIR }}) is marked with
continue-on-error: true which allows dbt failures to be ignored; remove the
continue-on-error: true key from that step so dbt test failures will fail the
workflow (making this step consistent with the Python e2e tests).
🧹 Nitpick comments (5)
elementary/monitor/dbt_project/macros/overrides/dateadd.sql (2)

14-14: Document the purpose of removing 'order by 1' from the interval.

The comment block explains the two bugs being fixed (.replace() on integer and select wrapper), but doesn't explain why 'order by 1' needs to be stripped from the interval. This appears to be a workaround for some upstream dbt-dremio behavior or specific use case.

Consider adding a brief inline comment or expanding the header comment to explain when/why an interval parameter would contain 'order by 1'.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@elementary/monitor/dbt_project/macros/overrides/dateadd.sql` at line 14, Add
a brief explanation for stripping 'order by 1' where interval is normalized:
document why interval may include "order by 1" (e.g., upstream dbt-dremio or a
wrapped SELECT that appends an ORDER BY) and note that the line {% set interval
= interval.replace('order by 1', '') %} removes this artifact to prevent SQL
syntax errors or incorrect interval parsing; place this as an inline comment
immediately above that line and/or expand the header comment to mention the
specific upstream behavior and the conditions when the workaround is required.

29-30: Add explicit 'day' case for clarity.

The day datepart falls through to the else branch, which works correctly but is implicit. This could confuse future maintainers who might think day is unsupported or that the else branch handles some error case.

💡 Proposed fix to add explicit day case
     {% elif datepart == 'second' %}
         TIMESTAMPADD(SQL_TSI_SECOND, CAST({{interval}} as int), CAST({{from_date_or_timestamp}} as TIMESTAMP))
+    {% elif datepart == 'day' %}
+        TIMESTAMPADD(SQL_TSI_DAY, CAST({{interval}} as int), CAST({{from_date_or_timestamp}} as TIMESTAMP))
     {% else %}
+        {# Fallback for unrecognized dateparts - treat as day #}
         TIMESTAMPADD(SQL_TSI_DAY, CAST({{interval}} as int), CAST({{from_date_or_timestamp}} as TIMESTAMP))
     {% endif %}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@elementary/monitor/dbt_project/macros/overrides/dateadd.sql` around lines 29
- 30, The template implicitly handles the 'day' datepart in the final else
branch; add an explicit branch for when datepart equals 'day' in the dateadd
macro to map it to TIMESTAMPADD(SQL_TSI_DAY, CAST({{interval}} as int),
CAST({{from_date_or_timestamp}} as TIMESTAMP)) — locate the conditional block in
the dateadd macro (the same block that currently emits TIMESTAMPADD for other
dateparts) and insert an elif/else-if for 'day' that returns the exact
TIMESTAMPADD(SQL_TSI_DAY, ...) expression so the behavior is explicit and
readable.
elementary/monitor/dbt_project/macros/overrides/cast_column.sql (1)

9-10: Only Z suffix is handled; other UTC offsets are not normalized.

The regex 'Z$' only strips the literal Z timezone suffix. ISO 8601 timestamps can also have explicit offsets like +00:00, +05:30, or -08:00. If such timestamps are encountered, they will pass through unmodified and may cause Gandiva parsing failures.

If the input data is guaranteed to only contain Z-suffixed timestamps, this is fine. Otherwise, consider extending the pattern to handle numeric offsets.

💡 Optional: Extended regex to also strip numeric offsets
-    cast(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE({{ timestamp_field }}, '(\d)T(\d)', '$1 $2'), '(\.\d{3})\d+', '$1'), 'Z$', '') as {{ elementary.edr_type_timestamp() }})
+    cast(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE({{ timestamp_field }}, '(\d)T(\d)', '$1 $2'), '(\.\d{3})\d+', '$1'), 'Z$', ''), '[+-]\d{2}:\d{2}$', '') as {{ elementary.edr_type_timestamp() }})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@elementary/monitor/dbt_project/macros/overrides/cast_column.sql` around lines
9 - 10, The macro dremio__edr_cast_as_timestamp currently only strips a trailing
'Z' before casting, which misses numeric UTC offsets; update the final
REGEXP_REPLACE pattern in dremio__edr_cast_as_timestamp so it also matches and
removes timezone offsets like +HH:MM, -HH:MM, +HHMM or -HHMM (in addition to
'Z') prior to casting to elementary.edr_type_timestamp(), using an anchored
regex that accepts optional colon between hours and minutes and both +/− signs.
tests/e2e_dbt_project/macros/system/generate_schema_name.sql (1)

23-23: Minor style inconsistency: {{ return() }} vs {% do return() %}.

This line uses {{ return(...) }} while the rest of the file uses {% do return(...) %}. Both work, but consistency would improve readability.

♻️ Optional: align return style
-      {{ return(dremio__generate_schema_name(custom_schema_name, node)) }}
+      {% do return(dremio__generate_schema_name(custom_schema_name, node)) %}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e_dbt_project/macros/system/generate_schema_name.sql` at line 23,
Replace the inline expression-style return with the statement-style return for
consistency: change the usage of the template expression "{{
return(dremio__generate_schema_name(custom_schema_name, node)) }}" to the
do-statement form "{% do return(dremio__generate_schema_name(custom_schema_name,
node)) %}" so the return in the macro dremio__generate_schema_name matches the
rest of the file's style.
tests/e2e_dbt_project/models/schema.yml (1)

350-350: Consider extracting the data_type logic into a macro for maintainability.

The chained conditionals for target-specific data types are becoming lengthy. As you add more adapters, this pattern will become harder to maintain.

💡 Optional: Create a reusable macro

You could create a macro in the project that centralizes the string type mapping:

{# macros/get_string_type.sql #}
{% macro get_string_type() %}
  {% if target.type in ['bigquery', 'databricks', 'athena', 'spark'] %}
    STRING
  {% elif target.type in ['trino', 'dremio', 'duckdb'] %}
    varchar
  {% elif target.type == 'redshift' %}
    character varying
  {% else %}
    TEXT
  {% endif %}
{% endmacro %}

Then use it in schema.yml:

data_type: "{{ get_string_type() }}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e_dbt_project/models/schema.yml` at line 350, Extract the inline
Jinja conditional in the data_type field into a reusable macro (e.g.,
get_string_type) that encapsulates the adapter-to-string-type mapping and then
call that macro from the schema.yml data_type line; specifically, create a macro
(name it get_string_type) that checks target.type and returns the appropriate
literal for bigquery/databricks/athena/spark, trino/dremio/duckdb, redshift, and
a default, and replace the existing inline expression in the data_type field
with "{{ get_string_type() }}" so the mapping logic lives in one place and is
easier to maintain.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/test-warehouse.yml:
- Around line 269-275: The script currently builds EXTRA_ARGS as a plain string
and expands it unquoted in the dbt run invocation, which is brittle and triggers
SC2086; change EXTRA_ARGS into a bash array (e.g., EXTRA_ARGS=() and
EXTRA_ARGS+=(--threads 1) when testing inputs.warehouse-type == "dremio") and
then invoke dbt run --target "${{ inputs.warehouse-type }}" "${EXTRA_ARGS[@]}"
|| true so each argument is passed safely without word-splitting or globbing.

In
`@elementary/monitor/dbt_project/macros/base_queries/current_tests_run_results_query.sql`:
- Line 3: Occurrences of ref(...) use the unsupported keyword argument package=;
update each ref call to the positional form with the package name first (e.g.
change ref('elementary_test_results', package='elementary') to the positional
form with 'elementary' as the first argument) for the following identifiers:
ref('elementary_test_results', package='elementary'), ref('dbt_run_results',
package='elementary'), ref('dbt_tests', package='elementary'), ref('dbt_models',
package='elementary'), and ref('dbt_sources', package='elementary').

In `@tests/e2e_dbt_project/ci/save_seed_cache.sh`:
- Around line 32-41: The readiness loops for ClickHouse (curl /ping) and
Postgres (pg_isready) currently break on success but silently continue on
exhaustion; update both loops in save_seed_cache.sh to detect when the retry
loop finishes without a successful check and immediately fail the script (exit
non-zero) with a clear error message; specifically, after the for ...; do ...;
done blocks that use curl -sf http://localhost:8123/ping and pg_isready -h
localhost -p 5432, add logic to check whether the loop exited via break or
exhausted all attempts and call exit 1 (or return non-zero) and log an error so
the CI job fails fast when services are not ready.

In `@tests/e2e_dbt_project/external_seeders/base.py`:
- Around line 24-27: The run helper currently uses subprocess.run(...,
shell=True) which opens a command-injection vector; change run(cmd: str, ...) to
accept either a sequence or build cmd as List[str] and call
subprocess.run(cmd_list, shell=False, check=check, **kw) so commands are
executed without a shell, and update callers (notably _upload_csvs_to_minio and
the mc_cmds construction in dremio.py) to pass a list of argv tokens instead of
interpolated shell strings; if you must keep composing shell strings, escape all
environment-derived values (minio_access_key, minio_secret_key, network, etc.)
with shlex.quote() before interpolation, but the preferred fix is refactoring
_upload_csvs_to_minio to build and pass list-based commands to run to eliminate
shell and chaining via "&&".

In `@tests/e2e_dbt_project/external_seeders/dremio.py`:
- Around line 336-341: The current try/except blocks around calls to self._sql
(e.g., where you run create_sql, COPY INTO and refresh operations) swallow
exceptions and only print warnings so the seeder can exit successfully with a
partial state; change the behavior to collect failures and re-raise after
attempting all tables: capture exceptions (including table_name and exception
message) into a list (e.g., failed_tables or errors) inside each except instead
of just printing, continue attempting remaining tables, and at the end of the
loading routine check the collection and raise a single aggregated Exception (or
RuntimeError) summarizing all table failures; reference the existing self._sql
calls, created_tables list and table_name variables to locate the blocks to
update, and apply the same pattern to the other try/except blocks noted for COPY
INTO and refresh.
- Around line 281-282: The retry exhaustion currently only prints "  ERROR:
Failed to create SeedFiles source after 3 attempts" and continues; change this
to raise an exception so CI fails fast: replace that print with raising a
descriptive exception (e.g., raise RuntimeError("Failed to create SeedFiles
source after 3 attempts")) at the same location where the S3/SeedFiles creation
retry loop ends (the code that prints that exact message), ensuring the
exception bubbles out of the loader so the test run fails.
- Around line 99-125: The HTTP calls currently have no timeouts and can hang;
update every requests call to include an explicit timeout (use timeout=(5, 30)
or similar) — specifically add the timeout param to requests.post in _get_token,
to both requests.post and the periodic requests.get in _sql (so each looped poll
cannot block past the deadline), and to the requests.post, requests.get, and
both requests.put calls inside _create_s3_source; keep the same headers/json
args and only append timeout=(5, 30) (or configurable constants) to each call.

In `@tests/e2e_dbt_project/external_seeders/spark.py`:
- Around line 51-55: The try/except around cursor.execute currently prints
errors and continues, which swallows failures; change the logic in the failing
blocks (the try/except handling the cursor.execute calls at the spots shown and
the similar blocks around lines handling other seed executions) to aggregate
failures and return a non-zero exit or re-raise after processing so the process
fails: collect exceptions into a failure list (e.g., seed_errors), append the
error and the SQL/seed info inside each except, and after the loop if
seed_errors is non-empty call sys.exit(1) or raise a RuntimeError summarizing
the failures so the test run exits non-zero instead of reporting success. Ensure
you update all occurrences referenced (the try/except around cursor.execute and
the other two similar blocks).
- Around line 45-49: The SQL-building uses unescaped dynamic identifiers
(`seed_schema`, `table_name`, `tmp_view`, and column names from `cols`), so
implement and use a helper like quote_identifier(name) that escapes any
backticks in name (replace ` with ``) and then wraps the result in backticks;
apply this helper when building `col_defs` (the generator that builds f"`{c}`
STRING"), when interpolating `{table_name}` and `{seed_schema}` in the CREATE
TABLE SQL, and wherever `tmp_view` is used (also in the other block around lines
58-70) so all dynamic identifiers are safely quoted before embedding in Spark
SQL.

In `@tests/e2e_dbt_project/load_seeds_external.py`:
- Around line 29-44: The CLI docstring for main currently says SCHEMA_NAME is
the target schema but for adapter "spark" that argument is ignored; update main
to either document this behavior and/or enforce it: inside main, detect if
adapter == "spark" and set schema_name = "test_seeds" (or call out and
raise/echo a warning) before creating seeder_cls = SEEDERS[adapter] and
instantiating seeder = seeder_cls(...); also update the function docstring to
explicitly state that SCHEMA_NAME is ignored for Spark (or mention that it will
be overridden to "test_seeds") so callers understand the behavior.

---

Outside diff comments:
In @.github/workflows/test-warehouse.yml:
- Around line 292-295: The "Test e2e dbt project" workflow step (named "Test e2e
dbt project" and using working-directory ${{ env.E2E_DBT_PROJECT_DIR }}) is
marked with continue-on-error: true which allows dbt failures to be ignored;
remove the continue-on-error: true key from that step so dbt test failures will
fail the workflow (making this step consistent with the Python e2e tests).

---

Nitpick comments:
In `@elementary/monitor/dbt_project/macros/overrides/cast_column.sql`:
- Around line 9-10: The macro dremio__edr_cast_as_timestamp currently only
strips a trailing 'Z' before casting, which misses numeric UTC offsets; update
the final REGEXP_REPLACE pattern in dremio__edr_cast_as_timestamp so it also
matches and removes timezone offsets like +HH:MM, -HH:MM, +HHMM or -HHMM (in
addition to 'Z') prior to casting to elementary.edr_type_timestamp(), using an
anchored regex that accepts optional colon between hours and minutes and both
+/− signs.

In `@elementary/monitor/dbt_project/macros/overrides/dateadd.sql`:
- Line 14: Add a brief explanation for stripping 'order by 1' where interval is
normalized: document why interval may include "order by 1" (e.g., upstream
dbt-dremio or a wrapped SELECT that appends an ORDER BY) and note that the line
{% set interval = interval.replace('order by 1', '') %} removes this artifact to
prevent SQL syntax errors or incorrect interval parsing; place this as an inline
comment immediately above that line and/or expand the header comment to mention
the specific upstream behavior and the conditions when the workaround is
required.
- Around line 29-30: The template implicitly handles the 'day' datepart in the
final else branch; add an explicit branch for when datepart equals 'day' in the
dateadd macro to map it to TIMESTAMPADD(SQL_TSI_DAY, CAST({{interval}} as int),
CAST({{from_date_or_timestamp}} as TIMESTAMP)) — locate the conditional block in
the dateadd macro (the same block that currently emits TIMESTAMPADD for other
dateparts) and insert an elif/else-if for 'day' that returns the exact
TIMESTAMPADD(SQL_TSI_DAY, ...) expression so the behavior is explicit and
readable.

In `@tests/e2e_dbt_project/macros/system/generate_schema_name.sql`:
- Line 23: Replace the inline expression-style return with the statement-style
return for consistency: change the usage of the template expression "{{
return(dremio__generate_schema_name(custom_schema_name, node)) }}" to the
do-statement form "{% do return(dremio__generate_schema_name(custom_schema_name,
node)) %}" so the return in the macro dremio__generate_schema_name matches the
rest of the file's style.

In `@tests/e2e_dbt_project/models/schema.yml`:
- Line 350: Extract the inline Jinja conditional in the data_type field into a
reusable macro (e.g., get_string_type) that encapsulates the
adapter-to-string-type mapping and then call that macro from the schema.yml
data_type line; specifically, create a macro (name it get_string_type) that
checks target.type and returns the appropriate literal for
bigquery/databricks/athena/spark, trino/dremio/duckdb, redshift, and a default,
and replace the existing inline expression in the data_type field with "{{
get_string_type() }}" so the mapping logic lives in one place and is easier to
maintain.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 7969949 and d48a335.

⛔ Files ignored due to path filters (2)
  • tests/e2e_dbt_project/data/training/numeric_column_anomalies_training.csv is excluded by !**/*.csv
  • tests/e2e_dbt_project/data/validation/numeric_column_anomalies_validation.csv is excluded by !**/*.csv
📒 Files selected for processing (54)
  • .github/workflows/test-warehouse.yml
  • elementary/monitor/dbt_project/dbt_project.yml
  • elementary/monitor/dbt_project/macros/alerts/population/model_alerts.sql
  • elementary/monitor/dbt_project/macros/alerts/population/source_freshness_alerts.sql
  • elementary/monitor/dbt_project/macros/alerts/population/test_alerts.sql
  • elementary/monitor/dbt_project/macros/base_queries/current_tests_run_results_query.sql
  • elementary/monitor/dbt_project/macros/base_queries/owners.sql
  • elementary/monitor/dbt_project/macros/base_queries/resources.sql
  • elementary/monitor/dbt_project/macros/base_queries/tags.sql
  • elementary/monitor/dbt_project/macros/can_upload_source_freshness.sql
  • elementary/monitor/dbt_project/macros/get_dbt_models_test_coverage.sql
  • elementary/monitor/dbt_project/macros/get_exposures.sql
  • elementary/monitor/dbt_project/macros/get_models.sql
  • elementary/monitor/dbt_project/macros/get_models_latest_invocation.sql
  • elementary/monitor/dbt_project/macros/get_models_latest_invocations_data.sql
  • elementary/monitor/dbt_project/macros/get_models_runs.sql
  • elementary/monitor/dbt_project/macros/get_nodes_depends_on_nodes.sql
  • elementary/monitor/dbt_project/macros/get_result_rows_agate.sql
  • elementary/monitor/dbt_project/macros/get_seeds.sql
  • elementary/monitor/dbt_project/macros/get_singular_tests.sql
  • elementary/monitor/dbt_project/macros/get_snapshots.sql
  • elementary/monitor/dbt_project/macros/get_source_freshness_results.sql
  • elementary/monitor/dbt_project/macros/get_sources.sql
  • elementary/monitor/dbt_project/macros/get_test_last_invocation.sql
  • elementary/monitor/dbt_project/macros/get_test_results.sql
  • elementary/monitor/dbt_project/macros/get_tests.sql
  • elementary/monitor/dbt_project/macros/overrides/cast_column.sql
  • elementary/monitor/dbt_project/macros/overrides/dateadd.sql
  • elementary/monitor/dbt_project/models/alerts/alerts_v2.sql
  • elementary/monitor/dbt_project/models/alerts/deprecated/alerts.sql
  • tests/e2e/report/fixtures/elementary_output.json
  • tests/e2e_dbt_project/ci/restore_seed_cache.sh
  • tests/e2e_dbt_project/ci/save_seed_cache.sh
  • tests/e2e_dbt_project/dbt_project.yml
  • tests/e2e_dbt_project/docker-compose.yml
  • tests/e2e_dbt_project/docker/dremio/dremio-setup.sh
  • tests/e2e_dbt_project/docker/spark/Dockerfile
  • tests/e2e_dbt_project/external_seeders/__init__.py
  • tests/e2e_dbt_project/external_seeders/base.py
  • tests/e2e_dbt_project/external_seeders/dremio.py
  • tests/e2e_dbt_project/external_seeders/spark.py
  • tests/e2e_dbt_project/generate_data.py
  • tests/e2e_dbt_project/load_seeds_external.py
  • tests/e2e_dbt_project/macros/system/drop_test_schemas.sql
  • tests/e2e_dbt_project/macros/system/generate_schema_name.sql
  • tests/e2e_dbt_project/models/nested/models/tree/nested.sql
  • tests/e2e_dbt_project/models/numeric_column_anomalies.sql
  • tests/e2e_dbt_project/models/one.sql
  • tests/e2e_dbt_project/models/schema.yml
  • tests/e2e_dbt_project/tests/singular_test_with_no_ref.sql
  • tests/e2e_dbt_project/tests/singular_test_with_one_ref.sql
  • tests/e2e_dbt_project/tests/singular_test_with_source_ref.sql
  • tests/e2e_dbt_project/tests/singular_test_with_two_refs.sql
  • tests/profiles/profiles.yml.j2
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/e2e_dbt_project/docker/dremio/dremio-setup.sh

devin-ai-integration bot and others added 3 commits March 3, 2026 11:11
- Revert ref() syntax from package= keyword to positional form in 20 monitor macros
- Add HTTP_TIMEOUT constant and apply to all 7 requests calls in dremio.py
- Raise RuntimeError on S3 source creation failure instead of silent print
- Aggregate and raise failures in dremio.py and spark.py load() methods
- Fix shell=True injection: convert base.py run() to list-based subprocess
- Quote MinIO credentials with shlex.quote() in dremio.py
- Add backtick-escaping helper _q() for Spark SQL identifiers
- Fail fast on readiness timeout in save_seed_cache.sh
- Convert EXTRA_ARGS to bash array in test-warehouse.yml (SC2086)
- Remove continue-on-error from dbt test step
- Add explicit day case in dateadd.sql override
- Document Spark schema_name limitation in load_seeds_external.py

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (6)
elementary/monitor/dbt_project/macros/overrides/dateadd.sql (1)

34-36: Silent fallback to day for unrecognized dateparts may mask bugs.

If a caller passes an invalid or misspelled datepart (e.g., 'years' instead of 'year'), this silently defaults to day without any indication. Consider raising an error or logging a warning to surface misuse early.

🛡️ Proposed fix to raise an error for unrecognized dateparts
     {% elif datepart == 'day' %}
         TIMESTAMPADD(SQL_TSI_DAY, CAST({{interval}} as int), CAST({{from_date_or_timestamp}} as TIMESTAMP))
     {% else %}
-        {# Fallback for unrecognized dateparts — default to day #}
-        TIMESTAMPADD(SQL_TSI_DAY, CAST({{interval}} as int), CAST({{from_date_or_timestamp}} as TIMESTAMP))
+        {{ exceptions.raise_compiler_error("dremio__dateadd: unrecognized datepart '" ~ datepart ~ "'. Supported values: year, quarter, month, week, day, hour, minute, second.") }}
     {% endif %}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@elementary/monitor/dbt_project/macros/overrides/dateadd.sql` around lines 34
- 36, The current else branch silently falls back to TIMESTAMPADD(SQL_TSI_DAY,
...) for unrecognized dateparts; replace that silent fallback with a clear
failure by invoking dbt's compiler error helper
(exceptions.raise_compiler_error) and include the invalid datepart value in the
message so callers see the misuse; update the macro (dateadd.sql) to call
exceptions.raise_compiler_error with a descriptive message instead of returning
TIMESTAMPADD for the else branch (keep TIMESTAMPADD only for the valid "day"
branch).
tests/e2e_dbt_project/external_seeders/base.py (1)

58-58: Use os.path.splitext for table name derivation.

Line 58 can strip internal .csv fragments (e.g. a.csv.backup.csva.backup). splitext is safer and explicit.

♻️ Proposed change
-                table_name = fname.replace(".csv", "")
+                table_name, _ = os.path.splitext(fname)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e_dbt_project/external_seeders/base.py` at line 58, Replace the
fragile string replace used to derive table_name from fname (table_name =
fname.replace(".csv", "")) with a splitext-based extraction: use
os.path.splitext(fname)[0] so filenames like "a.csv.backup.csv" correctly yield
"a.csv.backup" (update imports to include os if missing and adjust any
references to table_name accordingly).
tests/e2e_dbt_project/external_seeders/spark.py (1)

38-40: Guarantee cursor/connection cleanup on early exceptions.

If execution fails before Line 84, resource cleanup is skipped. Wrap DB work in try/finally (or context managers).

♻️ Proposed structure
         conn = hive.Connection(host=host, port=port, username="dbt")
         cursor = conn.cursor()
-
-        print(f"Creating schema '{seed_schema}'...")
-        cursor.execute(f"CREATE DATABASE IF NOT EXISTS `{seed_schema}`")
-
-        for subdir, csv_path, table_name in self.iter_seed_csvs():
-            ...
-
-        cursor.close()
-        conn.close()
+        try:
+            print(f"Creating schema '{seed_schema}'...")
+            cursor.execute(f"CREATE DATABASE IF NOT EXISTS `{seed_schema}`")
+
+            for subdir, csv_path, table_name in self.iter_seed_csvs():
+                ...
+        finally:
+            cursor.close()
+            conn.close()

Also applies to: 84-85

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e_dbt_project/external_seeders/spark.py` around lines 38 - 40, Wrap
the Hive connection and cursor usage in a try/finally (or use context managers)
so they are always cleaned up on early exceptions: create conn via
hive.Connection and cursor = conn.cursor() inside a try block, perform DB
operations there, and in the finally ensure cursor.close() and conn.close() are
called (apply the same pattern to the later block around the code at lines
84-85). Reference the variables conn and cursor and the DB operation blocks so
you ensure both the initial connection/cursor and the later one are always
closed even if an exception is raised.
tests/e2e_dbt_project/external_seeders/dremio.py (1)

337-351: Escape quoted identifiers for schema/table/column names.

seed_schema, table_name, and CSV header values are embedded into quoted identifiers without escaping internal " characters.

♻️ Suggested helper and usage
 class DremioExternalSeeder(ExternalSeeder):
@@
+    `@staticmethod`
+    def _q_ident(name: str) -> str:
+        return '"' + name.replace('"', '""') + '"'
@@
-        nessie_ns = f'NessieSource."{seed_schema}"'
+        q = self._q_ident
+        nessie_ns = f"NessieSource.{q(seed_schema)}"
@@
-            fqn = f'{nessie_ns}."{table_name}"'
+            fqn = f"{nessie_ns}.{q(table_name)}"
@@
-            col_defs = ", ".join(f'"{c}" VARCHAR' for c in cols)
+            col_defs = ", ".join(f"{q(c)} VARCHAR" for c in cols)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e_dbt_project/external_seeders/dremio.py` around lines 337 - 351, The
code builds quoted identifiers using seed_schema, table_name and CSV header
values without escaping embedded double-quotes (e.g., nessie_ns, fqn, col_defs),
which can produce invalid SQL or injection; add a small helper (e.g.,
quote_identifier(name: str) -> str) that returns the name with internal "
characters doubled and wrapped in double quotes, then use it when composing
nessie_ns, fqn and each column name produced by csv_columns() (replace
f'{...}"{table_name}"' and f'"{c}" VARCHAR' with calls to the helper) so all
schema/table/column identifiers are safely escaped before building create_sql.
.github/workflows/test-warehouse.yml (2)

255-255: Simplify redundant cache-save condition.

Line 255 already restricts to (postgres || clickhouse), so the extra != duckdb/trino/spark/dremio checks are unnecessary and make maintenance harder.

Minimal cleanup
-        if: steps.seed-cache.outputs.cache-hit != 'true' && inputs.warehouse-type != 'duckdb' && inputs.warehouse-type != 'trino' && inputs.warehouse-type != 'spark' && inputs.warehouse-type != 'dremio' && (inputs.warehouse-type == 'postgres' || inputs.warehouse-type == 'clickhouse')
+        if: steps.seed-cache.outputs.cache-hit != 'true' && (inputs.warehouse-type == 'postgres' || inputs.warehouse-type == 'clickhouse')
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/test-warehouse.yml at line 255, The conditional on line
containing "if: steps.seed-cache.outputs.cache-hit != 'true' &&
inputs.warehouse-type != 'duckdb' && inputs.warehouse-type != 'trino' &&
inputs.warehouse-type != 'spark' && inputs.warehouse-type != 'dremio' &&
(inputs.warehouse-type == 'postgres' || inputs.warehouse-type == 'clickhouse')"
is redundant; remove the unnecessary "!= duckdb/trino/spark/dremio" checks and
simplify it to check only that the cache missed and the warehouse type is
postgres or clickhouse, e.g. replace the whole condition with "if:
steps.seed-cache.outputs.cache-hit != 'true' && (inputs.warehouse-type ==
'postgres' || inputs.warehouse-type == 'clickhouse')" so future maintenance is
easier.

93-95: Align cache-enabled adapter lists between restore and save paths.

Lines 93-95 allow cache compute/restore for trino, dremio, and spark, but Lines 254-260 never save cache for those adapters in this workflow. This creates ongoing cache misses and extra cache API work. Either add save support for them or remove them from restore eligibility.

Also applies to: 254-260

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/test-warehouse.yml around lines 93 - 95, The restore
conditional that checks inputs.warehouse-type includes adapters (trino, dremio,
spark, duckdb) that are not included in the cache save path, causing cache
misses; either remove those adapters from the restore condition (the if:
inputs.warehouse-type == ... expression used before the run block) or add them
to the cache save step (the workflow’s cache save step / actions/cache usage
later in the file that currently limits saved adapters) so the lists match
exactly; update the same adapter names in both the restore and save/cache steps
to keep symmetry.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/test-warehouse.yml:
- Around line 145-153: The workflow currently uses "docker compose up
--exit-code-from dremio-setup" which implicitly aborts all services when the
dremio-setup container exits; instead bring the stack up detached, verify
readiness, then run the setup container separately and fail the job if it
returns non-zero: replace the single "docker compose up --exit-code-from
dremio-setup dremio-setup" step with a two-step sequence that (1) runs "docker
compose up -d" to start Dremio and MinIO, waits/health-checks until the services
are ready (e.g., polling Dremio/MinIO endpoints or using docker compose
ps/healthchecks), and (2) runs the setup as an isolated run (e.g., "docker
compose run --rm dremio-setup") capturing its exit code and failing the job when
that exit code is non-zero so successful setup does not stop the background
services.

In `@tests/e2e_dbt_project/external_seeders/dremio.py`:
- Around line 251-276: The 409-handling branch currently returns unconditionally
even when GET (get_resp) or PUT (put_resp) fail, preventing retries and leaving
SeedFiles potentially broken; change the logic in the 409 branch so you only
return when the PUT succeeds (put_resp.status_code == 200) and otherwise do not
return—log the warning for non-200 GET/PUT responses but let the function
continue (so callers/retries can handle/fix the failure) by removing the
premature `return` after the GET/PUT failure paths and keeping the
successful-return only inside the `if put_resp.status_code == 200` block.

In `@tests/e2e_dbt_project/external_seeders/spark.py`:
- Line 32: The runtime import "from pyhive import hive" introduces a missing
dependency; add "pyhive" (and optional transitive packages such as "thrift-sasl"
if SASL auth is used) to the project's dependency manifests—e.g., add pyhive to
the main dependencies in pyproject.toml (or the relevant extras group used for
e2e/CI) and also to dev-requirements.txt so CI/e2e environments install it;
ensure the versions are compatible with existing thrift libraries and update
lockfiles if present.

---

Nitpick comments:
In @.github/workflows/test-warehouse.yml:
- Line 255: The conditional on line containing "if:
steps.seed-cache.outputs.cache-hit != 'true' && inputs.warehouse-type !=
'duckdb' && inputs.warehouse-type != 'trino' && inputs.warehouse-type != 'spark'
&& inputs.warehouse-type != 'dremio' && (inputs.warehouse-type == 'postgres' ||
inputs.warehouse-type == 'clickhouse')" is redundant; remove the unnecessary "!=
duckdb/trino/spark/dremio" checks and simplify it to check only that the cache
missed and the warehouse type is postgres or clickhouse, e.g. replace the whole
condition with "if: steps.seed-cache.outputs.cache-hit != 'true' &&
(inputs.warehouse-type == 'postgres' || inputs.warehouse-type == 'clickhouse')"
so future maintenance is easier.
- Around line 93-95: The restore conditional that checks inputs.warehouse-type
includes adapters (trino, dremio, spark, duckdb) that are not included in the
cache save path, causing cache misses; either remove those adapters from the
restore condition (the if: inputs.warehouse-type == ... expression used before
the run block) or add them to the cache save step (the workflow’s cache save
step / actions/cache usage later in the file that currently limits saved
adapters) so the lists match exactly; update the same adapter names in both the
restore and save/cache steps to keep symmetry.

In `@elementary/monitor/dbt_project/macros/overrides/dateadd.sql`:
- Around line 34-36: The current else branch silently falls back to
TIMESTAMPADD(SQL_TSI_DAY, ...) for unrecognized dateparts; replace that silent
fallback with a clear failure by invoking dbt's compiler error helper
(exceptions.raise_compiler_error) and include the invalid datepart value in the
message so callers see the misuse; update the macro (dateadd.sql) to call
exceptions.raise_compiler_error with a descriptive message instead of returning
TIMESTAMPADD for the else branch (keep TIMESTAMPADD only for the valid "day"
branch).

In `@tests/e2e_dbt_project/external_seeders/base.py`:
- Line 58: Replace the fragile string replace used to derive table_name from
fname (table_name = fname.replace(".csv", "")) with a splitext-based extraction:
use os.path.splitext(fname)[0] so filenames like "a.csv.backup.csv" correctly
yield "a.csv.backup" (update imports to include os if missing and adjust any
references to table_name accordingly).

In `@tests/e2e_dbt_project/external_seeders/dremio.py`:
- Around line 337-351: The code builds quoted identifiers using seed_schema,
table_name and CSV header values without escaping embedded double-quotes (e.g.,
nessie_ns, fqn, col_defs), which can produce invalid SQL or injection; add a
small helper (e.g., quote_identifier(name: str) -> str) that returns the name
with internal " characters doubled and wrapped in double quotes, then use it
when composing nessie_ns, fqn and each column name produced by csv_columns()
(replace f'{...}"{table_name}"' and f'"{c}" VARCHAR' with calls to the helper)
so all schema/table/column identifiers are safely escaped before building
create_sql.

In `@tests/e2e_dbt_project/external_seeders/spark.py`:
- Around line 38-40: Wrap the Hive connection and cursor usage in a try/finally
(or use context managers) so they are always cleaned up on early exceptions:
create conn via hive.Connection and cursor = conn.cursor() inside a try block,
perform DB operations there, and in the finally ensure cursor.close() and
conn.close() are called (apply the same pattern to the later block around the
code at lines 84-85). Reference the variables conn and cursor and the DB
operation blocks so you ensure both the initial connection/cursor and the later
one are always closed even if an exception is raised.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d48a335 and 19652db.

📒 Files selected for processing (10)
  • .github/workflows/test-warehouse.yml
  • elementary/monitor/dbt_project/macros/alerts/population/model_alerts.sql
  • elementary/monitor/dbt_project/macros/alerts/population/source_freshness_alerts.sql
  • elementary/monitor/dbt_project/macros/get_source_freshness_results.sql
  • elementary/monitor/dbt_project/macros/overrides/dateadd.sql
  • tests/e2e_dbt_project/ci/save_seed_cache.sh
  • tests/e2e_dbt_project/external_seeders/base.py
  • tests/e2e_dbt_project/external_seeders/dremio.py
  • tests/e2e_dbt_project/external_seeders/spark.py
  • tests/e2e_dbt_project/load_seeds_external.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • tests/e2e_dbt_project/ci/save_seed_cache.sh
  • elementary/monitor/dbt_project/macros/alerts/population/model_alerts.sql
  • elementary/monitor/dbt_project/macros/alerts/population/source_freshness_alerts.sql

f"(schema={seed_schema}) ==="
)

from pyhive import hive
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Checking line 32 in spark.py =="
if [ -f "tests/e2e_dbt_project/external_seeders/spark.py" ]; then
  sed -n '25,40p' "tests/e2e_dbt_project/external_seeders/spark.py"
else
  echo "File not found, listing directory contents:"
  find tests -name "spark.py" -type f 2>/dev/null | head -5
fi

echo
echo "== All pyhive imports in e2e seeder code =="
rg -n 'pyhive' tests/e2e_dbt_project -C2

echo
echo "== Searching for dependency manifests =="
fd -t f '(pyproject\.toml|setup\.py|setup\.cfg|requirements.*\.txt|Pipfile|poetry\.lock)' --max-results 20

echo
echo "== Checking for pyhive in dependency files =="
fd -t f '(pyproject\.toml|setup\.py|setup\.cfg|requirements.*\.txt)' | while read -r f; do
  if rg -q 'pyhive|PyHive' "$f" 2>/dev/null; then
    echo "Found in: $f"
    rg -n 'pyhive|PyHive' "$f"
  fi
done

Repository: elementary-data/elementary

Length of output: 1138


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Content of pyproject.toml (looking for pyhive/thrift) =="
if [ -f "pyproject.toml" ]; then
  cat pyproject.toml | grep -i -A5 -B5 'pyhive\|thrift\|test\|dev' || echo "No matches found, showing full file structure:"
  echo "---"
  grep -E '^\[.*\]|pyhive|thrift' pyproject.toml || true
fi

echo
echo "== Content of dev-requirements.txt =="
if [ -f "dev-requirements.txt" ]; then
  cat dev-requirements.txt
fi

echo
echo "== Full context around line 32 in spark.py (check for conditionals) =="
sed -n '1,40p' "tests/e2e_dbt_project/external_seeders/spark.py"

echo
echo "== Looking for all requirements files =="
find . -name '*requirement*.txt' -type f 2>/dev/null | grep -v '.git'

Repository: elementary-data/elementary

Length of output: 2655


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Full pyproject.toml [tool.poetry.dependencies] section =="
sed -n '/^\[tool\.poetry\.dependencies\]/,/^\[/p' pyproject.toml | head -50

echo
echo "== Check for optional dependencies (extras) =="
sed -n '/^\[tool\.poetry\.extras\]/,/^\[/p' pyproject.toml

echo
echo "== All occurrences of 'pyhive' or 'thrift' in repository =="
rg -i 'pyhive|thrift_sasl|sasl' . --type py --type toml --type txt 2>/dev/null || echo "No matches found"

Repository: elementary-data/elementary

Length of output: 2185


Add pyhive to project dependencies.

Line 32 introduces a hard runtime dependency on pyhive that is missing from all dependency manifests (pyproject.toml, dev-requirements.txt). If the Spark seeding path is executed in CI/e2e tests, the import will fail at runtime. Add pyhive with any required transitive dependencies (e.g., thrift-sasl for SASL authentication) to the appropriate dependency group.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e_dbt_project/external_seeders/spark.py` at line 32, The runtime
import "from pyhive import hive" introduces a missing dependency; add "pyhive"
(and optional transitive packages such as "thrift-sasl" if SASL auth is used) to
the project's dependency manifests—e.g., add pyhive to the main dependencies in
pyproject.toml (or the relevant extras group used for e2e/CI) and also to
dev-requirements.txt so CI/e2e environments install it; ensure the versions are
compatible with existing thrift libraries and update lockfiles if present.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
tests/e2e_dbt_project/external_seeders/dremio.py (1)

257-283: ⚠️ Potential issue | 🟠 Major

Do not return from the 409 branch unless update succeeds.

Line [282] exits even when GET by-path or PUT update fails, which skips the v2 fallback/retry path and can leave SeedFiles broken.

🔧 Minimal fix
             if resp.status_code == 409:
                 # Source already exists – update it with PUT
                 print("  SeedFiles source exists, updating...")
                 # Get the existing source id + tag for the update
                 get_resp = requests.get(
                     f"http://{self.dremio_host}:{self.dremio_port}/api/v3/catalog/by-path/SeedFiles",
                     headers=headers,
                     timeout=self.HTTP_TIMEOUT,
                 )
                 if get_resp.status_code == 200:
                     existing = get_resp.json()
                     payload["id"] = existing["id"]
                     payload["tag"] = existing.get("tag", "")
                     put_resp = requests.put(
                         f"http://{self.dremio_host}:{self.dremio_port}/api/v3/catalog/{existing['id']}",
                         headers=headers,
                         json=payload,
                         timeout=self.HTTP_TIMEOUT,
                     )
                     if put_resp.status_code == 200:
                         print("  SeedFiles S3 source updated")
                         return
                     print(
                         f"  Warning: update returned {put_resp.status_code}: {put_resp.text}"
                     )
-                return
+                else:
+                    print(
+                        f"  Warning: failed fetching existing source "
+                        f"({get_resp.status_code}: {get_resp.text[:200]})"
+                    )
+                # Continue to v2 fallback / retry.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e_dbt_project/external_seeders/dremio.py` around lines 257 - 283, The
409 branch currently returns early even when the GET by-path or PUT update
fails; update the logic in the block handling resp.status_code == 409
(referencing payload, get_resp, existing, put_resp) so that you only return
immediately when the PUT succeeds (put_resp.status_code == 200); if
get_resp.status_code != 200 or the PUT does not succeed, do not return—log or
print the warning but let execution continue so the v2 fallback/retry path can
run and attempt to recover SeedFiles.
🧹 Nitpick comments (1)
tests/e2e_dbt_project/external_seeders/dremio.py (1)

23-39: Avoid silent parse failures in _docker_defaults.

Lines [38-39] and [54-55] swallow all exceptions, which makes credential-default parsing failures hard to diagnose in CI.

🔧 Suggested tweak
-    except Exception:
-        pass
+    except Exception as e:
+        print(f"  Warning: failed reading defaults from {compose_path}: {e}")
@@
-    except Exception:
-        pass
+    except Exception as e:
+        print(f"  Warning: failed reading defaults from {setup_path}: {e}")

Also applies to: 45-55

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e_dbt_project/external_seeders/dremio.py` around lines 23 - 39, The
try/except in _docker_defaults currently swallows all exceptions during parsing
of compose_path and environment items (around the block that opens compose_path,
iterates services -> "dremio-minio" -> "environment", and sets
defaults["MINIO_ACCESS_KEY"/"MINIO_SECRET_KEY"]) which hides CI failures; change
the broad except to either narrow exceptions you expect (e.g.,
FileNotFoundError, yaml.YAMLError, ValueError) or log the exception details and
context before continuing (use logging.exception or raise a wrapped error) so
failures are visible — apply the same change to the second try/except that
handles the other parse block so neither silent-swallowing try/except hides
parsing errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/e2e_dbt_project/external_seeders/spark.py`:
- Around line 54-62: The current logic creates empty tables with "CREATE TABLE
IF NOT EXISTS" which preserves prior data; change the empty-seed path in
spark.py to ensure a clean table by using "CREATE OR REPLACE TABLE" (or by
issuing "DROP TABLE IF EXISTS {q(seed_schema)}.{q(table_name)}" then a plain
CREATE TABLE) so header-only CSVs do not retain old rows; update the sql string
built for the empty-CSV branch (the same variable passed to cursor.execute)
accordingly and keep the existing try/except that appends to failures to
preserve error handling.

---

Duplicate comments:
In `@tests/e2e_dbt_project/external_seeders/dremio.py`:
- Around line 257-283: The 409 branch currently returns early even when the GET
by-path or PUT update fails; update the logic in the block handling
resp.status_code == 409 (referencing payload, get_resp, existing, put_resp) so
that you only return immediately when the PUT succeeds (put_resp.status_code ==
200); if get_resp.status_code != 200 or the PUT does not succeed, do not
return—log or print the warning but let execution continue so the v2
fallback/retry path can run and attempt to recover SeedFiles.

---

Nitpick comments:
In `@tests/e2e_dbt_project/external_seeders/dremio.py`:
- Around line 23-39: The try/except in _docker_defaults currently swallows all
exceptions during parsing of compose_path and environment items (around the
block that opens compose_path, iterates services -> "dremio-minio" ->
"environment", and sets defaults["MINIO_ACCESS_KEY"/"MINIO_SECRET_KEY"]) which
hides CI failures; change the broad except to either narrow exceptions you
expect (e.g., FileNotFoundError, yaml.YAMLError, ValueError) or log the
exception details and context before continuing (use logging.exception or raise
a wrapped error) so failures are visible — apply the same change to the second
try/except that handles the other parse block so neither silent-swallowing
try/except hides parsing errors.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 19652db and a63ce6b.

📒 Files selected for processing (2)
  • tests/e2e_dbt_project/external_seeders/dremio.py
  • tests/e2e_dbt_project/external_seeders/spark.py

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
tests/e2e_dbt_project/external_seeders/spark.py (1)

42-42: Use _q() for schema identifier quoting.

Line 42 uses raw backtick quoting for seed_schema while the rest of the file consistently uses the _q() helper for identifier escaping. For consistency and to maintain the quoting invariant established by the helper:

♻️ Proposed fix
         print(f"Creating schema '{seed_schema}'...")
-        cursor.execute(f"CREATE DATABASE IF NOT EXISTS `{seed_schema}`")
+        cursor.execute(f"CREATE DATABASE IF NOT EXISTS {q(seed_schema)}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e_dbt_project/external_seeders/spark.py` at line 42, The CREATE
DATABASE statement uses raw backticks around seed_schema instead of the
project's identifier-quoting helper; update the cursor.execute call that builds
the SQL (the line with cursor.execute(f"CREATE DATABASE IF NOT EXISTS
`{seed_schema}`")) to call the _q() helper for escaping (e.g. use
_q(seed_schema) inside the CREATE DATABASE IF NOT EXISTS expression) so it
follows the same quoting invariant as the rest of the file.
tests/e2e_dbt_project/external_seeders/dremio.py (2)

360-387: Consider idempotent table recreation to prevent duplicate data on reruns.

The current pattern uses CREATE TABLE IF NOT EXISTS (line 362) followed by COPY INTO (line 385). If the table already exists from a previous run with data, COPY INTO will append rows, causing duplicates.

Unlike the Spark seeder which drops before creating, this could lead to unexpected state during local development reruns.

♻️ Proposed fix for idempotent table loading
             # Create empty Iceberg table with VARCHAR columns
             col_defs = ", ".join(f'"{c}" VARCHAR' for c in cols)
-            create_sql = f"CREATE TABLE IF NOT EXISTS {fqn} ({col_defs})"
+            # Drop and recreate to ensure idempotent reruns
+            drop_sql = f"DROP TABLE IF EXISTS {fqn}"
+            create_sql = f"CREATE TABLE {fqn} ({col_defs})"
             try:
+                self._sql(token, drop_sql, timeout=60)
                 self._sql(token, create_sql, timeout=60)
                 created_tables.append(table_name)
             except Exception as e:

38-39: Silent exception handling obscures configuration issues.

The try-except-pass blocks when reading docker-compose.yml and dremio-setup.sh silently swallow all exceptions. While failures here aren't fatal (defaults are optional), logging at debug level would help diagnose misconfiguration.

♻️ Optional: Add debug logging
+import logging
+
+logger = logging.getLogger(__name__)
+
 def _docker_defaults() -> dict[str, str]:
     ...
     try:
         with open(compose_path) as fh:
             cfg = yaml.safe_load(fh)
         ...
     except Exception:
-        pass
+        logger.debug("Could not read docker-compose.yml for defaults", exc_info=True)

     ...
     try:
         with open(setup_path) as fh:
             content = fh.read()
         ...
     except Exception:
-        pass
+        logger.debug("Could not read dremio-setup.sh for defaults", exc_info=True)

Also applies to: 54-55

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e_dbt_project/external_seeders/dremio.py` around lines 38 - 39,
Replace the silent "except Exception: pass" blocks in dremio.py (the ones
wrapping reading docker-compose.yml and dremio-setup.sh) with debug-level
logging that records the exception and contextual message; ensure a logger is
created (e.g., logger = logging.getLogger(__name__)) and call
logger.debug("Failed to read docker-compose.yml, using defaults", exc_info=True)
and similarly for the dremio-setup.sh block so configuration failures are
visible in debug logs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/e2e_dbt_project/external_seeders/dremio.py`:
- Around line 38-39: Replace the silent "except Exception: pass" blocks in
dremio.py (the ones wrapping reading docker-compose.yml and dremio-setup.sh)
with debug-level logging that records the exception and contextual message;
ensure a logger is created (e.g., logger = logging.getLogger(__name__)) and call
logger.debug("Failed to read docker-compose.yml, using defaults", exc_info=True)
and similarly for the dremio-setup.sh block so configuration failures are
visible in debug logs.

In `@tests/e2e_dbt_project/external_seeders/spark.py`:
- Line 42: The CREATE DATABASE statement uses raw backticks around seed_schema
instead of the project's identifier-quoting helper; update the cursor.execute
call that builds the SQL (the line with cursor.execute(f"CREATE DATABASE IF NOT
EXISTS `{seed_schema}`")) to call the _q() helper for escaping (e.g. use
_q(seed_schema) inside the CREATE DATABASE IF NOT EXISTS expression) so it
follows the same quoting invariant as the rest of the file.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a63ce6b and 2263199.

📒 Files selected for processing (2)
  • tests/e2e_dbt_project/external_seeders/dremio.py
  • tests/e2e_dbt_project/external_seeders/spark.py

devin-ai-integration bot and others added 2 commits March 3, 2026 11:50
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
@devin-ai-integration
Copy link
Contributor Author

Follow-up on CodeRabbit comment about pyhive in SparkExternalSeeder: CI installs Spark deps via dbt-spark[PyHive], which brings in pyhive. I also added an explicit ImportError -> RuntimeError with a clear install hint so local runs fail with an actionable message.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (5)
.github/workflows/test-warehouse.yml (1)

189-196: Prefer a case statement for adapter membership.

The loop works, but case is shorter and easier to maintain as adapter list evolves.

♻️ Optional simplification
-          DOCKER_ADAPTERS="postgres clickhouse trino dremio duckdb spark"
-          IS_DOCKER=false
-          for adapter in $DOCKER_ADAPTERS; do
-            if [ "$adapter" = "${{ inputs.warehouse-type }}" ]; then
-              IS_DOCKER=true
-              break
-            fi
-          done
+          IS_DOCKER=false
+          case "${{ inputs.warehouse-type }}" in
+            postgres|clickhouse|trino|dremio|duckdb|spark) IS_DOCKER=true ;;
+          esac
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/test-warehouse.yml around lines 189 - 196, Replace the
manual for-loop membership check (iterating DOCKER_ADAPTERS and setting
IS_DOCKER) with a shell case statement that matches "${{ inputs.warehouse-type
}}" against the adapter patterns; update the block referencing DOCKER_ADAPTERS,
IS_DOCKER, adapter and inputs.warehouse-type to use case ... esac to set
IS_DOCKER=true on a match and IS_DOCKER=false otherwise, keeping the same
behavior but with a shorter, easier-to-maintain structure.
elementary/monitor/dbt_project/macros/overrides/dateadd.sql (1)

18-36: Consider adding case-insensitive datepart matching for robustness.

The datepart comparisons are case-sensitive. Although dbt convention is to pass lowercase datepart values, adding a .lower() normalization would make this macro defensive against non-lowercase inputs from upstream code. This aligns with dbt best practices for cross-warehouse compatibility.

♻️ Suggested fix
 {% macro dremio__dateadd(datepart, interval, from_date_or_timestamp) %}
+    {% set datepart = datepart | lower %}
     {% set interval = interval | string %}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@elementary/monitor/dbt_project/macros/overrides/dateadd.sql` around lines 18
- 36, Normalize the datepart variable to lowercase before the comparisons in the
dateadd macro (use the Jinja filter for lower on the datepart used in the
if/elif checks) so all branches (year, quarter, month, week, day, hour, minute,
second) match case-insensitively; update the exceptions.raise_compiler_error
call to reference the normalized value or include the original input for clarity
if desired. Ensure you modify the conditional checks that reference datepart and
the error message generation in the dateadd macro to use the lower-cased form.
tests/e2e_dbt_project/external_seeders/spark.py (1)

43-45: Harden connection cleanup for partial initialization failures.

If cursor creation fails after opening the connection, cleanup is not guaranteed. Initialize conn/cursor defensively and close conditionally in finally.

♻️ Proposed refactor
-        conn = hive.Connection(host=host, port=port, username="dbt")
-        cursor = conn.cursor()
-        try:
+        conn = None
+        cursor = None
+        try:
+            conn = hive.Connection(host=host, port=port, username="dbt")
+            cursor = conn.cursor()
             print(f"Creating schema '{seed_schema}'...")
             cursor.execute(f"CREATE DATABASE IF NOT EXISTS `{seed_schema}`")
@@
-        finally:
-            cursor.close()
-            conn.close()
+        finally:
+            if cursor is not None:
+                cursor.close()
+            if conn is not None:
+                conn.close()

Also applies to: 90-92

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e_dbt_project/external_seeders/spark.py` around lines 43 - 45, The
connection/cursor cleanup is unsafe if cursor creation fails; initialize conn =
None and cursor = None before opening/creating them in the spark seeder and wrap
the connection/cursor logic in try/finally so you conditionally close
cursor.close() if cursor is not None and conn.close() if conn is not None; apply
the same defensive pattern to the second occurrence around the other
hive.Connection/cursor creation (the block currently at lines with
hive.Connection(host=host, port=port, username="dbt") and subsequent
conn.cursor()) to ensure partial initialization failures are cleaned up.
tests/e2e_dbt_project/external_seeders/dremio.py (2)

38-39: Consider logging non-file parsing failures in _docker_defaults.

The broad except ...: pass blocks hide malformed YAML/regex issues and make auth failures harder to diagnose.

♻️ Proposed refinement
-    except Exception:
-        pass
+    except FileNotFoundError:
+        pass
+    except Exception as e:
+        print(f"  Warning: failed parsing docker defaults from {compose_path}: {e}")
@@
-    except Exception:
-        pass
+    except FileNotFoundError:
+        pass
+    except Exception as e:
+        print(f"  Warning: failed parsing dremio defaults from {setup_path}: {e}")

Also applies to: 54-55

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e_dbt_project/external_seeders/dremio.py` around lines 38 - 39, The
broad "except: pass" in _docker_defaults (and the second similar block)
suppresses errors — change each to "except Exception as e:" and log the
exception details (e.g., using logger.exception or logger.error with the
exception) including context about which YAML/regex or auth step failed; only
swallow the exception if it is specifically a FileNotFoundError (or other
expected file-read errors), otherwise log and re-raise or return a clear failure
so malformed YAML/regex/auth errors are visible during test runs.

9-10: Resolve YAML library alignment: use declared ruamel.yaml API instead of PyYAML.

This file imports PyYAML (import yaml, yaml.safe_load() at lines 24–26), but the project declares ruamel.yaml as its YAML dependency in pyproject.toml. PyYAML is not explicitly declared, creating an implicit dependency assumption.

Switch to the ruamel.yaml API for consistency with the project's declared dependencies.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e_dbt_project/external_seeders/dremio.py` around lines 9 - 10, The
file imports PyYAML and calls yaml.safe_load, but the project uses ruamel.yaml;
replace the PyYAML usage by importing ruamel.yaml's API (e.g., from ruamel.yaml
import YAML), create a YAML() instance (e.g., yaml = YAML()), and replace any
yaml.safe_load(...) calls with yaml.load(...) (or yaml.load(stream) for file
streams) so the module uses ruamel.yaml consistently; update the top-level
import and all references to the symbol yaml.safe_load to the ruamel YAML
instance in the ExternalSeeder-related code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/test-warehouse.yml:
- Around line 96-105: The SEED_HASH calculation currently concatenates
generate_data.py, dbt_project.yml, docker-compose.yml and profiles.yml.j2 but
omits actual seed files, so include the seed data files (e.g., the data/**
CSVs/parquets) in the hash computation; update the SEED_HASH generation (the
block that defines SEED_HASH) to also read or list the seed files (for example
by catting matching files or appending the output of git ls-files data/**) so
changes to seed content will change the hash—refer to the SEED_HASH variable and
the listed files generate_data.py, dbt_project.yml, docker-compose.yml, and
profiles.yml.j2 when editing.

In `@elementary/monitor/dbt_project/macros/overrides/dateadd.sql`:
- Around line 12-37: The macro dremio__dateadd uses JDBC constants like
SQL_TSI_YEAR in TIMESTAMPADD calls which Dremio doesn't support; update each
TIMESTAMPADD call in the dremio__dateadd macro (the branches handling datepart
== 'year','quarter','month','week','day','hour','minute','second') to use bare
unit keywords YEAR, QUARTER, MONTH, WEEK, DAY, HOUR, MINUTE, SECOND respectively
(e.g., replace TIMESTAMPADD(SQL_TSI_YEAR, ...) with TIMESTAMPADD(YEAR, ...))
while keeping the CASTs and interval handling unchanged.

In `@tests/e2e_dbt_project/external_seeders/dremio.py`:
- Around line 367-369: The seeder currently uses "CREATE TABLE IF NOT EXISTS"
(variables col_defs, create_sql, fqn, qi) and then COPY INTO which leaves prior
data intact; change the flow to make seeding idempotent by clearing the table
before loading: if the table exists drop it or run a TRUNCATE/DELETE against fqn
(use the same fqn/qi references) and then CREATE TABLE (or CREATE TABLE IF NOT
EXISTS followed by TRUNCATE) before executing COPY INTO; apply the same change
to the other occurrence around the COPY INTO block at the second location
referenced (the block at ~391-394) so each run resets the table state prior to
loading.

In `@tests/e2e_dbt_project/external_seeders/spark.py`:
- Around line 71-80: The SQL embeds container_path directly into a single-quoted
literal inside cursor.execute for creating the temp view (tmp_view), which will
break if container_path contains single quotes; change the code that builds the
SQL for cursor.execute (the CREATE OR REPLACE TEMPORARY VIEW ... OPTIONS (path
'...')) to escape single quotes in container_path (e.g., replace each ' with ''
before formatting) or otherwise use a safe parameterization; ensure you apply
the escaping where container_path is used so the generated SQL string remains
valid.

---

Nitpick comments:
In @.github/workflows/test-warehouse.yml:
- Around line 189-196: Replace the manual for-loop membership check (iterating
DOCKER_ADAPTERS and setting IS_DOCKER) with a shell case statement that matches
"${{ inputs.warehouse-type }}" against the adapter patterns; update the block
referencing DOCKER_ADAPTERS, IS_DOCKER, adapter and inputs.warehouse-type to use
case ... esac to set IS_DOCKER=true on a match and IS_DOCKER=false otherwise,
keeping the same behavior but with a shorter, easier-to-maintain structure.

In `@elementary/monitor/dbt_project/macros/overrides/dateadd.sql`:
- Around line 18-36: Normalize the datepart variable to lowercase before the
comparisons in the dateadd macro (use the Jinja filter for lower on the datepart
used in the if/elif checks) so all branches (year, quarter, month, week, day,
hour, minute, second) match case-insensitively; update the
exceptions.raise_compiler_error call to reference the normalized value or
include the original input for clarity if desired. Ensure you modify the
conditional checks that reference datepart and the error message generation in
the dateadd macro to use the lower-cased form.

In `@tests/e2e_dbt_project/external_seeders/dremio.py`:
- Around line 38-39: The broad "except: pass" in _docker_defaults (and the
second similar block) suppresses errors — change each to "except Exception as
e:" and log the exception details (e.g., using logger.exception or logger.error
with the exception) including context about which YAML/regex or auth step
failed; only swallow the exception if it is specifically a FileNotFoundError (or
other expected file-read errors), otherwise log and re-raise or return a clear
failure so malformed YAML/regex/auth errors are visible during test runs.
- Around line 9-10: The file imports PyYAML and calls yaml.safe_load, but the
project uses ruamel.yaml; replace the PyYAML usage by importing ruamel.yaml's
API (e.g., from ruamel.yaml import YAML), create a YAML() instance (e.g., yaml =
YAML()), and replace any yaml.safe_load(...) calls with yaml.load(...) (or
yaml.load(stream) for file streams) so the module uses ruamel.yaml consistently;
update the top-level import and all references to the symbol yaml.safe_load to
the ruamel YAML instance in the ExternalSeeder-related code.

In `@tests/e2e_dbt_project/external_seeders/spark.py`:
- Around line 43-45: The connection/cursor cleanup is unsafe if cursor creation
fails; initialize conn = None and cursor = None before opening/creating them in
the spark seeder and wrap the connection/cursor logic in try/finally so you
conditionally close cursor.close() if cursor is not None and conn.close() if
conn is not None; apply the same defensive pattern to the second occurrence
around the other hive.Connection/cursor creation (the block currently at lines
with hive.Connection(host=host, port=port, username="dbt") and subsequent
conn.cursor()) to ensure partial initialization failures are cleaned up.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 2263199 and a5dcef9.

📒 Files selected for processing (6)
  • .github/workflows/test-warehouse.yml
  • elementary/monitor/dbt_project/macros/overrides/dateadd.sql
  • tests/e2e_dbt_project/docker/dremio/dremio-setup.sh
  • tests/e2e_dbt_project/external_seeders/base.py
  • tests/e2e_dbt_project/external_seeders/dremio.py
  • tests/e2e_dbt_project/external_seeders/spark.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/e2e_dbt_project/docker/dremio/dremio-setup.sh

devin-ai-integration bot and others added 3 commits March 3, 2026 12:02
- test-warehouse.yml: replace for-loop with case statement for Docker adapter check
- dateadd.sql: use bare TIMESTAMPADD keywords instead of SQL_TSI_* constants, add case-insensitive datepart matching
- spark.py: harden connection cleanup with None-init + conditional close, escape single quotes in container_path
- dremio.py: switch from PyYAML to ruamel.yaml for project consistency, log non-file parsing failures, make seeding idempotent with DROP TABLE before CREATE TABLE

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…esigned to fail)

The e2e project has tests tagged error_test and should_fail that are
intentionally designed to fail. The dbt test step needs continue-on-error
so these expected failures don't block the CI job. The edr monitoring
steps that follow validate the expected outcomes.

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
This override normalizes ISO 8601 format to 'YYYY-MM-DD HH:MM:SS.sss'.
#}

{%- macro dremio__edr_cast_as_timestamp(timestamp_field) -%}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we open a PR to dbt-data-reliability to include this instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created dbt-data-reliability PR #957 which updates the existing dremio__edr_cast_as_timestamp with the full ISO 8601 normalization (removing T separator and Z suffix). Once merged, I'll remove this override from the elementary CLI project.

- Outputs just TIMESTAMPADD(...) as an expression (no "select" prefix)
#}

{% macro dremio__dateadd(datepart, interval, from_date_or_timestamp) %}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we maybe in a PR to dbt-data-reliability to instead override edr_dateadd for dremio with this logic?
(can be the same PR as the other comment I mentioned)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created dbt-data-reliability PR #957 to move this logic there as dremio__edr_dateadd (using adapter.dispatch). Once merged, I'll remove this override from the elementary CLI project.

devin-ai-integration bot and others added 3 commits March 3, 2026 17:22
…bt-data-reliability

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…eliability

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
@haritamar haritamar merged commit e8dabbf into master Mar 3, 2026
20 checks passed
@haritamar haritamar deleted the devin/ELE-5266-1772368864-code-ci branch March 3, 2026 22:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant